Assignment 4 (TDTS11): Analysis of popular sites |
| By Niklas Carlsson, January 2013 |
________________________________________________________________________
IMPORTANT: We want to remind you to read the complete instructions before starting this assignment.
Motivated by the concept of problem-based learning, the final assignment will involve a number of tasks of varying difficulty that may require that you identify and learn tools and techniques that are not explicitly taught in the course but if mastered (after some practice) may significantly simplify and speed up some of the tasks in this assignment.
Advice: After reading the instructions, we recommend that you first complete the "first" task and then make a careful game plan for attacking the remainder of the assignment. As you solve the various tasks you may also want to see if there are additional tools/knowledge/information that you need to obtain to more efficiently solve the different tasks.
________________________________________________________________________
First task [MUST]: Learning the tools
In your own words, please (concisely) explain what the following tools, sites, and/or services do and what information they may provide:
Note that the 'man' pages on many systems can be valuable as an information source.
Recomendation: It is suggested that you each (on your own) study the above tools before the first time you and your lab partner start this assignment. When you first meet you should spend a few hours going through each tool and completing the above task. The time invested in learning the above tools is likely to be valuable for this assignment, and more importantly for the continuation of your carriers/education.
________________________________________________________________________
Second task [2pt]: Data collection
In this task you have two options: (i) collect your own trace, or (ii) find a person in the class who has collected a trace and borrow that trace. If you select option (ii) it should be clearly stated in your report who you borrowed the trace from. If you select option (i), please share the trace with your classmates. (Looking at each other's traces and/or comparing results may provide additional insights between differences observed, as well as help sanity check your results.)
The trace should be collected as follows: (1) Close down all applications that you have running. (2) Try to empty your local cache and any cookies that you may have on the local machine. (3) Start up Wireshark and begin a new capture (with Wireshark). (4) Start your favorite Web browser. (5) For the next 25 minutes, at the beginning of each new minute go to the main page of the i-th most popular site on the Internet according to alexa.com. In other words, you should visit the 25 most popular sites over the span of 25 minutes. (You should not click on any links, just go to the main page as defined at alexa.com.) (6) Allow for a few minutes for transactions to complete at the end of the trace period. (7) Stop capturing with Wireshark.
You should now have a trace in which the "front" page of the 25
most popular sites on the Internet (according to alexa.com) was visited.
Third task [2pt]: High-level data summary
For the top 5 pages, please answer and discuss the following questions. Note that you may need to use additional tools or complimentary information sources to answer some of the questions below. (Please also refer to the lectures notes for concepts such as RTT, hop count, etc.)
Pick five pages that have many different http requests and answer the following questions.
Can you automate tasks 4 or 5, such that solving the task for 500 pages essentially takes
the same time as doing it for 1 page?
For this task you should provide your automated solution
(e.g., the script(s) or the source code for the program used for the data analysis and processing)
and provide the answers for all 25 top-25 pages.
Ninth task [4pt]: Building your own Web crawler
In this task you should build your own Web crawler that explores the tree structure of the links (and objects that are used) at a given set of the Web sites. Your crawler should take a list of URL's as input and generate a tree structure of domains that are linked as output. For example, a site "www.foo.com" that have links to domains "bar1.com" and "bar2.com" and use a image from the domain "img.com" should result in the following relationships: "www.foo.com -> bar1.com", "www.food.com -> bar2.com" and "www.food.com -> img.com". You do not have to visualize the tree structure (although this can be a fun and interesting exercise in itself).
Your answers should clearly explain what you learned and how you solved the questions. (Note that the steps taken to obtain an answer in many cases is more important than the answer itself.) Please explain if you found additional tools and information sources which helped you answer the above questions. Finally, if you could not solve the question, please explain why it was not possible and what information you would need to solve the question.
Please structure your report such that your answers are clearly indicated for each question (and section of the assignment). It is not the TA's task to search for the answers. Both the questions themselves and the corresponding answers should be clearly stated (and indicated) in your report. Structure your report accordingly. Furthermore, your answers should be explained and supported using additional evidence, when applicable. During the demonstration the TA may ask similar questions to assess your understanding of the lab. You are expected to clearly explain and motivate your answers. As the assignments are done in groups of two, both members of the group will be asked to answer questions.
Additional instructions and information about the reports can be found here. Please take this chance to read the guidelines carefully.