Assignment 5 (TDTS09): Analysis of popular sites

First task: Learning the tools

In your own words, please (concisely) explain what the following tools, sites, and/or services do and what information they may provide:

  1. traceroute (or potential Windows replacement?)
  2. dnslookup (dig, host, or similar - depending on system)
  3. alexa.com
  4. compete.com
  5. arin.net, ripe.net, apnic.net, and lacnic.net
  6. other tools (including Wireshark) that you decide to use to slve this assignment.
Note that the 'man' pages on many systems can be valuable as an information source.

Second task: Data collection

In this task you have two options: (i) collect your own trace, or (ii) find a person in the class who has collected a trace and borrow that trace. If you select option (ii) it should be clearly stated in your report who you borrowed the trace from. If you select option (i), please share the trace with your classmates. (Looking at each other's traces and/or comparing results may provide additional insights between differences observed, as well as help sanity check your results.)

The trace should be collected as follows: (1) Close down all applications that you have running. (2) Try to empty your local cache and any cookies that you may have on the local machine. (3) Start up Wireshark and begin a new capture (with Wireshark). (4) Start your favorite Web browser. (5) For the next 25 minutes, at the beginning of each new minute go to the main page of the i-th most popular site on the Internet according to alexa.com. In other words, you should visit the 25 most popular sites over the span of 25 minutes. (You should not click on any links, just go to the main page as defined at alexa.com.) (6) Allow for a few minutes for transactions to complete at the end of the trace period. (7) Stop capturing with Wireshark.

You should now have a trace in which the "front" page of the 25 most popular sites on the Internet (according to alexa.com) was visited.

Third task: Data analysis

Note that you may need to use additional tools or complimentary information sources to answer some of the questions below. (Please also refer to the lectures notes for concepts such as RTT, hop count, etc.)

  1. Please quantify how much additional content (both in terms of bytes and http transactions) were downloaded in addition to the 25 pages that you as a user explicitly requested. Does it differ much between sites?
  2. Can any of the http fields provide you insights to what content was "dragged along" with the 25 visited sites? For simplicity, let us define "dragged along" as served by a different domain. (You may also want to quantify how many http requests in total where caused by you making each request.)
  3. In addition to the 25 sites and the dragged along data, did you capture any other traffic to/from your computer? Please try to explain and categorize this traffic.
  4. What is the average response time of a http request?
  5. What is the average response time of a dns query?
  6. What is the average RTT for each type of content?
  7. Please comment on differences in RTT for different sites/content.
  8. Can you specify what type of content that was "dragged along"?
  9. What is the average path length (e.g., hop count) of any of the drag-along content?
  10. What is the average path length (e.g., hop count) of any of the data initially requested?
  11. What are some of the most common "network owners" of the drag along content and where are these domains located?
  12. Which of the top-25 sites have gained the most relative popularity over the last year?
  13. Based on this small sample, would you be able to give an estimate of the diameter of the internet? If so, what would it be and how did you obtain this estimate? If not, please explain your reasoning.
Your answers should clearly explain what you learned and how you solved the questions. (Note that the steps taken to obtain an answer in many cases is more important than the answer itself.) Please explain if you found additional tools and information sources which helped you answer the above questions. Finally, if you could not solve the question, please explain why it was not possible and what information you would need to solve the question.

For the third task you are expected to carefully answer at least 10 out of the 13 questions. You are, however, encouraged to answer all questions.

Note: For this third task you can chose between doing the work for the top-ten sites or the top-25 sites. Note that if you plan your analysis carefully (using command line and scripting, for example) it should not require much more work doing the analysis for 25 sites than for two sites. It is, however, okay to only analyze the top-10 sites.

Additional reporting

In addition to the regular reporting to Juha, I also want you to email me (Niklas) your final report for this assignment (as a pdf document).