Assignment 4 (TDTS11): Scripts, utility tools, and trace analysis

By Niklas Carlsson and Rahul Hiran, Februaury 2015

_________________________________________________________

IMPORTANT: We want to remind you to read the complete instructions before starting this assignment. Also, note that this assignment is much more demanding than your previous assignments in this course and you therefore should make sure to get started right away.

Motivated by the concept of problem-based learning, the final assignment will involve a number of tasks of varying difficulty that may require that you identify and learn tools and techniques that are not explicitly taught in the theory part of the course but if mastered (after some practice) may significantly simplify and speed up some of the tasks in this assignment and help you in your future careers.

The assignment consists of a number of "tasks" and there are some pointers to help you on your way. However, to simulate real world problems, the information is not spoon feed, there is often not a single correct answer (but there are of course good and bad answers!), and you will be required to identify the relevant information that best help you complete the tasks. (In fact, there may even be somewhat conflicting advices, which you must weigh against each other.)

Advice: After reading the instructions, we recommend that you first complete the "first" task and then make a careful game plan for attacking the remainder of the assignment. As you solve the various tasks you may also want to revise your plan (e.g., order of tasks), as well as see if there are additional tools/knowledge/information that you need to obtain to more efficiently solve the different tasks.

Requirements

For this assignment you must all do task 1. Among the other tasks you should select enough tasks that the sum of the points (pts) associated with the tasks adds to at least 18 (out of 38) points.

Task list

First task [MUST]: Learning the tools

In your own words, please (concisely) explain what the following tools, sites, and/or services do and what information they may provide:

traceroute (or potential Windows replacement)
nslookup (dig, host, or similar - depending on system)
alexa.com, compete.com, etc.
Google trends, etc.
arin.net, ripe.net, apnic.net, and lacnic.net
Shell/script programming (e.g., bash) together with unix commands such as "grep", "sed", "sort", and "awk" may be useful here.
Other tools (including Wireshark) that you decide to use to solve this assignment.

Note that google and the 'man' pages on many systems can be valuable information sources. (For example, try typing "man sort" to find out how you can use "sort" with different arguments.)

Recommendation: It is suggested that you each (on your own) study the above tools before the first time you and your lab partner start this assignment. When you first meet you should spend a few hours going through each tool and completing the above task. The time invested in learning the above tools is likely to be valuable for this assignment, and more importantly for the continuation of your carriers/education. However, also remember that the best way to learn these tools is to solve real tasks, so please do try to apply them to the tasks below.

Advice: When learning to use utility tools such as "grep", "sed", "sort", and "awk", we recommend that you create a text file with row-column format (e.g., a file that have 20 lines, and each line have 5 tab-separated numbers or words) and then try to extract information from this file. For example, given a file (that you may create yourself) with HTTP packet information such as (i) HTTP version, (ii) send time, (iii) file size, and (iv) object id: try to extract all lines that start with a particular word (say HTTP/1.1), sort all lines based on the numeric values in column 3 (in reverse order), remove duplicate object request, and calculate the sum of all file sizes (with or without duplicates). In your assignment you can (briefly) explain what type of exercises you created for yourself and how you learned the different tools.

Advice: When playing around with the different utility tools it is important to note that you can link multiple commands using a pipe "|". For example, the command "cat file.txt | sort -rn -k 3 | awk '{print $3 " " $4}'" first read a file, then sort the output based on the third column, and then print columns 3 and 4. Again, create a file or two and play around with the various commands.

Advice: When using a scripting language (such as bash, for example) you can also create for loops, if statements, and use other standard programming syntax. We would suggest that you test writing a for loop in which you create 5 different files named "site1.txt", "site2.txt", "site3.txt", "site4.txt", and "site5.txt". A useful thing to note here is that you can write to a file using the less-than sign ('>'). For example, "echo "blah" > file.txt" writes "blah" into the file file.txt. Depending on your solution, using the dollar sign ('$') to access the value stored in variables may also come into play here.

Second task [2pt]: Data collection

In this task you have two options: (i) collect your own trace, or (ii) find a person in the class who has collected a trace and borrow that trace. If you select option (ii) it should be clearly stated in your report who you borrowed the trace from. If you select option (i), please share the trace with your classmates. (Looking at each other's traces and/or comparing results may provide additional insights between differences observed, as well as help sanity check your results.)

The trace should be collected as follows: (1) Close down all applications that you have running. (2) Try to empty your local cache and any cookies that you may have on the local machine. (3) Start up Wireshark and begin a new capture (with Wireshark). (4) Start your favorite Web browser. (5) For the next 25 minutes, at the beginning of each new minute go to the main page of the i-th most popular site on the Internet according to alexa.com. In other words, you should visit the 25 most popular sites over the span of 25 minutes. (You should not click on any links, just go to the main page as defined at alexa.com.) (6) Allow for a few minutes for transactions to complete at the end of the trace period. (7) Stop capturing with Wireshark.

You should now have a trace in which the "front" page of the 25 most popular sites on the Internet (according to alexa.com) was visited.

Note: Analysis of traces from purely text-based browsers (that only display the text) will not be acceptable.

Advice: For this task, as Google does a lot of prefetching that may be confusing when analyzing the traces, we recommend that you (at least initially) use a regular Web browser other than Crome; e.g., IE, Firefox, or similar.

Advice: The best way to solve this assignment is to use one-liners (sequences of commands on a single command line). Although temporary outputs are useful for debugging your commands, to many temporary files can easily mess up your file directory and it is difficult to remember what is in each file.

Advice: The best way to solve this assignment is to break the tasks into smaller tasks (potentially creating temporary files with intermediate results, allowing you to keep track of what is achieved in each step), each brining you one step closer to your answer.

Third task [4pt]: High-level data summary

For the entire trace duration, as well as for each site individually, please determine how many (i) URLs, (ii) unique URLs, (iii) unique domains, (iv) unique prefix (e.g., .com, .edu, .se, etc.) where contacted.
For the entire trace duration, as well as for each site individually, please determine how many (i) unique TCP connections where established, (ii) unique servers where contacted, (iii) unique IP packets where sent/recieved, and (iv) bytes sent/recieved.

Advice: You may want to use the export commands of wireshark to create text files that may be easier to parse using some of the utility tools discussed above. (A brief overview of the export command is discussed at the bottom of this document.) Are there additional actions that you can do to these traces such as to make them easier to parse them?

Advice: You may want to analyze one site at a time. You may therefor want to consider splitting a trace into one sub-trace per site. Are there timestamps, for example, in the traces that allow you to easily break the traces into such smaller sub-traces? Maybe conditional awk statements may be worth considering here?

Fourth task [2pt]: Drag along contents

Please quantify how much additional content (both in terms of bytes and http transactions) were downloaded in addition to the 25 pages that you as a user explicitly requested. Does it differ much between sites?
Can any of the http fields provide you insights to what content was "dragged along" with the 25 visited sites? For simplicity, let us define "dragged along" as served by a different domain. (You may also want to quantify how many http requests in total where caused by you making each request.)
In addition to the 25 sites and the dragged along data, did you capture any other traffic to/from your computer? Please try to explain and categorize this traffic.

Fifth task [4pt]: More drag along content

Pick five (5) information rich pages that have many different http requests and answer the following questions.

Can you specify what type of content that was "dragged along"?
What is the average path length (e.g., hop count) of any of the drag-along content?
What is the average path length (e.g., hop count) of any of the data initially requested?
What are some of the most common "network owners" of the drag along content and where are these domains located?

Sixth task [2pt]: Yet more drag along content

Can you automate tasks 5, such that solving the task for 500 pages essentially takes the same time as doing it for 1 page? For this task you should provide your automated solution (e.g., the script(s) or the source code for the program used for the data analysis and processing) and provide the answers for all 25 top-25 pages.

Seventh task [2/4/6pt]: Site performance of top 5/25 pages

For the top 5/25 pages, please answer and discuss the following questions. Note that you may need to use additional tools or complimentary information sources to answer some of the questions below. (Please also refer to the lectures notes for concepts such as RTT, hop count, etc.)

What is the average response time of a http request?
What is the average response time of a dns query?
What is the average RTT for each type of content?
Please comment on differences in RTT for different sites/content.
What is the average observed download throughput?
Can you estimate the load time (observed by the user)?

If you do the question for 5 pages or use a manual (non-scripted) solution you will receive 2pt. If you do the question for 25 pages you receive 4pt if you have a automated solution for 25 pages and 6pt if you have a automated solution that would easily solve the problem for arbitrary number of pages (say 500) pages (in very short time).

Advice: It may be worth mapping which tools and data is best to solve each of the tasks in this assignment.

Eight task [2pt]: Popularity trends

Which of the top-25 sites have gained the most relative popularity over the last year? Any other interesting trends on the world wide web? If possible, please provide both statistical and visual support for your answer.

Ninth task [2pt]: The size of the Internet?

Based on the above trace, would you be able to give an estimate of the diameter of the Internet? If so, what would it be and how did you obtain this estimate? If not, please explain your reasoning.
Can you collect some alternative set of data to help answer this question? If so, please explain your approach (and provide an estimate, if possible).

Tenth task [6pt]: Building your own Web crawler

In this task you should build your own Web crawler that explores the tree structure of the links (and objects that are used) at a given set of the Web sites. Your crawler should take a list of URL's as input and generate a tree structure of domains that are linked as output. For example, a site "www.foo.com" that have links to domains "bar1.com" and "bar2.com" and use a image from the domain "img.com" should result in the following relationships: "www.foo.com -> bar1.com", "www.food.com -> bar2.com" and "www.food.com -> img.com". You do not have to visualize the tree structure (although this can be a fun and interesting exercise in itself).

Develop and test the crawler
Document the functionality and limitations of your crawler
Pick some illustrative sites from the alexa site and provide the tree structure of the domain relationships associated with these sites.

Eleventh task [8pt]: Netninny lab done in TDTS04

The instructions for this assignment can be found here. Note that this assignment can be time consuming in itself, and we therefore require that you treat this task as a bonus opportunity and you therefore must complete the required 18 credits (using the other tasks) before attempting this task.

Demonstration and Report

For this assignment you will need to write a report that carefully describes which tasks you have selected to solve, explains how you solved them, and the lessons and insights you have gained. As in previous assignments, you should also discuss (and demonstrate) your solutions with the TA.

Your answers should clearly explain what you learned and how you solved the questions. (Note that the steps taken to obtain an answer in many cases are more important than the answer itself.) Only giving your answer is not acceptable.

Please explain if you found additional tools and information sources which helped you answer the above questions. Finally, if you could not solve the question, please explain why it was not possible and what information you would need to solve the question.

Please structure your report such that your answers are clearly indicated for each question (and section of the assignment). It is not the TA's task to search for the answers. Both the questions themselves and the corresponding answers should be clearly stated (and indicated) in your report. Structure your report accordingly. Furthermore, your answers should be explained and supported using additional evidence, when applicable. During the demonstration the TA may ask similar questions to assess your understanding of the lab. You are expected to clearly explain and motivate your answers. As the assignments are done in groups of two, both members of the group will be asked to answer questions.

Additional instructions and information about the reports can be found here. Please take this chance to read the guidelines carefully.

Exporting Wireshark Traces into Text Files

One approach to create pure text files from the wireshark traces is to use the the following menues: File -> export -> as "plain text" file. At this time, you are presented with a window from which you can choose "Displayed", "Packet details", and "As displayed", for example, before pressing "OK".

Figure 1. Wireshark export window.

Using the "Filter" window you can filter the packets/information that you want to export. For example, if you filter using "http", then only the http packet information will be displayed. When you export using the "As displayed" option, only displayed packets will be exported to text file.

You may want to use different filters to answer different questions in the assignment. Please note that part of the assignment is to determine which information to extract, such as to simplify the processing of your exported traces. Here, it is important to keep track of which protocol information to filter for when answering each of the questions.

It is also possible to only export the "Packet summary line". Other packet details options are:

All collapsed: No detailed information will be exported; only summary information for each protocol.
As displayed: Information will be exported based on what details you currently have in the packet details pain. For example, if you have expanded the detailed IP and http information (trees) in the packet details pain, then these details will be exported for all the selected packets.
All expanded: This will export all the details that you can see in packet details pain when all the protocol detail (trees) are expanded.

Depending on what you will do, you should carefully select which information to extract. Note that more information not necessarily is better and that you carefully must select the appropriate information to answer each question. Of course the above utility tools should also help you extract the information you are interested in from the files.

Below are some basic examples.

1. Packet summary:

No.     Time           Source                Destination           Protocol Length Sequence number Acknowledgement number Info
      1 0.000000000    10.0.1.82             205.251.219.181       HTTP     440    1               1                      GET /images/help/bubble.png HTTP/1.1 
      2 0.004133000    10.0.1.82             205.251.219.181       HTTP     447    1               1                      GET /images/help/bubble_filler.png HTTP/1.1 
      6 0.029374000    205.251.219.181       10.0.1.82             HTTP     740    1449            375                    HTTP/1.0 200 OK  (PNG)
      8 0.034654000    205.251.219.181       10.0.1.82             HTTP     754    1               382                    HTTP/1.0 200 OK (PNG)

2. Http with details:

Frame 1: 440 bytes on wire (3520 bits), 440 bytes captured (3520 bits)
Ethernet II, Src: QuantaMi_13:af:92 (aa:aa:aa:aa:aa:aa), Dst: Apple_b9:73:56 (aa:aa:aa:aa:aa:aa)
Internet Protocol Version 4, Src: 10.0.1.82 (10.0.1.82), Dst: 205.251.219.181 (205.251.219.181)
Transmission Control Protocol, Src Port: 38710 (38710), Dst Port: http (80), Seq: 1, Ack: 1, Len: 374
Hypertext Transfer Protocol
    GET /images/help/bubble.png HTTP/1.1\r\n
    Host: pcache.alexa.com\r\n
    Connection: keep-alive\r\n
    User-Agent: Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.70 Safari/537.17\r\n
    Accept: */*\r\n
    Referer: http://www.alexa.com/topsites\r\n
    Accept-Encoding: gzip,deflate,sdch\r\n
    Accept-Language: en-US,en;q=0.8\r\n
    Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3\r\n
    \r\n
    [Full request URI: http://pcache.alexa.com/images/help/bubble.png]

Frame 2: 447 bytes on wire (3576 bits), 447 bytes captured (3576 bits)
Ethernet II, Src: QuantaMi_13:af:92 (aa:aa:aa:aa:aa:aa), Dst: Apple_b9:73:56 (aa:aa:aa:aa:aa:aa)
Internet Protocol Version 4, Src: 10.0.1.82 (10.0.1.82), Dst: 205.251.219.181 (205.251.219.181)
Transmission Control Protocol, Src Port: 38707 (38707), Dst Port: http (80), Seq: 1, Ack: 1, Len: 381
Hypertext Transfer Protocol
    GET /images/help/bubble_filler.png HTTP/1.1\r\n
    Host: pcache.alexa.com\r\n
    Connection: keep-alive\r\n
    User-Agent: Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.70 Safari/537.17\r\n
    Accept: */*\r\n
    Referer: http://www.alexa.com/topsites\r\n
    Accept-Encoding: gzip,deflate,sdch\r\n
    Accept-Language: en-US,en;q=0.8\r\n
    Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3\r\n
    \r\n
    [Full request URI: http://pcache.alexa.com/images/help/bubble_filler.png]

3. IP packets with .IP. and .http. details:

Frame 1: 440 bytes on wire (3520 bits), 440 bytes captured (3520 bits)
Ethernet II, Src: QuantaMi_13:af:92 (aa:aa:aa:aa:aa:aa), Dst: Apple_b9:73:56 (aa:aa:aa:aa:aa:aa)
Internet Protocol Version 4, Src: 10.0.1.82 (10.0.1.82), Dst: 205.251.219.181 (205.251.219.181)
    Version: 4
    Header length: 20 bytes
    Differentiated Services Field: 0x00 (DSCP 0x00: Default; ECN: 0x00: Not-ECT (Not ECN-Capable Transport))
    Total Length: 426
    Identification: 0xbe51 (48721)
    Flags: 0x02 (Don't Fragment)
    Fragment offset: 0
    Time to live: 64
    Protocol: TCP (6)
    Header checksum: 0xc5f9 [correct]
    Source: 10.0.1.82 (10.0.1.82)
    Destination: 205.251.219.181 (205.251.219.181)
Transmission Control Protocol, Src Port: 38710 (38710), Dst Port: http (80), Seq: 1, Ack: 1, Len: 374
Hypertext Transfer Protocol
    GET /images/help/bubble.png HTTP/1.1\r\n
    Host: pcache.alexa.com\r\n
    Connection: keep-alive\r\n
    User-Agent: Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.70 Safari/537.17\r\n
    Accept: */*\r\n
    Referer: http://www.alexa.com/topsites\r\n
    Accept-Encoding: gzip,deflate,sdch\r\n
    Accept-Language: en-US,en;q=0.8\r\n
    Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3\r\n
    \r\n
    [Full request URI: http://pcache.alexa.com/images/help/bubble.png]

4. Overview of all protocol headers

No.     Time           Source                Destination           Protocol Length Sequence number Acknowledgement number Info
      1 0.000000000    10.0.1.82             205.251.219.181       HTTP     440    1               1                      GET /images/help/bubble.png HTTP/1.1 

Frame 1: 440 bytes on wire (3520 bits), 440 bytes captured (3520 bits)
Ethernet II, Src: QuantaMi_13:af:92 (20:7c:8f:13:af:92), Dst: Apple_b9:73:56 (24:ab:81:b9:73:56)
Internet Protocol Version 4, Src: 10.0.1.82 (10.0.1.82), Dst: 205.251.219.181 (205.251.219.181)
Transmission Control Protocol, Src Port: 38710 (38710), Dst Port: http (80), Seq: 1, Ack: 1, Len: 374
Hypertext Transfer Protocol

No.     Time           Source                Destination           Protocol Length Sequence number Acknowledgement number Info
      2 0.004133000    10.0.1.82             205.251.219.181       HTTP     447    1               1                      GET /images/help/bubble_filler.png HTTP/1.1 

Frame 2: 447 bytes on wire (3576 bits), 447 bytes captured (3576 bits)
Ethernet II, Src: QuantaMi_13:af:92 (20:7c:8f:13:af:92), Dst: Apple_b9:73:56 (24:ab:81:b9:73:56)
Internet Protocol Version 4, Src: 10.0.1.82 (10.0.1.82), Dst: 205.251.219.181 (205.251.219.181)
Transmission Control Protocol, Src Port: 38707 (38707), Dst Port: http (80), Seq: 1, Ack: 1, Len: 381
Hypertext Transfer Protocol

No.     Time           Source                Destination           Protocol Length Sequence number Acknowledgement number Info
      3 0.020771000    205.251.219.181       10.0.1.82             TCP      66     1               375                    http > 38710 [ACK] Seq=1 Ack=375 Win=89 Len=0 TSval=276
5196414 TSecr=4294963077

Frame 3: 66 bytes on wire (528 bits), 66 bytes captured (528 bits)
Ethernet II, Src: Apple_b9:73:56 (24:ab:81:b9:73:56), Dst: QuantaMi_13:af:92 (20:7c:8f:13:af:92)
Internet Protocol Version 4, Src: 205.251.219.181 (205.251.219.181), Dst: 10.0.1.82 (10.0.1.82)
Transmission Control Protocol, Src Port: http (80), Dst Port: 38710 (38710), Seq: 1, Ack: 375, Len: 0
...
...
...

Finally, some random example commands that may be worth playing around with ...

cat httpAsdisplayed.txt | grep "Host" | sed 's/\\r\\n//g'| awk '{print $2}' | sort | uniq -c | wc
cat httpAsdisplayed.txt | grep -A 15 "Host"  | grep "Referer" | wc
cat httpAsdisplayed.txt | grep "Host" | wc