Instructions for Bonus Questions in TDTS06, fall 2016

For this assignments you are expected to solve one out of five exercises. Three of these exercises are fairly open ended and there are many ways they can be solved. One question is much more directed/guideded, but leave much room for additional work, evaluation, and discussion (after the initial framework is established).

Note that these questions are for bonus marks, so you will likely need to spend some time figuring out how to best solve the questions. This may involve doing some detective work, and learning new tools and/or techniques. It is important that you clearly explain the problem you are solving, how you solved the problem, including any abstractions/assumptions that you make to capture the dynamics of the system in the question, and what insights that your solution provide.

With the risk of being repetitive, please clearly state and motivate any assumptions that you make. Explain and comment code that you submit, and make sure that answers are clearly explained and motivated.

The assignment can give up to four (4) bonus marks, with marks assigned based on the quality of the solution. The bonus marks will only be valid for the original exam (on October 28, 2016). You may use either Swedish or English (English preferred) and are expected to solve the question individually.

The assignment is due on October 18, at 23:59 and should be submitted as a hardcopy (on paper) to Niklas (placed in his mailbox) by the time of the deadline. You should also send an email with a .pdf file with the report and a .zip file with any potential code to Niklas. (Also, please remember to have "TDTS06" and "Bonus assignment" in the subject field.)

1. Good vs Bad: File Sharing Simulations

While this question can be analytically modeled, you may want to use simulations to solve the problem. Consider a file-sharing community with M users. Assume that all M users want to download the same file, and N of these users makes the file available at time zero. Furthermore, assume that n out of these N users are making available corrupt copies (malicious or fake copies, for example) and that the downloader of a file would not know if a file is corrupt or not until the file is fully downloaded. (The other N-n copies are good copies of the file.) You can assume that it takes T time units to upload a file to a single user and all users are willing to upload copies until every peer in the system except the n bad peers have a good copy of the file.

Please simulate and plot how the number of good (and bad) copies in the system changes with time for different initial values n, N, and M. (Without loss of generality you can pick T=1.) You can initially consider the case when the "good" peers always check if the file is corrupt or not, before uploading the file to somebody else. In the case a peer discover that the file is corrupt, they delete the file and try to download a new copy at random. Also, consider the more general case when peers only check the validity of a file with a probability p each time they either have downloaded or uploaded the file. What happen when you change the probability p?

2. Cache Simulator in C or C++

For this question you are expected to write a trace-based simulator and analyze the performance of some simple cache policies and configurations for the delivery of video. Your simulator should read a trace (with a sequence of requests) from standard input (stdin), take a number of (optional) command-line input variables (using argc and argv), and write the results to stdout and result files. Two example runs that your program should allow could look as follows:

$ cat input.txt | yourprogram option=0 X=5 Y=5 stats.txt > output.txt
$ cat input.txt | yourprogram option=1 X=5 T=100 stats.txt > output.txt

Each line in the trace file should have the information about a new request, with each line having six (6) tab-separated fields: (i) the timestamp of the original request (decimal value), (ii) the client ID (string), (iii) the video duration (decimal value), (iv) the original server name (string), (v) the file name (string), (v) the total file size (decimal value), and (vi) a priority value between 0 and 10 (decimal value).

Your simulator should take the trace as input (from stdin) and calculate various statistics, which you should plot and discuss in a report. You are expected to have two implementations: a LRU-based implementation and a TTL-based implementation.

LRU-based implementation: Implement a cache that use the Least Recently Used (LRU) policy with storage space X bytes that cache all files that have a priority higher than Y. Both X and Y should be inputs to the program.
TTL-based implementation: Implement a cache that have variable sized storage but that stores all files that have a priority higher than Y for a time duration T since last accessed. Both Y and T should be inputs to the program.

To stdout you should write the input information of each request, as well as information about the action taken for that object (i.e., was it cached) and what is the number of files and bytes that is stored in the cache after that action is taken. To the output file (stats.txt in the above examples), you should write summary statistics of interest, including (i) the hit rate, (ii) what fraction of files was requested n times before cache eviction, and (iii) what number of files and bytes was never stored on the cache.

For the purpose of evaluation and discussion, you will also need to generate a "reasonable" trace. For this step, please use your own judgment. For example, you can assume that inter-request times are Poisson (exponential and independent request-inter-arrival times), the file popularity follow a Zipf distribution, and file sizes/durations/priorities are drawn from some reasonable distributions (based on reasonable assumptions, explained and motivated in your report).

Note: For this assignment, the exponentially distributed inter-arrival times you can be generated relatively easily using ((1/λ) * (-log(u))), where λ is the request rate, the log function is the natural logarithm (ln), and u is a uniformly distributed random number.

3. Geographic Mapping of Services

For this question you may want to use a combination of wireshark, scripts, gnuplot, and various utility tools available in unix and/or the Internet. Identify the top 100 most popular Web pages on the Internet (using alexa.com). If you access only the first page of each of these sites, how many servers do you contact in total? How many of the contacted servers (or hosts) are operated by a different company than the originally contacted host? How many of all sites are served by at least one server that is owned by Google? How many of the servers use gzip? How much bandwidth did gzip save in these cases? Where are all the servers located? How far away are the servers located? Can you visualize these results using graphs, plots, and/or tables?

4. Cache Occupancy

For this question you are expected to use math (and possibly simulations for validation) to estimate the average time that different content spend in the a Web cache. You can assume that the cache serves a very large number of users from a fixed set of objects. Assume that there are N objects, all objects are of the same size L, the cache has a total size B, and the requests for each file i follows a Poisson process with request rate λi, where the rates λi follow a Zipf distribution with Zipf parameter α. You should also assume that the cache use a FCFS cache replacement policy.

Given the above information you should determine a reasonable estimate for the time it takes for a file to get kicked out of the cache, as well as how much of the time each object would spend in the cache if the system remained in steady-state for a very long time. When interpreting your results, you may want to consider different B, N, λ0, and α. (Without loss of generality you can pick L=1.)

5. Current Research Topic

This exercise involves both reading and some exploration/implementation/analysis. First, carefully read a full-length research paper (not posters or demos!) that has been published the last 5 years (2012-2016) in IEEE/ACM Transactions on Networking or one of the following conferences: ACM SIGCOMM, ACM SIGMETRICS, or ACM IMC. These papers are typically 12-14 pages long and can typically be accessed through the author's websites, the conference website, and/or through the ACM digital library. (The ACM digital library can be accessed for free from any computer on campus, thanks to the library; otherwise, author copies are typically easy to find by pasting the title into google, and let google do the job for you ...) Second, carefully summarize the paper and the key insights from the paper. Finally, build your own (relatively) simple simulator, mathematical model, or measurement framework to analyze/validate some smaller/specific aspect considered in the paper or to further investigate some question(s) raised in the paper.