Fake News: A Content-Altering Web Proxy

Based on an assignment by Carey Williamson. Adapted by Carl Magnus Bruhner.
This assignment replaces Net Ninny (some content preserved).
Last updated January 2021.

Contents

Overview of the Assignment

The purpose of this assignment is to learn about the HyperText Transfer Protocol (HTTP) used by the World Wide Web. In particular, you will design and implement an HTTP proxy (i.e., Web proxy) with functionality that demonstrates both the simplicity and the power of HTTP as an application-layer protocol. Along the way, you will also learn a lot about socket programming, TCP/IP, network debugging, and more. Java, Python, C or C++ are allowed to use in the assignment.

The so-called "Fake News" phenomenon is rampant on the Internet, but it seems unfair to let Donald Trump have all the fun, so let's make some fake news of our own! We are going to make a Web proxy that alters certain content on simple Web pages before they are rendered by the Web browser, so that the user sees factually incorrect information without knowing it. To keep the assignment simple, we will restrict ourselves only to HTTP (not HTTPS), and consider only basic text and HTML pages with a few images.

For our purpose, the fake news will involve Smiley from Stockholm. Specifically, you need to change all occurrences of "Smiley" on a Web page into "Trolly", and all occurrences of "Stockholm" into "Linköping". And if you find any JPG images of Smiley (linked or embedded), then you should replace them with your favourite troll image file (JPG, GIF, or PNG) from the Internet.

 



Background

About Web Proxies

A Web proxy is a piece of software that functions as an intermediary between a Web client (browser) and a Web server. The Web proxy intercepts Web requests from clients and reformulates the requests for transmission to a Web server. When a response is received from the Web server, the proxy sends the response back to the client. From the server's point of view, the proxy is the client, since that is where the request comes from. Similarly, from the client's point of view, the proxy is the server, since that is where the response comes from. A Web proxy thus provides a single point of control to regulate Internet access between clients and servers. A lot of schools use Web proxies to limit the types of Web sites that students are allowed to access. Net Nanny and Barracuda are examples of commercially available Web proxies.

Socket Programming

As a good resource for socket programming you can use the Beej's Guide to Network Programming.

You will need at least one TCP (stream) socket for client-proxy communication, and at least one additional TCP (stream) socket for proxy-server communication. If you want your proxy to support multiple concurrent HTTP transactions (recommended), you will need to fork child processes for request handling as well. Each child process will use its own socket instances for its communications with the client and with the server. An example for such usage of the fork() function is demonstrated in Beej's Guide to Network Programming under "A Simple Stream Server".

About Hypertext Transfer Protocol (HTTP)

The Web and HTTP is discussed in Section 2.2 of the course text book, i.e. Computer Networking: A Top-Down Approach. Read this section well before proceeding to implement your proxy server! Pay specific attention to the discussion of non-persistent and persistent connections (Section 2.2.2 of the text book). You may need to refer to the HTTP/1.0 and HTTP/1.1 specifications as well. Consider that HTTP/1.1 uses persistent connections and therefore, Web servers do not close connections immediately after finishing servicing the current request—unless they are explicitly told in the HTTP request to do so. If the server does not receive a new request over the current TCP connection, it will close the connection after a configured period of time, say 30 seconds. Since the length of the HTTP content sent by the server is not always mentioned in the HTTP headers (an example is Chunked Transfer specified in HTTP/1.1) it is up to the HTTP client (i.e. the browser) to interpret the content and determine whether more data is coming from the server as part of the current HTTP response. It will, however, be difficult for (and not necessary for all types of) proxy servers to analyze the HTTP content. One way to avoid waiting for the server timeouts (to determine the end of transmission) is to modify the HTTP request.

 



Requirements Specification

In this assignment, you will implement your very own Web proxy, in either C, C++, Java or Python using TCP/IP socket programming. The goals of the assignment are to build a properly functioning Web proxy for simple Web pages, and then use your proxy to change some of the content before it is delivered to the browser.

There are two main pieces of functionality needed in your proxy. The first is the ability to intercept (and parse) HTTP requests and responses, so that the proxy can determine what changes (if any) need to be made to the requested content. The second is the ability to insert the false information into the page in some appropriate way so that the page still renders properly.

Note: As part of this assignment you should learn socket programming. You are expected to use only the basic libraries available for socket programming. If uncertain about what libraries you can use, we highly recommend that you check this with your TA before setting out to use non-basic libraries, as their use might violate the goals of the assignment, which are to learn about (1) HTTP and (2) socket programming. For example, using an HttpURLConnection Java class to fetch the data from the Web server is not allowed! The proxy should not impose any limit on the size of the transferred HTTP data, not even with realloc() or similar.

You do not have to relay HTTPS requests through the proxy, and the browser can be configured to only use proxy for HTTP.

 



Development Strategy

If you are not sure how to start developing your proxy server, you can use the following stepwise strategy:

  1. Consider that the proxy server has two parts, a server part that the browsers connect to, and a client part that connects to the Web servers. The server part and the client part are not isolated pieces of code. That is, the client part can be a class object instantiated from the server part, a function called from the server part, or even some lines of code embedded in the code of the server part. The server part receives the HTTP request form the browser and delivers that HTTP request to the client part (this delivery, in its simplest form, can be done by using the same variable). The client part then (based on the HTTP request) determines to which Web server it should connect and—after connecting to that server—sends the HTTP request to and receives the HTTP response from that server. The client part then delivers back the received content to the server part to be sent back to the browser.
  2. Read and understand the simple TCP server and the simple TCP client examples. Try to identify the steps taken in the TCP server (i.e. creating a socket, binding the socket to the desired address and port, listening for the connections, accepting connections, forking to handle concurrent connections, sending/receiving data, and closing the socket) and in the TCP client (i.e. creating a socket, connecting to a server, sending/receiving data, and closing the socket). A good understanding of these steps, and differences between a TCP server and a TCP client, will greatly help you in getting your proxy server up and running quickly.
  3. Start your coding by implementing the server part of your proxy. The server part should receive the HTTP request from the browser and deliver it to the client part. You may print out the HTTP request on the screen to make sure it is received and stored correctly.
  4. Add the code for the client part to your proxy server. The client part should receive the HTTP request from the server part and extract the information needed to carry out the request on behalf of the browser. The client part should also apply the required modifications to the HTTP request to make it ready to be sent to the Web server. If you are not sure what modifications should be done to the HTTP request, please read this part and this part more carefully! Once the required information is extracted and the request is appropriately modified, the client part should connect to the Web server, send the modified request to it, and receive the HTTP response from the server. You will implement the content altering in a later step.
  5. Add content altering to the client part. Please note that as part of the requirements, not every content should be searched for keywords. The reason is that to search the content, you need to store the whole content (at least in a straightforward implementation) which severely limits the ability of your proxy to handle delivery of large files. Therefore, based on the content type, i.e. text or non-text (and compressed or non-compressed), you can use different approaches to send the Web server's response back to the browser. Please consider that it is not required to search the compressed content for the keywords.

 



Testing

Your proxy will be tested on the following 4 test cases:
  1. A simple ASCII text file
  2. A looong ASCII text file
  3. A simple HTML file
  4. An HTML file with link to a photo
  5. An HTML file with embedded photos
Once you have these cases working, you can try your proxy on other pages. However, it might be challenging finding non-HTTPS webpages – especially containing the given keywords. It is important that you understand (and explain) what your proxy can and cannot do, as well as why it has the limitations that it does have. This is expected to be described in your report.

 



What to Deliver

When you are finished, please create a single (g)zipped archive with your solution. Your file should include all the above-mentioned items.

 



Demonstration

The primary test of correctness for your proxy is a simple visual test. That is, for most Web pages, the content displayed by your Web browser should look the same regardless of whether you are using your Web proxy or retrieving content directly from the Web server. This mode of operation can be called "invisible" mode, since the presence of the proxy is invisible to the user. The only differences appear when you try to access content containing "Smiley" and "Linköping". In this case, the keywords should be altered according to the instructions in the requirement specification.

The TA will ask you to demonstrate your Fake News Web proxy in action; e.g., by browsing the test links and possibly other HTTP sites as well. You should be ready to answer questions about the details of your code.

 



Important Hints

 



A Note on Wireshark Protocol Dissectors

You may find the following hint useful, specially when you use Wireshark to sniff data that is being exchanged between browser and the proxy server.

Wireshark uses protocol dissectors to extract information from packets. For example, the information shown under the "Hypertext Transfer Protocol" node in the packet details pane is extracted using the HTTP protocol dissector. Wireshark, however, is not always able to choose the right dissector for a packet. This happens for example, when an uncommon port is used for a common protocol, making Wireshark not able to choose the right type of dissector. Such an example is shown in Figure 1 where a proxy server on port 60000 is used to access web pages. It can be seen from Figure 2 that although the contents of the packets are HTTP data, the protocol is not detected as HTTP.

Figure 1
Figure 1: Surfing the Web through a proxy server on port 60000

Fortunately, it is possible to instruct Wireshark what dissector to use for a given packet. By right-clicking on a packet and selecting "Decode As ...", a window opens which allows assigning the desired protocol dissector to the selected packet, see Figure 6. In the example of Figure 6, after selecting the Transport tab, one can select the HTTP protocol dissector to be used for every packet with the source port of 60000. In the same manner, one can right-click on a packet with the destination port of 60000 and assign the HTTP dissector to it, so that both outbound and inbound packets to and from the proxy server are decoded as HTTP. Please note that such user specified decodes cannot be saved and are lost upon exiting Wireshark.

Figure 2
Figure 2: The "Decode As" window