Fake News: A Content-Altering Web Proxy

Based on an assignment by Carey Williamson. Adapted by Carl Magnus Bruhner.
This assignment replaces Net Ninny (some content preserved).
Last updated January 2021.

Contents

Overview of the Assignment
Background
Requirements Specification
Development Strategy
Testing
What to Deliver
Demonstration
Important Hints
A Note on Wireshark Protocol Dissectors

Overview of the Assignment

The purpose of this assignment is to learn about the HyperText Transfer Protocol (HTTP) used by the World Wide Web. In particular, you will design and implement an HTTP proxy (i.e., Web proxy) with functionality that demonstrates both the simplicity and the power of HTTP as an application-layer protocol. Along the way, you will also learn a lot about socket programming, TCP/IP, network debugging, and more. Java, Python, C or C++ are allowed to use in the assignment.

The so-called "Fake News" phenomenon is rampant on the Internet, but it seems unfair to let Donald Trump have all the fun, so let's make some fake news of our own! We are going to make a Web proxy that alters certain content on simple Web pages before they are rendered by the Web browser, so that the user sees factually incorrect information without knowing it. To keep the assignment simple, we will restrict ourselves only to HTTP (not HTTPS), and consider only basic text and HTML pages with a few images.

For our purpose, the fake news will involve Smiley from Stockholm. Specifically, you need to change all occurrences of "Smiley" on a Web page into "Trolly", and all occurrences of "Stockholm" into "Linköping". And if you find any JPG images of Smiley (linked or embedded), then you should replace them with your favourite troll image file (JPG, GIF, or PNG) from the Internet.

Background

About Web Proxies

A Web proxy is a piece of software that functions as an intermediary between a Web client (browser) and a Web server. The Web proxy intercepts Web requests from clients and reformulates the requests for transmission to a Web server. When a response is received from the Web server, the proxy sends the response back to the client. From the server's point of view, the proxy is the client, since that is where the request comes from. Similarly, from the client's point of view, the proxy is the server, since that is where the response comes from. A Web proxy thus provides a single point of control to regulate Internet access between clients and servers. A lot of schools use Web proxies to limit the types of Web sites that students are allowed to access. Net Nanny and Barracuda are examples of commercially available Web proxies.

Socket Programming

As a good resource for socket programming you can use the Beej's Guide to Network Programming.

You will need at least one TCP (stream) socket for client-proxy communication, and at least one additional TCP (stream) socket for proxy-server communication. If you want your proxy to support multiple concurrent HTTP transactions (recommended), you will need to fork child processes for request handling as well. Each child process will use its own socket instances for its communications with the client and with the server. An example for such usage of the fork() function is demonstrated in Beej's Guide to Network Programming under "A Simple Stream Server".

About Hypertext Transfer Protocol (HTTP)

The Web and HTTP is discussed in Section 2.2 of the course text book, i.e. Computer Networking: A Top-Down Approach. Read this section well before proceeding to implement your proxy server! Pay specific attention to the discussion of non-persistent and persistent connections (Section 2.2.2 of the text book). You may need to refer to the HTTP/1.0 and HTTP/1.1 specifications as well. Consider that HTTP/1.1 uses persistent connections and therefore, Web servers do not close connections immediately after finishing servicing the current request—unless they are explicitly told in the HTTP request to do so. If the server does not receive a new request over the current TCP connection, it will close the connection after a configured period of time, say 30 seconds. Since the length of the HTTP content sent by the server is not always mentioned in the HTTP headers (an example is Chunked Transfer specified in HTTP/1.1) it is up to the HTTP client (i.e. the browser) to interpret the content and determine whether more data is coming from the server as part of the current HTTP response. It will, however, be difficult for (and not necessary for all types of) proxy servers to analyze the HTTP content. One way to avoid waiting for the server timeouts (to determine the end of transmission) is to modify the HTTP request.

Requirements Specification

In this assignment, you will implement your very own Web proxy, in either C, C++, Java or Python using TCP/IP socket programming. The goals of the assignment are to build a properly functioning Web proxy for simple Web pages, and then use your proxy to change some of the content before it is delivered to the browser.

There are two main pieces of functionality needed in your proxy. The first is the ability to intercept (and parse) HTTP requests and responses, so that the proxy can determine what changes (if any) need to be made to the requested content. The second is the ability to insert the false information into the page in some appropriate way so that the page still renders properly.

The most important HTTP command for your Web proxy to handle is the "GET" request, which specifies the URL for an object to be retrieved. In the basic operation of your proxy, it should be able to parse, understand, and forward to the Web server a (possibly modified) version of the client HTTP request. Similarly, the proxy should be able to parse, understand, and return to the client a (possibly modified) version of the HTTP response that the Web server provided to the proxy. Please give some careful thought to how your proxy handles commonly occurring HTTP response codes, such as 200 (OK), 304 (Not Modified), and 404 (Not Found).
You will need at least one TCP socket (i.e., SOCK_STREAM) for client-proxy communication, and at least one additional TCP socket for each Web server you are talking to for proxy-server communication. If you want your proxy to support multiple concurrent HTTP transactions (not required), you will need to fork child processes for request handling as well. Each child process or thread will use its own socket instances for its communications with the client and with the server.
When implementing your proxy, feel free to compile and run your Web proxy on a university machine or your own computer. However, be aware that you will ultimately have to demo your proxy to your TA at some point. You should try to access your proxy from your favourite Web browser (e.g., Edge, Firefox, Chrome, Safari), and computer (either on campus or at home). To test the proxy, you will have to configure your Web browser to use your specific Web proxy (e.g., look for menu selections like Tools, Internet Options, Proxies, Advanced, LAN Settings).
As you design and build your Web proxy, give careful consideration to how you will debug and test it. For example, you may want to print out information about requests and responses received, processed, forwarded, redirected, or altered. Once you become confident with the basic operation of your Web proxy, you can toggle off the verbose debugging output. If you are testing on your home network, you can also use WireShark to collect network packet traces. By studying the HTTP messages and TCP/IP packets going to and from your proxy, you might be able to figure out what is working, what isn't working, and why.

Note: As part of this assignment you should learn socket programming. You are expected to use only the basic libraries available for socket programming. If uncertain about what libraries you can use, we highly recommend that you check this with your TA before setting out to use non-basic libraries, as their use might violate the goals of the assignment, which are to learn about (1) HTTP and (2) socket programming. For example, using an HttpURLConnection Java class to fetch the data from the Web server is not allowed! The proxy should not impose any limit on the size of the transferred HTTP data, not even with realloc() or similar.

You do not have to relay HTTPS requests through the proxy, and the browser can be configured to only use proxy for HTTP.

Development Strategy

If you are not sure how to start developing your proxy server, you can use the following stepwise strategy:

Consider that the proxy server has two parts, a server part that the browsers connect to, and a client part that connects to the Web servers. The server part and the client part are not isolated pieces of code. That is, the client part can be a class object instantiated from the server part, a function called from the server part, or even some lines of code embedded in the code of the server part. The server part receives the HTTP request form the browser and delivers that HTTP request to the client part (this delivery, in its simplest form, can be done by using the same variable). The client part then (based on the HTTP request) determines to which Web server it should connect and—after connecting to that server—sends the HTTP request to and receives the HTTP response from that server. The client part then delivers back the received content to the server part to be sent back to the browser.
Read and understand the simple TCP server and the simple TCP client examples. Try to identify the steps taken in the TCP server (i.e. creating a socket, binding the socket to the desired address and port, listening for the connections, accepting connections, forking to handle concurrent connections, sending/receiving data, and closing the socket) and in the TCP client (i.e. creating a socket, connecting to a server, sending/receiving data, and closing the socket). A good understanding of these steps, and differences between a TCP server and a TCP client, will greatly help you in getting your proxy server up and running quickly.
Start your coding by implementing the server part of your proxy. The server part should receive the HTTP request from the browser and deliver it to the client part. You may print out the HTTP request on the screen to make sure it is received and stored correctly.
Add the code for the client part to your proxy server. The client part should receive the HTTP request from the server part and extract the information needed to carry out the request on behalf of the browser. The client part should also apply the required modifications to the HTTP request to make it ready to be sent to the Web server. If you are not sure what modifications should be done to the HTTP request, please read this part and this part more carefully! Once the required information is extracted and the request is appropriately modified, the client part should connect to the Web server, send the modified request to it, and receive the HTTP response from the server. You will implement the content altering in a later step.
Add content altering to the client part. Please note that as part of the requirements, not every content should be searched for keywords. The reason is that to search the content, you need to store the whole content (at least in a straightforward implementation) which severely limits the ability of your proxy to handle delivery of large files. Therefore, based on the content type, i.e. text or non-text (and compressed or non-compressed), you can use different approaches to send the Web server's response back to the browser. Please consider that it is not required to search the compressed content for the keywords.

Testing

Your proxy will be tested on the following 4 test cases:

Once you have these cases working, you can try your proxy on other pages. However, it might be challenging finding non-HTTPS webpages – especially containing the given keywords. It is important that you understand (and explain) what your proxy can and cannot do, as well as why it has the limitations that it does have. This is expected to be described in your report.

What to Deliver

The source code of your Web proxy in which the function of each block of code is described by a short comment.
A clear and concise user manual (at most 1 page) that describes how to compile, configure, and use your Web proxy. Make sure to indicate the required features that the proxy supports. Make sure to clarify where and how the testing was done (e.g., home or university), what works, and what does not. Be honest!

When you are finished, please create a single (g)zipped archive with your solution. Your file should include all the above-mentioned items.

Demonstration

The primary test of correctness for your proxy is a simple visual test. That is, for most Web pages, the content displayed by your Web browser should look the same regardless of whether you are using your Web proxy or retrieving content directly from the Web server. This mode of operation can be called "invisible" mode, since the presence of the proxy is invisible to the user. The only differences appear when you try to access content containing "Smiley" and "Linköping". In this case, the keywords should be altered according to the instructions in the requirement specification.

The TA will ask you to demonstrate your Fake News Web proxy in action; e.g., by browsing the test links and possibly other HTTP sites as well. You should be ready to answer questions about the details of your code.

Important Hints

This is a very challenging assignment, so please get started early. It is planned to use 3 lab slots (i.e. weeks).
Focus on the basic HTTP proxy functionality first, by simply forwarding everything that you receive from the client directly to the server, and everything you receive from the server directly back to the client. Then add more functionality, such as text parsing, content alteration, request alteration, and/or HTTP redirection.
Start with very simple Web pages, such as those indicated above. Once you have these working, then you can try more complicated Web pages with lots of embedded objects, possibly from multiple servers.
HTTP data may contain null (i.e. '\0') characters. This may happen, for example, when the content is encoded using the gzip algorithm (such as the search results returned by Google). Therefore, be aware that string manipulation functions—such as strlen(), strstr(), strcat(), etc.—that assume null-terminated strings as their input parameters might not work on HTTP content as expected. These functions are, however, safe to be used when processing HTTP headers.
If you need to manipulate HTTP headers, be advised that not all browsers and web servers use the same case for the headers. For example, both Connection: keep-alive and Connection: Keep-Alive are both valid HTTP headers. Your solution should therefore not be case sensitive.
Be advised that some Web browsers change the HTTP request when they are configured to send the request through a Web proxy. Whether you need to worry about this, depends on your implementation of the proxy server. For example, the Chrome browser sends the following requests in the absence and in the presence of a Web proxy, respectively (the requests are partially shown):
- GET / HTTP/1.1
  Host: www.google.com
  Connection: keep-alive
- GET http://www.google.com/ HTTP/1.1
  Host: www.google.com
  Proxy-Connection: keep-alive
Some Web servers do not respond as expected when the host information appears on the GET line of the HTTP request.
Sometimes, even if you have properly closed the listening socket when your proxy server quits, you will receive the Address already in use error message when you re-run your proxy server. The reason is that it takes some time before the operating system clears the file handle from the file table. As a workaround you may want to use the setsockopt() function with the SO_REUSEADDR option on the listening socket.
Your proxy will need one socket for talking to the client, and another socket for talking to the server. Make sure to keep track of which one is which. This is very important to understand!
Your proxy will likely need to dynamically create a socket for every new server that it talks to. Most of the examples above involve only one server, which is easier. But you will likely need to generalize this to multiple servers. If so, make sure to manage these sockets properly.
You may find that network firewalls block certain ports, which may make configuration and use of your proxy tricky. A good Wireshark trace can help show you what is actually happening on the network.
Try to avoid servers that automatically redirect HTTP to HTTPS, since TLS handshakes and encrypted content are well beyond the intended scope of the assignment. Let's keep things simple with HTTP only.
Here is a generic debugging checklist that you might find helpful.

A Note on Wireshark Protocol Dissectors

You may find the following hint useful, specially when you use Wireshark to sniff data that is being exchanged between browser and the proxy server.

Wireshark uses protocol dissectors to extract information from packets. For example, the information shown under the "Hypertext Transfer Protocol" node in the packet details pane is extracted using the HTTP protocol dissector. Wireshark, however, is not always able to choose the right dissector for a packet. This happens for example, when an uncommon port is used for a common protocol, making Wireshark not able to choose the right type of dissector. Such an example is shown in Figure 1 where a proxy server on port 60000 is used to access web pages. It can be seen from Figure 2 that although the contents of the packets are HTTP data, the protocol is not detected as HTTP.

Figure 1: Surfing the Web through a proxy server on port 60000

Fortunately, it is possible to instruct Wireshark what dissector to use for a given packet. By right-clicking on a packet and selecting "Decode As ...", a window opens which allows assigning the desired protocol dissector to the selected packet, see Figure 6. In the example of Figure 6, after selecting the Transport tab, one can select the HTTP protocol dissector to be used for every packet with the source port of 60000. In the same manner, one can right-click on a packet with the destination port of 60000 and assign the HTTP dissector to it, so that both outbound and inbound packets to and from the proxy server are decoded as HTTP. Please note that such user specified decodes cannot be saved and are lost upon exiting Wireshark.

Figure 2: The "Decode As" window