Fake News: A Content-Altering Web Proxy
|
Based on an assignment by
Carey Williamson. Adapted by Carl Magnus Bruhner.
This assignment replaces
Net Ninny (some content preserved).
Last updated January 2021. |
Contents
Overview of the Assignment
The purpose of this assignment is to learn about the HyperText Transfer
Protocol (HTTP) used by the World Wide Web. In particular, you will design and
implement an HTTP proxy (i.e., Web proxy) with functionality that demonstrates
both the simplicity and the power of HTTP as an application-layer protocol.
Along the way, you will also learn a lot about socket programming, TCP/IP,
network debugging, and more. Java, Python, C or C++ are allowed to use in the
assignment.
The so-called "Fake News" phenomenon is rampant on the Internet, but it seems
unfair to let Donald Trump have all the fun, so let's make some fake news of
our own! We are going to make a Web proxy that alters certain content on simple
Web pages before they are rendered by the Web browser, so that the user sees
factually incorrect information without knowing it. To keep the assignment
simple, we will restrict ourselves only to HTTP (not HTTPS), and consider only
basic text and HTML pages with a few images.
For our purpose, the fake news will involve Smiley from Stockholm.
Specifically, you need to change all occurrences of "Smiley" on a Web page
into "Trolly", and all occurrences of "Stockholm" into "Linköping".
And if you find any JPG images of Smiley (linked or embedded),
then you should replace them with your favourite
troll
image file (JPG, GIF, or PNG) from the Internet.
Background
About Web Proxies
A Web proxy is a piece of software that functions as an intermediary
between a Web client (browser) and a Web server. The Web proxy intercepts Web
requests from clients and reformulates the requests for transmission to a Web
server. When a response is received from the Web server, the proxy sends the
response back to the client. From the server's point of view, the proxy
is the client, since that is where the request comes from. Similarly,
from the client's point of view, the proxy is the server, since that is
where the response comes from. A Web proxy thus provides a single point of
control to regulate Internet access between clients and servers. A lot of
schools use Web proxies to limit the types of Web sites that students are
allowed to access. Net Nanny and Barracuda are examples of commercially
available Web proxies.
Socket Programming
As a good resource for socket programming you can use the Beej's Guide to Network Programming.
You will need at least one TCP (stream) socket for client-proxy communication,
and at least one additional TCP (stream) socket for proxy-server communication.
If you want your proxy to support multiple concurrent HTTP transactions
(recommended), you will need to fork child processes for
request handling as well. Each child process will use its own socket instances
for its communications with the client and with the server. An example for such
usage of the fork() function is demonstrated in Beej's Guide to Network
Programming under "A Simple Stream
Server".
About Hypertext Transfer Protocol (HTTP)
The Web and HTTP is discussed in Section 2.2 of the course text book, i.e.
Computer Networking: A Top-Down Approach. Read this section
well before proceeding to implement your proxy server! Pay specific attention to
the discussion of non-persistent and persistent connections (Section 2.2.2 of
the text book). You may need to refer to the HTTP/1.0 and HTTP/1.1 specifications as well.
Consider that HTTP/1.1 uses persistent connections and therefore, Web servers do
not close connections immediately after finishing servicing the current
request—unless they are explicitly told in the HTTP request to do so. If
the server does not receive a new request over the current TCP connection, it
will close the connection after a configured period of time, say 30 seconds.
Since the length of the HTTP content sent by the server is not always mentioned
in the HTTP headers (an example is Chunked Transfer specified in HTTP/1.1) it is
up to the HTTP client (i.e. the browser) to interpret the content and determine
whether more data is coming from the server as part of the current HTTP
response. It will, however, be difficult for (and not necessary for all types
of) proxy servers to analyze the HTTP content. One way to avoid waiting for the
server timeouts (to determine the end of transmission) is to modify the HTTP
request.
Requirements Specification
In this assignment, you will implement your very own Web proxy, in either
C, C++, Java or Python using TCP/IP socket programming. The goals of the
assignment are to build a properly functioning Web proxy for simple Web pages,
and then use your proxy to change some of the content before it is delivered to
the browser.
There are two main pieces of functionality needed in your proxy. The first is
the ability to intercept (and parse) HTTP requests and responses, so that the
proxy can determine what changes (if any) need to be made to the requested
content. The second is the ability to insert the false information into the page
in some appropriate way so that the page still renders properly.
- The most important HTTP command for your Web proxy to handle is the "GET"
request, which specifies the URL for an object to be retrieved. In the basic
operation of your proxy, it should be able to parse, understand, and forward to
the Web server a (possibly modified) version of the client HTTP request.
Similarly, the proxy should be able to parse, understand, and return to the
client a (possibly modified) version of the HTTP response that the Web server
provided to the proxy. Please give some careful thought to how your proxy
handles commonly occurring HTTP response codes, such as 200 (OK), 304 (Not
Modified), and 404 (Not Found).
- You will need at least one TCP socket (i.e., SOCK_STREAM) for client-proxy
communication, and at least one additional TCP socket for each Web server you
are talking to for proxy-server communication. If you want your proxy to support
multiple concurrent HTTP transactions (not required), you will need to fork
child processes for request handling as well. Each child process or thread will
use its own socket instances for its communications with the client and with the
server.
- When implementing your proxy, feel free to compile and run your Web proxy on
a university machine or your own computer. However, be aware that you will
ultimately have to demo your proxy to your TA at some point. You should try to
access your proxy from your favourite Web browser (e.g., Edge, Firefox, Chrome,
Safari), and computer (either on campus or at home). To test the proxy, you will
have to configure your Web browser to use your specific Web proxy (e.g., look
for menu selections like Tools, Internet Options, Proxies, Advanced, LAN
Settings).
- As you design and build your Web proxy, give careful consideration to how you
will debug and test it. For example, you may want to print out information about
requests and responses received, processed, forwarded, redirected, or altered.
Once you become confident with the basic operation of your Web proxy, you can
toggle off the verbose debugging output. If you are testing on your home
network, you can also use WireShark to collect network packet traces. By
studying the HTTP messages and TCP/IP packets going to and from your proxy, you
might be able to figure out what is working, what isn't working, and why.
Note:
As part of this assignment you should learn socket programming. You are expected
to use only the
basic libraries available for socket programming. If
uncertain about what libraries you can use, we highly recommend that you check
this with your TA before setting out to use non-basic libraries, as their use
might violate the goals of the assignment, which are to learn about (1) HTTP and
(2) socket programming. For example, using an HttpURLConnection Java class to
fetch the data from the Web server is not allowed! The proxy should not impose
any limit on the size of the transferred HTTP data, not even with realloc() or
similar.
You do not have to relay HTTPS requests through the proxy, and the browser
can be configured to only use proxy for HTTP.
Development Strategy
If you are not sure how to start developing your proxy server, you can
use the following stepwise strategy:
- Consider that the proxy server has two parts, a server part that the
browsers connect to, and a client part that connects to the Web servers. The
server part and the client part are not isolated pieces of code. That is,
the client part can be a class object instantiated from the server part, a
function called from the server part, or even some lines of code embedded in
the code of the server part. The
server part receives the HTTP request form the browser and delivers that
HTTP request to the client part (this delivery, in its simplest form, can be
done by using the same variable). The client part then (based on the HTTP
request) determines to which Web server it should connect and—after
connecting to that server—sends the HTTP request to and receives the
HTTP response from that server. The client part then delivers back the
received content to the server part to be sent back to the browser.
- Read and understand the
simple TCP server and
the
simple TCP client examples. Try to
identify the steps taken in the TCP server (i.e. creating a socket,
binding the socket to the desired address and port, listening for the
connections, accepting connections, forking to handle concurrent connections,
sending/receiving data, and closing the socket) and in the TCP client (i.e. creating
a socket, connecting to a server, sending/receiving data, and closing the
socket). A good understanding of these steps, and differences between a TCP
server and a TCP client, will greatly help you in getting your proxy server
up and running quickly.
- Start your coding by implementing the server part of your proxy. The
server part should receive the HTTP request from the browser and deliver it
to the client part. You may print out the HTTP request on the screen to make
sure it is received and stored correctly.
- Add the code for the client part to your proxy server. The client part
should receive the HTTP request from the server part and extract the
information needed to carry out the request on behalf of the browser. The
client part should also apply the required modifications to the HTTP request
to make it ready to be sent to the Web server. If you are not sure what
modifications should be done to the HTTP request, please read
this
part and this part more
carefully! Once the required information is extracted and the request is
appropriately modified, the client part should connect to the Web server,
send the modified request to it, and receive the HTTP response from the
server. You will implement the content altering in a later step.
- Add content altering to the client part. Please note that as part of the requirements,
not every content should be searched
for keywords. The reason is that to search the content,
you need to store the whole content (at least in a straightforward
implementation) which severely limits the ability of your proxy to
handle delivery of large files.
Therefore, based on the content type, i.e. text or non-text (and compressed
or non-compressed), you can use different approaches to send the Web
server's response back to the browser. Please consider that it is not
required to search the compressed content for the keywords.
Testing
Your proxy will be tested on the following 4 test cases:
-
A simple ASCII text file
-
A looong ASCII text file
-
A simple HTML file
-
An HTML file with link to a photo
-
An HTML file with embedded photos
Once you have these cases working, you can try your proxy on other pages.
However, it might be challenging finding non-HTTPS webpages – especially
containing the given keywords. It is important that you understand (and explain)
what your proxy can and cannot do, as well as why it has the limitations that it
does have. This is expected to be described in your report.
What to Deliver
- The source code of your Web proxy in which the function of each
block of code is described by a short comment.
- A clear and concise user manual (at most 1 page) that describes how
to compile, configure, and use your Web proxy. Make sure to indicate the
required features that the proxy supports. Make sure to clarify where
and how the testing was done (e.g., home or university), what works, and
what does not. Be honest!
When you are finished, please create a single (g)zipped archive with
your solution. Your file should include all the above-mentioned items.
Demonstration
The primary test of correctness for your proxy is a simple visual test. That is, for most Web pages,
the content displayed by your Web browser should look the same regardless of whether you are using
your Web proxy or retrieving content directly from the Web server. This mode of operation can be called
"invisible" mode, since the presence of the proxy is invisible to the user. The only differences appear
when you try to access content containing "Smiley" and "Linköping". In this case, the keywords should
be altered according to the instructions in the
requirement specification.
The TA will ask you to demonstrate your Fake News Web proxy in action; e.g.,
by browsing the test links and possibly other HTTP sites as well. You
should be ready to answer questions about the details of your code.
Important Hints
- This is a very challenging assignment, so please get started early.
It is planned to use 3 lab slots (i.e. weeks).
- Focus on the basic HTTP proxy functionality first, by simply
forwarding everything that you receive from the client directly to the
server, and everything you receive from the server directly back to the
client. Then add more functionality, such as text parsing, content
alteration, request alteration, and/or HTTP redirection.
- Start with very simple Web pages, such as those indicated above.
Once you have these working, then you can try more complicated Web pages
with lots of embedded objects, possibly from multiple servers.
- HTTP data may contain null (i.e. '\0') characters. This may happen, for example, when the content is encoded
using the gzip algorithm (such as the search results returned by Google).
Therefore, be aware that string manipulation functions—such as strlen(), strstr(), strcat(),
etc.—that assume null-terminated strings as their input parameters might not work on HTTP
content as expected. These functions are, however, safe to be used when processing HTTP
headers.
- If you need to manipulate HTTP headers, be advised that not all browsers and web servers use the
same case for the headers. For example, both Connection: keep-alive and
Connection: Keep-Alive are both valid HTTP headers. Your solution should
therefore not be case sensitive.
- Be advised that some Web browsers change the HTTP request when they are
configured to send the request through a Web proxy. Whether you need to
worry about this, depends on your implementation of the proxy server. For
example, the Chrome browser sends the following requests in the absence and
in the presence of a Web proxy, respectively (the requests are partially
shown):
- GET / HTTP/1.1
Host: www.google.com
Connection: keep-alive
- GET http://www.google.com/ HTTP/1.1
Host: www.google.com
Proxy-Connection: keep-alive
Some Web servers do not respond as expected when the host information appears on the GET line of the
HTTP request.
- Sometimes, even if you have properly closed the listening socket
when your proxy server quits, you
will receive the Address already in use error message when you re-run your proxy server.
The reason is that it takes some time before the operating system clears the
file handle from the file table. As a workaround you may want to use the
setsockopt() function with the
SO_REUSEADDR option on the listening socket.
- Your proxy will need one socket for talking to the client, and
another socket for talking to the server. Make sure to keep track of
which one is which. This is very important to understand!
- Your proxy will likely need to dynamically create a socket for every
new server that it talks to. Most of the examples above involve only one
server, which is easier. But you will likely need to generalize this to
multiple servers. If so, make sure to manage these sockets properly.
- You may find that network firewalls block certain ports, which may
make configuration and use of your proxy tricky. A good Wireshark trace
can help show you what is actually happening on the network.
- Try to avoid servers that automatically redirect HTTP to HTTPS,
since TLS handshakes and encrypted content are well beyond the intended
scope of the assignment. Let's keep things simple with HTTP only.
- Here is a generic
debugging checklist that you might find helpful.
A Note on Wireshark Protocol Dissectors
You may find the following hint useful, specially when you use Wireshark to sniff data
that is being exchanged between browser and the proxy server.
Wireshark uses protocol dissectors to extract information from packets. For
example, the information shown under the "Hypertext Transfer Protocol" node
in the packet details pane is extracted using the HTTP protocol
dissector. Wireshark, however, is not always able to choose the right
dissector for a packet. This happens for example, when an uncommon port is used for a
common protocol, making Wireshark not able to choose the right type of
dissector. Such an example is shown in Figure 1 where a proxy server on port
60000 is used to access web pages. It can be seen from Figure 2 that although
the contents of the packets are HTTP data, the protocol is not detected as HTTP.
Fortunately, it is possible to instruct Wireshark what dissector to use for a given
packet. By right-clicking on a packet and selecting "Decode As ...", a window
opens which allows assigning the desired protocol dissector to the selected packet,
see Figure 6. In the example of Figure 6, after selecting the Transport tab, one
can select the HTTP protocol dissector to be used for every packet with the source
port of 60000. In the same manner, one can right-click on a packet with the
destination port of 60000 and assign the HTTP dissector to it, so that both
outbound and inbound packets to and from the proxy server are decoded as HTTP.
Please note that such user specified decodes cannot be saved and are lost upon
exiting Wireshark.