Net Ninny: A Web Proxy Based Service

 

By Niklas Carlsson and Farrokh Ghani Zadegan, AG, and last modified January 2018
(Based on an assignment by Carey Williamson)

Contents

Overview of the Assignment

The purpose of this assignment is to learn about the World Wide Web and the HyperText Transfer Protocol (HTTP). Along the way, you will also learn about TCP/IP and socket programming in java, C or C++. Either of these programming languages are okay to use for the assignment.

Imagine that you are a conscientious Internet user who wishes to protect your friends and family from viewing inappropriate Web content. In particular, you want them to avoid any Web pages that might insult their intelligence. Specific examples that may come to mind are Web pages that mention SpongeBob, Britney Spears, Paris Hilton, or Norrk??ping. You must do your best to prevent their Web browsers from viewing these sites.

Rather than purchasing commercial software, such as Net Nanny, to restrict access to Web content, you have decided to write your own solution. Your solution will build upon your knowledge of Web proxies and HTTP, coupled with a few simple rules for URL-based and content-based filtering of Web pages. Since the content filtering is not all that sophisticated, we will refer to this simple software as the "Net Ninny".

Note: As part of this assignment you should learn socket programming. You are expected to use only the basic libraries available for socket programming. If uncertain about what libraries you can use, we highly recommend that you check this with your TA before setting out to use non-basic libraries, as their use might violate the goals of the assignment, which are to learn about (1) HTTP and (2) socket programming. For example, using an HttpURLConnection Java class to fetch the data from the Web server is not allowed!

Preliminaries

About Web Proxies

A Web proxy is a software entity that functions as an intermediary between a Web client (browser) and a Web server. The Web proxy intercepts Web requests from clients and reformulates the requests for transmission to a Web server. When a response is received from the Web server, the proxy sends the response back to the client. While the presence of the proxy as an intermediary in the request-response interaction adds some overhead, one advantage of a proxy is that it conceals the identity of the client from the Web server. That is, from the server's point of view, the proxy is the client. Similarly, from the client's point of view, the proxy is the server. A Web proxy thus provides a single point of control to regulate Internet access between clients and servers. This is also a natural point to implement content filtering, if so desired.

Socket Programming

As a good resource for socket programming you can use the Beej's Guide to Network Programming available here in HTML and PDF formats.

You will need at least one TCP (stream) socket for client-proxy communication, and at least one additional TCP (stream) socket for proxy-server communication. If you want your proxy to support multiple concurrent HTTP transactions (recommended), you will need to fork child processes for request handling as well. Each child process will use its own socket instances for its communications with the client and with the server. An example for such usage of the fork() function is demonstrated in Beej's Guide to Network Programming under "A Simple Stream Server".

About Hypertext Transfer Protocol (HTTP)

The Web and HTTP is discussed in Section 2.2 of the course text book, i.e. Computer Networking: A Top-Down Approach. 5th Ed. Read this section well before proceeding to implement your proxy server! Pay specific attention to the discussion of non-persistent and persistent connections (Section 2.2.2 of the text book). You may need to refer to the HTTP/1.0 and HTTP/1.1 specifications as well. Consider that HTTP/1.1 uses persistent connections and therefore, Web servers do not close connections immediately after finishing servicing the current request—unless they are explicitly told in the HTTP request to do so. If the server does not receive a new request over the current TCP connection, it will close the connection after a configured period of time, say 30 seconds. Since the length of the HTTP content sent by the server is not always mentioned in the HTTP headers (an example is Chunked Transfer specified in HTTP/1.1) it is up to the HTTP client (i.e. the browser) to interpret the content and determine whether more data is coming from the server as part of the current HTTP response. It will, however, be difficult for (and not necessary for all types of) proxy servers to analyze the HTTP content. One way to avoid waiting for the server timeouts (to determine the end of transmission) is to modify the HTTP request.

Requirements Specification

In this assignment, you will implement and test a simple Web proxy in C, C++, or Java using TCP/IP socket programming. The goals of the assignment are to build a properly functioning Web proxy for simple Web pages, and apply simple URL-based and content-based filtering techniques to restrict the Web pages that can be accessed by the user. You do not need to implement any caching (i.e., file storage) in your Web proxy.

As part of the assignment you are expected to design and implement a Web proxy having the following features:

  1. The proxy should support both HTTP/1.0 and HTTP/1.1.
  2. Handles simple HTTP GET interactions between client and server
  3. Blocks requests for undesirable URLs, using HTTP redirection to display this error page instead
  4. Detects inappropriate content bytes within a Web page before it is returned to the user, and redirecting to this error page
  5. Imposes no limit on the size of the transferred HTTP data
  6. Is compatible with all major browsers (e.g. Internet Explorer, Mozilla Firefox, Google Chrome, etc.) without the requirement to tweak any advanced feature
  7. Allows the user to select the proxy port (i.e. the port number should not be hard coded)
  8. Is smart in selection of what HTTP content should be searched for the forbidden keywords. For example, you probably agree that it is not wise to search inside compressed or other non-text-based HTTP content such as graphic files, etc.
  9. (Optional) Supporting file upload using the POST method
  10. You do not have to relay HTTPS requests through the proxy

Preparation Questions

Before starting your implementation, please carefully discuss the following preparation questions with your lab partner, write a short (informal) summary with the answers and game plan, and discuss the answers and game plan briefly with your TA before starting the implementation of your solution: The informal summary will not be graded, but may be helpful in discussing issues that arise when implementing your solutions. It may also provide an opportunity for some self-reflection at the end of the assignment.

Important Hints

Development Strategy

If you are not sure how to start developing your proxy server, you can use the following stepwise strategy:

  1. Consider that the proxy server has two parts, a server part that the browsers connect to, and a client part that connects to the Web servers. The server part and the client part are not isolated pieces of code. That is, the client part can be a class object instantiated from the server part, a function called from the server part, or even some lines of code embedded in the code of the server part. The server part receives the HTTP request form the browser and delivers that HTTP request to the client part (this delivery, in its simplest form, can be done by using the same variable). The client part then (based on the HTTP request) determines to which Web server it should connect and—after connecting to that server—sends the HTTP request to and receives the HTTP response from that server. The client part then delivers back the received content to the server part to be sent back to the browser.
  2. Read and understand the simple TCP server and the simple TCP client examples. Try to identify the steps taken in the TCP server (i.e. creating a socket, binding the socket to the desired address and port, listening for the connections, accepting connections, forking to handle concurrent connections, sending/receiving data, and closing the socket) and in the TCP client (i.e. creating a socket, connecting to a server, sending/receiving data, and closing the socket). A good understanding of these steps, and differences between a TCP server and a TCP client, will greatly help you in getting your proxy server up and running quickly.
  3. Start your coding by implementing the server part of your proxy. The server part should receive the HTTP request from the browser and deliver it to the client part. You may print out the HTTP request on the screen to make sure it is received and stored correctly.
  4. Add the code for the client part to your proxy server. The client part should receive the HTTP request from the server part and extract the information needed to carry out the request on behalf of the browser. The client part should also apply the required modifications to the HTTP request to make it ready to be sent to the Web server. If you are not sure what modifications should be done to the HTTP request, please read this part and this part more carefully! Once the required information is extracted and the request is appropriately modified, the client part should connect to the Web server, send the modified request to it, and receive the HTTP response from the server. You will implement the content filtering in a later step. This helps you avoid storing the HTTP response to be searched for the forbidden words. That is, instead of storing the HTTP response (which requires multiple recv() calls), you can send every group of bytes received from the server by each recv() call, to the browser by using a send() call.
  5. Add URL filtering to the server part. You may think about why we did not implement this step before step 4! You need to make sure that the HTTP 302 redirection response you are using is correctly formatted regarding the Carriage Return ('\r') and Line Feed ('\n') characters.
  6. Add content filtering to the client part. Please note that as part of the requirements, not every content should be searched for the forbidden keywords. The reason is that to search the content, you need to store the whole content (at least in a straightforward implementation) which severely limits the ability of your proxy to handle delivery of large files. Therefore, based on the content type, i.e. text or non-text (and compressed or non-compressed), you can use different approaches to send the Web server's response back to the browser. Please consider that it is not required to search the compressed content for the forbidden keyword.

What to Deliver

When you are finished, please create a single gzipped tar file with your solution. Your file should include all the above-mentioned items. As for all other assignments, you should print the code and present your solution for the TA. (See general instructions.)

Demonstration

The primary test of correctness for your proxy is a simple visual test. That is, for most Web pages, the content displayed by your Web browser should look the same regardless of whether you are using your Web proxy or retrieving content directly from the Web server. This mode of operation can be called "invisible" mode, since the presence of the proxy is invisible to the user. The only differences appear when you try to access inappropriate content. In this case, the offensive content is suppressed, and HTTP redirection can be used to show the appropriate error pages mentioned in the requirement specification.

The TA will ask you to demonstrate your Net Ninny Web proxy in action; e.g., by browsing websites such as those suggested as potential test sites above. You will also be asked to show the filtering capabilities of your proxy. You should be ready to answer questions about the details of your code.