Net Ninny: A Web Proxy Based Service
|
By Niklas Carlsson
and Farrokh Ghani Zadegan, and last modified August 2016
(Based on an assignment by Carey Williamson) |
Contents
Overview of the Assignment
The purpose of this assignment is to learn about the World Wide Web and the HyperText Transfer
Protocol (HTTP). Along the way, you will also learn about TCP/IP and socket programming in
java, C or C++. Either of these programming languages are okay to use for the assignment.
Imagine that you are a conscientious Internet user who wishes to protect your friends and family
from viewing inappropriate Web content. In particular, you want them to avoid any Web pages that
might insult their intelligence. Specific examples that may come to mind are Web pages that mention
SpongeBob, Britney Spears, Paris Hilton, or
Norrk??ping. You must do your best to prevent their
Web browsers from viewing these sites.
Rather than purchasing commercial software, such as
Net Nanny, to restrict access to Web content,
you have decided to write your own solution. Your solution will build upon your knowledge of
Web proxies and HTTP, coupled with a few simple rules for URL-based and content-based filtering
of Web pages. Since the content filtering is not all that sophisticated, we will refer to this
simple software as the "Net Ninny".
Note:
As part of this assignment you should learn socket programming.
You are expected to use only the basic libraries available for socket programming.
If uncertain about what libraries you can use, we highly recommend that you check this
with your TA before setting out to use non-basic libraries, as their use might violate the
goals of the assignment, which are to learn about (1) HTTP and (2) socket programming.
For example, using an HttpURLConnection Java class to fetch the data from the Web
server is not allowed!
Preliminaries
About Web Proxies
A Web proxy is a software entity that functions as an intermediary between a Web client (browser)
and a Web server. The Web proxy intercepts Web requests from clients and reformulates the requests
for transmission to a Web server. When a response is received from the Web server, the proxy sends
the response back to the client. While the presence of the proxy as an intermediary in the
request-response interaction adds some overhead, one advantage of a proxy is that it conceals the
identity of the client from the Web server. That is, from the server's point of view, the proxy is
the client. Similarly, from the client's point of view, the proxy is the server. A Web proxy thus
provides a single point of control to regulate Internet access between clients and servers. This
is also a natural point to implement content filtering, if so desired.
Socket Programming
As a good resource for socket programming you can use the Beej's Guide to Network
Programming available here in HTML
and PDF formats.
You will need at least one TCP (stream) socket for client-proxy communication, and at least
one additional TCP (stream) socket for proxy-server communication. If you want your proxy to
support multiple concurrent HTTP transactions (recommended), you will need to fork child
processes for request handling as well. Each child process will use its own socket instances
for its communications with the client and with the server.
An example for such usage of the fork()
function is demonstrated in Beej's Guide to Network Programming under
"A Simple Stream Server".
About Hypertext Transfer Protocol (HTTP)
The Web and HTTP is discussed in Section 2.2 of the course text book, i.e.
Computer Networking: A Top-Down Approach. 5th Ed. Read this
section well before proceeding to implement your proxy server! Pay specific
attention to the discussion of non-persistent and persistent connections
(Section 2.2.2 of the text book). You may need to refer to the
HTTP/1.0 and
HTTP/1.1 specifications as well.
Consider that HTTP/1.1 uses persistent connections and therefore, Web servers do not
close connections immediately after finishing servicing the current request—unless
they are explicitly told in the HTTP request to do so. If the server does not
receive a new request over the current TCP connection, it will close the
connection after a configured period of time, say 30 seconds. Since the length
of the HTTP content sent by the server is not always mentioned in the HTTP
headers (an example is Chunked Transfer specified in HTTP/1.1) it is up to the
HTTP client (i.e. the browser) to interpret the content and determine whether
more data is coming from the server as part of the current HTTP response. It
will, however, be difficult for (and not necessary for all types of) proxy servers
to analyze the HTTP content. One way to avoid waiting for the server timeouts
(to determine the end of transmission) is to modify the HTTP request.
Requirements Specification
In this assignment, you will implement and test a simple Web proxy in C, C++,
or Java using TCP/IP socket programming.
The goals of the assignment are to build a properly functioning Web proxy for simple Web pages,
and apply simple URL-based and content-based filtering techniques to restrict the Web pages that can be
accessed by the user. You do not need to implement any caching (i.e., file storage) in your Web proxy.
As part of the assignment you are expected to design and implement a Web
proxy having the following features:
- The proxy should support both HTTP/1.0 and HTTP/1.1.
- Handles simple HTTP GET interactions between client and server
- Blocks requests for undesirable URLs, using HTTP redirection to display
this error page instead
- Detects inappropriate content bytes within a Web page before it is returned to the user,
and redirecting to
this error page
- Imposes no limit on the size of the transferred HTTP data
- Note: Using the
realloc()
function to increase the size of the buffer (allocated to receiving the HTTP data from the server)
is also considered as imposing a limit. Instead, you must chose a size for the buffer and manage the data in it intelligently
so that it does not result in lost packets or unsent data being overwritten when receiving responses.
It is also recommended that you avoid performing type casting operations when receiving and sending data from the buffer.
- Is compatible with all major browsers (e.g. Internet Explorer, Mozilla Firefox, Google Chrome, etc.) without the requirement to tweak any advanced feature
- Allows the user to select the proxy port (i.e. the
port number should not be hard coded)
- Is smart in selection of what HTTP content should be
searched for the forbidden keywords. For example, you probably agree that it is not wise to search
inside compressed or other non-text-based HTTP content such as graphic files, etc.
- (Optional) Supporting file upload using the POST method
- You do not have to relay HTTPS requests through the proxy
Preparation Questions
Before starting your implementation, please carefully discuss the following preparation questions with your
lab partner, write a short (informal) summary with the answers and game plan, and
discuss the answers and game plan briefly with your TA before starting the implementation of your solution:
-
How will you transfer text data over a socket? How will you transfer binary data like image files over socket?
You can assume the context of language that you plan to use for performing the assignment (i.e., c, c++, or java).
-
What does the "Connection: close" and "Connection: Keep-alive" header field imply in HTTP protocol.
When should one be used over the other? [See question 20 in assignment 1.]
For this question you may also want to go back to the traces used
in assignment 1 and see if you can find either (or both) of these types of connections.
-
Consider the use of a proxy server, through which the client sends its request.
Briefly explain (using a block diagram)
the HTTP request-response interaction between a client, proxy, and server.
Pay careful attention to the
TCP port numbers as you will use a similar setup in this assignment.
[See optional question 20 in assignment 21.]
-
Please outline a high-level algorithm that describes how you plan to implement your version of NetNinny.
The informal summary will not be graded, but may be helpful in discussing issues that arise when
implementing your solutions. It may also provide an opportunity for some self-reflection at the end of the assignment.
Important Hints
-
Make sure to break the assignment in smaller steps.
For example, first implement a simple proxy that only streams every content type without any modifications.
In this step you can for example assure yourself and the TA that the simple stream client/server parts work properly.
Then, in the next part they can add filtering. Of course, each such step should be broken into even finer
sub-steps. For example, for the first step we recommend that you first implement a working client
and a working server, before you connect them into providing the proxy functionality. Again,
this allow for much easier debugging and testing.
- HTTP data may contain null (i.e. '\0') characters. This may happen, for example, when the content is encoded
using the gzip algorithm (such as the search results returned by Google).
Therefore, be aware that string manipulation functions—such as strlen(), strstr(), strcat(),
etc.—that assume null-terminated strings as their input parameters might not work on HTTP
content as
expected. These functions are, however, safe to be used when processing HTTP
headers.
- If you need to manipulate HTTP headers, be advised that not all browsers and web servers use the
same case for the headers. For example, both Connection: keep-alive and
Connection: Keep-Alive are both valid HTTP headers. Your solution should
therefore not be case sensitive.
- Be advised that some Web browsers change the HTTP request when they are
configured to send the request through a Web proxy. Whether you need to
worry about this, depends on your implementation of the proxy server. For
example, the Chrome browser sends the following requests in the absence and
in the presence of a Web proxy, respectively (the requests are partially
shown):
- GET / HTTP/1.1
Host: www.google.com
Connection: keep-alive
- GET http://www.google.com/ HTTP/1.1
Host: www.google.com
Proxy-Connection: keep-alive
Some Web servers
do not
respond as expected when the host information appears on the GET line of the
HTTP request.
- Sometimes, even if you have properly closed the listening socket
when your proxy server quits, you
will receive the Address already in use error message when you re-run your proxy server.
The reason is that it takes some time before the operating system clears the
file handle from the file table. As a workaround you may want to use the
setsockopt() function with the
SO_REUSEADDR option on the listening socket.
- As mentioned in
Section 1.4 of the Beej's Guide to Network Programming, those of you who
are using C/C++ for your implementation, may need to link to the following
libraries when compiling for Solaris or SunOS: -lnsl
-lsocket -lresolv
-
In your testing of the proxy, you may want to go through incremental
steps similar to the following:
- Download a simple text file such as the
good text file test
- Download a simple HTML file such as the
good HTML file test
- Download an HTML file with a bad name such as the
bad URL HTML file test
- Download an HTML file with a good name but bad content such as the
bad
content HTML file test
- Download various pages that you would expect a regular user accessing.
For example, the TAs suggest that a reasonable list to test first could include
stackoverflow.com,
aftonbladet.se,
svd.se,
liu.se, qz.com, and bbc.com.
-
Some website may force use of HTTPS. Try to identify popular websites that do,
as well as websites that do not. Check how your proxy handles both types.
(Again, you do not have to relay requests over HTTPS through the proxy,
but your proxy need to work properly and not crash when running into websites that use HTTPS.)
- Test if you can go to www.google.com
or www.google.se when proxy is ON.
Type the words to be blocked in Google. Can your proxy filter the data? If not, why?
-
Test if you can visit www.youtube.com
and watch the YouTube homepage as you would see when proxy is OFF.
Regrdless if it work or does not work,
determine the reason for why it works and not work.
It is important to understand the limitations of your proxy.
-
Test other streaming services (e.g., vimeo
and dailymotion)
and see if you spot any differences.
What streaming services, if any,
does your proxy handle properly?
Try to explain your findings, by asking why and why not questions.
-
You could also visit www.wikipedia.com
and search for blocked keyword. Can you filter the data? If not, why?
It is important that you understand (and explain) what your proxy can and cannot do,
as well as why it has the limitations that it does have. This is expected to be described
in your report.
- To setup your browser to fetch HTTP data through a
proxy server:
- In Windows, web browsers use the proxy settings configured in
Control Panel -> Internet Options -> Connections -> LAN Settings ->Proxy
Server where Use a proxy server for your LAN should be
checked and the Address and Port should be entered. To
fine-tune the proxy settings such that they are only applied to HTTP
(and not HTTPS, FTP, etc.) you can use the Advanced push
button.
- The proxy settings for Firefox can be configured regardless of the
proxy settings of the OS. Depending on the OS, you can find the
Options window for Firefox either under Tools->Options
(e.g. in Windows) or Edit->Preferences (e.g. in Solaris). In
the Options window, select Advanced->Network->Settings... and
choose the Manual proxy configuration option. Here you can
enter the HTTP Proxy address and Port. (NOTE: if you are using
localhost as the proxy address, remember to clear the contents of
the No proxy for text box)
- As you design and build your Web proxy, give careful consideration to how you will debug and
test it. For example, you may want to print out information about requests and responses received,
processed, forwarded, blocked, or redirected. Once you become confident with the basic operation of
your Web proxy, you can toggle off the verbose debugging output. If you are testing on your home
network, you can also use tools like Wireshark or tcpdump to collect network packet traces. By
studying the HTTP/TCP packets going to and from your proxy, you can convince yourself (and perhaps
your TA) that it is working properly.
Development Strategy
If you are not sure how to start developing your proxy server, you can
use the following stepwise strategy:
- Consider that the proxy server has two parts, a server part that the
browsers connect to, and a client part that connects to the Web servers. The
server part and the client part are not isolated pieces of code. That is,
the client part can be a class object instantiated from the server part, a
function called from the server part, or even some lines of code embedded in
the code of the server part. The
server part receives the HTTP request form the browser and delivers that
HTTP request to the client part (this delivery, in its simplest form, can be
done by using the same variable). The client part then (based on the HTTP
request) determines to which Web server it should connect and—after
connecting to that server—sends the HTTP request to and receives the
HTTP response from that server. The client part then delivers back the
received content to the server part to be sent back to the browser.
- Read and understand the
simple TCP server and
the
simple TCP client examples. Try to
identify the steps taken in the TCP server (i.e. creating a socket,
binding the socket to the desired address and port, listening for the
connections, accepting connections, forking to handle concurrent connections,
sending/receiving data, and closing the socket) and in the TCP client (i.e. creating
a socket, connecting to a server, sending/receiving data, and closing the
socket). A good understanding of these steps, and differences between a TCP
server and a TCP client, will greatly help you in getting your proxy server
up and running quickly.
- Start your coding by implementing the server part of your proxy. The
server part should receive the HTTP request from the browser and deliver it
to the client part. You may print out the HTTP request on the screen to make
sure it is received and stored correctly.
- Add the code for the client part to your proxy server. The client part
should receive the HTTP request from the server part and extract the
information needed to carry out the request on behalf of the browser. The
client part should also apply the required modifications to the HTTP request
to make it ready to be sent to the Web server. If you are not sure what
modifications should be done to the HTTP request, please read
this
part and this part more
carefully! Once the required information is extracted and the request is
appropriately modified, the client part should connect to the Web server,
send the modified request to it, and receive the HTTP response from the
server. You will implement the content filtering in a later step. This helps
you avoid storing the HTTP response to be searched for the forbidden words.
That is, instead of storing the HTTP response (which requires multiple
recv() calls), you can send every group of
bytes received from the server by each recv()
call, to the browser by using a send() call.
- Add URL filtering to the server part. You may think about why we did not
implement this step before step 4! You need to make sure that the
HTTP 302 redirection
response you are using is correctly formatted regarding the Carriage Return
('\r') and Line Feed ('\n')
characters.
- Add content filtering to the client part. Please note that as part of the requirements,
not every content should be searched
for the forbidden keywords. The reason is that to search the content,
you need to store the whole content (at least in a straightforward
implementation) which severely limits the ability of your proxy to
handle delivery of large files.
Therefore, based on the content type, i.e. text or non-text (and compressed
or non-compressed), you can use different approaches to send the Web
server's response back to the browser. Please consider that it is not
required to search the compressed content for the forbidden keyword.
What to Deliver
- The source code of your Net Ninny proxy in which the function of each
block of code is described by a short comment.
- Provide a clear and concise user manual (about 1 page) that describes how to compile, configure,
and use your Web proxy. Make sure to indicate the required features and optional features (if any)
that the proxy supports. In the manual, for the
required features 2, 3, 6, 7, and
8, refer to the part of the code that implements that feature. To make it
easy to follow, you can also annotate the code in a clear and easy to find
way, to state the feature requirement that is addressed there.
- Provide a careful description of the testing of the proxy, accompanied by documented evidence
(e.g., debug output, packet traces), where appropriate. The latter is particularly important if your
Web proxy is not fully working. Make sure to clarify where and how the testing was done (e.g., home,
university, work), and which cases were successful, and which ones were not. Again, the TA may ask
for these test cases to be shown (as part of illustrating a functioning implementation).
- Provide a summary of what your proxy-based service can and cannot do.
The limitations of the service should be clearly stated (with reference to the above testing).
-
Finally, please list, summarize, and discuss how your proxy handle different website types,
including both streaming and non-streaming websites that use (or not use)
HTTPS and gzip, for example. (Include example websites here.)
When you are finished, please create a single gzipped tar file with your solution. Your file should
include all the above-mentioned items. As for all other assignments, you should
print the code and present your solution for the TA. (See general instructions.)
Demonstration
The primary test of correctness for your proxy is a simple visual test. That is, for most Web pages,
the content displayed by your Web browser should look the same regardless of whether you are using
your Web proxy or retrieving content directly from the Web server. This mode of operation can be called
"invisible" mode, since the presence of the proxy is invisible to the user. The only differences appear
when you try to access inappropriate content. In this case, the offensive content is suppressed,
and HTTP redirection can be used to show the appropriate error pages mentioned in the
requirement specification.
The TA will ask you to demonstrate your Net Ninny Web proxy in action; e.g.,
by browsing websites such as those suggested as potential test sites above.
You will also be asked to
show the filtering capabilities of your proxy. You should be ready to answer
questions about the details of your code.