Saturday, March 7, 2009

HTTP Communication: A Closer Look

About four months ago, I wrote a very simple HTTP server in Python, since my thesis has a Python artifact and I wanted to integrate it with a server. I've known the basics of HTTP for a year and a half now, but actually writing a server is obviously another story.

For those who aren't familiar with HTTP (I mean the actual protocol itself... everyone knows what it does, but not everyone knows what it looks like), or who know the basics but need to see a few examples, "HTTP Made Really Easy" is a great place to start.

My very simple HTTP server worked nicely, but I soon ran into a problem. If the server was hosted on Windows and I accessed it from a browser on Linux, I wasn't getting the payload of the POST packets (I wasn't getting all of the header either, although I didn't notice at first). I got the payload for all other Windows/Linux combinations (server on Linux, browser on Windows; and both server and browser on same system).

Till today, I had no clue what could be (in my mind) causing Linux to send requests without the payload. Then I decided to use Wireshark to find out what exactly was happening to the packets. Wireshark is a great tool that lets you see the actual data in packets you send and receive.

By using Wireshark, I noticed that the HTTP requests were being split into multiple packets when sent from Linux, while a Windows would send the request as one whole packet. This means the payload would arrive in a second or third packet, and since I had only one send() call, I would not receive it.

The solution is to keep a buffer associated with each single connection (identified by ip:port), and append each packet to it. You know when you've reached the end from the Content-Length field in the HTTP header, which tells you the size of the payload. The payload starts after the first "\r\n\r\n" double-newline, so you can start counting from there.

Now, regarding connections... here's another thing I learned by poking around in Wireshark. I used to think that a browser keeps the same connection open for each website, until the browser is closed, so that it can use the same connection (for efficiency purposes) for the same website rather than opening new connections all the time. Well, that's not the case.

Apparently, each HTTP request starts a new connection, so if you're watching requests coming in from the server side, you'll see the client's port number increase by one each time. Connections are reused only to send multiple packets associated with the same request (as above). In other words, a Linux client would open a connection, split the request into a number of packets, send them via the same connection, and close the connection. Well, almost.

The browser actually doesn't close the connection. If your server just sends the data, the browser will keep waiting for data to arrive. So your server must close the connection immediately after sending the response.

No comments:

Post a Comment