Wednesday, October 14, 2009

Using wget

With Geocities going down in less that 2 weeks' time, I found myself needing to archive a number of websites hosted there that would otherwise disappear. For this purpose, one can go through the frustrating experience of saving a webpage's files one by one, but that would be stupid when there exist tools that automate the whole process.

The tool for the job is GNU Wget. While I've used this tool before for similar purposes, Geocities has several annoying things that made me need to learn to use the tool a bit better.

For starters, this is how to use the tool:

wget http://www.geocities.com/mitzenos/

Great, that downloaded index.html. But we want to download the whole site! So we use the -r option to make it recursive. This means that it will follow references to files used by each webpage, using attributes such as href and src. While this recursion could potentially go on forever, what limits it is the (default) recursion depth (i.e. follow such references only to a certain limit) and the fact that wget will, by default, not follow links that span different hosts (i.e. jump to different domains). Here's how the recursion thing works:

wget -r http://www.geocities.com/mitzenos/

OK, that downloads an entire site. In the case of Geocities, which hosts many different accounts, wget may end up downloading other sites on Geocities. If /mitzenos/ links to /xenerkes/, for example, both accounts are technically on the same host, so wget will just as well download them both. We can solve this problem by using the -np switch [ref1] [ref2]. Note combining -r and -np as -rnp does not work (at least on Windows it doesn't).

wget -r -np http://www.geocities.com/mitzenos/

So that solved most of the problems. Now when we try downloading /xenerkes/ separately, Geocities ends up taking down the site for an hour because of bandwidth restrictions, and you see a lot of 503 Service Temporarily Unavailable errors in the wget output. This is because Geocities impose a 4.2MB hourly limit on bandwidth (bastards). Since the webspace limit for Geocities is 15MB, it makes it difficult to download a site with size between 4.2MB and 15MB.

The solution to this problem is to force wget to download files at a slower rate, so that if the site is, say, 5MB, then the bandwidth will be spread over more than one hour. This is done using the -w switch [ref: download options], which by default takes an argument in seconds (you can also specify minutes, etc). For Geocities, 40-60 seconds per file should be enough, if the files aren't very large. Back when Geocities was popular, it wasn't really normal to have very large files on the web, so that isn't really an issue. This is the line that solves it:

wget -r -np -w 60 http://www.geocities.com/mitzenos/

This command will obviously take several hours to download a site if there are a lot of files, so choose the download interval wisely. If you're exceeding the bandwidth limit then use a large interval (around 60 seconds); if there are lots of files and the download is too damn slow, then use a smaller interval (30-40 seconds).