An improved wget / http-get function using archive.org

I post this because it’s an obvious but I suspect often overlooked idea that may be useful to other internet programmers reading this. As you may know, I’m writing a bot that is used to convert old blogs into physical printed scrapbooks. As a part of doing this, it has to download all of the images that are referenced therein. As I’m sure is no surprise, when you’re talking about images that are more than a few years old, the links are quite often dead, and there’s nothing there but a 404 or useless redirect to download. However, there is a very good chance that you can find it on Archive.org’s supremely useful “Wayback Machine” which is an ambitious attempt to build a digital time machine by archiving dated snapshots of the Internet. I decided that it might be a good idea to integrate this into my http-get function, and what surprised me is how easy this was to do.

Here’s the meta-code:

   http_get($url)
      retrieve($url)
      if the retrieved data is 404 or otherwise invalid
         retrieve(http://web.archive.org/web/19950101010101/$url

It’s really that easy to create a function that downloads a URL, and if it can’t find it, checks for it on the Wayback Machine!

Specifically, what this does is attempts to download your URL from the standard location. The data retrieved could be “good”, or it could be a “404″ or other standard error, or it could be “bad” data. You’ll find that often expired links do not create obvious failures. An expired link often redirects to the site’s main index, or it may point at sites that are gone or have been completely redesigned and redirect you to any number of locations. Sometimes the easiest way to determine if the retrieved data is “valid” is by checking the file type. For example, in my case I was downloading only images, and if the URL returned anything other than an image I could be pretty sure that it had expired (since few redirect systems are intelligent enough to respect file type). You’ll have to come up with your own way of determining whether the data you downloaded is what you wanted.

If the data downloaded from the original URL is determined to be invalid, you create a new URL that starts with “http://web.archive.org/web/19950101010101/” and is followed by your original URL. So for example, if you were trying to download http://www.zentastic.com/shannon-larratt-is-zentastic.gif and it failed, your Wayback URL would be http://web.archive.org/web/19950101010101/http://www.zentastic.com/shannon-larratt-is-zentastic.gif. Now, that doesn’t mean that the Wayback Machine has to have a version of the file from January 1st, 1995 at 01:01:01. When you request a date that doesn’t exist, it will try and redirect you to the closest one it has. What it will do is give you a “302 Moved Temporarily” with the “correct” URL (which your http-get function should already deal with anyway). Download the URL specified in the “Location” field, and you’ll get the first version of the file stored by the Wayback Machine.

I should note that if you use the URL that I’ve specified above, what you’ll get is the oldest version of the file. The reason I did that is that I figured that if I’m trying to retrieve an old version of the file that is no longer at the URL, the oldest one had the best chance of being the correct one. If on the other hand you want to download the most recent version of the file, you can ask for it with a URL starting with “http://web.archive.org/web/20130101010101/” (ie. January 1st, 2013 instead of 1995). However, depending on the type of file and type of redirect in use, there’s a good chance that the Wayback Machine could be archiving the same junk data that you got and are trying to avoid in the first place. Alternately, if you know the date of the linking entity (for example if it’s a blog or forum post), you could also use that date.

Anyway, I found it was a very easy improvement to my standard http-get function that at least in some cases, improves functionality dramatically! Hope this was helpful to someone.

3 Comments

  1. dAN wrote:

    Nifty tip with the date parameter. I love it when well engineered websites display this kind of emergent functionality

    As far as I knew archive.org tends not to scrape ‘n store images though? I hope I’m wrong for the sake of your project

    On a related note the founder of archive.org has an interesting blog
    http://brewster.kahle.org

    Thursday, January 26, 2012 at 7:56 am | Permalink
  2. Shannon wrote:

    Dan – You’re totally wrong thankfully. They store tons of images and I was able to restore 99% of the images that were missing from the data set I was working with. It was an optimal solution. Thanks for the tip on their blog, I wasn’t aware of it.

    Thursday, January 26, 2012 at 11:55 am | Permalink
  3. dAN wrote:

    Glad to hear it :)

    Thursday, January 26, 2012 at 12:06 pm | Permalink
Wow Shannon, that's really annoying! What is it, 1997 on Geocities? Retroweb is NOT cool!

Post a Comment

Your email is never published nor shared. Required fields are marked *
*
*