Link Rot: Scourge of Modern Self-Educators
June 28, 2008
Spend nearly any amount of time on the Internet and you’ll run across what’s called a 404 error message–an HTTP standard response code that signifies the absence of a web page or, at least, an inability to access the server that hosts it–known to many simply as “oops” or, more commonly, as a number of profane remarks. It is the result of the ironically natural decomposition of the digital world–a process that results in the loss of a number of valuable LIS resources every year.
Just have a look at one of the LIS directories out there–I once came across a wonderful introduction to collection development hosted by the Arizona state library in the height of my initial, obsessive fervor. In my excitement, I managed to bookmark the site and proceeded to read about half of the short articles before I began drifting off (It was already late when I found the site) but, a day later, I clicked on the link only to find a soft 404 error message.
The difference between a true 404 error message and a “soft” 404 message lies in the level of communication. A true 404 error message comes from a server to a person’s browser in the form of a machine language–once the browser receives this code, it knows to display a pre-set page explaining that the site requested is unavailable for whatever reason. A “soft” 404 error message isn’t actually a 404 error at all–when the web master of a site no longer wants to host a specific page, he can make another page (a page that reads 200 OK in the machine language) that simply tells you directly in english (or, again, whatever other human language) that the article you requested isn’t available anymore. It is this “soft” 404 error message that makes automated clean up of link rot nearly impossible because, to a machine, the page that so artfully displays the responsible party’s apology appears to be just like any other functioning page.
Wake up! The boring part is over…
So how do we address this problem? To preserve a paper resource that may normally be discarded by the reader such as a journal or a newspaper, physical archives are created–the same can be done with web pages in a digital environment through web archiving. There are some formats that are more convenient than others and, when you get into matters of multimedia, there are some formats more capable of thorough preservation than others. The latter, however, is not typically required for the preservation of simple web pages. The most popular format choices are:
-HTML (HyperText Markup Language): Saves only that which is represented by HTML coding (Text, color, underlining, text size, etc).
-”Webpage, complete”: This option found on some computers using a windows operating system and also saves HTML but additionally saves images from the page in a separate but linked folder on your hard drive.
-MHTML (Multipurpose Internet Mail Extensions HTML): Also known on some windows computers as “Web Archive”, this format includes the images and attachments of a page like the “webpage, complete” does but instead of adding these images to a linked folder, it embeds them with the text thereby saving space and eliminating clutter on the drive.
-PDF (Portable Document Format): While not technically a format for saving web pages, a lot of sites link to information in this format and it is, at least for multi-page documents, very convenient saving the time of storing several individual pages.
The only other issue is the drive on which this information is saved. The hard drives built-in on the average computer have more than enough space but even the relatively cheap 1 GB flashdrives have many times the space that one person working on a amature basis will ever use. These types of drives typically fit into USB ports for convenience and have a data retention life somewhere around a decade so they’re a good investment for under twenty bucks a piece.
So I lied…the boring part wasn’t over but now it is–place a comment and put what you’ve learned into action!
Tagged: 404, Crawlers, Digital Preservation, error message, Flash drive, Library, Library Science, MHTML, Preservtion, Web Archive, Web Archiving, web crawler