The future of history

Can the internet be classed as an historical document?

Written by Billy Aiken

History is a hard thing to preserve: the manuscripts of the Library of Alexandria, burned down 1,600 years ago, and the early days of printing saw much material perish as it wasn't considered valuable. But now we live in the information age and have a real appetite for knowledge.

Printed material is routinely scanned into digital form, and copies can be made and distributed easily. Old movie newsreels are transferred to VHS or are digitised, and we are even rebuilding the Library of Alexandria in the shape of the Bibliotheca Alexandrina, which reflects the scope of the original that boasted 700,000 indexed manuscripts from the third century BC.

The ancient library and museum was a meeting place for peoples and cultures, giving birth to a new intellectual dynamic. By gathering in one place all the known sources of knowledge, and indexing them for scholarly study, it created the world's first research institute.

Read that again and you could almost replace 'ancient library' with 'the internet': the meeting of cultures, the intellectual dynamic, the global source of knowledge. Which begs the question, is the internet an historical document?

The Wayback Machine

Hoping to answer this question by archiving the Internet itself is Brewster Kahle, a man with a solid pedigree when it comes to web information services.

Kahle invented the wide area information servers system back in 1989 enabling users to find and retrieve documents from online databases by way of full text searching, before search engines as we know them even existed. (Yahoo was born in 1994, soon after the first graphical web browsers became available.)

"In 1996, we created the Internet Archive because we felt it was critical to preserve a permanent record of this historically significant new medium for the public," he said at the Wayback launch in October 2001.

"The Archive has catalogued over 10 billion web pages that might otherwise have been lost, giving us a record of the origins and evolution of the internet, as well as snapshots of our society as a whole around the turn of the century."

The Wayback Machine provides a free, public and powerful front end to the Internet Archive for the first time, and together this 'Internet Library' hopes to prevent a repeat of history, where early materials from any medium are lost and end up being partially restored through fragments.

But Kahle might already be too late, despite the impressive figures of 10 billion web pages archived and 16 million Usenet group postings consuming a huge 100Tb of storage.

The problem lies with the fact that the Wayback Machine Internet Archive only stretches back to 1996, yet the web itself stretches back a few years further and the internet adds a couple of decades more onto the timeline.

So Wayback is only ever going to be a partial archive of material, which leads to a second problem: many sites won't be archived at all because they are password protected or simply don't allow the automated software processes that Wayback uses to make its sweeps of the web.

These sweeps take two months to complete, with the result that the archive is growing at a rate of an additional 10Tb per month.

Yet at any time the publishers of the archived material can request its removal - which will be honoured - but a request doesn't mean that the information itself will be removed.

Wayback when

While Wayback will delete pages published by someone at their request, they cannot delete all references to the page. As many government agencies have discovered when trying to remove sensitive information from the web, copies proliferate to the point where removal is impossible.

Take the example of the federal Agency for Toxic Substances and Disease Registry (ATSDR) which pulled a sensitive report from its website entitled 'Industrial Chemicals and Terrorism' within days of 11 September.

All traces of the document, and its description of sources for home-brew nerve gases and explosives, were eliminated. Not that it mattered, as a search on Google will find 15,000 references, and the first page of results will have direct links to the report in full.

Or you could simply go to the Wayback Machine, enter the URL for the ATSDR and click on the 4 August version of the site which contains a full copy of the report.

Some agencies have got wise and are censoring retroactive access. Try looking for the Nuclear Regulatory Commission site and you get a 'Blocked Site Error' notice 'per the request of the site owner'.

While it is obvious that the operators of the Wayback Machine are treading on delicate legal and moral ground when it comes to censorship, they have little choice but to allow site owners the right of erasure, even if this will erode the historical value of the project.

However, some might say that the existence of a blocked site error is an historically important comment in itself. By using other internet research resources, primarily the Google type mega-search indices, it is easy to see why the site owner might feel such censorship was necessary, given the political and social climate of the time.

History in the making

There are billions of voices represented on the web, but only fleetingly when you realise that the average life span of web page information is between 75 and 100 days.

So the archiving of online material is becoming increasingly important if we are to provide future generations access to its historical treasure trove of information.

An internet archive will preserve how our culture responds to events at any given time, showing humanity at its best and worst, cutting across cultures and societies to provide the ultimate global cross-section reference point.

The Internet Archive's digital library at the Wayback Machine is now officially the world's largest known database, beating every other library in the world for digital data held, even eclipsing the Library of Congress and British Library collections.

Using the Wayback Machine is easy: go to the site, pop in a URL, select a date from the table that appears and up comes the site on that date - or rather a copy of it.

Some of these may be missing part or all of the graphical content, and many of the links won't work, but it's the closest thing to having a microfiche of the internet as we are likely to get.

So if you want to research the 11 September attacks, document the website progressions of the 'Heaven's Gate' cult leading up to their mass suicide in 1997, or simply get a snapshot of the web at any point of time, you can do it here.

We asked writer and part-time historian Barnaby Page what constitutes a document of historical interest. "Any document has potential historical interest, particularly to social historians when often a collection of documents is of more interest than the sum of the individual documents," he said.

"A single list in isolation would tell us little beyond some bare facts. I can imagine that en masse, large numbers of personal home pages could tell future historians quite a bit about early 21st century life."

New pamphlets for old

Indeed, Miranda Mowbray, a researcher with the Hewlett Packard Labs in Bristol, sees a clear analogy between British 17th century pamphlets and self-published web pages on social and political topics.

"In the 17th century, pamphlets were tricky to censor and contained a large variety of alternative perspectives and theories," she said. "I imagine the same goes for 19th century personal letters. Because internet content is produced from a much wider variety of sources than in most other media, it probably provides a truer reflection of the things people currently believe."

But perhaps the most salient point to make at this moment is whether historians and researchers will accept the internet as a legitimate source of historical data.

Paul Harwood, a qualified librarian with more than 20 years' experience, and managing director of Swets Blackwell UK - one of the world's largest suppliers of subscription and information management tools working with 60,000 libraries worldwide - is cautious.

"The main difference between the published data and journals we handle and documentation available via the internet is that the printed matter will have been subjected to a certain amount of examination and peer review, and will be proven through research and experimentation," he explained.

"Information or documentation supplied over the internet is not subjected to the same process, which means people could take the information at face value or as proven valid historical fact when it could be little more than speculation."

Although there is some merit in the argument that every version of a document contributes to the historical big picture, and some may insist that documents which have been tampered with provide the most insight when compared to the originals, it is vital that there is some method of validating the integrity of internet documents in order to ensure a factual baseline from which to work.

Stanford Library in the US is collaborating with the National Science Foundation and Sun Microsystems to create LOCKSS (Lots Of Copies Keep Stuff Safe). LOCKSS is a Java/Linux-based server system designed to store electronic journals at library level.

These systems communicate with each other to preserve the integrity of their caches, and to ensure that a minimum number of copies of each article exist globally.

Mirror to society

Internet archives change the web from being a collection of ephemera into an historically important artefact of modern political, cultural and social life.

Historians need all the relevant information they can find in order to achieve perspective of value when researching their particular subject matter. For the first time the source of this information, and the mechanism by which it can be effectively searched, are one and the same.

The internet, and the web in particular, allows everyone to exercise their right to remember as we move into an age of digital media. Whether you are an historian, journalist, student or just a 'seeker of the truth' an internet archive offers a mirror to society.

Sure, no such archive can ever be complete, but a partial archive is immensely valuable. After all, very few historians work on the 'complete history of everything' basis, and none believe they have to read every available piece of source material to arrive at a valid big picture.

Rachel Stewart, manager of the corporate archive for Intel, sums it up well. "Future researchers will become data analysers not simply looking for the information, but examining all of the data and making up their own minds," she said.

"The internet is a picture of humankind yesterday, today and tomorrow. More people will have the opportunity to review information, and a better informed world makes for one able to make clearer, more intelligent decisions."

Finding archived material online

Here are a few websites to get you started:

British Library
www.bl.uk

Library of Congress
www.loc.gov

Wayback Machine
www.archive.org

Google Groups
groups.google.com

Unesco Archives Portal
www.unesco.org/webworld/portal_archives

Tags:

Further reading

Google launches cash for questions service

Hire your own personal researcher   More...

Related articles

Do you agree?

Advertisement

Job of the week

Search thousands of IT jobs :

Search thousands of IT jobs:

Advanced search

Hiring now on ComputingCareers:

Related IT jobs

Search thousands of IT jobs :

Search thousands of IT jobs:

Advanced search

Advertisement

Watch

09 May 2008

2.51 MBWiMax muddle, Google tactics and asteroid bunkum More...

08 May 2008

3.26 MBBroadband Anywhere, phone-free transport and Web 3.0 More...

07 May 2008

3.19 MBUK success, a paucity of IT women and robot wars More...

Poll

DATA ENCRYPTION

DATA ENCRYPTION

Should encryption be mandatory for all personal data held by companies and governments?

Previous poll results

Newsletter signup

Sign up for our range of FREE newsletters:

Existing User

Newsletter user login:

Enter email address to edit your newsletter preferences

Spotlight

Ofcom

Ofcom outlines future wireless vision

Wi-Fi healthcare and intelligent car brakes in the pipeline   More...

HP

HP Labs opens doors to academia

Innovation Research Program invites proposals related to current research   More...

Advertisement

Asteroid

Nasa plans manned mission to asteroid

Bruce Willis thankfully not going   More...

MySpace

MySpace offers opt-in data sharing

Deals signed with Photobucket, Twitter, eBay and Yahoo   More...

Advertisement