A permanent record of every NZ website

This month, the National Library of New Zealand has taken up a seriously impressive challenge.  For the first time ever, they’re undertaking a comprehensive harvest of the New Zealand internet — that means making a permanently archived copy of every single website that falls under the .nz domain.

The 2008 Domain Harvest is part of the National Library’s National Digital Heritage Archive (NDHA) — a pioneering digital preservation programme that is being closely watched by archives and national libraries worldwide.

This effort is part of a growing recognition that the important cultural artifacts of our time are no longer finding their way into libraries.  Instead, they are languishing in increasingly obsolete formats (ever consider how difficult it is to retrieve data from a 5 1/4 floppy disk these days?) or simply vanishing in a puff of ones and zeroes.

In their introduction to the project they state that:

“The National Library of New Zealand has a legal mandate and a social responsibility to preserve New Zealand’s social and cultural history, be it in the form of books, newspapers and photographs, or of websites, blogs and YouTube videos….We will be able to look back on internet documents as we do the printed words left to us by previous generations.”

Rather than hiding from the digital era, the NDHA attempts to tackle head-on the extremely complex technical problems that are involved in preserving the contents of our computers for posterity.

It’s worth considering for a moment what a thoroughly ambitious initiative this is…consider this excerpt from the Library’s Web Harvest FAQ:

I run several large New Zealand websites in .co.nz containing literally tens of millions of pages plus unknown quantities of dynamic pages. Thousands of pages change daily and total content is several hundred gigabyte, and a lot of that is video and imagery. Do you intend to download all of the content from all of my sites?

“In principle, yes.

“In practice the internet is infinitely large, because of the large number of dynamic pages, and it is impossible to harvest everything. We will therefore have to stop harvesting at some point. Our initial target for this harvest is 100 million URLs (we may extend this to 150 million).”

These 100 million or so websites will be copied, digitally preserved and eventually made accessible to the public as a snapshot of internet life in this year of the 21st century.

This is the only planned “whole-of-domain” harvest in the near future.  Selective harvesting, or “web curating”, will continue indefinitely using new software designed by NLNZ and the British Library.

The National Library will release an update on the progress of the 2008 web harvest at the end of this week.