Harvest time

Archiving organizations take on the big job of preserving .gov sites before a new administration arrives

FIVE FEDERAL, educational and private archiving organizations
have partnered to crawl the Web, gathering data from sites in the
.gov domain to create an endof- term snapshot for posterity.

The ambitious Web harvest is an effort to preserve millions of
pages of government Web sites that are in danger of disappearing,
or at least changing, when a new administration comes into office
Jan. 20.

'No matter who wins [the presidential election], we expect
there will be changes in the policies governing Web sites,'
said Abbie Grotke, coordinator of digital media projects for the
Library of Congress' Office of Strategic Initiatives.

Much of what exists online now is liable to disappear regardless
of policy, as sites are updated on a regular basis, sometimes

LOC has been doing monthly crawls of congressional Web sites
since late 2003. However, 'there is a bit of a gap in who is
responsible for archiving the executive and judicial branches at
the end of the term,' said Kris Carpenter, Web group director
at the Internet Archives.

To fill that gap, LOC, the Internet Archives, the California
Digital Library and the University of North Texas, with some help
from the Government Printing Office, have taken on the job.

'Nobody told us we needed to do it, but we realized there
was nobody else tasked to do this,' Grotke said. 'We
all thought it was important to do.'

'It would be a tragedy if we didn't attempt to
preserve this,' Carpenter said.

All of the organizations are members of the International
Internet Preservation Consortium and frequently cooperate on
similar projects.

The project is a big one. The Internet Archives estimates that
it will gather some 125 million pages from around 5,000 sites.
Estimates of the total volume of data to be collected range from 10
to 20 terabytes.

Each organization is contributing according to its expertise.
LOC will focus on development of its archives for the project; the
Internet Archives began a comprehensive baseline crawl of the .gov
domain in August and will do a second crawl before inauguration
day; the University of North Texas and the California Digital
Library will focus on prioritizing sites that need more frequent
attention and doing more indepth crawls; and the GPO federal
depository library program is offering advice on curating the

Now you see it ...

The online world presents a paradox. We often are warned that
once something appears online it never really disappears and that
incautious statements can come back to haunt us years after they
are made. But at the same time, online data is ephemeral,
constantly changing and moving even if we cannot be sure that it
has ever been expunged. This makes finding and documenting it for
future reference difficult.

Because they are dynamic, 'all Web sites are at
risk,' Carpenter said.

The Internet Archives was established in 1996 to make digital
material permanently available. The collection is not intended to
be comprehensive, Carpenter said. 'It doesn't include
everything. We're really focused on digital heritage,'
that is, how society manifests itself online.

The collection now consists of a petabyte of compressed data and
is expected to expand by multiple petabytes a year. To access the
data, the Internet Archives has developed the Wayback Machine, an
open-source online tool used by entering the URL and date of an
archived site. Once in a site, hyperlinks can be followed.

The University of North Texas has set a similar, if more limited,

'We've been involved in capturing public government
Web sites since 1997, when we started the CyberCemetery,'
said Cathy Hartman, assistant dean of digital and information
technologies. 'We are a federal depository library, and we
looked at this as part of the role that the department should be

The CyberCemetery originally archived Web sites from agencies
that were shutting down after hitting the end of life, or at least
the end of their funding. The project began by going out and
looking for these sites, and as word of its mission spread,
agencies began contacting the cemetery to contribute sites for
preservation. 'Now we are harvesting agencies that are not
dying,' Hartman said.

The CyberCemetery's collection is not officially
sanctioned, but the university and GPO have developed guidelines
specifically for collecting and preserving digital material.

The online environment has evolved rapidly, Hartman said.
'When Clinton came to office [in 1992], there was almost
nothing available on government Web sites.' Today, many
government services are offered online and millions of pages of
official information are maintained online.

The university did a fairly extensive harvest of .gov sites four
years ago, but this year's project is more extensive. One of
the first challenges it faced was coming up with a list of Web
sites to crawl and harvest. Compiling that list is an ongoing task,
but once the duplicates and inappropriate sites are removed, it
probably will be about 4,200 sites, said Mark Phillips, head of the
University of North Texas' digital projects unit.

'One of the problems we had is that there are multiple
organizations involved, and each of us had a different list of the
.gov domain,' Phillips said. 'But nobody had a
comprehensive list.'

The California Digital Library, for instance, had a broad list
that included many state government sites that are outside the
scope of this project. So the university developed the URL
Nomination Tool (UNT) to help create an authoritative list. It is a
Web-based tool developed with the Django open-source Web framework
with a MySQL opensource database on the back end.

'It's a way to add a little bit of metadata'
about the URLs and Web sites being considered, Phillips said.
'The biggest challenge in building the tool was the data
model.' Creating a user interface was easy, but it took time
to come up with a common data model that would accommodate the
various types of information included in URL lists from different

The UNT was created specifically for the end-of-term harvest
because of the size of the project and the number of collaborators
involved. But creating a list of target sites is 'a process
that happens every time you do a big crawl,' Phillips said.
'This is a tool that we will be able to use in other
projects. We're hoping to release it as an open-source tool
for the Web harvest community.'

Cooperation not guaranteed

Crawling the Web and gathering pages from the sites are not
technically difficult. The Internet Archives is using its Heritrix
open-source crawler designed for massive Web-scale crawls. It
starts with the list of seed addresses from the URL Nominator and
visits each of them, following up links within each domain and
crawling them as well. The job is complicated by the fact that many
sites are dynamic, with content for a single page hosted on
different servers. The activity of the crawler is monitored to
avoid getting caught in a loop of links or other mishaps.

The cooperation of Web site owners is helpful when doing a crawl
to harvest data, and some provide site maps for crawlers to help
them on their way.

But 'we don't always get cooperation,'
Carpenter said. Whitehouse.gov is an example of an uncooperative
site. Although the Internet Archives generally respects the privacy
of site owners who do not want to participate in its harvests,
those sites that the Archives considers public, such as the White
House site, get included whether they like it or not.

Once the baseline material has been gathered, archivists and
researchers at the University of North Texas and California Digital
Library will identify sites likely to change frequently by January,
which will be followed up with more frequent harvests.

Nobody expects the results to be comprehensive. 'We
can't at this stage afford to capture everything as it
changes, in real time,' Carpenter said. 'But we do
invest heavily to create a representative snapshot at any given

Once the harvest is complete, each partner in the project will
get a complete copy of the material, although the Internet Archives
is expected to be the main point of access for the collection. It
is expected to be available online by March or April. The 10 to 20
terabytes of the .gov end-ofterm collection will be just a small
part of the petabyte of data the Internet Archives already is
making available, but it will be a huge addition for the other

'I think we're going to end up with more data than
we're equipped to deal with,' the University of North
Texas' Phillips said.

Even the Internet Archives faces limits on how its large
collections can be used. Its current access technology, the Wayback
Machine, works more like a browser than a search engine, requiring
a URL to find material. So although the end-ofterm collection will
be included in the archives' general collection, it probably
will also be broken out as a separate special collection that could
be searched.

But even as a separate collection, the end-of-term harvest might
be approaching the upper limit of what is feasible to search with
current search engines. Searching a group of pages that have
changed over time is different from using Google to do a search of
the live Web, which covers only current pages.

'We've been fairly happy with tools that have scaled
to the hundreds of millions of documents,' Carpenter said.
But once you get to a billion documents, the quality of search
results drops off quickly.

Mining such massive collections is the next big step in Internet
preservation, Hartman said. She said the university has yet not
decided whether it will use tools to pull together subject-specific
content in a search or to break the collection up into smaller sets
by subject matter.

'We may have to have separate collections,' she
said. 'Experiment and research will tell for sure.'

That could be the next project for this team. 'We are
hoping to find some funding for research in that area,'
Hartman said. 'There's a strong interest among our
partners in that.'

Stay Connected

Sign up for our newsletter.

I agree to this site's Privacy Policy.