Harvest time

 

Connecting state and local government leaders

Five federal, educational and private archiving organizations have partnered to crawl the Web, gathering data from sites in the .gov domain to create an end-of-term snapshot for posterity.

FIVE FEDERAL, educational and private archiving organizations
have partnered to crawl the Web, gathering data from sites in the
.gov domain to create an endof- term snapshot for posterity.


The ambitious Web harvest is an effort to preserve millions of
pages of government Web sites that are in danger of disappearing,
or at least changing, when a new administration comes into office
Jan. 20.


'No matter who wins [the presidential election], we expect
there will be changes in the policies governing Web sites,'
said Abbie Grotke, coordinator of digital media projects for the
Library of Congress' Office of Strategic Initiatives.


Much of what exists online now is liable to disappear regardless
of policy, as sites are updated on a regular basis, sometimes
daily.


LOC has been doing monthly crawls of congressional Web sites
since late 2003. However, 'there is a bit of a gap in who is
responsible for archiving the executive and judicial branches at
the end of the term,' said Kris Carpenter, Web group director
at the Internet Archives.


To fill that gap, LOC, the Internet Archives, the California
Digital Library and the University of North Texas, with some help
from the Government Printing Office, have taken on the job.


'Nobody told us we needed to do it, but we realized there
was nobody else tasked to do this,' Grotke said. 'We
all thought it was important to do.'


'It would be a tragedy if we didn't attempt to
preserve this,' Carpenter said.


All of the organizations are members of the International
Internet Preservation Consortium and frequently cooperate on
similar projects.


The project is a big one. The Internet Archives estimates that
it will gather some 125 million pages from around 5,000 sites.
Estimates of the total volume of data to be collected range from 10
to 20 terabytes.


Each organization is contributing according to its expertise.
LOC will focus on development of its archives for the project; the
Internet Archives began a comprehensive baseline crawl of the .gov
domain in August and will do a second crawl before inauguration
day; the University of North Texas and the California Digital
Library will focus on prioritizing sites that need more frequent
attention and doing more indepth crawls; and the GPO federal
depository library program is offering advice on curating the
collection.


Now you see it ...


The online world presents a paradox. We often are warned that
once something appears online it never really disappears and that
incautious statements can come back to haunt us years after they
are made. But at the same time, online data is ephemeral,
constantly changing and moving even if we cannot be sure that it
has ever been expunged. This makes finding and documenting it for
future reference difficult.


Because they are dynamic, 'all Web sites are at
risk,' Carpenter said.


The Internet Archives was established in 1996 to make digital
material permanently available. The collection is not intended to
be comprehensive, Carpenter said. 'It doesn't include
everything. We're really focused on digital heritage,'
that is, how society manifests itself online.


The collection now consists of a petabyte of compressed data and
is expected to expand by multiple petabytes a year. To access the
data, the Internet Archives has developed the Wayback Machine, an
open-source online tool used by entering the URL and date of an
archived site. Once in a site, hyperlinks can be followed.


The University of North Texas has set a similar, if more limited,
mission.


'We've been involved in capturing public government
Web sites since 1997, when we started the CyberCemetery,'
said Cathy Hartman, assistant dean of digital and information
technologies. 'We are a federal depository library, and we
looked at this as part of the role that the department should be
filling.'


The CyberCemetery originally archived Web sites from agencies
that were shutting down after hitting the end of life, or at least
the end of their funding. The project began by going out and
looking for these sites, and as word of its mission spread,
agencies began contacting the cemetery to contribute sites for
preservation. 'Now we are harvesting agencies that are not
dying,' Hartman said.


The CyberCemetery's collection is not officially
sanctioned, but the university and GPO have developed guidelines
specifically for collecting and preserving digital material.


The online environment has evolved rapidly, Hartman said.
'When Clinton came to office [in 1992], there was almost
nothing available on government Web sites.' Today, many
government services are offered online and millions of pages of
official information are maintained online.


The university did a fairly extensive harvest of .gov sites four
years ago, but this year's project is more extensive. One of
the first challenges it faced was coming up with a list of Web
sites to crawl and harvest. Compiling that list is an ongoing task,
but once the duplicates and inappropriate sites are removed, it
probably will be about 4,200 sites, said Mark Phillips, head of the
University of North Texas' digital projects unit.


'One of the problems we had is that there are multiple
organizations involved, and each of us had a different list of the
.gov domain,' Phillips said. 'But nobody had a
comprehensive list.'


The California Digital Library, for instance, had a broad list
that included many state government sites that are outside the
scope of this project. So the university developed the URL
Nomination Tool (UNT) to help create an authoritative list. It is a
Web-based tool developed with the Django open-source Web framework
with a MySQL opensource database on the back end.


'It's a way to add a little bit of metadata'
about the URLs and Web sites being considered, Phillips said.
'The biggest challenge in building the tool was the data
model.' Creating a user interface was easy, but it took time
to come up with a common data model that would accommodate the
various types of information included in URL lists from different
sources.


The UNT was created specifically for the end-of-term harvest
because of the size of the project and the number of collaborators
involved. But creating a list of target sites is 'a process
that happens every time you do a big crawl,' Phillips said.
'This is a tool that we will be able to use in other
projects. We're hoping to release it as an open-source tool
for the Web harvest community.'


Cooperation not guaranteed


Crawling the Web and gathering pages from the sites are not
technically difficult. The Internet Archives is using its Heritrix
open-source crawler designed for massive Web-scale crawls. It
starts with the list of seed addresses from the URL Nominator and
visits each of them, following up links within each domain and
crawling them as well. The job is complicated by the fact that many
sites are dynamic, with content for a single page hosted on
different servers. The activity of the crawler is monitored to
avoid getting caught in a loop of links or other mishaps.


The cooperation of Web site owners is helpful when doing a crawl
to harvest data, and some provide site maps for crawlers to help
them on their way.


But 'we don't always get cooperation,'
Carpenter said. Whitehouse.gov is an example of an uncooperative
site. Although the Internet Archives generally respects the privacy
of site owners who do not want to participate in its harvests,
those sites that the Archives considers public, such as the White
House site, get included whether they like it or not.


Once the baseline material has been gathered, archivists and
researchers at the University of North Texas and California Digital
Library will identify sites likely to change frequently by January,
which will be followed up with more frequent harvests.


Nobody expects the results to be comprehensive. 'We
can't at this stage afford to capture everything as it
changes, in real time,' Carpenter said. 'But we do
invest heavily to create a representative snapshot at any given
time.'


Once the harvest is complete, each partner in the project will
get a complete copy of the material, although the Internet Archives
is expected to be the main point of access for the collection. It
is expected to be available online by March or April. The 10 to 20
terabytes of the .gov end-ofterm collection will be just a small
part of the petabyte of data the Internet Archives already is
making available, but it will be a huge addition for the other
institutions.


'I think we're going to end up with more data than
we're equipped to deal with,' the University of North
Texas' Phillips said.


Even the Internet Archives faces limits on how its large
collections can be used. Its current access technology, the Wayback
Machine, works more like a browser than a search engine, requiring
a URL to find material. So although the end-ofterm collection will
be included in the archives' general collection, it probably
will also be broken out as a separate special collection that could
be searched.


But even as a separate collection, the end-of-term harvest might
be approaching the upper limit of what is feasible to search with
current search engines. Searching a group of pages that have
changed over time is different from using Google to do a search of
the live Web, which covers only current pages.


'We've been fairly happy with tools that have scaled
to the hundreds of millions of documents,' Carpenter said.
But once you get to a billion documents, the quality of search
results drops off quickly.


Mining such massive collections is the next big step in Internet
preservation, Hartman said. She said the university has yet not
decided whether it will use tools to pull together subject-specific
content in a search or to break the collection up into smaller sets
by subject matter.


'We may have to have separate collections,' she
said. 'Experiment and research will tell for sure.'


That could be the next project for this team. 'We are
hoping to find some funding for research in that area,'
Hartman said. 'There's a strong interest among our
partners in that.'



X
This website uses cookies to enhance user experience and to analyze performance and traffic on our website. We also share information about your use of our site with our social media, advertising and analytics partners. Learn More / Do Not Sell My Personal Information
Accept Cookies
X
Cookie Preferences Cookie List

Do Not Sell My Personal Information

When you visit our website, we store cookies on your browser to collect information. The information collected might relate to you, your preferences or your device, and is mostly used to make the site work as you expect it to and to provide a more personalized web experience. However, you can choose not to allow certain types of cookies, which may impact your experience of the site and the services we are able to offer. Click on the different category headings to find out more and change our default settings according to your preference. You cannot opt-out of our First Party Strictly Necessary Cookies as they are deployed in order to ensure the proper functioning of our website (such as prompting the cookie banner and remembering your settings, to log into your account, to redirect you when you log out, etc.). For more information about the First and Third Party Cookies used please follow this link.

Allow All Cookies

Manage Consent Preferences

Strictly Necessary Cookies - Always Active

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Sale of Personal Data, Targeting & Social Media Cookies

Under the California Consumer Privacy Act, you have the right to opt-out of the sale of your personal information to third parties. These cookies collect information for analytics and to personalize your experience with targeted ads. You may exercise your right to opt out of the sale of personal information by using this toggle switch. If you opt out we will not be able to offer you personalised ads and will not hand over your personal information to any third parties. Additionally, you may contact our legal department for further clarification about your rights as a California consumer by using this Exercise My Rights link

If you have enabled privacy controls on your browser (such as a plugin), we have to take that as a valid request to opt-out. Therefore we would not be able to track your activity through the web. This may affect our ability to personalize ads according to your preferences.

Targeting cookies may be set through our site by our advertising partners. They may be used by those companies to build a profile of your interests and show you relevant adverts on other sites. They do not store directly personal information, but are based on uniquely identifying your browser and internet device. If you do not allow these cookies, you will experience less targeted advertising.

Social media cookies are set by a range of social media services that we have added to the site to enable you to share our content with your friends and networks. They are capable of tracking your browser across other sites and building up a profile of your interests. This may impact the content and messages you see on other websites you visit. If you do not allow these cookies you may not be able to use or see these sharing tools.

If you want to opt out of all of our lead reports and lists, please submit a privacy request at our Do Not Sell page.

Save Settings
Cookie Preferences Cookie List

Cookie List

A cookie is a small piece of data (text file) that a website – when visited by a user – asks your browser to store on your device in order to remember information about you, such as your language preference or login information. Those cookies are set by us and called first-party cookies. We also use third-party cookies – which are cookies from a domain different than the domain of the website you are visiting – for our advertising and marketing efforts. More specifically, we use cookies and other tracking technologies for the following purposes:

Strictly Necessary Cookies

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Functional Cookies

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Performance Cookies

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Sale of Personal Data

We also use cookies to personalize your experience on our websites, including by determining the most relevant content and advertisements to show you, and to monitor site traffic and performance, so that we may improve our websites and your experience. You may opt out of our use of such cookies (and the associated “sale” of your Personal Information) by using this toggle switch. You will still see some advertising, regardless of your selection. Because we do not track you across different devices, browsers and GEMG properties, your selection will take effect only on this browser, this device and this website.

Social Media Cookies

We also use cookies to personalize your experience on our websites, including by determining the most relevant content and advertisements to show you, and to monitor site traffic and performance, so that we may improve our websites and your experience. You may opt out of our use of such cookies (and the associated “sale” of your Personal Information) by using this toggle switch. You will still see some advertising, regardless of your selection. Because we do not track you across different devices, browsers and GEMG properties, your selection will take effect only on this browser, this device and this website.

Targeting Cookies

We also use cookies to personalize your experience on our websites, including by determining the most relevant content and advertisements to show you, and to monitor site traffic and performance, so that we may improve our websites and your experience. You may opt out of our use of such cookies (and the associated “sale” of your Personal Information) by using this toggle switch. You will still see some advertising, regardless of your selection. Because we do not track you across different devices, browsers and GEMG properties, your selection will take effect only on this browser, this device and this website.