It Pays to be Persistent

'Document not found' messages, such as these generated by government Web sites, often frustrate users. Persistent identifiers can help.

'Document Not Found.' Is there a browser message more annoying?

With its Information Bridge program, the Energy Department's Office of Scientific and Technical Information is trying to do away with such messages. The agency is giving its research documents permanent addresses on the Web so they can always be found. Ascribing permanence in an online world is no easy feat, but it may go a long way toward minimizing 'Document Not Found' messages.

The first decade of the Web was a time of fluidity for government agencies. Early adopters posted agency material, only to have it shuffled around as new IT initiatives and enterprise architectures uprooted the order of documents. These days, when someone types in an older Web address for some agency page, chances are they'll see an error message. Equally problematic is the fact that, as copies of documents proliferate across the Web, updates go unnoticed. And these sorts of problems will only grow worse over time.

To address these concerns, Energy and other agencies are embracing the idea of 'persistent identifiers,' which assign documents permanent online addresses. No matter how many changes the agency makes to its IT architecture, documents with persistent identifiers will always be accessible through the same address.

Technically, persistence is easy. It mostly involves recognizing that someone within that agency needs to manage the task. Energy maintains a set of servers that contain a list of all the assigned permanent addresses, written in the Permanent Uniform Resource Locator format. Those addresses are mapped to the current addresses where documents reside.

'In the old days, everybody relied on a report number to find a document. Now people use PURLs,' said Sharon Jordan, assistant director at OSTI in the Office of Program Integration.

So far, it has been mostly the scientific and library communities that have embraced permanent document identifiers. Eventually, all agencies may have to grapple with them.

Forever Marked

Energy's Information Bridge program gives users the ability to search Energy research and development documents, including papers published in scientific journals as well as 'gray literature,' or working papers, informal presentations and other unpublished but still pertinent material.

The material ranges the gamut of Energy research, covering physics, chemistry, materials, biology, environmental sciences, energy technologies, engineering, computer and information science, renewable energy and more. Operational since 1997, Information Bridge's material stretches back to 1995.

'When we began dealing with digital documents, we realized that we needed a locator or identifier,' Jordan said. Since OSTI houses only some of the documents, while others are hosted by the originating facilities themselves, they risked setting up a system that would point to documents that could be moved by their custodians.

To assign permanence to the documents, OSTI uses the PURL naming system developed by Online Computer Library Center Inc., a nonprofit research organization for the library community. A PURL looks like a Uniform Resource Locator, the format used for Web addresses. Unlike URLs, however, PURLs come with at least an implicit guarantee from the custodian of a document that the document will always be available at an address.

So far, the Information Bridge has attached PURLs to about 110,000 documents, said Jeff Given, OSTI IT services project manager for Information International Associates Inc. of Oak Ridge, Tenn. OSTI set up a dedicated server, called a resolution service, that keeps track of all PURL addresses, along with their current locations.

When a PURL-based request for a document comes in from the Internet, the software simply redirects that request to the server holding the document, said Stuart Weibel, a senior research scientist at the OCLC who created PURL. While someone can change what a PURL resolves to, they can't change the PURL itself.

Redirecting incoming requests for pages is nothing new. It's a standard feature of Web server software. But OCLC's open-source program makes the task of managing PURLs much easier, Weibel said. You can download the software at http://purl.oclc.org.

While the technology may be a snap, the key to running a permanent identifier system is administrative oversight. An agency needs to understand the importance of setting up a system that will keep track of where all its documents'old and new'are located. It also needs to organize its workflow so new documents are registered though the service.

OSTI's task was relatively easy, because Energy already had procedures in place to register scientific documents with the office. Each Energy office designates an individual who manages its documents. When a researcher creates a new artifact, they must alert their office, which in turn alerts OSTI of the new document.
'It used to be a paper process. In the 1990s, we just updated our procedures to do it all electronically,' Jordan said.

A handle on Defense docs

Like Energy, the Defense Department is also using permanent identifiers for its scientific literature, but it took a slightly different course.

The Defense Technical Information Center, based in Fort Belvoir, Va., serves as a centralized repository for Defense scientific and technical information. Like OSTI, DTIC has long had in place procedures for researchers to keep DTIC notified of their documents. To provide a public conduit to these documents, DTIC set up the Public Scientific and Technical Information Network, a Web portal through which users can search the 200,000 technical reports under DTIC's care (http://stinet.dtic.mil).

But DTIC does not use PURLs. It uses a technique called the Handle System to keep track of the documents, said James Erwin, director of information science and technology for the Defense Technical Information Center. The Corporation for National Research Initiatives oversees the Handle System (www.handle.net), and offers software that can be integrated into browsers so they recognize handles.
Like PURLs, handles can be inserted into standard URLs. A search for the term silicon nitride at the StiNet site will return a document with the address http://handle.dtic.mil/100.2/ADA428642. The first part of the address is a standard URL; the second part, 100.2/ADA428642, is the permanent handle.

Though they serve their immediate communities, these persistent-identifier networks are starting to be adopted by other groups as well. For instance, Defense Department librarians have incorporated DTIC's Handle System addresses into their own reference systems, Erwin said. In addition, DTIC is working with the Defense Department's Advanced Distributed Learning Office, which is building a repository of e-learning modules. The modules will be tagged with handles so they will be searchable through StiNet.

Information Bridge is also expanding. Earlier this year, Energy joined an academic linking service called CrossRef, run by a nonprofit industry collation called the Publishers International Linking Association (www.crossref.org). CrossRef assigns permanent digital object identifiers to scientific reports, allowing them to be located even if their Web addresses change. OSTI will place CrossRef identifiers on about 90,000 Energy articles. This will allow CrossRef users, mostly academics who may not know about Information Bridge, to find Energy Department data.

Erwin and other persistent identifier gurus, however, are thinking more broadly. They want to see a global identifier system for the world's documents.

Coming soon to your agency

Organizations like OSTI and DTIC are addressing a problem that all agencies will soon face. Subsection 207d of the 2002 E-Government Act calls for 'the adoption of standards, which are open to the maximum extent feasible, to enable the organization and categorization of Government information ... in a way that is searchable electronically, including by searchable identifiers; and in ways that are interoperable across agencies.'

Erwin is co-chair of the Categorization of Government Information Working Group, part of the Interagency Committee on Government Information. This group created a set of recommendations for Office of Management and Budget on how federal agencies could meet the E-Gov mandate. Persistent identifiers are a key component.

The group recommends that the federal government stick close to standards proposed by the Internet Engineering Task Force, which shepherded the protocols that led to the Internet's ubiquity. IETF's proposed identification scheme is called Uniform Resource Names. While today's identifier schemes, such as PURL and the Handle System, piggyback on Web URLs to identify a document's whereabouts, the URN system would create an entirely separate information space on the Internet. The URN creators assume that Web addresses will always be in turmoil, so they propose creating an entirely different naming scheme solely dedicated to permanently placed documents. Under the scheme, the address for a permanent document would start with urn:// instead of the common http://.

A governmentwide group formerly called the Commerce, Energy, NASA, Defense Information Managers Group (now known simply as CENDI) is working on a prototype of how such a system would work, which it plans to demonstrate this month. VeriSign Inc. of Mountain View, Calif., will contribute an open-source browser plug-in that will identify URNs as hyperlinks.

And just as the Internet bound together many individual networks, URN can bind together different identifier schemes, in-cluding the Handle System and PURL. This means that if wide-scale adoption of URN occurs, existing identifier systems will not have to change their addresses.

The URN identification will be a multipart address, said Michael Mealling, who wrote the IETF request for comments that explain the workings of a URN resolution service, called the Dynamic Delegation Discovery Service. Mealling is CEO of Refactored Networks LLC of Kennesaw, Ga.

The first part of the address after the urn:// prefix will identify the type of naming system in place. The International Assigned Numbers Authority will manage these namespaces, as they are called.

'It is a system that allows opaque persistent identifiers to be used generically. You can have a URN contain an ISBN number just as easily as it contains a product code or a handle,' Mealling said. While the supply chain community could use URNs to keep track of RFID-tagged items, the library community could use the same URN system to track unique ISBN numbers for books.

Playing the name game

After the namespace is identified, the resolution process would then hand off the request to the particular organization overseeing the document requested, much like the Domain Name System only identifies the top-level domain names, leaving organizations to manage their own Web spaces.

A global URN resolution system would 'not contain data about the document. It points you to the server that can then give you data about the document,' Mealling said.

If the system works and becomes widely adopted by agencies, it could represent a major step not only in e-government, but also in information sharing in general. In the meantime, experts say, IT shops should brush up on the subject of persistent identifiers. CENDI put out a white paper last year titled Persistent Identification: A Key Component of an E-Government Infrastructure. You can read it by visiting www.gcn.com and typing 474 in the GCN.com/box.

Reader Comments

Please post your comments here. Comments are moderated, so they may not appear immediately after submitting. We will not post comments that we consider abusive or off-topic.

Please type the letters/numbers you see above