The Web's next act: A worldwide database
Almost 20 years ago, Sir Tim Berners-Lee, then a contractor for the European Organization for Nuclear Research, invented a document hypertexting format that became the basis for the World Wide Web. He now hopes to advance this technology another step by building a web of data. And he wants government to lead the charge.
"Now I want you to put your data on the Web," Berners-Lee said at a talk hosted by the Technology, Entertainment, Design organization earlier this year, where he introduced his concept of Linked Data. He identified the U.S. government as a candidate for early use of this format.
President Barack Obama "said American government data would be available on the Internet in an accessible format, and I hope they will put it up as Linked Data," Berners-Lee said.
For many, Linked Data is still a difficult concept to understand. After all, isn't data already on the Web, in terms of text on Web pages? Berners-Lee told the TED crowd, "You can read [documents] and follow links from them, and that's [about] it.… There is still huge unlocked potential."
Efforts such as the Office of Management and Budget's Data.gov have opened the doors to wider use of government data, though it is still not enough, Berners-Lee said at the conference. Posting the application programming interfaces and comma-separated values (CSV) files extracted from databases requires work by other programmers before the information becomes fully useful to citizens. The data should be encoded in HTML itself, he recommended. In doing so, the entire World Wide Web can host its own database.
The idea seems to be taking off, at least in pockets across the Web. A Web site that documents where Linked Data can be found online, called Linked Data, at last count found more than 4.2 billion assertions encoded in the Resource Description Framework across a variety of different data-annotating projects. The British Broadcasting Corporation has tested RDF to augment searches of its huge program guide. Best Buy and eBay have encoded their commercial listings in RDF.
Some government data has already made it to the Linked Data cloud — outside parties have rendered data feeds from Data.gov and the federal enterprise architecture into RDF. But will the government embrace this new format?
Even more open?
At the recent International Semantic Web Conference in Chantilly, Va., many of the discussions were devoted to better understanding Berners-Lee's concept of Linked Data. The phrase Semantic Web was first coined to describe Berners-Lee's vision of a data Web, and much of the conference was dedicated to refining advanced concepts of the Semantic Web, such as ontologies. But others focused on the simpler goal of getting Linked Data onto the Web.
ISWC attendees said the idea of machine-readable data can be a hard sell to people who are unfamiliar with the idea. The idea of Linked Data, like the idea of a World Wide Web when it was first introduced, "solves a problem we didn't know we had," said Ronald Reck, head of consulting firm Rrecktek.
In other words, many of the benefits offered by the then-nascent Web, such as the ability to share documents, was already offered through other technologies, such as the File Transfer Protocol. Likewise, it is difficult to understand the concept of a single format for Web-based data when plenty of formats such as relational databases and spreadsheets already annotate data in ways that make it reusable by other systems.
How is Linked Data different from other data on the Web? In short, it is annotated with RDF.
At a talk at the ISWC conference, Berners-Lee made a point that simply making data available through application programming interfaces or CSV files would not make the data fully available to others. "When you look at putting government data on the Web, one of the concerns is…to not just put it out there on Excel files on Data.gov," he said. "You should put these things in RDF."
Indeed, some have called on the government to make its data available for processing, not just available for public access. Recently, the Sunlight Foundation, a nonprofit organization dedicated to increasing government transparency, criticized the House and Internal Revenue Service for releasing their public documents as PDFs. Although PDF is an open format — meaning Adobe publishes the specifications for rendering PDF documents — such documents cannot be easily parsed by a computer program written to harvest data.
"Government releasing data in PDF tends to be catastrophic for open-government advocates, journalists and our readers because of the amount of overhead it takes to get data out of it," Sunlight Foundation Director Clay Johnson wrote in a blog post. "Most earmark requests by members of [Congress] are published as PDF files of scanned letters, leading the Sunlight Foundation and others to write custom parsers for each letter."
What about APIs? An agency could set up an API that would allow organizations such as the Sunlight Foundation to write a program that could pull data from the agency documents by using commands that the API provides to access data.
However, APIs pose their own problems. During the question-and-answer period after Berners-Lee's talk at ISWC, an audience member asked why exposing the APIs isn't sufficient for exposing data, a technique used by Data.gov. Berners-Lee said that to use an API, a systems administrator or developer must write a program for the data to be accessible. With RDF, a Web browser should be able to reuse the data, requiring no additional work on the part of users.
Berners-Lee said that if the Web manager uses common uniform resource identifiers to identify people, cities or countries in the data, the browser could automatically pull information from other Web sites about those entities. "So there is very much more value to data for me, if I'm just browsing," he said.
At ISWC, Dean Allemang, chief scientist at Semantic Web consulting firm TopQuadrant, offered an example of how a machine-readable Web would help everyone involved. His example was work-related: booking hotels.
Say you want to attend a conference at an out-of-town location. The conference site probably has a Web site, so you copy its physical address from the site and go to an online hotel broker site, such as Hotels.com, to find a nearby hotel. By entering that address into the search criteria, you do a search for hotels within a certain geographical radius. Or you just a get a list of hotels and go to a third Web site, a mapping site such as MapQuest, and enter hotel addresses and the conference center address to see which hotels are close to the conference center.
In Allemang's view, this is crazy. Why copy some information from one page and paste it to another using the same computer? Why can't the computer itself do the work?
The trick would be to get all the sites to agree on how to represent an address, Allemang said. Then the addresses can be passed from one site to the next through your browser automatically, without you having to do anything. The mapping site could check your cache and list any addresses found there, offering you the option of mapping them.
Automating such a task — and countless others that users do on computers — is the point of creating a machine-readable Web. If computer programs can read the Web pages and carry out tasks, users won't have to.
Relational databases make the prospect feasible. With databases, you can structure data so each data element is slotted into a predictable location. You can query a database of personnel data to find the birth date of a particular person because the row of data with that person's information has a column dedicated to birth dates.
However, that approach wouldn’t work so well for data beyond a single database. "The problem is that everyone assumes you will need to build a huge data warehouse where everything can be compared,” Allemang said. “This will never happen."
In addition, on the Web, data is not structured in such a way that it can be retrieved with any consistency, and many people who design and maintain Web sites would not agree on the same format for structuring data.
The key to making this new format work is RDF.
A new proposal
Overseen by the World Wide Web Consortium, the organization that maintains the Web’s standards, RDF is a way of making data available by encoding it so that external IT systems can understand it.
RDF is based on making associations. It describes data by breaking each data element into three nodes: a subject, predicate and object. For example, consider the fact that Yellowstone National Park offers camping. "Yellowstone" would be the subject. "offers" would be the predicate and "camping" would be the object. All three elements get uniform resource identifiers, or a globally recognized Internet addresses.
A query against triple store, which is what an RDF database is called, can link together disparate facts. If another triple, perhaps located in another triple store, contains the fact that the Mammoth Hot Springs are located in Yellowstone, a single search across multiple triple stores can return both facts.
Additional standards can further refine the precision of the data definition. For instance, two parties can agree that the term "Yellowstone" refers to "Yellowstone National Park" by using a shared, controlled vocabulary, which can be referenced through an RDF schema called RDFS. RDFS also allows for inferencing — in RDFS, you can state that Yellowstone is a type of national park. So a search for national parks that offer camping would return "Yellowstone."
Of course, the Interior Department could build a list of all the national parks and include the services that each park offers. But with the Semantic Web approach, such a single database would never be needed. The services for each park could maintain their own data, and the results could be compiled only when someone posts some piece of specific data, Allemang said. In essence, with RDF, a user can build a set of data from various sources on the Web that might not have been brought together before.
How do you use these triples? One way is through the query language for RDF, called SPARQL, an acronym for the humorously recursive SPARQL Protocol and RDF Query Language. With Structured Query Language (SQL), you can query multiple database tables through the Join function. With a SPARQL query, you specify all the triples you would need, and the query engine will deliver the answers that fit all of your criteria.
For example, let's say you are looking for a four-star hotel in New York. You have a query to look for triples specifying four-star hotels, hotels and New York. The query search engine would find all the triples for hotels in New York, in addition to all the triples for four-star hotels, and filter the set down to four-star hotels in New York.
Even more sophisticated interpretations of RDF triples can be done through another W3C Web standard called the Web Ontology Language (OWL).
The logical chain of reason within an RDF triple is relatively static and can vary according to who does the encoding. One triple might say that Yellowstone "offers" camping as a service, but another triple might state that camping "is offered" at Arcadia National Park. Although it might seem obvious to us that both Arcadia and Yellowstone offer camping, it wouldn't be to the computer. An RDF query engine, perhaps one embedded in a Web application, could consult OWL and return both entries.
RDF at work
For anyone familiar with HTML, RDF could be thought of as an extension to the metatag, which developers use to describe the contents of Web pages for search engines.
A Web site can host an RDF document that contains a list of terms, called a namespace, that can be used to tag different bits of data across its Web pages. All the pertinent data on the site's Web pages can then be tagged with terms in this namespace. For example, <album:name>Abbey Road</album:name> indicates that the name of a music album in the text is called "Abbey Road." As long as the Web page of RDF names is formatted using a standard W3C RDF namespace, a link between the organization's own namespace and the rest of the Web is established.
As Berners-Lee states in a document that describes how Linked Data works, RDF identifiers also could use hash tags, which would give any data elements on a Web page their own Web addresses. For instance, a Web page with some RDF-tagged information about someone named Albert could be rendered like this: "http://example.org/smith#albert."
"This is a valuable thing to do, as anyone on the planet can now use that global identifier to refer to Albert and give more information," Berners-Lee writes, adding that additional information could be mapped to Albert through RDF, such as who his children are.
Although the idea of a machine-readable Web sounds great, it still requires data holders to render material in RDF, a tall order for already-overworked Web managers. Fortunately, the W3C has been working to develop standards that would make embedding RDF into Web pages easier.
Recently, the consortium published the first draft of HTML+RDFa, which the standards body describes as "a mechanism for embedding RDF in HTML." The advantage of HTML+RDFa is that it allows Web managers to directly embed RDF into an HTML document rather than create a separate file, said Tim Finin, a computer science professor at the University of Maryland at Baltimore County who has done a lot of work on artificial intelligence and the Semantic Web.
"You could have a document with text for people to look at and also data that a machine could extract that would say the same thing," Finin said. "Whereas before, you would have to publish an HTML document and an RDF document and somehow link them. And if you changed one, you would have to change the other. It's better to have just one."
Presumably, HTML+RDFa could speed this development because it eases the burden of creating separate RDF annotations for data on Web pages.
HTML+RDFa "is a promising development," said Michael Daconta, chief technology officer at Accelerated Information Management, former metadata program manager at the Homeland Security Department and a GCN columnist.
Daconta cautioned that the field of semantic markup still faces a chicken-or-egg problem in which Web managers need tools to embed RDF on their pages and organizations need tools to parse RDF information for their own services. But such tools for either party probably won't be created until RDF starts to become more widely used.
Few Web site managers are trained in RDF, and not many Web development applications use the standard, Berners-Lee said at ISWC. "I'm not sure we have a grasp of our needs for the next phase of products," he said.
Part of the issue is the inherent complexity of the Semantic Web concept. Even simple sets of data linked by RDF, which was one simple component of Berners-Lee's grand vision, "is still remarkably difficult as a paradigm shift," he said.
He said the use of RDF should not require building new systems or changing the way site administrators work, reminiscing about how many original Web sites linked to mainframe systems. Instead, scripts can be written in Python, Perl or other languages that can convert data in spreadsheets or relational databases into RDF for users. "You will want to leave the social processes in place, leave the technical systems in place," he said.
Though nascent, the trend of linking data on Web sites could grow exponentially as more organizations — including government agencies — get involved, just as the original Web did in the last decade. "The more things you have to connect together, the more powerful it is," Berners-Lee said.
Joab Jackson is the senior technology editor for Government Computer News.