The Web's next act: A worldwide database

 

Connecting state and local government leaders

Sir Tim Berners-Lee, founder of the Web, wants to create a worldwide Web database, and he wants government agencies to lead the way.

Almost 20 years ago, Sir Tim Berners-Lee, then a contractor for the European Organization for Nuclear Research, invented a document hypertexting format that became the basis for the World Wide Web. He now hopes to advance this technology another step by building a web of data. And he wants government to lead the charge.

"Now I want you to put your data on the Web," Berners-Lee said at a talk hosted by the Technology, Entertainment, Design organization earlier this year, where he introduced his concept of Linked Data. He identified the U.S. government as a candidate for early use of this format.

President Barack Obama "said American government data would be available on the Internet in an accessible format, and I hope they will put it up as Linked Data," Berners-Lee said.

For many, Linked Data is still a difficult concept to understand. After all, isn't data already on the Web, in terms of text on Web pages? Berners-Lee told the TED crowd, "You can read [documents] and follow links from them, and that's [about] it.… There is still huge unlocked potential."

Efforts such as the Office of Management and Budget's Data.gov have opened the doors to wider use of government data, though it is still not enough, Berners-Lee said at the conference. Posting the application programming interfaces and comma-separated values (CSV) files extracted from databases requires work by other programmers before the information becomes fully useful to citizens. The data should be encoded in HTML itself, he recommended. In doing so, the entire World Wide Web can host its own database.

The idea seems to be taking off, at least in pockets across the Web. A Web site that documents where Linked Data can be found online, called Linked Data, at last count found more than 4.2 billion assertions encoded in the Resource Description Framework across a variety of different data-annotating projects. The British Broadcasting Corporation has tested RDF to augment searches of its huge program guide. Best Buy and eBay have encoded their commercial listings in RDF.

Some government data has already made it to the Linked Data cloud — outside parties have rendered data feeds from Data.gov and the federal enterprise architecture into RDF. But will the government embrace this new format?

Even more open?

At the recent International Semantic Web Conference in Chantilly, Va., many of the discussions were devoted to better understanding Berners-Lee's concept of Linked Data. The phrase Semantic Web was first coined to describe Berners-Lee's vision of a data Web, and much of the conference was dedicated to refining advanced concepts of the Semantic Web, such as ontologies. But others focused on the simpler goal of getting Linked Data onto the Web.

ISWC attendees said the idea of machine-readable data can be a hard sell to people who are unfamiliar with the idea. The idea of Linked Data, like the idea of a World Wide Web when it was first introduced, "solves a problem we didn't know we had," said Ronald Reck, head of consulting firm Rrecktek.

In other words, many of the benefits offered by the then-nascent Web, such as the ability to share documents, was already offered through other technologies, such as the File Transfer Protocol. Likewise, it is difficult to understand the concept of a single format for Web-based data when plenty of formats such as relational databases and spreadsheets already annotate data in ways that make it reusable by other systems.

How is Linked Data different from other data on the Web? In short, it is annotated with RDF.

At a talk at the ISWC conference, Berners-Lee made a point that simply making data available through application programming interfaces or CSV files would not make the data fully available to others. "When you look at putting government data on the Web, one of the concerns is…to not just put it out there on Excel files on Data.gov," he said. "You should put these things in RDF."

Indeed, some have called on the government to make its data available for processing, not just available for public access. Recently, the Sunlight Foundation, a nonprofit organization dedicated to increasing government transparency, criticized the House and Internal Revenue Service for releasing their public documents as PDFs. Although PDF is an open format — meaning Adobe publishes the specifications for rendering PDF documents — such documents cannot be easily parsed by a computer program written to harvest data.

"Government releasing data in PDF tends to be catastrophic for open-government advocates, journalists and our readers because of the amount of overhead it takes to get data out of it," Sunlight Foundation Director Clay Johnson wrote in a blog post. "Most earmark requests by members of [Congress] are published as PDF files of scanned letters, leading the Sunlight Foundation and others to write custom parsers for each letter."

What about APIs? An agency could set up an API that would allow organizations such as the Sunlight Foundation to write a program that could pull data from the agency documents by using commands that the API provides to access data.

However, APIs pose their own problems. During the question-and-answer period after Berners-Lee's talk at ISWC, an audience member asked why exposing the APIs isn't sufficient for exposing data, a technique used by Data.gov. Berners-Lee said that to use an API, a systems administrator or developer must write a program for the data to be accessible. With RDF, a Web browser should be able to reuse the data, requiring no additional work on the part of users.

Berners-Lee said that if the Web manager uses common uniform resource identifiers to identify people, cities or countries in the data, the browser could automatically pull information from other Web sites about those entities. "So there is very much more value to data for me, if I'm just browsing," he said.

At ISWC, Dean Allemang, chief scientist at Semantic Web consulting firm TopQuadrant, offered an example of how a machine-readable Web would help everyone involved. His example was work-related: booking hotels.

Say you want to attend a conference at an out-of-town location. The conference site probably has a Web site, so you copy its physical address from the site and go to an online hotel broker site, such as Hotels.com, to find a nearby hotel. By entering that address into the search criteria, you do a search for hotels within a certain geographical radius. Or you just a get a list of hotels and go to a third Web site, a mapping site such as MapQuest, and enter hotel addresses and the conference center address to see which hotels are close to the conference center.

In Allemang's view, this is crazy. Why copy some information from one page and paste it to another using the same computer? Why can't the computer itself do the work?

The trick would be to get all the sites to agree on how to represent an address, Allemang said. Then the addresses can be passed from one site to the next through your browser automatically, without you having to do anything. The mapping site could check your cache and list any addresses found there, offering you the option of mapping them.

Automating such a task — and countless others that users do on computers — is the point of creating a machine-readable Web. If computer programs can read the Web pages and carry out tasks, users won't have to.

Relational databases make the prospect feasible. With databases, you can structure data so each data element is slotted into a predictable location. You can query a database of personnel data to find the birth date of a particular person because the row of data with that person's information has a column dedicated to birth dates.

However, that approach wouldn’t work so well for data beyond a single database. "The problem is that everyone assumes you will need to build a huge data warehouse where everything can be compared,” Allemang said. “This will never happen."

In addition, on the Web, data is not structured in such a way that it can be retrieved with any consistency, and many people who design and maintain Web sites would not agree on the same format for structuring data.

The key to making this new format work is RDF.

A new proposal

Overseen by the World Wide Web Consortium, the organization that maintains the Web’s standards, RDF is a way of making data available by encoding it so that external IT systems can understand it.

RDF is based on making associations. It describes data by breaking each data element into three nodes: a subject, predicate and object. For example, consider the fact that Yellowstone National Park offers camping. "Yellowstone" would be the subject. "offers" would be the predicate and "camping" would be the object. All three elements get uniform resource identifiers, or a globally recognized Internet addresses.

A query against triple store, which is what an RDF database is called, can link together disparate facts. If another triple, perhaps located in another triple store, contains the fact that the Mammoth Hot Springs are located in Yellowstone, a single search across multiple triple stores can return both facts.

Additional standards can further refine the precision of the data definition. For instance, two parties can agree that the term "Yellowstone" refers to "Yellowstone National Park" by using a shared, controlled vocabulary, which can be referenced through an RDF schema called RDFS. RDFS also allows for inferencing — in RDFS, you can state that Yellowstone is a type of national park. So a search for national parks that offer camping would return "Yellowstone."

Of course, the Interior Department could build a list of all the national parks and include the services that each park offers. But with the Semantic Web approach, such a single database would never be needed. The services for each park could maintain their own data, and the results could be compiled only when someone posts some piece of specific data, Allemang said. In essence, with RDF, a user can build a set of data from various sources on the Web that might not have been brought together before.

How do you use these triples? One way is through the query language for RDF, called SPARQL, an acronym for the humorously recursive SPARQL Protocol and RDF Query Language. With Structured Query Language (SQL), you can query multiple database tables through the Join function. With a SPARQL query, you specify all the triples you would need, and the query engine will deliver the answers that fit all of your criteria.

For example, let's say you are looking for a four-star hotel in New York. You have a query to look for triples specifying four-star hotels, hotels and New York. The query search engine would find all the triples for hotels in New York, in addition to all the triples for four-star hotels, and filter the set down to four-star hotels in New York.

Even more sophisticated interpretations of RDF triples can be done through another W3C Web standard called the Web Ontology Language (OWL).

The logical chain of reason within an RDF triple is relatively static and can vary according to who does the encoding. One triple might say that Yellowstone "offers" camping as a service, but another triple might state that camping "is offered" at Arcadia National Park. Although it might seem obvious to us that both Arcadia and Yellowstone offer camping, it wouldn't be to the computer. An RDF query engine, perhaps one embedded in a Web application, could consult OWL and return both entries.

RDF at work

For anyone familiar with HTML, RDF could be thought of as an extension to the metatag, which developers use to describe the contents of Web pages for search engines.

A Web site can host an RDF document that contains a list of terms, called a namespace, that can be used to tag different bits of data across its Web pages. All the pertinent data on the site's Web pages can then be tagged with terms in this namespace. For example, <album:name>Abbey Road</album:name> indicates that the name of a music album in the text is called "Abbey Road." As long as the Web page of RDF names is formatted using a standard W3C RDF namespace, a link between the organization's own namespace and the rest of the Web is established.

As Berners-Lee states in a document that describes how Linked Data works, RDF identifiers also could use hash tags, which would give any data elements on a Web page their own Web addresses. For instance, a Web page with some RDF-tagged information about someone named Albert could be rendered like this: "http://example.org/smith#albert."

"This is a valuable thing to do, as anyone on the planet can now use that global identifier to refer to Albert and give more information," Berners-Lee writes, adding that additional information could be mapped to Albert through RDF, such as who his children are.

Although the idea of a machine-readable Web sounds great, it still requires data holders to render material in RDF, a tall order for already-overworked Web managers. Fortunately, the W3C has been working to develop standards that would make embedding RDF into Web pages easier.

Recently, the consortium published the first draft of HTML+RDFa, which the standards body describes as "a mechanism for embedding RDF in HTML." The advantage of HTML+RDFa is that it allows Web managers to directly embed RDF into an HTML document rather than create a separate file, said Tim Finin, a computer science professor at the University of Maryland at Baltimore County who has done a lot of work on artificial intelligence and the Semantic Web.

"You could have a document with text for people to look at and also data that a machine could extract that would say the same thing," Finin said. "Whereas before, you would have to publish an HTML document and an RDF document and somehow link them. And if you changed one, you would have to change the other. It's better to have just one."

Presumably, HTML+RDFa could speed this development because it eases the burden of creating separate RDF annotations for data on Web pages.

Challenges ahead

HTML+RDFa "is a promising development," said Michael Daconta, chief technology officer at Accelerated Information Management, former metadata program manager at the Homeland Security Department and a GCN columnist.

Daconta cautioned that the field of semantic markup still faces a chicken-or-egg problem in which Web managers need tools to embed RDF on their pages and organizations need tools to parse RDF information for their own services. But such tools for either party probably won't be created until RDF starts to become more widely used.

Few Web site managers are trained in RDF, and not many Web development applications use the standard, Berners-Lee said at ISWC. "I'm not sure we have a grasp of our needs for the next phase of products," he said.

Part of the issue is the inherent complexity of the Semantic Web concept. Even simple sets of data linked by RDF, which was one simple component of Berners-Lee's grand vision, "is still remarkably difficult as a paradigm shift," he said.

He said the use of RDF should not require building new systems or changing the way site administrators work, reminiscing about how many original Web sites linked to mainframe systems. Instead, scripts can be written in Python, Perl or other languages that can convert data in spreadsheets or relational databases into RDF for users. "You will want to leave the social processes in place, leave the technical systems in place," he said.

Though nascent, the trend of linking data on Web sites could grow exponentially as more organizations — including government agencies — get involved, just as the original Web did in the last decade. "The more things you have to connect together, the more powerful it is," Berners-Lee said.

X
This website uses cookies to enhance user experience and to analyze performance and traffic on our website. We also share information about your use of our site with our social media, advertising and analytics partners. Learn More / Do Not Sell My Personal Information
Accept Cookies
X
Cookie Preferences Cookie List

Do Not Sell My Personal Information

When you visit our website, we store cookies on your browser to collect information. The information collected might relate to you, your preferences or your device, and is mostly used to make the site work as you expect it to and to provide a more personalized web experience. However, you can choose not to allow certain types of cookies, which may impact your experience of the site and the services we are able to offer. Click on the different category headings to find out more and change our default settings according to your preference. You cannot opt-out of our First Party Strictly Necessary Cookies as they are deployed in order to ensure the proper functioning of our website (such as prompting the cookie banner and remembering your settings, to log into your account, to redirect you when you log out, etc.). For more information about the First and Third Party Cookies used please follow this link.

Allow All Cookies

Manage Consent Preferences

Strictly Necessary Cookies - Always Active

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Sale of Personal Data, Targeting & Social Media Cookies

Under the California Consumer Privacy Act, you have the right to opt-out of the sale of your personal information to third parties. These cookies collect information for analytics and to personalize your experience with targeted ads. You may exercise your right to opt out of the sale of personal information by using this toggle switch. If you opt out we will not be able to offer you personalised ads and will not hand over your personal information to any third parties. Additionally, you may contact our legal department for further clarification about your rights as a California consumer by using this Exercise My Rights link

If you have enabled privacy controls on your browser (such as a plugin), we have to take that as a valid request to opt-out. Therefore we would not be able to track your activity through the web. This may affect our ability to personalize ads according to your preferences.

Targeting cookies may be set through our site by our advertising partners. They may be used by those companies to build a profile of your interests and show you relevant adverts on other sites. They do not store directly personal information, but are based on uniquely identifying your browser and internet device. If you do not allow these cookies, you will experience less targeted advertising.

Social media cookies are set by a range of social media services that we have added to the site to enable you to share our content with your friends and networks. They are capable of tracking your browser across other sites and building up a profile of your interests. This may impact the content and messages you see on other websites you visit. If you do not allow these cookies you may not be able to use or see these sharing tools.

If you want to opt out of all of our lead reports and lists, please submit a privacy request at our Do Not Sell page.

Save Settings
Cookie Preferences Cookie List

Cookie List

A cookie is a small piece of data (text file) that a website – when visited by a user – asks your browser to store on your device in order to remember information about you, such as your language preference or login information. Those cookies are set by us and called first-party cookies. We also use third-party cookies – which are cookies from a domain different than the domain of the website you are visiting – for our advertising and marketing efforts. More specifically, we use cookies and other tracking technologies for the following purposes:

Strictly Necessary Cookies

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Functional Cookies

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Performance Cookies

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Sale of Personal Data

We also use cookies to personalize your experience on our websites, including by determining the most relevant content and advertisements to show you, and to monitor site traffic and performance, so that we may improve our websites and your experience. You may opt out of our use of such cookies (and the associated “sale” of your Personal Information) by using this toggle switch. You will still see some advertising, regardless of your selection. Because we do not track you across different devices, browsers and GEMG properties, your selection will take effect only on this browser, this device and this website.

Social Media Cookies

We also use cookies to personalize your experience on our websites, including by determining the most relevant content and advertisements to show you, and to monitor site traffic and performance, so that we may improve our websites and your experience. You may opt out of our use of such cookies (and the associated “sale” of your Personal Information) by using this toggle switch. You will still see some advertising, regardless of your selection. Because we do not track you across different devices, browsers and GEMG properties, your selection will take effect only on this browser, this device and this website.

Targeting Cookies

We also use cookies to personalize your experience on our websites, including by determining the most relevant content and advertisements to show you, and to monitor site traffic and performance, so that we may improve our websites and your experience. You may opt out of our use of such cookies (and the associated “sale” of your Personal Information) by using this toggle switch. You will still see some advertising, regardless of your selection. Because we do not track you across different devices, browsers and GEMG properties, your selection will take effect only on this browser, this device and this website.