Databases tag along with XML

New breed broadens the Web horizon by storing documents in their native format

Extensible Markup Language has become the data format of choice for many emerging applications in e-commerce and e-government.

It underlies such emerging Web services standards as the World Wide Web Consortium's Simple Object Access Protocol and new electronic data interchange standards such as ebXML. It's also supported by new XML versions of the Accredited Standards Committee's X12 format. Increasing numbers of federal documents are being generated in XML.

All this XML should make e-government easier, right? Well, sort of. XML makes it easier to move data around between applications. But managing all that XML content is something of a challenge.

XML documents can be transformed, in some cases, into data fields in the columns, rows and tables of conventional relational databases through so-called XML shredding. But the processing required to transform to and from XML drastically slows down database applications. For more complex documents, the relationships between data elements become exceedingly complex when converted to relational databases, making such a task impractical and expensive.

Shredding data

There's another problem with relational storage of XML: the fidelity of XML documents. Once XML data is shredded, it isn't always put back together in the same way. It may end up out of order, or with formatting changes. In fact, relational databases that handle XML data often add their own XML to the documents to help index them, which increases the size and creates even more of a storage headache.

If a document's integrity must be preserved, storing it in a relational format is probably not an option.

That's where a new breed of software comes in: native XML databases. They store XML documents in their original formatting, without altering them.

The emergence over the past two years of some real standards for XML databases has helped these mutants descended from hierarchical, object and relational databases take advantage of XML properties to create relationships and indices within documents. They do this with data management and querying tools similar to those in the relational databases familiar to most IT hands.

Like hierarchical databases, XML databases rely on an upside-down tree view of data. Like object databases, the types of data associated with each data element can be easily modified or extended. And like relational databases, XML databases offer a standard set of querying tools that is, or at least will be, universal across all XML database implementations.

But there's no reason to buy an XML database if you don't use XML documents, and the standards for this growing new species are still evolving rapidly.

Big vendors coming

Nevertheless, there's enough demand for XML databases that large vendors are entering the market. Microsoft Corp. and Oracle Corp. have announced intentions to ship their own native XML databases soon.

The first widely available commercial XML database was Tamino, from Software AG Inc. Over the past three years, other companies have brought out their commercial XML databases, and open-source efforts are delivering native XML databases as well, including the Apache Software Foundation's Xindice, which can be downloaded from the Web site xml.apache.org/xindice. Many applications that use XML documents, such as Web content management systems, now include integrated XML databases.

Rather than require the use of modeling tools to build a definition of the structure, or schema, of the database, XML databases leverage the data definition and relationship standards of XML itself, using the W3C's document type definitions (DTDs) or XML Schema standards to build the indices and structures.

Some databases even build these on the fly as documents are added; this ad hoc schema building dramatically reduces the amount of work required to get an XML database up and running. Unlike relational databases, XML databases can accept changes to their schemas on the fly, as required by the applications that access them. Changing a relational schema usually requires rebuilding and reloading the entire database.

Without a standard query and transaction language, there wouldn't be much appeal to XML databases. Until recently there wasn't one'there were only three proposed standards, called XQL, XML Query and XPath. Fortunately, all of these efforts have been consolidated into the W3C's XQuery standard.

XQuery has two parts'the XQuery database query language, which has features similar to those of the Structured Query Language, and XPath, a language for navigating the structure of XML documents themselves.

Because the databases store documents in their original XML form, developers don't need to do much differently to read data. Many XML databases support connections to applications through a Java Database Connect driver or directly through XML interfaces such as the Simple API for XML or the Document Object Model.

A growing number support connections through XML Web services protocols such as SOAP and XML-RPC, making it easier to build distributed XML applications.

Some can even make direct connections to Web search engines such as Verity to return XML documents.

Integrated into apps

Software AG and Ipedo Inc. are trying to build support for large volumes of transactions directly into XML databases, rather than using a transaction or application server to do the job for them. And application server vendors are starting to integrate XML databases into their products. That's important for large-scale e-commerce and e-government systems using XML.

The most obvious application of XML databases is in the realm of document and content management. And that's where they've had much of their early success.

Several large U.S. publishers have begun content management initiatives based on XML databases, some of them with document stores in the terabyte range.

In the government sector, the United Kingdom's Public Records Office uses Ixia Inc.'s TEXTML software as part of its document management efforts. Ixia also claims the Air Force as a customer. XML databases have had success in other document-centric e-government applications, too. For example, the California State Board of Equalization uses Tamino as part of an application for online filing of sales and use taxes.

Eventually, XML databases could be at the heart of nearly every document-driven government process. But for now, the market is a risky one.

Software AG's Tamino holds more than half the market, and the remainder is divided among a host of smaller companies and recent software start-ups.

The entry of Microsoft and Oracle into the pool will create a lot of waves and likely sink a number of the current players. Because standards are just starting to solidify, it may take a healthy dose of paranoia to manage the potential risks associated with committing to any one product.

But that should change quickly. As the W3C solidifies the XQuery specification, and as open-source and commercial XML databases mature, there will be less risk in selecting an XML database product than in selecting a relational one.

In view of the explosive growth of XML as a data format, that can't happen fast enough.

Kevin Jonah, a Maryland network manager, writes about computer technology.

Reader Comments

Please post your comments here. Comments are moderated, so they may not appear immediately after submitting. We will not post comments that we consider abusive or off-topic.

Please type the letters/numbers you see above