Get ready for the next big thing: XML

Get ready for the next big thing: XML

Extensible Markup Language's tag sets will give Web developers greater access to databases

By Shawn P. McCarthy

Special to GCN

Most Web developers know Extensible Markup Language looms large in their future, but few know whether it will be next month or next year.



Hypertext Markup Language will be used for a long time because it does basic file presentation so well. XML becomes necessary only when you need better access to, or control of, data embedded in office files and databases.

Like HTML, XML is a streamlined version of the Standard Generalized Markup Language, which makes it possible to use and display information in different ways by defining its structure and elements. The International Standards Organization's SGML specification is posted on the Web at www.iso.ch/cate/d16387.html.

XML is designed specifically for Web presentation. Its big advantage is that groups of developers can collaborate using their own customized tags to exploit functions that aren't possible with HTML.

As XML evolves, professional groups will establish specific XML tag sets to use in education, commerce, science and other fields. The sets likely will evolve in much the same way as OFEX, the Open Financial Exchange format used by the banking industry.

Flexible format

Although XML is for presentation, think of it as a data format, not a document format. Straight XML documents are already common on the Web, but the language's great power lies in generating documents on the fly from databases.

An XML Glossary
Attribute: A property that can be assigned a value associated with an element. Hyperlinks and embedded images are attributes.

CDF: Channel Definition Format, a push technology used in XML.

DTD: Document Type Definition, a set of rules governing the tags in an XML document, set at the top of the document.

DSSSL: Document Style Semantics and Specification Language, an SGML linking standard.

Element: The key word that starts a declaration of element type.

Entity: Phrase or character that represents text or data stored elsewhere.

Parser: A program that checks an XML document to ensure it is valid.

Stlye sheets: Can be associated with an XML document to control information display.

Well-formed: An XML document whose open and close tags match and are nested correctly, and whose entities and attributes are properly declared.

XLL: Extensible Linking Language, the linking standard for XML.

XSL: Extensible Style Language, the style standard for XML.


XML has two remarkably different purposes. It can locate specific types of data embedded in documents, and it also can generate temporary documentsÑWeb pages'from databases. Many sites have begun storing their information in databases and generating pages dynamically on demand.

XML can easily personalize content views if the data sources are properly tagged. The pages convert to straight HTML after the necessary data is culled'that's how leading Web search engines produce their customizable start pages.

Visitors do not need an XML-capable browser to see the data, which passes behind the scenes as XML and changes into HTML only for display.

Conceptually, XML is a trio of specifications:

''The XML 1.0 recommendation explains the syntax of the metalanguage.

''The XML Linking Language and XPointer are the World Wide Web Consortium's working drafts that describe ways to link relationships between documents.

''The Extensible Style Language, now a W3C Note, describes how to render XML using different style sheets for various types of display devices.

XML also can issue commands. If you encounter a tag when browsing with Microsoft Internet Explorer, it starts a function that lets you update installed software.

The main page for the W3C's XML efforts is at www.w3.org/XML/Activity. XML only recently became a W3C recommendation and is not yet an official standard.

Internet Explorer 5.0 so far is the only browser that understands XML elements, based on the draft specification. The parts Microsoft adopted for Explorer will likely be part of the official XML standard. Netscape Communications Corp. is taking a wait-and-see attitude and likely will not release an XML-ready Communicator 5 until the specification becomes official.

Here's how XML makes documents readable by users and by browsers and other software programs. Say you want to create a document about some machine parts stored in a warehouse. In the HTML world, you would start with a document that looked something like this:



Machine Parts



Left-handed widgets



Then you would add more lines of description to produce a basic Web page. If your colleagues later wanted to put the information into a report or add it to a database, they would gather the page, strip out the font and alignment tags, and then reformat the information.

Now here's how the same document might look in XML:

Bob Smith



Machine parts

Left-handed widgets



In XML, tags can be invented to describe data types. Anyone searching for an occurrence of left-handed widgets within a recognized tag called would have a good chance of finding the widget entry.

Once tags are generally recognized, software can deal with them automatically. It's simple to tell a program to look at a specific directory and pull in the contents of the tags from all documents within the directory. Then the contents can be imported into a database field, outputted to other documents, updated for reinsertion in the original document or held for other uses such as building new pages.

If every bit of information is properly tagged, you can pull all of it into a database. At that point you no longer need to maintain the original document, just the database.

Get together

But you cannot keep adding new tag names, especially if you share data with other offices. How would they know what your tags meant? That's why groups have gotten together to develop standardized tag sets.

Given the appropriate tags, you can stack an enormous amount of data into an XML document. Anything becomes a data field just by tagging it, including the document itself. Take a look at this XML document:





Bob Smith



Machine parts

Left-handed widget

Roto Tiller

Garden

Detailed description

XYZ 186



Remove the cotter pin. Remove old widget. Install new widget. Replace cotter pin.









It looks like HTML, but it has no presentation data. That comes from another source, such as a style sheet. It's like having your word processor import addresses or names via mail merge rather than typing and formatting them directly in the document. Many systems merge XML and style data back into a presentational language such as HTML for easier reading.

The downside of XML's flexibility is that it is less forgiving than HTML. Browsers ignore HTML commands they fail to understand. If items aren't properly nested, it's no big deal to the browsers.'But in XML, an improperly formatted file creates a fatal error. Applications will refuse to process the file.

That means a document must be what XML experts call well-formed to work right. It has to be ready for a computer program to read, and thus ready to be used in multiple ways for network delivery.

In a well-formed document:

''All begin tags and end tags match up.

''Empty tags use the special XML syntax .

''All the attribute values are properly quoted, for example: .''All the entities, or reusable data chunks, are declared.Checking for code errors across thousands of documents is tough, so XML users turn to automated tools such as the Lark parser. An online demo of Lark appears at xml.com/xml/pub/tools/ruwf/check.html, which can check whether your document is well-formed.

XML designers recognized that document authors sometimes omit important information or include extraneous text. The document type definition, or DTD, makes sure that XML coding will do what was intended.

For the parts file above, a DTD might work like this:

. The root element contains all other elements.

. This simply says to expect a standalone tag.

. This defines the tag. Within the parentheses are additional sets of tags. They must appear inside the tags, in the same order.

An XML document can have an internal or an external DTD. It must be external if the DTD applies to multiple XML files.

The elements can get more complex. For example, in the term , the #PCDATA term is parsed character data'nonbinary information such as an image or raw text. You could designate "author" as the author's name or a photo.

The DTD checks to confirm that items within the tags follow its rules. For details about how DTDs are constructed, visit www.w3.org/TR/REC-xml#dt-doctype. But an XML document need not have a DTD to function. If the document is well-formed, it requires no special rules to tell a browser or other device how to read it.

A validating parser knows whether a document is well-formed. To do a quick, simple validation, save a document with an .xml extension, then view it in Internet Explorer 5.0, which will show whether anything is incomplete.

The key to writing successful XML is to do a great deal of advance planning. Decide how documents will be stored and served, how databases will be accessed, what tag sets will be used, and how they will nest so that the resulting documents are not only well-formed but also make sense to readers.

Decide whether you will need a DTD. If so, should it be internal or external, and how should it be structured? Don't worry about style sheets until you have everything else in place.

Above all, learn what others in your agency are doing about tag set creation. Because the government shares so much information, it needs a governmentwide tag set.

Then set up some experiments with a few dozen documents. Check the resources in this article to get started.

You can read the full XML specification at www.w3.org/TR/REC-xml.

inside gcn

  • cloud migration (deepadesigns/Shutterstock.com)

    What agencies can learn from the Army’s complicated move to the cloud

Reader Comments

Please post your comments here. Comments are moderated, so they may not appear immediately after submitting. We will not post comments that we consider abusive or off-topic.

Please type the letters/numbers you see above