Documents in the Portable Document Format are about as common on the Web as celebrity photos, but does the format help or hinder the dissemination of information? Readers were of two minds in responding to our report on the Sunlight Foundation’s criticisms of PDF. Several stressed the need for better-educated users. Others contended that data shouldn't always be easy to get at.
Sunlight Labs Director Clay Johnson argued that PDF works against government transparency because the format makes it difficult for computers to parse information. An architect for Adobe – although PDF is an open standard, Adobe has built a substantial business around PDF – responded that it is fairly easy to incorporate Extensible Markup Language into a PDF, although most people don’t know how to do it.
“So the real problem is not the feature set of the Adobe products, but how government officials use them,” wrote a reader named Mike. “Any shift in applications will result even less transparency for a while until users become familiar with the new applications. I know our organization specially prohibits the use of advanced Adobe features and scans in the documents. This is because our document control gatekeepers are stuck in the 1970s. Even more to the point for transparency is access to the documents to begin with. Look at the health care debate: Every discussion on the current bill is confronted by advocates on both sides with ‘but there is no final bill yet.’ Transparency and access extend beyond the feature set of Adobe products.”
“I agree with Mike,” wrote Buddy of Somewhere in the USA, who suggested that to “most users who use Acrobat, PDF means let's scan this in ‘image’ format and make it a PDF when they need to learn how to use the software! Where I work at there was never any training on Adobe, let alone work for a place where software features are not disabled by administrative personnel under the guise of security. As for a worker who has to get items out, it’s just as fast to make a SCAN PDF and go from there.... Job done.”
“What's important is how you create/manipulate the documents,” added Kelvyn in Philadelphia. “I work in a municipal agency with a hybrid paper/electronic record system, and, for the moment at least, PDF remains the best bridge between those worlds. I can scan a document, add fields that push info into our database and publish a fully-searchable PDF on our Web site. The problem is that many PDFs on government sites are simple scanned images of text, and not easily searchable. I spend a lot of time trying to educate my staff on the need to print directly to PDF from other apps in order to preserve full-text searching.”
At least one reader, signed Anonymous, offered some practical advice: “All you have to do to get a working copy of the PDF doc is to right-click on it, and hit, ‘Select all.’ Then do a ‘Ctrl-C’ to copy it to either a Word or WP doc. Then hit ‘Ctrl-V’ to paste it on the page. Once it's in WP or Word, you can work with it.”
A reader named Ed, however, said that doesn’t always work. “I tried to comment on an environmental impact statement that the Maryland State Highway Administration put out in PDF and it was a nightmare,” he writes. “They locked the document so you could not cut and paste into another document. So the practice of putting their statement into another document and then questioning or refuting the statement was almost impossible. And this is exactly what they wanted because they did not want dissent.”
But is making it difficult to extract data from a document always a bad thing? Stanley Baranowski wrote that transparency might not be the real, or even the only, issue. “It seems to me one issue is ‘data extraction’ and the difficulty of ‘others’ extracting certain information but not necessarily all of the data, only what ‘they’ want you to see, not the entire document -- that ‘taken out of context’ thing. I think that the difficulty of extracting pieces of the entire document actually reinforces the idea of transparency. Someone cannot cut and paste just what thy want to show you, but there's the entire document -- no manipulation -- read it all and decide for yourselves.”
“My initial reaction is against this idea [of easily parsed data],” wrote Charles, of Hollywood, Fla. “Our goal is making the information available to the largest number of *people* and PDF is an excellent way to do that. I say ‘our’ because I work for the city in which I live and one part of my job is making the information available. The tone of the article seems, to me at least, to be that I need to spend more time making the information we provide in such as way as to allow *them* to just cut and paste into whatever they are doing with said information. I use the 'old lady' test (my apologies to the old ladies) -- can the oldest and most technologically inept citizen in my city find, and then read, the information? If the answer is yes, then I have done a good job. If you want to do something else with that information, then you are probably tech-savvy enough to figure out how to get the information out of my PDF files.”
“I believe transparency must also include the ability to ensure accuracy of presentation,” added another reader. “Too many individuals I work with do not verify what they read but take it as fact [that] it is accurate. If someone can extract my data and manipulate it and then reproduce it as mine, that is worse than it being easy to parse. One of the biggest reasons I use PDF is because it is protected from the average users’ exploits.”
Posted on Nov 05, 2009 at 2:06 PM2 comments
For some time now, people
have been wondering what will be the successor to silicon-based processor technology. Once Moore's Law inevitably hits the wall — as the limits of how many transistors can be packed onto a silicon wafer is reached — what new technologies will continue to march forward to ever-more powerful computers?
One possible answer comes in the form of an emerging science called plasmonics, reports Science News.
Plasmonics, studied by the National Institute of Standards and Technology and others, is a newly understood technique of compressing light into conduits just a few nanometers wide. Tiny plasmonic lasers could do the work that transistors do today. Such "plasers" are smaller than silicon-based transistors, much more computing power can be packed into a smaller surface.
Plasmonics exploits a strange nano-level phenomenon known as the surface plasmon wave, caused when light hits a very small scrap of metal. The "light can set off a wave in the free electrons hanging out on the metal’s surface. This wave carries the light along like a surfer riding on an electron sea," the publication reports.
All the usual disclaimers apply when talking about far-term technology: The devil remains buried in the details. Converting light into plasma waves hasn't been sussed out yet, and how to encode data on a wave remains a challenge as well. Also, plasma waves don't travel very far before dying out.
Like quantum computing — that other great promise for post-silicon processing — plasmonic computers remain decades away. Still worth watching, though.
Posted on Nov 02, 2009 at 2:03 PM0 comments
Tim Berners-Lee, the inventor of the Web format, and the organization that keeps the standards of the Web, the World Wide Web Consortium, have recently been promoting the idea of making the Web machine-readable, or a Web of data. What does that mean? After all, at least in one sense, the Web is already being read by a machine -- namely your own computer -- when you surf the Web.
At the International Semantic Web Conference, being held this week in Chantilly, Va., Dean Allemang, chief scientist at Semantic Web consulting firm TopQuadrant, offered a solid example of how a machine-readable Web would help us all, in theory anyway.
His example was work-related: booking hotels. Say you wanted to attend a conference at some out-of-town location. The conference site itself probably has a Web site.
You copy its physical address from its site, and go to an online hotel broker site, such as Hotels.com, to find a nearby hotel. You do a search on hotels, say, by entering that address into the search criteria, to seek hotel within a certain radius. Or you just a get a list of hotels and go to a third Web site, a mapping site such as MapQuest, and enter hotel addresses and the conference center address to see if any hotel is close to the conference center.
In Allemang's view, this really is crazy. Why copy some information from one page and paste it to another, using the same computer? Why can't the computer itself do the work?
The trick would be to get all the sites to agree on how to represent an address, Allemang said. Then, the addresses can be passed from one site to the next through your browser, automatically, without you having to do anything. The mapping site could check your cache and list any addresses found there, offering you the option of mapping them.
Automating such a task (and the countless others we do by hand on our computers), is the point of creating a machine-readable Web. If computer programs can read the Web pages and carry out tasks, we won't have to.
Relational databases make the prospect feasible. With databases, you can structure data so each data element is slotted into a predictable location. You can query a database of personnel data to return a birth date of a particular person, because the row of data with that person's info has a dedicated column dedicated to the birth date.
This approach wouldn’t work so well for data beyond a single database, however. "The problem is that everyone assumes you will need to build a huge data warehouse, where everything can be compared. This will never happen," Allemang said. Another factor: On the Web, data is not structured in such a way that it can retrieved with any consistency, and the vast number of people who design and maintain Web sites would not all agree on the same format for structuring data.
The answer the W3C has come up with comes in a form of a set of interrelated standards, that can be used to embed data on Web sites, as well as to interpret the data that is found there. One standard is the Resource Description Framework. The other is the Web Ontology Language, or OWL.
RDF is a way of encoding data so it can be available for a wider audience in such a way that external IT systems can understand it. It is based on making associations. It describes data by breaking each data element into three nodes: a subject, a predicate, and object. For example, consider the fact that Yellowstone National Park offers camping. "Yellowstone" would be subject. "offers" would be the predicate and "camping" would be the "object." (All three elements get uniform resource identifiers, or a globally-recognized Internet addresses).
A query against Triple Store, which is what a RDF database is called, can link together disparate facts. If another triple, perhaps located in another Triple Store, contains the fact that Yellowstone contains the Mammoth Hot Springs, a single search across multiple Triple Stores can return both facts.
Additional standards can further refine the precision of the data definition. For instance, two parties can agree that the term "Yellowstone" refers "Yellowstone National Park" by using a shared, controlled vocabulary, which can be referenced through a Resource Description Framework schema and RDFS. RDFS also allows inferencing. In RDFS, you can state that Yellowstone is a type of national park. So a search for national parks that offer camping would return Yellowstone.
Of course, the Interior Department could build a list of all the national parks and include which services each park offers. But with the semantic Web approach, such a single database would never be needed. The services for each park could maintain their own data, and the results could be compiled only when someone posts some piece of specific data, Allemang pointed out. In essence, with RDF, a user can build a set of data from various sources on the Web that may have not been brought together before.
How do you use these triples? One way is through the query language for RDF, called SPARQL (an abbreviation for the humorously recursive SPARQL Protocol and RDF Query Language). With Structured Query language (SQL), you can query multiple database tables through the JOIN function. With a SPARQL query, you specify all the triples you would need, and the query engine will filter down to the answers that fit all of your criteria.
For instance, say you are looking for a four-star hotel in New York. You have a query to look for triples specifying for four-star hotels, and for hotels and New York. The query search engine would find all the triples for hotels in New York, as well as all the triples for four-star hotels, and filter the set down to four-star hotels in New York.
Even more sophisticated interpretations of RDF Triples can be done through OWL.
The logical chain of reason within a RDF Triple is relatively static, and can vary according to who does the encoding. One triple may say that Yellowstone "offers" camping as a service, but another triple may state that camping "is offered" Arcadia National Park. While it may seem obvious to us that both Arcadia and Yellowstone offer camping, it wouldn't be to the computer. A SPARQL query engine, perhaps one embedded in a Web application, could consult OWL and return both entries though.
While the idea of a machine-readable Web sounds great, there still requires data holders to render their material in RDF, a tall order for already-overworked Web managers. But the benefits may be worth it — once online, data can be reused in ways that government managers may never have considered.
Posted on Oct 27, 2009 at 12:00 PM0 comments
By now the IT trade press is so awash with breathlessly anticipatory articles and reviews of the upcoming Windows 7 that we've stopped scanning them, for the most part. But this one, from the humor magazine
Cracked caught our eye, "
A Review of the Pirated Copy of Windows 7 I Bought On eBay." This was the funniest thing we've read in a while, in fact. To hear this reviewer tell it, Windows 7 has limited hardware support, few decent built-in applications (Microsoft Fax notwithstanding) and is terribly unstable!
Keep in mind, that this article comes from a magazine specializing in low-brow humor. So some of the language (at least in the background of some screenshots) is not work-safe. And we won't give away the punch-line. But suffice to say, it is a good one, and speaks volumes about the fruits of our relentless march of technological progress.
Now, to dig up that old floppy drive…….
Posted on Oct 09, 2009 at 1:21 PM0 comments
Server-huggers aren't the only ones wary of the lure of cloud computing. Software vendors also seem reluctant to hitch onto this latest trend.
Earlier this week, Doug Bourgeois, director of the Interior Department's National Business Center, talked about how NBC is ramping up a set of infrastructure services for other agencies to use. He spoke as part of a cloud-computing panel at the Virtualization, Cloud Computing and Green IT Summit, held by the 1105 Government Information Group, publishers of GCN.
The idea is that NBC can offer other agencies such infrastructure at a lower cost and with all the necessary government security compliance already in place, largely because it had already built out much of the infrastructure in the process of delivering its own services. But in the course of ramping up the center's offering, Bourgeois did come across one stumbling block: Software licensing.
NBC plans to offer infrastructure service on a pay-as-you-go basis (at least, initially, in monthly increments). But much of the software needed to supply this infrastructure-as-a-service — server software, databases, and such -- can only be procured via old-fashioned enterprise licenses. This means all the software that NBC might use must be purchased beforehand.
"The traditional enterprise license agreement that software providers want to bring to the table requires the service provider to outlay the money up front for the entire enterprise license, and then you have the ability to provision those licenses as clients accessing your system," Bourgeois said in a subsequent interview with GCN. "That just doesn't work in a cloud model. The service providers are taking all the risk and paying up front" for services that may or may not be actually used.
This is especially problematic, Bourgeois explained, insofar as the projected use of NBC's cloud services, being not only a new service but a new type of service, can vary wildly. And because much of the cost-savings is based on a shared-usage model, charging full price for each copy of a program that might be used, and/or for every customer that might use that program, would cut into the cost-savings that cloud computing could bring about.
Oddly enough, hardware vendors seem to have come to terms with the pay-as-you-go route. For its own cloud services, the Defense Information Systems Agency hammered out an agreement with Hewlett-Packard and Sun Microsystems wherein each company would outfit DISA with fleets of servers within the agency's data centers, but only charge for those servers that were actually used. NBC struck a similar deal with its own vendors.
Yet many software companies seem loathe to offer a similar deal, Bourgeois said, adding that NBC is currently talking with a number vendors to see if any deals can be worked out.
Bourgeois didn't name the vendors he was speaking with, though we queried a few of the biggest enterprise software companies — including Microsoft, Oracle, Red Hat, RightNow Technologies -- to find out if they offer any sort of usage-based licensing, or if they would be willing to do so. Thus far, one company has responded to our recent request: customer relationship management software provider RightNow Technologies (We'll keep you updated with their responses from the other companies).
RightNow currently does not offer usage-based pricing, but is open to the idea, said Kevin Paschuck, the company's vice president of public sector operations. In fact, the company already is in discussion with DISA on establishing a monthly payment based on actual consumption.
"Our typical contract aligns with the industry standard of an annual commitment with the opportunity to tune up or down the licenses at the end of the contract based on amount used," Paschuck said in an e-mail. "However, we are open to monthly usage based contracts."
True, vendors have long padded their bottom lines with the inherent inefficiencies of government IT purchasing — by making small but expensive sales to branch agencies, or by selling more seats on agency-wide enterprise licenses than ever get used. From behind the procurement officer's desk, a bounty of cost-savings can be glimpsed. But to be fair, it is obvious why some software companies may be reluctant to go to a usage-based pricing model. Software sales are what keeps software companies in business. It is a core asset. Duh! You can't return a partially-eaten half smoke to Ben's Chili Bowl and expect an incremental refund of some sort.
As with transparency efforts stifling frank vendor-agency talk, usage-based pricing could ultimately spur some serious fiscal introspection on the part of vendors. And rethinking how a company's primary source revenue would be regenerated under a cloud model is not a task to be undertaken lightly. Add into this muddy mathematics the fact that many software companies, such as Microsoft or RightNow, are ramping up or already have their own software-as-a-cloud offerings, thereby making a government service cloud provider a potential competitor, in addition to being a potential customer. In short, asking for a new type of pricing is a big request.
Still, the current reluctance could problematic for nascent government cloud offerings. "The standard license agreement puts too much risk on the service provider," Bourgeois said.
Posted on Oct 09, 2009 at 2:03 PM0 comments