Is PDF hurting transparency?

Computers cannot easily parse government documents rendered within the Portable Document Format, according to the Sunlight Foundation, a nonprofit organization dedicated to government transparency. The group argues that because of this, the widely used document standard is actually detrimental to government transparency efforts.

The difficult parsing means that people have to work harder to reuse government data, the organization asserts.

Although PDF is an open standard, it's closely associated with Adobe, which makes popular free software for reading PDF documents, and more sophisticated software for creating them. Adobe representatives dispute the Foundation's claim, saying a PDF can contain parsable data, in the form of XML datasets, but admitted that not enough of its users know how to use the feature.

In a blog entry provocatively entitled "Adobe is Bad for Government," posted last week, Sunlight Labs head Clay Johnson bemoaned the difficulties of extracting data from PDFs.

Johnson — not the same person as the former deputy director of management at the Office of Management and Budget — points out a number of specific examples in which the government's use of PDFs have made data hard to extract. The examples include House of Representatives bills, the Internal Revenue Service's Political Action Committee filings, and Congressional earmark requests from members of Congress.

"It is a misunderstanding about the capabilities of PDF," said Bobby Caudill, who is the government solutions architect for Adobe Systems. Caudill pointed out that it is possible to load the documents used to create a PDF directly into the PDF file. An XML document could be incorporated in such a way, for instance. So all an end user would need to do is extract the XML document from the PDF and then parse away as usual.

However, Caudill admitted, most users of Adobe Acrobat don't know they can do this. "It is quite easy to do, but most people aren't aware of this capability," he said.

He also noted that, contrary to Johnson's assertion that PDF is a proprietary format, it actually is a standard controlled not by Adobe, but by International Standards Organization. "The process of development now belongs to them," Caudill said.

"It is easy to oversimplify the technology choices. There's this perception that there is this false choice between providing openness for people and openness for machines," added Robert Pinkerton, director of government solutions for Adobe. "You need to do both."

Pinkerton pointed to Adobe's one-day conference Nov. 5 on open government, at which agencies could explore these issues further.

Johnson also took aim at Adobe Flash, or rather the users of Flash, noting that in many cases government agencies assume that rendering data into a visually appealing format is the best way to achieve transparency. Instead of pie charts and dashboards, agencies should concentrate on providing the data in easily parsable formats, he said.

Reader Comments

Mon, Nov 16, 2009 Bruce W. Fowler, Ph.D.

PDF Archive, which is in part a USG standard, alleviates almost all of these problems. To the best of my knowledge its use on documents is policy. This indicates that the question should perhaps be why are so few documents being generated in accordance with policy?

Tue, Nov 10, 2009

Check out today's Tech Tuesday on WAMU radio: http://thekojonnamdishow.org/shows/2009-11-10/obstacles-open-government. Great discussion between Sunlight, Adobe and others on Open Government

Mon, Nov 9, 2009 Fred Wagner

I always thought PDF was an acronym for Published Document Format - designed so you could view the information as the author intended, but so you could NOT alter it to republish it any other way. It's a way to view a report or document without having to own the software that created it. If you wanted to extract the content of a PDF for re-use, you needed special software, or at least Acrobat Standard or Professional, so you could save it in an editable format for your own purposes. Those who are complaining seem to relish complaining, not understanding the situation, which Is completely solvable, without lawsuits.

Fri, Nov 6, 2009

There are several ways of creating PDF documents. If text and images are scanned and stored as image files it is very problematic to convert them into Word or other such document. Also, electronic documents (Word, Excel) can be structured in a way that a reader cannot save, copy or even print such documents. I have come across documents which are a mix of all three. So how is a typical user of such document able to use the information for research or other work?

Fri, Nov 6, 2009

"Government users of Adobe Acrobat software are minimally, if not in-, competent in their jobs."...Let's be honest now, most people in the general population do not use Adobe Acrobat for parseable XML data. That's just not how the program is presented to people. It offends me as a government worker that this person can make such offensive and generalizing statements here over something as small and understandable as people not using the PDF format to its full potential. His last sentence betrays his political agenda: he's really here to complain about public health care. People should feel free to express their opinion, but when a guy starts his post with "Why did I even waste time reading this article" and isn't even here to talk about it so much as insult people, I have to wonder why the editor let that through.

Show All Comments

Please post your comments here. Comments are moderated, so they may not appear immediately after submitting. We will not post comments that we consider abusive or off-topic.

Please type the letters/numbers you see above