ISO: Open-source tools to liberate data from PDFs

The government creates mountains of information, much of it in PDF form, yet agencies can find it difficult to extract and work with that data. One group hopes to crack open PDFs with a community-developed, open-source solution.

PDFs have many advantages: They’re easy to create, they can compress large files, they can be secured by adding watermarks, encrypting data or using passwords. Because they are "read only" documents that cannot be altered without leaving an electronic footprint, they meet legal requirements in a court of law. Another benefit is that the PDF looks the same no matter what computer or mobile device it is being viewed on, making it ideal for forms such as IRS Form 990s. And Adobe Acrobat Reader — the viewing tool — is free and easily available, making it easy to share PDFs.

The downside is also well known: PDF data, particularly information in images, cannot be easily extracted, analyzed and modeled. 

Options available to large organizations include purchasing an expensive, enterprise-level ETL (Extract-Transform-Load) tool to migrate data from PDFs into a database. Individual users can use online tools or applications to convert PDFs to Word or Excel files.

There are, however, only a handful of open-source PDF conversion tools. 

In an effort to expand open-source PDF conversion options, the Sunlight Foundation, a nonprofit founded in 2006 to encourage greater government openness and transparency, is hosting what it calls the PDF Liberation Hackathon, dedicated to improving open-source tools for PDF extraction. The Hackathon will run from January 17-19, 2014 at Sunlight offices in Washington, D.C., San Francisco, and around the world. 

The hackathon will encourage developers to build upon existing open-source PDF extraction solutions by creating additional features, extensions and plugins to the software to make PDFs more flexible and useful. In addition, hackers will have the option of using licensed PDF software libraries as long as the implementation cost of these libraries is less than $1,000. 

In addition to the Washington and San Francisco locations, teams can participate in person or remotely from anywhere in the world. 

Solutions will be judged on creativity, implementation cost, flexibility and user friendliness. Winning entries will be awarded prizes of up to $500, and those that are 100 percent open source will be featured on Sunlight’s API Community portal page. To receive a prize, a team must publish its source code on a GitHub public repository. For more information and to register click here

Hackathons have become a popular way for government and other entities to crowdsource innovative solutions at a low cost. NASA’s International Space Apps Challenge, designed to contribute to space exploration missions and help improve life on Earth, was billed as the largest hackathon ever, with 9,000 entrants. Last month Atlanta hosted Govathon, a hackathon to create mobile apps and interactive websites to address municipal government challenges, and LinkedIn hosted a hackathon for immigration reform in Silicon Valley.

About the Author

Kathleen Hickey is a freelance writer for GCN.


  • Records management: Look beyond the NARA mandates

    Pandemic tests electronic records management

    Between the rush enable more virtual collaboration, stalled digitization of archived records and managing records that reside in datasets, records management executives are sorting through new challenges.

  • boy learning at home (Travelpixs/Shutterstock.com)

    Tucson’s community wireless bridges the digital divide

    The city built cell sites at government-owned facilities such as fire departments and libraries that were already connected to Tucson’s existing fiber backbone.

Stay Connected