ISO: Open-source tools to liberate data from PDFs

The government creates mountains of information, much of it in PDF form, yet agencies can find it difficult to extract and work with that data. One group hopes to crack open PDFs with a community-developed, open-source solution.

PDFs have many advantages: They’re easy to create, they can compress large files, they can be secured by adding watermarks, encrypting data or using passwords. Because they are "read only" documents that cannot be altered without leaving an electronic footprint, they meet legal requirements in a court of law. Another benefit is that the PDF looks the same no matter what computer or mobile device it is being viewed on, making it ideal for forms such as IRS Form 990s. And Adobe Acrobat Reader — the viewing tool — is free and easily available, making it easy to share PDFs.

The downside is also well known: PDF data, particularly information in images, cannot be easily extracted, analyzed and modeled. 

Options available to large organizations include purchasing an expensive, enterprise-level ETL (Extract-Transform-Load) tool to migrate data from PDFs into a database. Individual users can use online tools or applications to convert PDFs to Word or Excel files.

There are, however, only a handful of open-source PDF conversion tools. 

In an effort to expand open-source PDF conversion options, the Sunlight Foundation, a nonprofit founded in 2006 to encourage greater government openness and transparency, is hosting what it calls the PDF Liberation Hackathon, dedicated to improving open-source tools for PDF extraction. The Hackathon will run from January 17-19, 2014 at Sunlight offices in Washington, D.C., San Francisco, and around the world. 

The hackathon will encourage developers to build upon existing open-source PDF extraction solutions by creating additional features, extensions and plugins to the software to make PDFs more flexible and useful. In addition, hackers will have the option of using licensed PDF software libraries as long as the implementation cost of these libraries is less than $1,000. 

In addition to the Washington and San Francisco locations, teams can participate in person or remotely from anywhere in the world. 

Solutions will be judged on creativity, implementation cost, flexibility and user friendliness. Winning entries will be awarded prizes of up to $500, and those that are 100 percent open source will be featured on Sunlight’s API Community portal page. To receive a prize, a team must publish its source code on a GitHub public repository. For more information and to register click here

Hackathons have become a popular way for government and other entities to crowdsource innovative solutions at a low cost. NASA’s International Space Apps Challenge, designed to contribute to space exploration missions and help improve life on Earth, was billed as the largest hackathon ever, with 9,000 entrants. Last month Atlanta hosted Govathon, a hackathon to create mobile apps and interactive websites to address municipal government challenges, and LinkedIn hosted a hackathon for immigration reform in Silicon Valley.

About the Author

Kathleen Hickey is a freelance writer for GCN.


  • senior center (vuqarali/Shutterstock.com)

    Bmore Responsive: Home-grown emergency response coordination 

    Working with the local Code for America brigade, Baltimore’s Health Department built a new contact management system that saves hundreds of hours when checking in on senior care centers during emergencies.

  • man checking phone in the dark (Maridav/Shutterstock.com)

    AI-based ‘listening’ helps VA monitor vets’ mental health

    To better monitor veterans’ mental health, especially during the pandemic, the Department of Veterans Affairs is relying on data and artificial intelligence-based analytics.

Stay Connected