ISO: Open-source tools to liberate data from PDFs

The government creates mountains of information, much of it in PDF form, yet agencies can find it difficult to extract and work with that data. One group hopes to crack open PDFs with a community-developed, open-source solution.

PDFs have many advantages: They’re easy to create, they can compress large files, they can be secured by adding watermarks, encrypting data or using passwords. Because they are "read only" documents that cannot be altered without leaving an electronic footprint, they meet legal requirements in a court of law. Another benefit is that the PDF looks the same no matter what computer or mobile device it is being viewed on, making it ideal for forms such as IRS Form 990s. And Adobe Acrobat Reader — the viewing tool — is free and easily available, making it easy to share PDFs.

The downside is also well known: PDF data, particularly information in images, cannot be easily extracted, analyzed and modeled. 

Options available to large organizations include purchasing an expensive, enterprise-level ETL (Extract-Transform-Load) tool to migrate data from PDFs into a database. Individual users can use online tools or applications to convert PDFs to Word or Excel files.

There are, however, only a handful of open-source PDF conversion tools. 

In an effort to expand open-source PDF conversion options, the Sunlight Foundation, a nonprofit founded in 2006 to encourage greater government openness and transparency, is hosting what it calls the PDF Liberation Hackathon, dedicated to improving open-source tools for PDF extraction. The Hackathon will run from January 17-19, 2014 at Sunlight offices in Washington, D.C., San Francisco, and around the world. 

The hackathon will encourage developers to build upon existing open-source PDF extraction solutions by creating additional features, extensions and plugins to the software to make PDFs more flexible and useful. In addition, hackers will have the option of using licensed PDF software libraries as long as the implementation cost of these libraries is less than $1,000. 

In addition to the Washington and San Francisco locations, teams can participate in person or remotely from anywhere in the world. 

Solutions will be judged on creativity, implementation cost, flexibility and user friendliness. Winning entries will be awarded prizes of up to $500, and those that are 100 percent open source will be featured on Sunlight’s API Community portal page. To receive a prize, a team must publish its source code on a GitHub public repository. For more information and to register click here

Hackathons have become a popular way for government and other entities to crowdsource innovative solutions at a low cost. NASA’s International Space Apps Challenge, designed to contribute to space exploration missions and help improve life on Earth, was billed as the largest hackathon ever, with 9,000 entrants. Last month Atlanta hosted Govathon, a hackathon to create mobile apps and interactive websites to address municipal government challenges, and LinkedIn hosted a hackathon for immigration reform in Silicon Valley.

About the Author

Kathleen Hickey is a freelance writer for GCN.


  • business meeting (Monkey Business Images/Shutterstock.com)

    Civic tech volunteers help states with legacy systems

    As COVID-19 exposed vulnerabilities in state and local government IT systems, the newly formed U.S. Digital Response stepped in to help. Its successes offer insight into existing barriers and the future of the civic tech movement.

  • data analytics (Shutterstock.com)

    More visible data helps drive DOD decision-making

    CDOs in the Defense Department are opening up their data to take advantage of artificial intelligence and machine learning tools that help surface insights and improve decision-making.

Stay Connected