Pittsburgh's Data Rivers project speeds data integration and cleanup and improves troubleshooting of the city's data pipelines.
To improve the flow of data in Pittsburgh, the city’s Open Data program is getting ready to launch Data Rivers, an upgraded data delivery system that speeds data integration and cleanup and improves troubleshooting of the city's data pipelines.
When the city “flips the switch” in a few weeks, Data Rivers will launch first with 311 data, the most commonly asked-for information, said Tara Matthews, senior digital services analyst at Pittsburgh’s Department of Innovation and Performance.
The new system connects to the application programming interface the 311 center uses to manage its call intake, said Nick Hall, the city’s digital services manager who has overseen the yearlong project. It reads the raw data in and cleans it -- standardizing date and address formats, for instance. Finally, the records are stripped of personally identifying information to create "safe data" that can be published out .
Data Rivers -- whose name refers to the city’s three rivers and to the idea of flowing data vs. a data lake -- also standardizes city data, streamlining and speeding the processes for city analysts.
“The data that we have in the city comes in a bunch of different formats -- a database, an API or a spreadsheet sitting on somebody’s desktop -- but the issue is … evaluating what was needed to actually get that data to the data center,” Matthews said.
The original extract, transform and load (ETL) system was basic, enabling the Open Data program, maintained by the Western Pennsylvania Regional Data Center, a collaboration among the city, Allegheny County and the University of Pittsburgh, to host datasets from a variety of systems.
In 2015, the city's “first priority was to make sure that we could get datasets onto the data center as quickly as possible,” Matthews said. “We didn’t really want to drag our feet with publishing things, so we focused on a method that would get our data online quickly.” That resulted in what she described as a data delivery model “built out of duct tape and bubble gum” in a recent blog post.
The "basic guts" of the original open data system was a set of scripts that ran on a scheduled basis to intake the data and send it to the Western Pennsylvania Regional Data Center, Hall said. Data Rivers "adds a bunch of structural design decisions that make it easier for people to maintain the pipelines and much more difficult for things to go wrong.”
The city adopted Apache Kafka, an open source, distributed and immutable data storage system, and added tools on top of the basic database, he said. First was a user interface or a developer environment for creating new data pipelines -- in other words, a way for someone to configure the chain of events that starts with pulling data out of an SQL database or a vendor’s API, cleaning it and applying administrative business requirements such as implementing privacy rules for removing personally identifiable information.
The second capability Data Rivers needed is data validation, Hall said. For help, he turned to Confluent, a company that supports and expands on Kafka. “They have something called the Schema Registry, which allows you to define schemas that describe the data and store them in a centralized registry," he said. It automatically "checks data as it comes in against predefined schema that will throw a flag if, say, something about the format of the data has changed or if the source of the data has had an outage,” he said. Ultimately, it allows for automated notifications to be sent when issues in the data systems are detected.
Once Data Rivers launches, Matthews will turn her attention to getting more high-level datasets up and running, while Hall moves on to building consumer-facing applications.
“Ultimately this is a product that can serve analysts working with all of the different departments within the city, [and save] those analysts hundreds of hours over the course of a year on integrations and cleanup and dealing with an outage in one of the data pipelines,” Hall said of Data Rivers. “As we’re able to publish data more effectively, these tools become not only something that can serve the public but can serve users internally.”
NEXT STORY: Mapping NYC's zoning processes