Scraping actionable intelligence from Word docs
- By Matt Leonard
- Jun 07, 2017
Kevin Roney, chief strategic analyst at Department of Homeland Security’s Science & Technology Directorate, started working with Immigration and Customs Enforcement about a year ago on ways to optimize the layout of relocation centers where detained immigrants await their hearings. He knew ICE would be providing some historical data on the demographics of recent detainees. What he got, however, was not what he expected: about 1,500 Word documents.
“What are we going to do with Word documents?” Roney recalled thinking.
These files were daily reports that contained information on the immigrants’ country of origin, age, gender, health information and security classification, but because of the format, extracting insights would be difficult.
“I knew the only way we were going to do this was we were actually going to have to write some code,” Roney told GCN at the June 6 Qlik Federal Summit.
A custom Java program helped analyze the text and extract the relevant data from the Word documents. The team then used MATLAB to write a predictive simulation that, based on the population’s characteristics, could tell ICE how to set up its facilities most efficiently -- how many beds to have and where to put walls, for example – while meeting detention safety and security standards.
Dan Tangherlini, president of SeamlessGov Federal and former administrator at the General Services Administration, said this was an example of taking useless data and pulling actionable information out of it.
“There is a difference between data and information,” Tangherlini said, speaking on the same panel as Roney. “They had data on a Word document, but his team had to sort through 1,500 Word documents in order to provide some analysis and yield some information that decision-makers could use.”
Every step of the data analysis process -- from gathering the data to visualizing it -- must be constantly improved, Tangherlini said, or an organization can end up with bad business intelligence.
Matt Leonard is a reporter/producer at GCN.
Before joining GCN, Leonard worked as a local reporter for The Smithfield Times in southeastern Virginia. In his time there he wrote about town council meetings, local crime and what to do if a beaver dam floods your back yard. Over the last few years, he has spent time at The Commonwealth Times, The Denver Post and WTVR-CBS 6. He is a graduate of Virginia Commonwealth University, where he received the faculty award for print and online journalism.
Leonard can be contacted at email@example.com or follow him on Twitter @Matt_Lnrd.
Click here for previous articles by Leonard.