Energy lab's Piranha puts teeth into text analysis
- By Rutrell Yasin
- Nov 29, 2012
The Energy Department’s Oak Ridge National Laboratory has pioneered a new approach to text analytics that uses software agents distributed over very large computer clusters that can quickly filter through large volumes of documents, show relationships between them and present relevant information to business and government analysts.
The software, called Piranha, is designed to overcome challenges most people face attempting to derive accurate and relevant information as they sift through large amounts of data on their computers. Piranha works faster than traditional approaches by clustering massive amounts of textual information in relatively short amounts of time, due to the scalability of the agent architecture, ORNL officials said.
ORNL’s Computational Data Analytics Group has been working on the system for close to nine years, said Thomas Potok, senior scientist and group leader. “We are able to take pretty large collections of text, go through and group them, cluster them and show people things of interest and significance,” Potok said.
Text analytics has been attracting the attention of agencies that deal with large amounts of unstructured data, such as NASA’s analysis of airline safety reports and a Homeland Security Department-funded bio-preparedness collective.
Typical users of Piranha might be law enforcement or military analysts, health care workers or anyone who has a large collection of text documents and needs help figuring out what they have, Potok said.
At one time, researchers or investigators might have a hundred documents to read, going through each document one by one on a computer. Now researchers might have to find patterns among of millions of documents. ORNL, in fact, is working with a law enforcement agency, helping investigators sift through millions of documents.
So far, ORNL has licensed Piranha to two companies, Pro2Serve and TextOre, Potok said. Pro2Serve, a Knoxville, Tenn.-based provider of technical and engineering services for critical infrastructure protection, will incorporate the software in the services it offers government agencies. TextOre, based in Fairfax, Va., is incorporating Piranha into the company’s suite of business analytical software and services to help analyze text data with greater speed and accuracy.
Analysts using Piranha can select a document and quickly find other documents that are a close match. If they select an e-mail message of interest, clustering allows them to quickly find similar e-mails on other computers, thus potentially establishing a link.
Piranha also lets analysts perform document sampling. A set of documents typically will contain common themes or topics. Representative themes from these documents can be quickly found. A hard drive may store thousands of documents across many different topics, from finances to favorite restaurants. Ten or 20 representative documents from these themes can be found and used by an analyst to determine what they mean.
Piranha has a “recommender” capability that lets users filter documents that are related to the subject they are researching. These documents can form the basis for searching for related documents and can help reduce the number of documents the user has to sift through, Potok said.
Then analysts can begin to determine how the documents are related. “If I have to put these documents into folders and group them, how would they be grouped?” Users can start looking at the entities and words within the documents for connections. “What we do is go from millions of documents down to very relevant case information or intelligence and say, ‘This is what I can act on immediately,’” Potok said. “This whole process now takes a matter of days instead of months, which is typical.”