LOC amasses Twitter Archive, but where's the data mining tech?
Agencies have been making some good use of social media, gleaning posts for early signs of earthquakes, outbreaks of disease, criminal activity and other events.
But tracking what’s trending is one thing. Being able to search through terabytes of stored posts for deeper information and context is another, as the Library of Congress is discovering with its Twitter Archive.
LOC struck a deal with Twitter in April 2010 to build an archive of tweets, starting with public posts from 2006 to April 2010, and all subsequent public tweets. What started with a collection of 20 billion tweets has grown to about 170 billion, with about 500 million more coming in each day, LOC says in an update on the project published this month.
That’s a big pile of short missives, but one LOC deems important at a time when social media has become such a major channel of communication, “supplementing and in some cases supplanting letters, journals, serial publications and other sources routinely collected by research libraries,” says LOC, which has been archiving public documents of all kinds since 1800.
“Archiving and preserving outlets such as Twitter will enable future researchers access to a fuller picture of today’s cultural norms, dialogue, trends and events to inform scholarship, the legislative process, new works of authorship, education and other purposes.”
The problem is how to enable researchers to separate the right tweets from the chaff — and Twitter has a lot of chaff. Calling it unstructured data would be an understatement, even if the use of hashtags could help in some circumstances.
LOC has received about 400 requests from researchers around the globe asking to peruse the archive on topics ranging from studying the rise of citizen journalism to predicting the stock market. But it has yet to approve any requests because of the problems of searching the archive. Searching just the 20 billion tweets in the original archive from 2006 to 2010 could take 24 hours, which LOC says is “an inadequate situation in which to begin offering access to researchers.”
The library has looked at distributed and parallel computing solutions designed to quickly divide a search of large data sets, but that would require a prohibitive investment in hundreds, or even thousands, of servers, LOC said. Existing commercial services that provide indexes of historic tweets operate on a small scale that wouldn’t allow the deep dive researchers would require.
“It is clear that technology to allow for scholarship access to large data sets is not nearly as advanced as the technology for creating and distributing that data,” LOC said. “Even the private sector has not yet implemented cost-effective commercial solutions because of the complexity.”
For now, the library also is talking to social media aggregation company Gnip about developing an interface to the archive, and working with congressional researchers and scholars on a basic level of access that researchers could use until archival search improves, LOC said.
The library might not have to wait long. As data feeds have exploded in recent years, big data tools and analytics techniques have improved. Agencies from the Homeland Security Department to NASA, for example, have been using improved text analytics to scan social media for signs of terrorist activity and examine airline logs to improve safety. And a team at the Energy Department’s Oak Ridge National Laboratory recently unveiled Pirhana, software that takes a clustering approach to speeding up search through large data sets.