What it takes to review 650,000 emails

What it takes to review 650,000 emails

On Oct. 28, FBI Director James Comey told Congress that bureau investigors had found and would analyze additional emails that may have relevance to the investigation into Hillary Clinton’s private email server. By Nov. 6 the FBI had concluded that the newly discovered emails did not affect the agency’s original conclusions that Clinton committed no criminal wrongdoing.

At a campaign stop in Michigan later that day, presidential candidate Donald Trump said the FBI’s expedited analysis simply isn’t possible: “You can’t review 650,000 new emails in eight days,” he said.  “You can’t do it, folks.”

But it is possible, and researchers, lawyers and cyber-forensic experts have been doing it for years. Just look at the Enron case, said Ben Shneiderman, a computer science professor at University of Maryland. There were more than a million emails released to the public during the Enron investigation in 2003. Since then, that database of emails has been used by researches to study how people use email, he said.

“The capacity for people to explore and visualize these kinds of datasets is a great success story of the research field,” he said.

So when the FBI was asked to look into these emails, it wasn’t being asked to do anything revolutionary. It’s a fairly standard cyber-forensic skill, according to Mark Lanterman, the CTO of Computer Forensic Services and former senior computer forensic analyst for the U.S. Secret Service Electronic Crimes Task Force.

They first myth that needs to be dispelled, Lanterman said, is that the software used by the FBI is “special.”

“I saw in the media when the story first broke -- a number of references to ‘special software’ that the FBI is using to do this,” he said. “They just use commercially available software like just about anyone else.”

Based on his  prior experience with federal law enforcement, Lanterman said the FBI would have likely used Encase, Forensic Toolkit or DTSearch software to help analyze the email data.

In any forensic case, the investigators first create a perfect copy, or forensic image, of the hard drive. The software then makes a searchable index of everything on the hard drive, he said.

Using keywords and timelines the software can filter the dataset based on a number of criteria. The software will show duplicates, which it finds using an identifier known as a hash.

Depending on the state of the email, it’s not surprising that it took eight days, Lanterman said. It’s common for these cases to take longer, and he said he was surprised it didn’t take at least a week more.

Shneiderman also noted that the FBI was likely not just working on searching databases and deduplicating data.

“Looking through it is one thing,” he said. “The political decision of deciding what’s important and the legal decision of deciding what is potentially a violation of law, potentially took longer.” said.

The FBI declined to comment for this article, instead referring to Comey’s letter.

About the Author

Matt Leonard is a former reporter for GCN.


  • Records management: Look beyond the NARA mandates

    Pandemic tests electronic records management

    Between the rush enable more virtual collaboration, stalled digitization of archived records and managing records that reside in datasets, records management executives are sorting through new challenges.

  • boy learning at home (Travelpixs/Shutterstock.com)

    Tucson’s community wireless bridges the digital divide

    The city built cell sites at government-owned facilities such as fire departments and libraries that were already connected to Tucson’s existing fiber backbone.

Stay Connected