Researchers at Indiana University are working on knowledge graphs that use semantic relatedness between concepts to determine the likely truth of a statement.
Computers are very good at digesting massive amounts of data, from Facebook feeds to NASA satellite data streams to medical records and insurance claims. What they’re not good at is determining the accuracy of all that data.
Not that artificial intelligence designers aren’t trying. But building inference engines that can perform logical operations on new, unanticipated data sets is incredibly difficult.
Researchers at Indiana University decided to try a different approach to the problem. Instead of trying to build complex logic into a program, researchers proposed something simpler. Why not try measure the likelihood of a statement being true by analyzing the proximity of its terms and the specificity of its connectors?
OK, I admit that this approach is not intuitive to non-mathematically inclined humans. But that’s what Giovanni Luca Ciampaglia, a postdoctoral fellow at Indiana University’s Bloomington School of Informatics and Computing, and his colleagues are researching, and it seems to work.
The first step is to create a knowledge graph for a given data set. When assessing the likely truth or falsity of a statement, the algorithms developed by Ciampaglia measure the number of steps required to get from the beginning data point in the statement to the end data point. Also, the algorithm factors in the “generic” quality of nodes that link the data points.
“Let’s say you want to check the statement, ‘Barack Obama is a socialist,’” said Ciampaglia. “Barack Obama is a person, and Joseph Stalin was also a person and was a communist, which is a special brand of socialist. So I found a connection there.” Of course, Ciampaglia said, “nobody would really buy this kind of argument, because you have to go to the concept of ‘human,’ and that concept is very general, because there are 6 billion humans.”
“Our algorithm isn’t really doing logic,” Ciampaglia explained. “It’s a measure of semantic relatedness. It’s about finding the shortest path between two data points.”
The following example shows the semantic steps between "Barack Obama" and "Islam." The sheer number of steps and the references to Canadians indicates that President Obama and Islam are not closely related concepts, making the statement unlikely to be true.
To test the algorithm, Ciampaglia’s team built a knowledge graph of Wikipedia data with 3 million concepts and 23 million links. The team’s automated fact-checking system – which assigned “truth scores” to statements being tested – consistently matched the results provided by human fact checkers.
Ciampaglia said his work is still a long way from a being a generally useful product. For starters, he said, a fact-checking tool based on his algorithm would first need to be able to digest and understand the text in the data set. And while natural language processing is a hot area of research, the tools are not yet up to the job of managing large data sets accurately.
“It will not be tomorrow or next summer,” he said. “But we will eventually get there. All these technologies are coming together, and what we did provides a direction for this last step.”
Ciampaglia said the eventual product will be something like a spellchecker or a grammar checker, in that it would alert the user to a likely falsehood in a statement by issuing an alert when there is too much distance between nodes or the connections are too generic.
It could be used for fraud detection and in screening data streamed during a disaster. “In a disaster or crisis, you want to get out only correct information,” he said.
NEXT STORY: Speak up: Last chance for the GCN reader survey