OASIS ratifies open standard for accessing unstructured information
- By William Jackson
- Mar 20, 2009
A standard for accessing the unstructured data that makes up an estimated 80 percent of data generated by enterprises has been ratified by the Organization for the Advancement of Structured Information Standards (OASIS).
The Unstructured Information Management Architecture (UIMA) was developed by IBM Corp., which incorporates it in a number of products. The company has made the architecture available as an open-source technology to encourage development of interoperable analytics tools for unstructured text documents, including e-mails, blog entries, news feeds and notes, in addition to audio recordings, images and video.
According to OASIS, these types of unstructured data are the largest, most current and fastest-growing types of data and account for 80 percent of information generated by enterprises.
IBM has made the architecture an open-source project of the Apache Software Foundation, which hosts an incubator project for UIMA-based software. UIMA was developed under the Royalty Free on Limited Terms model of the OASIS Intellectual Property Rights Policy.
UIMA allows data to be broken down and identified by specific components such as language, sentences, and person and place names. “Each component implements interfaces defined by the framework and provides self-describing metadata via XML descriptor files,” Apache said. “The framework manages these components and the data flow between them. Components are written in Java or C++; the data that flows between components is designed for efficient mapping between these languages.”
UIMA also allows applications to wrap components as network services. These can scale to large volumes by replicating processing pipelines over a cluster of networked nodes.
“The approval of UIMA as an OASIS standard represents a significant milestone in the areas of semantic analysis and search,” said David Ferrucci of IBM, who chairs the OASIS UIMA Technical Committee, which oversaw the development of the standard. “UIMA enables interoperability among a variety of application-specific analysis engines allowing the capture of a broad range of knowledge from unstructured sources.”
Results can be used by tools such as search engines, databases and knowledge bases. IBM already incorporates the architecture in number of products, including its eDiscovery Analyzer, Content Analyzer, OmniFind Enterprise Edition and InfoSphere Warehouse.
To achieve ratification as a standard, the architecture has been successfully used by a number of enterprises, including IBM, Amsoft, Carnegie Mellon University, Thomson Reuters and the University of Tokyo.
William Jackson is freelance writer and the author of the CyberEye blog.