NIH is seeking input on analysis methods related to compressing, formatting, mapping and visualizing data.
As biomedical research becomes more data-intensive, researchers are facing challenges working with increasingly large, complex and diverse data sets because they may lack the appropriate tools, accessibility and training.
Responding to these new challenges, the National Institute of Health launched the trans-NIH Big Data to Knowledge (BD2K) initiative, which supports advances in data science, other quantitative sciences, policy and training that are needed for the effective use of big data in biomedical research.
As part of the BD2K initiative, NIH’s National Human Genome Research Institute is soliciting comments and ideas for the development of analysis methods and software tools to support bio-informatics and computational science. Specifically, NIH is seeking input on software and analysis methods related to data compression/reduction, data visualization, data provenance and data wrangling, according to a request for information document.
NIH wants the scientific and informatics research and user communities to identify and prioritize needs and gaps in the four areas by Sept. 6. The RFI defines the four areas and explains why they are important to researchers handling large volumes of data.
Data compression is important because it helps reduce resource usage. However, most compression techniques involve trade-offs among various factors, including the degree of compression, the amount of distortion induced and the computational resources required to compress and decompress the data, the RFI states. Data reduction aims to more dramatically reduce the data volume and, at the same time, reduce the complexity and dimensionality of data for easier analysis. It usually involves processing and/or reorganization of the information to minimize redundancy, eliminate noise and preserve signal and data integrity.
Data visualization lets researchers use graphics and interactivity to communicate complex information, helping them to explore and gain insight/knowledge from the data. With big data, the challenge is on interpreting complex, high-throughput data, especially in the context of other relevant, but often orthogonal, data.
Data provenance is useful for determining attribution, identifying relationships between objects, tracking back differences in similar results and guaranteeing the reliability of the data. Additionally, it lets researchers determine whether a particular data set can be used in their research by providing lineage information about the data, the RFI states.
Data wrangling applies to the conversion, formatting and mapping of data that lets researchers more easily share data, submit it to a database or expose it to the Internet. Researchers working with big data often need specialized informatics skills to format data, apply metadata, fill gaps, use ontologies, capture provenance, annotate features and apply other functions to reformat, manipulate, transform or process data, according to the RFI.
Visualization of big data might be the most significant problem organizations will face in the future, Richard Schaeffer, former information assurance director for the National Security Agency and head of the consulting firm Riverbank Associates, told an audience last year at a conference on national security and big data.
“We haven’t seen real innovation in visualization” tools to aid in the processing and analysis of information, Schaeffer said. But that will change as researchers learn more about how human beings process information and make it actionable and as their research is incorporated into analytic and visualization tools, he noted. In the next couple of years we will see unimaginable breakthroughs in analytics and visualization tools, enabling real decision-making,” Schaeffer said.