How science is cutting big data down to size
A group of scientific researchers, who work with datasets in the terabyte range, wants to develop a set of tools for data sharing, analysis and access for data management challenges across the scientific community.
Under the auspices of the National Science Foundation, the group has begun a project called SciServer, whose mission is to, “build a long term, flexible ecosystem” to provide access to datasets generated by astronomy and space science projects.
"By building a common infrastructure, we can create data access and analysis tools useful to all areas of science,” Alex Szalay of Johns Hopkins University, the leader of the NSF-funded project told Phys.org.
SciServer grew out of work with the Sloan Digital Sky Survey (SDSS), an ongoing project to map the entire universe. SDSS, begun 15 years ago, now has over 70 terabytes in its database covering 220 million galaxies and 260 million stars.
But though the datasets are big, the researchers’ data management problems are common to those found in many agency or business offices today, including:
- Preserving data as file formats change and scientists retire.
- Consistently storing and applying metadata for how the data should be used.
- Providing equal access to data and expertise among researchers.
- Encouraging opportunities for new insights enabled by combining data for joint analysis.
The SciServer team, which began working on the solutions to the problems in 2013, said they would launch the project in phases over the next four years.
The tactics they will bring to the project include:
Bring the analysis to the data. “This means scientists can search and analyze big data without downloading terabytes of data, resulting in much faster processing times, Szalay said in a statement.
Specifiy real-world use cases. The SciServer team is collaborating to ensure the system will be most helpful to working scientists.
Develop new tools. To help ease the burden on researchers, the team developed "SciDrive," a cloud data storage system that allows scientists to upload and share data using a Dropbox-like interface.
Adapt existing working tools. The strategy of building tools by adapting existing, successful tools is a key factor in ensuring the success of the project.
"The tools we build will create a fully-functional, user-driven system from the beginning, making SciServer an indispensable tool for doing science in the 21st century,” Szalay said.
As SciServer becomes more mature, the team will expand to other areas of science including genomics and connectomics , which explores cellular connections across the structure of the brain, according to the researchers.
Connect with the GCN staff on Twitter @GCNtech.