DOE pilots big data infrastructure projects
Over the past few months, researchers at the Department of Energy have been exploring new approaches for collecting, moving, sharing and analyzing massive scientific datasets.
Researchers at Lawrence Berkeley National Laboratory recently led four science data pilot projects to show what could be gained when the facilities and tools were specifically linked to carry out specialized research and to show the potential of a highly focused science data infrastructure, according to a DOE statement.
“Science at all scales is increasingly driven by data, whether the results of experiments, supercomputer simulations or observational data from telescopes,” said Steven Binkley, director of DOE’s Office of Advanced Scientific Computing Research, which co-sponsored the projects.
“As each new generation of instruments and supercomputers comes on line, we have to make sure that our scientists have the capabilities to get the science out of that data and [that] these projects illustrate the future directions,” Binkley said.
All of the projects are researching new, more efficient technologies, and the goal is to reuse as many existing tools as possible and to develop new software as necessary, making it easier for the scientists to be able to examine their data in real time.
“In the past, a scientist used to save his or her data on an external device, then take it back to a PC running specialized software and analyze the results,” said Craig Tull, head of Physics and X-Ray Science Computing in the Computational Research Division. “With these projects, we demonstrated an entirely new model for doing science at larger scale more efficiently and effectively and with improved collaboration.”
The first project demonstrated the ability to use a central scientific computing facility – National Energy Research Scientific Computing Center (NERSC) – to serve data from several experimental facilities in multiple formats using DOE’s ultra fast ESnet. The teams built tools to transfer the data from each site to NERSC and to automatically or semi-automatically analyze and visualize the information.
The second project illustrated the concept known as a “super facility,” which integrates multiple, complementary user facilities into a virtual facility offering fundamentally greater capability. This demonstrated that researchers for the first time will soon be able to analyze their samples during preliminary or “beamtime” tests and to adjust their experiment for the maximum scientific results.
In the third project, the teams built a “data pipeline” for moving and processing observational data from the Dark Energy Survey. Using Docker virtual machine software that automates deployment of applications inside software containers, they built self-contained packages that included all the necessary applications for analysis. The containers could then be pushed out from NCSA to supercomputers at the national labs and fired up on the various systems pulling the data that they needed for processing. Then the results were pushed back to NCSA over ESnet.
The fourth project, the virtual data facility, was a multi-lab effort to create a proof of concept for some of the common challenges encountered across domains, including authentication, data replication, data publishing and a framework for building user interfaces. Data end-points were set up at Argonne, Brookhaven, Lawrence Berkeley, Oak Ridge and Pacific Northwest labs, and the service demonstrated datasets being replicated automatically from one site to the other four. It also provided a metadata service that could be used to create a data catalog.
Science teams across the DOE laboratory system are increasingly dependent on the ability to efficiently capture and integrate large volumes of data that often require computational and data services across multiple facilities. These projects demonstrate the scientific potential of big data infrastructure.
Connect with the GCN staff on Twitter @GCNtech.