Steering clear of ‘sneakernet’ at big-data scale
They don’t call it big data for nothing. A typical whole genome sample is about 150 GB, a day’s worth of HD surveillance video can run a terabyte or more, and yesterday’s weather constitutes some 20 terabytes of data from Doppler radar, weather satellites, buoy networks and other weather stations. And while there are plenty of tools that have scaled to collect and crunch such massive datasets, moving them around and making them available to researchers and the public is more problematic.
In the not-so-distant past, when even medium-sized files could tax a network, many organizations resorted to "sneakernet" -- manually walking a disk, or later a thumb drive, from one computer to the next. And while express mail and hard-drive arrays have entered the equation, some organizations are still “shuttling entire storage systems from one place to another just to be able to share data,” Sumit Sadana, executive vice president of flash memory provider SanDisk wrote in an op-ed in Re/Code.
“Hyperscale data centers, software-defined networking and new storage technologies represent the first steps in what will be a tremendous cycle of innovation,” Sadana wrote. But government agencies, often the generator of massive data sets, have sometimes had to build their own network infrastructure to support their research, opening their data for sharing and collaboration.
The Energy Science Network, for example, connects more than 40 Department of Energy research sites and 140 DOE-funded partners in industry and academia. ESnet helps to directly connect a long list of national laboratories, institutions and research facilities. It currently carries about 20-petabytes of data a month, and is expected to grow to 100 petabytes by 2016. ESnet launched its fifth-generation network collaboration with Internet2 in 2012, making 100-gigabit/sec speed available and deploying an 8.8-terabyte network.
Last year, NASA’s High End Computer Network reached a 91-gigabit/sec transfer between facilities in Denver and Greenbelt, Md., across ESnet -- the “fastest end-to-end data transfer ever conducted under ‘real world’ conditions,” according to Wired.
Another example is N-Wave, the 10-gigabit network connecting the National Oceanic and Atmospheric Administration, Internet2 and its partners in the national Research and Education network community. NOAA makes heavy use of N-Wave; the agency's research and development high-performance computing system alone moves up to 60 terabytes of data a day. And according to a recent newsletter, N-Wave is currently deploying 100-gigabit/sec connections in the Washington, D.C. area, connecting three sites in Maryland and one in Virginia.
Another approach is to put big data directly into the cloud. NOAA is in the research and development stages of a new public-private Big Data Project that builds on collaboration with industry-leading cloud providers, said Maia Hansen, a presidential innovation fellow with the National Oceanic and Atmospheric Administration.
According to Hansen, who spoke at the General Services Administration's DigitalGov Citizen Services Summit on May 21, NOAA will have stored approximately 300 petabytes of data by 2030 -- a resource that is extremely difficult to share if housed in a traditional NOAA data center. The NOAA Big Data Project is a structured agreement with Amazon Web Services, Google Cloud Platform, IBM, Microsoft and the Open Cloud Consortium to leverage private industry’s infrastructure and create a sustainable, market-driven ecosystem that lowers the cost barrier to data publication.
The National Institutes of Health, meanwhile, has created the Commons -- a storage framework built using the cloud and high performance computing to co-locate datasets with the analytical tools and workflow pipelines that use them, so that research results are accessible and shareable.
“The Commons is a scalable three-part environment that will provide computing and storage for sharing digital bio-medical research products,” George Komatsoulis, senior bioinformatics specialist at NIH’s National Center for Biotechnology Information, told the International Science Grid This Week.
The Commons is built on a computing platform built for storage and computation on this scale. The data, or digital objects, come from Individual investigators, existing databases and programs like Big Data to Knowledge. The objects follow specific guidelines for identification and citation so that they’re readily searchable.
“The Commons will make it easier to use (and reuse) the data and software contained within these well-curated resources,” Komatsoulis said.
The Commons will serve as a hub of public and private cloud providers, academic national labs and various computing centers willing to follow NIH requirements and help to create a democratized approach to these research datasets. All while keeping the cost of big data storage and sharing down, according to Komatsoulis.
The physical shuttling of storage systems may never disappear entirely -- as computer scientist Andrew Tanenbaum once put it, "never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway" -- but cloud computing and high-speed connectivity are giving government more and more ways to leave the sneakernet behind.
Amanda Ziadeh is a former reporter/producer for GCN.