Big data must haves: Capacity, compute, collaboration
- By Mark Pomerleau
- Apr 23, 2015
While big data researchers are pushing the boundaries of science, the less glamorous side of big data research – the network, computing and cloud architecture required to support their work – must be at the forefront of their minds.
At next week’s Internet2 Global Summit researchers will come together with network engineers, CIOs and technology leaders to discuss ways they can collaborate to advance research capabilities in IT infrastructure and applications.
Clemson professor Alex Feltus will showcase how his research team is leveraging the Internet2 infrastructure, including its Advanced Layer 2 Service high-speed connections and perfSONAR network monitoring, to substantially accelerate genomic big data transfers and transform researcher collaboration.
As DNA datasets get bigger and research increases, Feltus sees a need to change the way data is stored and transferred. “Of course we need bigger boxes, but we also need faster ways to put stuff into them. There is a serious data transfer bottleneck at the network-hard-drive interface. Thus, we need faster, reasonably-priced storage that can keep up with the advanced networks such as the Internet2 network,” he said.
One of the problems researchers encounter, is that while they might have access to supercomputers, information stored in remote data repositories is impossible to use without a network like Internet2. “You can process data on the fastest nodes in the world, but it's pointless for real-time applications if the supercomputer is hooked up to a slow pipe," Feltus said.
These advanced technology capabilities allow Feltus and his team to focus on their research, rather than attempting to build their own network and computing infrastructure to enable that work.
Arizona State University, which recently got 100 gigabit/sec connections to Internet2, has developed the Next Generation Cyber Capability, or NGCC, to respond to big data challenges. The NGCC integrates big data platforms and traditional supercomputing technologies with software-defined networking, high-speed interconnects and virtualization for medical research.
The NGCC provides three essential capabilities for big data research: The first is physical capacity – Internet2 connections, large-scale storage and multiple types of computation, including utility computing, traditional HPC and new big data computing. The second element is advanced logical capabilities such as software-defined storage and networking, metadata processing and semantics. The final element is collaboration – transdisciplinary teams of researchers, network engineers and computing professionals working together on the system as a whole.
It's this last capability that Feltus said is essential to the success in big data.
"A key aspect is the side of cyberinfrastructure that can't be coded: personal relationships," Feltus said. "Recently, my collaboration with network and storage researchers and engineers has opened my eyes to innovative possibilities that will impact my research via the human network."
Mark Pomerleau is a former editorial fellow with GCN and Defense Systems.