NCI cracks open cancer research by moving it to the cloud
- By Carolyn Duffy Marsan
- Jan 07, 2014
The National Cancer Institute (NCI) is at the bleeding edge of a trend toward using a cloud-based infrastructure to process, store and analyze its massive data sets.
6 Key Issues for NCI Cancer Clouds
The biomedical research community has identified six issues that need to be addressed in the NCI Cloud Initiative:
- Data access
- Computing capacity
- Data interoperability
In 2014, NCI plans to award three contracts for pilot projects to create cloud computing environments that will house a new data set — called the Cancer Genome Atlas — that is expected to be 2.5 petabytes in size. The so-called cancer clouds will not only serve as a central repository for the genomics data but also will provide a standard application programming interface (API) and tools for researchers to use in analyzing the data.
NCI’s goal for the cloud initiative is to democratize access to cancer genomics data to advance the treatment of cancer. The agency’s biomedical data sets are becoming so large that it is too expensive for most universities and pharmaceutical companies to invest in powerful-enough computers and large enough network bandwidth to download the data sets and process them locally, as is common practice today.
Instead, NCI is looking for cloud-based architectures that would allow researchers to access the data sets via a Web browser and analyze them remotely. Then researchers only would need the local bandwidth and processing power required to download results, which will be much smaller data sets.
"We want the smart graduate student or the smart postdoc who has an idea for a novel way of analyzing data to be able to relatively quickly write a piece of software to do analysis on the data and run it inexpensively,’’ said George Komatsoulis, Interim Director of NCI’s Center for Biomedical Informatics and Information Technology. "Now this person would need $2.5 million of hardware and wouldn’t generate an answer for a year.’’
Fundamental shift in research
Komatsoulis says the bottom-line return on the NCI Cloud Initiative will be better outcomes for cancer patients. "By making this investment, we are making the data that we collect more useful, more widely available, and we will get a much larger group of people looking at it for clues to how we can improve the treatment of cancer,’’ he said.
By creating a shared repository of data that is open to all researchers, the NCI Cloud Initiative represents a fundamental shift in cancer research.
"The mindset is really changing,’’ says Subha Madhavan, Director of the Innovation Center for Biomedical Informatics at Georgetown University Medical Center. "What we’ve been doing for the last 10 years is bring the data to the tools, which were on client/server architectures at universities and pharmaceutical research labs. But you can’t do that anymore because the majority of clinical projects produce data sets that are terabytes in scale. So the mindset is changing to bring the tools and expertise to the data. It’s a shared computing model that’s emerging.’’
Prof. Jake Yue Chen, Associate Professor of Bioinformatics at Indiana University, says the NCI Cloud Initiative represents a significant advancement for biomedical research.
"This is very profound. It’s like putting genomic information that’s buried in the ivy towers of a few major research centers and putting it into the hands of every cancer researcher. The more people who can analyze this data, the more insights we are going to get into cancer,’’ Chen said. "I would imagine that orders of magnitude of knowledge would be generated because more researchers will be able to analyze the data.’’
Bioinformatics infrastructure as a service
The driver behind the NCI Cloud Initiative is the Cancer Genome Atlas, which will provide in-depth data on about 11,000 cancer patients, with an average of 500 gigabytes of data per patient.
"We’re going to have the DNA sequence for their tumor and the matched normal control,’’ Komatsoulis explained. "There is RNA sequencing, medical images and clinical data. In addition, there is epigenetics, which are modifications to the DNA itself that impact the way various genes that exist in these patients get turned on and turned off. By September 2014, we expect to have generated, if not fully received, 2.5 petabytes of data. This is the shape of things to come.’’
The NCI cancer clouds will provide a complete bioinformatics infrastructure as a service, with built-in compute, storage, security and analytics. NCI has not defined the cloud-based infrastructure that it wants; instead, it is looking for innovative architectures that will meet the needs of the biomedical research community.
"We really don’t know yet what is the best technology or what is the best way to structure the data so it can be computed on efficiently,’’ Komatsoulis says. "What we’re looking for is innovation. This is one of those cases where the government has the opportunity to enable the scientific community to innovate to solve an important problem.’’
NCI plans to fund three different architectures for the pilot project, with three years of funding for each team. The agencies is hoping that these architectures will scale, given that it expects data sets as large as 20 to 50 petabytes by 2019.
"The pilot projects will allow us to evaluate with a really big data set what is and what isn’t the most effective architecture for doing the kinds of analysis that scientists are interested in doing,’’ Komatsoulis said. "It’s our intention to test these clouds to make sure they meet our performance requirements, but also to throw them open to cancer researchers who can vote with their feet.’’
Challenges in lowering the barriers to entry
Komatsoulis said a key challenge is creating an efficient API that preserves the security and integrity of the data. Experts said the data is likely to be encrypted both in transit and at rest, and that authentication and access controls must be applied to it.
"We’re looking to lower the barrier to entry for scientists,’’ Komatsoulis said. "One of the purposes of having a solid API is that it gives us the opportunity to embed the security best practices in all of the programs.’’
The compute and storage capabilities required by the NCI Cloud Initiative are available on the market today, Chen said. But he added that it will be a challenge to integrate these technologies in a massive shared repository that can meet the needs of the biomedical research community.
Another issue is standardization of data and frameworks for analyzing data, as there is variability in the terminology that cancer researchers use.
"One challenge is usability and user friendliness,’’ Madhavan said. "It’s one thing for computer scientists and biomedical informatics specialists to be able to hack our way through the data, but we also need bench researchers and physicians to be consumers of this data. We’ve got to deliver the information through easy-to-use mobile health and Web-based platforms.’’
The NCI Cloud Initiative is part of a trend where biomedical research data is processed in public clouds. For example, the 1000 Genomes database is available via Amazon’s Elastic Compute Cloud. Similarly, Georgetown University Lombardi Cancer Center is using Amazon Web services for gene sequencing related to breast and colorectal cancers.
"Our IT team is small. It would take years for us to set up an infrastructure to manage terabytes of data. But in a matter of weeks, we can set up our data on Amazon’s cloud,’’ Madhavan said. "The cloud is a game changer for researchers like ours who want to do big data analysis.’’
Industry analysts said they expect to see more government-sponsored big data projects adopt a cloud infrastructure for compute, storage and analytics.
"This sounds like a perfect example where cloud computing is a better arrangement given the bandwidth limits associated with downloading large data sets,’’ said Shawn McCarthy, research director at IDC Government Insights. "Putting data in a shared resource is becoming more popular because you can standardize the data. When everybody builds their own databases, you end up with different APIs and data name fields that are different. People spend more time normalizing the data than doing their analysis of it.’’