NCI Cloud Pilot program to boost cancer genomics data sharing, accessibility

UC Santa Cruz and UC Berkeley partner with Broad Institute to build a cloud computing environment for large-scale analysis of cancer genomics data

David Haussler
David Haussler, Director of the UC Santa Cruz Genomics Institute

The UC Santa Cruz Genomics Institute is part of a team led by the Broad Institute of Harvard and MIT that was awarded one of three National Cancer Institute (NCI) Cancer Genomics Cloud Pilot contracts. The goal of the project, which also involves scientists at UC Berkeley, is to build a system that will enable large-scale analysis of The Cancer Genome Atlas (TCGA) and other datasets by co-locating the data and the required computing resources in one cloud environment.

This co-location will enable researchers across institutions to bring their analytical tools and methods to use on data in an efficient, cost-effective manner, thereby promoting democratization and collaboration across the cancer genomics community. Seven Bridges Genomics and The Institute of Systems Biology, in collaboration with Google, are the two other awardees in the NCI Cloud Pilot program.

"Putting genomic data on the cloud for analysis and sharing is a great direction to go in," said David Haussler, professor of biomolecular engineering at UC Santa Cruz and director of the UCSC Genomics Institute.

CGHub Experience

Gad Getz of the Broad Institute is the lead principal investigator of the Broad-University of California Cloud Pilot (BUCCP) and will be leading the Broad team together with Matthew Trunnell and Anthony Philipakis. Haussler brings to the project the UC Santa Cruz team's experience in building and operating NCI's Cancer Genomics Hub, a secure repository for storing and accessing cancer genomic data from TCGA and related projects. The BUCCP also leverages the work of UC Berkeley researchers led by David Patterson to develop tools for efficient computing over genomics data. Patterson is also partnering with Haussler in the recently funded Center for Big Data in Translational Genomics led by UC Santa Cruz.

The Cancer Genomics Cloud Pilot effort is firmly rooted in the data-sharing principles set forth by the Global Alliance for Genomics and Health (GA4GH), of which Haussler, Patterson, Getz, and Philipakis are working group members, making it both technically-driven and mission-driven from its incipience. The pilot awardees will collaborate with each other and with the NCI Genomics Data Commons (GDC) at the University of Chicago, where the data will be hosted, as well as with the NCI staff and leadership towards a shared vision of a cohesive data and analysis infrastructure to advance the understanding and treatment of cancer.

"We will be working with the Broad Institute and the other two cloud pilot operations and the GDC as part of the Global Alliance for Genomics and Health, which strongly endorses the cloud pilots," said Haussler, who cofounded GA4GH and co-chairs its Data Working Group.

Large-scale sequencing

Large-scale sequencing efforts are helping researchers understand the genetic changes that lead to cancer and have led to the development of several successful, targeted chemotherapies. These developments show that identifying mutations that drive cancer can translate into therapeutics. However, three main challenges remain: first, processing massive sequence datasets requires costly computational infrastructures for which few groups have the resources; those that do have the resources often end up duplicating each others' engineering and analysis efforts. Second, data generation is outpacing the development of tools and methods that can be used on such large datasets: already, petabytes of data exist, and exabytes -- 1,000 times a petabyte -- are to come. Finally, data is being collected and stored in silos, minimizing the potential for synergy, data sharing, and integrated analysis.

To more fully understand the magnitude of a petabyte, if the average MP3 encoding of music requires around 1MB per minute, and the average song lasts about four minutes, then a petabyte of songs would last over 2,000 years playing continuously.

The impetus for the cancer genomics cloud pilots grew from an inquiry the NCI posed in April 2013 asking the NCI grantee community to describe their most frequent computational challenges. From these responses, six general themes emerged: data access, computing capacity and infrastructure, data interoperability, training, usability, and governance. The BUCCP is addressing these gaps in cancer genome analysis by building a platform for data aggregation and analysis on a computing cloud. This will combine a production environment for running analyses with robust security and access control together with a scalable paradigm for distributed data storage and computation. The BUCCP system will host The Cancer Genome Atlas (TCGA) data and will be pre-populated with commonly used computational tools to immediately empower the cancer genomics research and biomedical community. In addition, the team will develop strategies to engage the community and demonstrate the capabilities of the platform.

Exceptional opportunity

Benedict Paten, assistant director of the Center for Big Data in Translational Genomics at UC Santa Cruz, said that developing future cancer therapies based on whole genome sequencing is a major motivation for the center. "Four of our seven driving projects are focused on cancer genomics. The Cancer Genomics Cloud Pilots are an exceptional opportunity to bring the work of TCGA, the definitive cancer genomics project, into the age of the global information commons on the Internet, pioneered by our center and collaborators in the Global Alliance for Genomics and Health," Paten said.

"The Cancer Genomics Cloud Pilots will allow the cancer research community to collaborate in a way that has not been possible before," said Getz. "We'll now be able to share data and tools and jointly learn from the totality of cancer genomics data. Our cloud system will democratize access to computational tools for non-experts as well as empower developers with a platform for creating the next generation of analytical methods."

This project has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, Department of Health and Human Services, under Contract No. HHSN261201400006C.