The life sciences are in the midst of a data revolution. Technologies such as genome sequencing, gene-expression analysis, and high-resolution imaging are generating vast amounts of data that hold great potential for biomedical breakthroughs, while posing enormous challenges for data management.
Bioinformatics experts now view cloud-based computing and data storage as the most effective way to manage and use these rapidly growing biomedical datasets. But they have not yet worked out exactly how to build an efficient cloud-based platform for biomedical research.
Leaders in the field at UC Santa Cruz, the Broad Institute of MIT and Harvard, and the University of Chicago have now formed a partnership called the Commons Alliance to address this issue, with major funding from the National Institutes of Health (NIH).
"This project represents the culmination of years of effort to finally bring biomedical research into the internet and cloud computing era," said David Haussler, director of the UC Santa Cruz Genomics Institute.
The project is one of several efforts funded through the $9 million NIH Data Commons Pilot Phase, intended to accelerate biomedical discoveries by making biomedical research data findable, accessible, interoperable, and reusable for more researchers. A data commons is a shared virtual space where scientists can work with the digital objects of biomedical research, such as data and analytical tools.
"Many of the important questions we're asking now require us to interrogate these very large datasets, but nobody has the time or money to download all that data to their host institution and analyze it in its entirety," explained Benedict Paten, assistant professor of biomolecular engineering at UC Santa Cruz. "The cloud provides co-located compute and storage resources. If we have everything in one place, researchers can just rent compute time and use the available tools to analyze the data."
Paten, who directs the Genomics Institute's Computational Genomics Lab, is one of three principal investigators of the Commons Alliance. The others are Anthony Philippakis at the Broad Institute and Robert Grossman at the University of Chicago. All three have extensive experience developing software platforms to support large-scale biomedical research efforts, including the All of Us research program, the Genomic Data Commons, and the Human Cell Atlas initiative.
The Commons Alliance Platform will be designed to handle a heterogeneous mix of data types, including genomics, transcriptomics, and image data, along with associated metadata. Ultimately, Paten said, the goal is not one monolithic system to handle all biomedical data, but rather a set of common software modules for creating interoperable systems, which could all reside within a common cloud-based research environment.
"What we're building is essential to the future of biomedical science, because it will allow us to ask questions we couldn't otherwise ask and do things on a scale we never could before," Paten said. "At the nuts-and-bolts level, it's a big software engineering project, but its impact will be completely transformational."
In addition to $917,000 in initial Data Commons Pilot Phase funding for the Commons Alliance, the partnership was also awarded $5.8 million from the National Heart, Lung, and Blood Institute (NHLBI) for the integration of NHLBI data sets with the NIH Data Commons, including data from the Trans-Omics for Precision Medicine (TOPMed) Program.
In a recent commentary published in Medium, Paten, Philippakis, Grossman and others outlined their vision of a "data biosphere" for biomedical research, including four guiding principles. Platforms for biomedical data, they wrote, should be:
- modular, composed of functional components with well-specified interfaces;
- community-driven, created by many groups to foster a diversity of ideas;
- open, developed under open-source licenses that enable extensibility and reuse, with users able to add custom, proprietary modules as needed; and
- standards-based, consistent with standards developed by coalitions such as the Global Alliance for Genomics and Health (GA4GH).
The NIH Data Commons will be implemented in a four-year pilot phase to explore the feasibility and best practices for making digital objects available through collaborative platforms. This will be done on public clouds, which are virtual spaces where service providers make resources, such as applications and storage, available over the internet. The program is funded through the NIH Common Fund, which supports biomedical research programs that cut across the missions of all NIH Institutes and Centers.