Despite some successes, predicting cancer outcomes based on the molecular signatures in cancer cells remains a major challenge. A new effort, funded by the National Cancer Institute and led by researchers at the University of California, Santa Cruz, aims to clear several key roadblocks that have stymied progress in this field.
The $3.5 million project will use the latest in "big data" technology to bridge the gap between the petabytes of raw genomic data in centralized repositories like UCSC's Cancer Genomics Hub (CGHub) and the higher levels of interpretive information that can lead to clinically useful predictions, such as which drugs are most effective against tumors with certain mutations. Project leader Joshua Stuart, an associate professor of biomolecular engineering at UCSC's Baskin School of Engineering, compares the raw genomic data to the binary code running on a computer.
"Your web browser doesn't understand zeros and ones. There are layers and layers of software programs between that and what you see on a web page. We need to do the same thing for DNA sequences to reach the higher levels of interpretation needed for scientific discovery," Stuart said.
Stuart's group will build a separate database, called the Biomedical Evidence Graph (BMEG), for storing and analyzing interpretive information derived from the raw sequence data stored in the CGHub. Like Facebook's social graph, the BMEG will use a graph database structure designed for lightning-fast access to complex, interconnected datasets.
"Our analyses can reveal connections between different tumor samples based on their molecular profiles, and the natural way to represent that in a database is with the graph structures used for Facebook and other social networks," Stuart said.
A UCSC team led by bioinformatics expert and BMEG co-investigator David Haussler established CGHub in 2012 to manage data from the Cancer Genome Atlas (TCGA) consortium and other NIH cancer genomics research programs. Because CGHub holds genome sequences from thousands of individual patients, access is strictly controlled and limited to researchers approved by NIH. But the BMEG will hold higher-level data derived from analyses of the raw genome sequences and will not require the same level of security restrictions.
"The idea is to build a shared knowledge base and create a playground where lots of researchers can interact, test their algorithms, and compare results," Stuart said. "TCGA researchers have built a lot of great tools for data analysis, and we need to get those installed in the BMEG so the rest of the world can engage in that higher level analysis. We want to establish the tools and data analysis pipelines that will be useful for current and future collections of data."
The BMEG complements a parallel project, called Medbook, which will link together patients, biopsy samples, doctors, and researchers into a social network framework. Stuart and recent Ph.D. graduate Ted Goldstein created Medbook for their work on a prostate cancer project funded by Stand Up To Cancer (SU2C). "The BMEG will provide patient-level genome information to Medbook, creating a powerful partnership between the two efforts," Stuart said.
The BMEG will enable outside researchers to securely analyze CGHub-related data without needing to transmit vast amounts of data over the internet. Currently, TCGA data analysis centers download huge files of raw data from CGHub for analysis on their own computers. BMEG will be co-located with the CGHub servers at the San Diego Supercomputer Center, and researchers will be able to run their analyses as apps on the BMEG platform.
Cancer genomics researchers involved in TCGA and related projects are still working to develop and refine the analytical tools needed to extract predictive evidence from their rapidly growing databases. This includes the very first layer of analysis, the identification of mutations in a patient's tumor genome. Different research groups use different algorithms for this, and CGHub scientists found that these algorithms give inconsistent answers.
"That was a big shocker to me when the TCGA and International Cancer Genome consortiums started comparing the algorithms used by different institutions," Stuart said. "You'd think we'd know which one to install, but there are more than a dozen of them and they all came up with different answers. Only in the last year have we sorted that out, and David Haussler has created a unified effort to identify mutations for TCGA."
The mutation data, together with information on gene locations, then feeds into higher-level analyses. Stuart's lab, for example, uses this information to identify the genetic pathways affected by a patient's mutations. Using knowledge about how genes work in groups by signaling and regulating each other, he is able to find connections between different mutations that affect the same pathways. In previous work on TCGA data, results from Stuart's pathway analysis proved to be better predictors of overall survival in glioblastoma patients than a lower level of genomic analysis, and pathway signatures also revealed novel connections between mutations and drug response in breast cancers.
As with the mutation-finding algorithms, higher-level analyses are likely to differ among research groups. To find out who has the best algorithm for any particular type of analysis, they have to be compared side by side in blinded competitions. For the BMEG project, Stuart has teamed up with Adam Margolin of Sage Bionetworks, a nonprofit bioinformatics company, to run a series of such competitions.
"We've done this before, and it will be fun. We'll run competitions and then hold a conference at the end to go over the results," Stuart said. "To participate, people will give us their code and we'll run it on a test dataset. That's how we'll build out this enterprise, starting with the first level of data analysis and building up to predictive algorithms for things like patient survival and response to drug treatments."
By establishing a robust pipeline for analysis of cancer genomics data, the BMEG will help researchers discover clinically useful molecular signatures. In addition to their work with TCGA, Stuart and Haussler are involved in several other collaborative cancer genomics projects that will feed into the BMEG. These include the Library of Integrated Network-based Cellular Signatures (LINCS), the Stand Up To Cancer (SU2C) dream teams, and the I-SPY breast cancer trial. These projects provide a wealth of curated genomic and clinical data that can be used for developing and assessing cancer genomics tools.
The lead architect on the BMEG project at UCSC is Kyle Ellrott, a software developer who worked closely with Stuart and Goldstein to craft the original proposal. Ellrott has been leading the coordination of datasets and results for TCGA's new "Pan-Cancer" project to compare many different forms of cancer. He has broad experience in applying cutting-edge computational tools to cancer genome analysis.
This research is supported by the National Cancer Institute of the National Institutes of Health under award number R01CA180778.