A massive international collaboration has enabled scientists to assign specific functions for 80 percent of the human genome, providing new insights into the mechanisms of gene regulation and giving biomedical researchers a solid genetic foundation for understanding how the body works in health and disease.
The results of the Encyclopedia of DNA Elements (ENCODE) project are described in a coordinated set of 30 papers published in several journals on September 5, 2012. Scientists at the University of California, Santa Cruz, have operated the Data Coordination Center for ENCODE since an initial pilot project began in 2003, and they have made all of the ENCODE data available for public use through the UCSC Genome Browser.
"Our job was to gather data from 32 labs running different types of experiments on a staggering array of cells and tissues, and we had to establish a common data language so we could get it all into a single database that scientists across the world could use. We also developed a lot of new ways of looking at the data, creating search and visualization tools so that people could find the data most relevant to them," said Jim Kent, director of the UCSC Genome Browser project and head of the ENCODE Data Coordination Center.
ENCODE is supported by the National Human Genome Research Institute (NHGRI), one of the National Institutes of Health. Hundreds of researchers across the United States, United Kingdom, Spain, Singapore, and Japan performed more than 1,600 sets of experiments on 147 types of tissue using technologies standardized across the consortium. In total, ENCODE generated more than 15 trillion bytes of raw data, and the data analysis consumed the equivalent of more than 300 years of compute time.
"We've come a long way, and we have learned an incredible amount by integrating the different types of data that ENCODE produced, which was done at a scale never before achieved in biology. This data integration was one of the keys to the success of the project," said Ewan Birney of the European Bioinformatics Institute in the United Kingdom, lead analysis coordinator of the ENCODE data.
For Kent and his data coordination team at UCSC's Center for Biomolecular Science and Engineering, the scale of the project presented many challenges. To start with, they had to coordinate a small army of researchers who were producing data in labs around the world. "We had five data wranglers who traveled around to the labs, probably four conference calls a week at the height of it, plus large group meetings twice a year, and countless emails and skype calls," Kent said.
Researchers were able to map more than 4 million regulatory regions in the human genome where proteins specifically interact with the DNA. These findings represent a significant advance in understanding the precise and complex controls over how and when genes are active within a cell.
"The regulatory elements are responsible for ensuring that you get crystalline protein in the lens of your eye and hemoglobin in your blood, and not the other way around," Kent said. "It's quite complex. The information processing and the intelligence of the genome reside in the regulatory elements. With this project, we probably went from understanding less than five percent to now around 75 percent of them."
The ENCODE data are rapidly becoming a fundamental resource for researchers working to understand human biology and disease. More than one hundred papers using ENCODE data have already been published by investigators who were not part of the ENCODE project. For example, researchers studying the genetic basis of human diseases use genome-wide association studies to identify disease-associated variants, or markers, in the genome, and they are using the ENCODE resource in an effort to determine which of the many specific variants identified in a study actually contribute to disease. These disease-associated variants map not only to protein-coding regions of the genome, but more often to the non-coding regions of the genome, the vast tracts of sequence between genes where ENCODE has identified many regulatory sites.
"As much as nine out of 10 times, disease-linked genetic variants are not in protein-coding regions," said Mike Pazin, an ENCODE program director at NHGRI. "Far from being 'junk' DNA, this regulatory DNA clearly makes important contributions to human disease."
The coordinated publication set includes one main integrative paper and five other papers in the journal Nature; 18 papers in Genome Research; and six papers in Genome Biology. The ENCODE data are so complex that the three journals have developed a pioneering way to present the information in an integrated form that they call "threads." Since the same topics were addressed in different ways in different papers, a new website will allow anyone to follow a topic through all of the papers in the ENCODE publication set in which it appears. In addition to the "threaded papers," six review articles are being published in the Journal of Biological Chemistry, and other affiliated papers in Science, Cell, and other journals.
Despite the enormity of the data set described in this historic set of publications, it does not comprehensively describe all of the functional elements in all of the different types of cells in the human body. Much additional work needs to be done, and ENCODE is about to be renewed for an additional four years. During the next phase, ENCODE will increase the depth of the catalog with respect to the types of functional elements and cell types studied. It will also develop new tools for more sophisticated analyses of the data.