The University of California, Santa Cruz, has played a key role in an international project to catalog all of the biologically functional elements in 1 percent of the human genome. The results of the project, published today in a set of papers in two journals, promise to reshape our understanding of how the human genome functions.
The findings challenge the traditional view of our genetic blueprint as a tidy collection of independent genes, pointing instead to a complex network in which genes, along with regulatory elements and other types of DNA sequences that do not code for proteins, interact in overlapping ways not yet fully understood.
The UCSC Genome Bioinformatics Group headed by David Haussler, professor of biomolecular engineering, and Jim Kent, associate research scientist, adapted the internationally recognized UCSC Genome Browser as the data repository for this project. The UCSC Genome Browser web site allows researchers unencumbered access to the wealth of data produced by the international consortium. It showcases this data so that genetic scientists can mine it for clues about how the body works in health and in disease.
In a group paper published in the June 14 issue of Nature and in 28 companion papers published in the June issue of Genome Research, the ENCyclopedia Of DNA Elements (ENCODE) consortium, which is organized by the National Human Genome Research Institute (NHGRI), reported results of its exhaustive, four-year effort to build a parts list of all biologically functional elements in 1 percent of the human genome. Carried out by 35 groups from 80 organizations around the world, the research served as a pilot to test the feasibility of a full-scale initiative to produce a comprehensive catalog of all components of the human genome crucial for biological function.
"The sheer number of ENCODE data providers and the diversity of experimental methods used to generate this data presented a challenge to the UCSC team," said Kate Rosenbloom, lead software developer on the UCSC ENCODE team. "We were continually customizing our software for effective visualization and efficient retrieval of new data types."
The Nature publication includes a pull-out poster that is a screenshot of the UCSC Genome Browser concisely displaying a broad range of the ENCODE data.
In addition to serving as a data repository, the UCSC team has provided programming for the comparative genomics aspect of the ENCODE project through the Multi-Species Sequence Analysis Group. By aligning the human genome with the genomes of other species, it is possible to glean an understanding of the relative importance and roles of different DNA sequences. The UCSC Genome Browser has been designed to provide such alignments, and it currently displays 38 species, from simple organisms such as yeast and worms to mice, chimps, and humans. Comparative genomics is a major focus of Haussler's research group.
Authors of the ENCODE papers include researchers from academic, government, and industry organizations located in Australia, Austria, Canada, Germany, Japan, Singapore, Spain, Sweden, Switzerland, the United Kingdom, and the United States. The ENCODE project has been open to all interested researchers who agree to abide by the consortium's guidelines.
The UCSC team, led by Haussler and Kent, includes software developers Kate Rosenbloom and Rachel Harte, project manager Donna Karolchik, quality assurance manager Robert Kuhn, graduate student Daryl Thomas, and postdoctoral scholar Ting Wang, along with the entire genome browser staff and other graduate students and postdoctoral researchers in the Haussler lab.
Several of the UCSC participants attended an "ENCODE analysis jamboree" in Washington, D.C., in July 2005, where they provided custom programming services to consortium members and trained them in the use of the UCSC ENCODE browser. The UCSC group also hosted two ENCODE analysis groups for several days of focus on genes, gene transcription, and transcription regulation.
"This impressive effort has uncovered many exciting surprises and blazed the way for future efforts to explore the functional landscape of the entire human genome," said NHGRI director Francis Collins. "Because of the hard work and keen insights of the ENCODE consortium, the scientific community will need to rethink some long-held views about what genes are and what they do, as well as how the genome's functional elements have evolved. This could have significant implications for efforts to identify the DNA sequences involved in many human diseases."
The completion of the Human Genome Project in April 2003--aided by the bioinformatics contribution of Haussler and Kent--was a major achievement, but the sequencing of the genome marked just the first step toward the goal of using such information to diagnose, treat, and prevent disease. In recent years, researchers have made major strides in using DNA sequence data to identify genes, which are traditionally defined as the parts of the genome that code for proteins. The protein-coding component of these genes makes up just a small fraction of the human genome--1.5 percent to 2 percent.
Evidence exists that other parts of the genome also have important functions. Until now, however, most studies have concentrated on functional elements associated with specific genes and have not provided insights about functional elements throughout the genome. The ENCODE project represents the first systematic effort to determine where all types of functional elements are located and how they are organized.
In the pilot phase, ENCODE researchers devised and tested high-throughput approaches for identifying functional elements in the genome. Those elements included genes that code for proteins; genes that do not code for proteins; regulatory elements that control the transcription of genes; and elements that maintain the structure of chromosomes and mediate the dynamics of their replication.
The collaborative study focused on 44 targets, which together cover about 1 percent of the human genome sequence, or about 30 million DNA base pairs. The targets were strategically selected to provide a representative cross section of the entire human genome. All told, the ENCODE consortium generated more than 200 data sets and analyzed more than 600 million data points.
"Our results reveal important principles about the organization of functional elements in the human genome, providing new perspectives on everything from DNA transcription to mammalian evolution. In particular, we gained significant insight into DNA sequences that do not encode proteins, which we knew very little about before," said Ewan Birney, head of genome annotation at the European Molecular Biology Laboratory's European Bioinformatics Institute (EBI) in Hinxton, England, who led ENCODE's massive data integration and analysis effort.
The ENCODE consortium's major findings include the discovery that the majority of DNA in the human genome is transcribed into functional molecules, called RNA, and that these transcripts extensively overlap one another. This broad pattern of transcription challenges the long-standing view that the human genome consists of a relatively small set of discrete genes, along with a vast amount of so-called junk DNA that is not biologically active.
The new data indicate the genome contains very little unused sequence and is, in fact, a complex, interwoven network. In this network, genes are just one of many types of DNA sequences that have a functional impact.
"Our perspective of transcription and genes may have to evolve," the researchers state in their Nature paper, noting the network model of the genome "poses some interesting mechanistic questions" that have yet to be answered.
The main portal for ENCODE data is UCSC's ENCODE Genome Browser; the analysis effort is coordinated from Ensembl, a joint project of EBI and the Wellcome Trust Sanger Institute. Much of the primary data have been deposited in databases at NIH's National Center for Biotechnology Information and EBI. For more detailed information on the ENCODE project, including the consortium's data release and accessibility policies and a list of NHGRI-funded participants, go to the ENCODE project web site.