Genome researchers publish analysis of finished human genome sequence, plan next steps to figure out what it all means

A pair of papers published this week in the two leading scientific journals mark the completion of the Human Genome Project and the start of a new project to find all of the functional elements in human DNA. Researchers at the University of California, Santa Cruz, are involved in both projects.

In the October 21 issue of the journal Nature, the International Human Genome Sequencing Consortium published its scientific description of the finished human genome sequence, reducing the estimated number of human protein-coding genes from 35,000 to only 20,000 to 25,000, a surprisingly low number for our species. In the paper, researchers describe the final product of the Human Genome Project, the 13-year effort to read the information encoded in the human chromosomes that reached its culmination in 2003.

The Nature publication provides rigorous scientific evidence that the genome sequence produced by the Human Genome Project has both the high coverage and the accuracy needed to perform sensitive analyses, such as those focusing on the number of genes, segmental duplications involved in disease, and the "birth" and "death" of genes over the course of evolution.

"Obtaining the sequence recording our complete genetic heritage has been a huge step for humanity. There is no doubt that this will ultimately transform medicine," said David Haussler, professor of biomolecular engineering and a Howard Hughes Medical Institute investigator, who led UCSC's participation in the Human Genome Project.

The other major paper, published in the October 20 issue of the journal Science, outlines the plans of a research consortium organized by the National Human Genome Research Institute (NHGRI) to produce a comprehensive catalog of all parts of the human genome crucial to biological function. The ENCyclopedia Of DNA Elements (ENCODE) consortium has the ambitious goal of building a "parts list" of all sequence-based functional elements in the human DNA sequence.

"To really use the human genome sequence for medicine, we need to understand how it works--that is, what all the As, Cs, Gs, and Ts are actually doing in the cells in our bodies. This is much harder than reading the DNA sequence," Haussler said. "Through the ENCODE consortium, the same kind of team approach used in the Human Genome Project is being applied to address this much more difficult challenge."

The list of functional elements compiled by the ENCODE project will include: protein-coding genes; non-protein-coding genes; regulatory elements involved in the control of gene transcription; and DNA sequences that mediate chromosomal structure and dynamics. The ENCODE researchers also anticipate they may uncover additional functional elements that have yet to be recognized.

"Creating this monumental reference work will help us mine and fully utilize the human genome sequence. Such knowledge will lead to a far deeper understanding of human biology and stimulate the development of new strategies for improving human health," said NHGRI Director Francis S. Collins.

UC Santa Cruz researchers have been involved in the analysis of the human genome since late 1999. James Kent, then a graduate student in molecular, cell, and developmental biology working with Haussler, assembled the first working draft of the human genome in 2000 and created the UCSC Genome Browser, a widely used web-based tool for genomic research. Kent, now a research scientist in UCSC's Center for Biomolecular Science and Engineering (CBSE), which Haussler directs, is a coauthor on the Science and Nature papers, along with Haussler and other CBSE scientists and graduate students.

The UCSC researchers helped assemble the finished human genome sequence and made it publicly available to researchers worldwide through the UCSC Genome Browser. They also performed a key analysis of the coverage and accuracy of the finished sequence. The browser displays the finished genome in alignment with dozens of annotation tracks contributed by researchers at UCSC and collaborators worldwide.

One of the central goals of the effort to analyze the human genome is the identification of all genes, which are generally defined as stretches of DNA that code for particular proteins. According to the new findings, researchers have confirmed the existence of 19,599 protein-coding genes in the human genome and identified another 2,188 DNA segments that are predicted to be protein-coding genes.

"The analysis found that some of the earlier gene models were erroneous due to defects in the unfinished, draft sequence of the human genome," said Jane Rogers, head of sequencing at the Wellcome Trust Sanger Institute in Hinxton, England. "The task of identifying genes remains challenging, but has been greatly assisted by the finished human genome sequence."

The Nature paper also provides the scientific community with a peer-reviewed description of the finishing process, and an assessment of the quality of the finished human genome sequence, which was deposited into public databases in April 2003. The assessment confirms that the finished sequence now covers more than 99 percent of the euchromatic (or gene-containing) portion of the human genome and was sequenced to an accuracy of 99.999 percent, which translates to an error rate of only 1 base per 100,000 base pairs--10 times more accurate than the original goal.

The contiguity of the sequence is also massively improved. The average DNA letter now sits on a stretch of 38.5 million base pairs of uninterrupted, high-quality sequence--about 475 times longer than the 81,500 base-pair stretch that was available in the working draft. Access to uninterrupted stretches of sequenced DNA can greatly assist researchers hunting for genes and the neighboring DNA sequences that may regulate their activity, dramatically cutting the effort and expense required to find regions of the human genome that may contain small and often rare variants involved in disease.

In addition to reducing the count of human genes, scientists reported that the improved quality of the finished human genome sequence, compared with earlier drafts, provides a much clearer picture of certain phenomena such as duplication of DNA segments and the "birth" and "death" of genes.

Segmental duplications are large, almost identical copies of DNA, which are present in at least two locations in the human genome. A number of human diseases are known to be associated with mutations in segmentally duplicated regions. Segmental duplications also provide a window into understanding how our genome evolved and is still changing.

The accuracy of the finished human genome sequence produced by the Human Genome Project has also given scientists some initial insights into the birth and death of genes in the human genome. Scientists have identified more than 1,000 new genes that arose in the human genome after our divergence with rodents some 75 million years ago. Most of these arose through recent gene duplications and are involved with immune, olfactory, and reproductive functions.

Additionally, researchers used the finished human genome to identify and characterize 33 nearly intact genes that have recently acquired one or more mutations, causing them to stop functioning, or "die."

More than 2,800 researchers who took part in the International Human Genome Sequencing Consortium share authorship on the Nature paper, which expands upon the group's initial analysis published in February 2001. In addition to Haussler and Kent, coauthors on the Nature paper who are affiliated with UCSC include Robert Baertsch, Hiram Clawson, Mark Diekhans, Terrence Furey, Angela Hinrichs, Fan Hsu, Yontao Lu, Kate Rosenbloom, Krishna Roskin, Adam Siepel, Charles Sugnet, Daryl Thomas, Heather Trumbower, and Ryan Weber.

The coauthors of the Science paper on the ENCODE project include Haussler, Kent, Daryl Thomas, Kate Rosenbloom, Hiram Clawson, and Adam Siepel.

Additional information about the National Human Genome Research Institute is available on the web at www.genome.gov.

_____

Note to reporters: You may contact Haussler at (831) 459-2105 or haussler@cse.ucsc.edu.