The advent of online databases to access the human genome has been a boon to biomedical research, and the usefulness of this information has just moved to a new level. Researchers at the University of California, Santa Cruz (UCSC), the European Bioinformatics Institute (EBI), the National Center for Biotechnology Information (NCBI), and the Wellcome Trust Sanger Institute (WTSI) in Great Britain have released the results of a project to identify a core set of genes that can be located in the human genome and have been validated as coding for proteins.
After more than a year of work, the collaboration has released a set of 14,795 genes that can be reliably said to code for a protein. This gene set, called the Consensus Coding Sequence (CCDS) set, was posted today on the three major public human genome browsers: the UCSC Genome Browser, the Ensembl Browser at EBI and WTSI, and the NCBI web site.
The CCDS set is built by consensus among the collaborating members at UCSC, NCBI, EBI, and WTSI. UCSC's involvement in this international collaboration is led by David Haussler, professor of biomolecular engineering and a Howard Hughes Medical Institute investigator.
"Now that biomedical science has an internationally accepted human genome reference sequence to work from, it's time to identify a corresponding reference set of human genes from that genome," Haussler said.
The CCDS project addresses the fact that the genes listed in human genome databases often are not entirely validated, and the same gene may have different names in different databases. Since the data characterizing the genes come from a variety of sources, researchers are not always certain that a listed gene is real and its stated function is accurate.
The CCDS genes have been given unique identifier and version numbers to help locate them on genome maps. Each of the genome browser sites will receive regular updates as the collaboration continues to refine its knowledge of the protein-coding genes.
Until the Human Genome Project succeeded in sequencing and assembling the entire human genome, researchers could sequence the DNA in a gene, but had no way to accurately determine its location in the genome. Once the genome was sequenced, researchers began to note which parts of the genome contained known genes, a process known as genome annotation.
Haussler's group at UCSC pioneered the use of a mathematical approach known as hidden Markov models as a way to find genes in DNA sequences using automated computer programs. The technique is now widely used for this purpose, but Haussler said the problem of finding all the genes in the DNA sequence of the human genome has proven to be "much more difficult than we ever imagined."
"It will take the coordinated efforts of experimentalists and computational biologists many more years to complete this task," he said.
The CCDS set is calculated following coordinated whole genome annotation updates carried out by NCBI and Ensembl. Annotation updates represent genes that are defined by a mixture of manual curation, carried out by the WTSI Havana team and the NCBI RefSeq group, and automated computational processing performed by groups at Ensembl and NCBI.
"Resolving inconsistencies between gene structures generated by complementary methods of manual curation and automatic annotation is a major step towards providing stable and accurate annotation that can be relied on by researchers," said Tim Hubbard, head of human genome analysis at WTSI.
According to Mark Diekhans, the lead researcher on this project from the UCSC Genome Bioinformatics Group, inconsistencies arise because different centers have used different methods to identify where genes reside. "The names and locations do not always agree, especially in cases where the gene's function isn't well understood," Diekhans said.
As a result, the huge gene databases contain genes that appear to be duplicates or are quite similar, either in their DNA sequences or in their expression in the living organism. Having the entire human genome assembled and viewable with online browsers presents an opportunity for researchers to find each gene sequence in its correct location on the chromosomes and determine, for example, if one gene has been given two different names, or if they are in fact separate genes.
The UCSC contribution to the CCDS project has been mostly quality control, Diekhans said. "We compared the gene sets postulated by NCBI and Sanger to find the intersections between them. Then we applied various bioinformatics approaches to find where the sequences in this intersecting set might not actually be protein-coding," he said.
One approach the UCSC group used was to compare the intersecting set to a list of likely pseudogenes, elements in the DNA sequence that appear to be genes but cannot be transcribed to form proteins. These pseudogenes were predicted by software developed at UCSC by graduate student Robert Baertsch. Sequences that appeared to be pseudogenes were removed from consideration for the CCDS set, and any removed sequence will likely undergo more study before it is finally accepted as a gene or ruled out.
The UCSC team also compared the intersecting set with analogous locations on the genomes of other organisms--chimpanzee, chicken, dog, mouse, rat, and rhesus monkey. These comparisons between species aid in gene validation. "When gene segments are conserved across multiple species, it indicates that they are likely to be real and not pseudogenes," Diekhans said.
The UCSC Genome Browser makes comparative genomics much simpler, because it allows side-by-side comparisons of analogous genome segments from various species.
"Comparing human genes to the genes of related species will be the key to finalizing the human gene set," Haussler said. "All of biology is the result of evolution. Genes cannot be fully apprehended outside of their evolutionary context."
The collaborating groups used a conservative process in establishing the CCDS set. "We were going for high quality and high confidence," Diekhans said. "When in doubt about a gene, we left it out of our set. This makes the CCDS a valuable reference set for disease research."
In addition to Diekhans and Haussler, the UCSC team includes Adam Siepel, Robert Baertsch, Fan Hsu, Chuck Sugnet, and the entire UCSC Genome Browser team, led by Jim Kent. Lead researchers at collaborating institutions include David Lipman, Jim Ostell, and Kim Pruitt at NCBI; Hubbard, Richard Durbin, Steve Searle, and Jennifer Ashurst at WTSI; and Ewan Birney at EBI.
"UCSC is very proud to be playing a role in this collaboration with such outstanding collaborators as NCBI, EBI, and the Wellcome Trust Sanger Institute," Haussler said.