Bioinformatics experts gain ground in protein sequence analysis

Proteins, with their extraordinary diversity of structure and function, pose some of the toughest problems in the field of bioinformatics, giving rise to a growing arsenal of computational tools for protein analysis. An array of computer-based strategies is now available to help molecular biologists who have found an unknown protein, determined its sequence of amino acid subunits, and want to know its three-dimensional structure and biological function.

Computational techniques alone may not provide all the answers, but they are powerful enough to have earned a place in the standard toolkit for protein research. The Sequence Alignment and Modeling System (SAM), introduced in the early 1990s by researchers at the University of California, Santa Cruz, has become one of the most popular software packages for the analysis of protein sequences.

SAM now faces stiff competition, but UCSC researchers keep improving the software and are working on other software programs to complement it. Both academic researchers and commercial companies are among the users of the SAM software.

"We have licensed the SAM software to more than 200 academic research groups and about 20 commercial companies. We also have a web server that sees over 1,000 uses per week for protein structure prediction," said Richard Hughey, professor and chair of computer engineering at UCSC.

The list of companies that have licenced the SAM software from UCSC reads like a Who's Who of the biotechnology industry: Affymetrix, Celera, Genentech, Novartis, Pfizer, and Pharmacia, among others. While commercial companies must pay a fee to use the software (as much as $125,000), academic licenses are free, Hughey said.

Proteins carry out most of the crucial functions of living cells. They are typically large molecules with very complex shapes. Their structural and functional diversity surpasses that of any other kind of molecule. Enzymes, antibodies, hormones, muscle, tendons, cartilage, hair, and feathers are all made of proteins.

At the simplest level, proteins are long chains of subunits called amino acids. There are 20 different amino acids, and their sequence in the linear chain of a protein molecule ultimately determines its structure. Sections of the molecule may twist into coils or fold into sheets, and the entire protein folds into a precise and often highly complex three-dimensional structure.

Software programs such as SAM take advantage of the structural similarities of related proteins and the existence of large databases of information on known proteins. Proteins that share a common ancestor have many similarities in their amino acid sequences. These similarities make it possible to create statistical models of families of related proteins. A software program can compare an unknown protein's sequence with such statistical models and may be able to predict the protein's structure based on its similarity to known proteins.

SAM uses a statistical technique known as Hidden Markov Models (HMMs), first introduced to the field of bioinformatics by David Haussler, holder of the UC Presidential Chair in computer science and director of UCSC's Center for Biomolecular Science and Engineering. The SAM software was initially developed by Haussler, postdoctoral researcher Anders Krogh, now at the University of Copenhagen, and others. Haussler later focused on DNA sequence analysis, and further development of the SAM software was taken over by Hughey and Kevin Karplus, professor of computer engineering.

SAM has a history of success in an unusual series of group experiments performed every two years to establish the state of the art in protein structure prediction. The Fifth Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction (CASP5) concluded in December 2002.

The top performers in one category of the CASP5 experiment were "metaservers" that combined several different servers, including SAM, and looked for agreement between different methods, Karplus said. "The success of the metaservers was somewhat unexpected--these automatic methods outperformed most human predictors," he said.

UCSC entered two versions of SAM (SAM-T99 and a newer version, SAM-T02) in CASP5, as well as a new program Karplus is developing called Undertaker. Undertaker is designed to predict protein folding based on the tendency for parts of a protein molecule that are hydrophobic--literally "water-fearing"--to be buried inside the structure where they won't come in contact with water.

"The burial of hydrophobic residues is one of the main driving forces in protein folding, and Undertaker is an attempt to use that to predict new folds," Karplus said.

He found that the combination of Undertaker with SAM did not perform as well as SAM alone on the easier targets, where there was a good alignment of the unknown sequence with a known template. The combined programs did surprisingly well, however, on some of the hardest problems, Karplus said.

"Where our methods had started failing in the past, that's where we started succeeding," he said. "We still have a lot of work to do, but I think we can improve even more over the next year."

_____

Note to reporters: You may contact Hughey at (831) 459-2939 or rph@cse.ucsc.edu and Karplus at (831) 459-4250 or karplus@soe.ucsc.edu.