Study offers guidance on state-of-the-art long-read RNA sequencing techniques

Portrait of Angela Brooks
Professor of Biomolecular Engineering Angela Brooks served as a lead organizer for the LRGASP consortium (photo by Carolyn Lagatutta).
The idea for the LRGASP Consortium was originally discussed among scientists at a conference in 2019.
The techniques used for genetic sequencing of DNA and RNA have been rapidly improving over the past decade, but different methods have costs and benefits, and the scientific community has yet to determine which techniques will yield the best results for a given research question. Optimizing sequencing methods can help researchers across a variety of biological fields, from conservation science to precision medicine. 

For several years, UC Santa Cruz’s Angela Brooks has served as a lead organizer in an effort to provide scientists with a fair process of evaluating which methods and technologies perform best in long-read RNA sequencing experiments. Now, a global consortium of scientists have published the results of this effort, offering guidance for the future of RNA sequencing experimentation and analysis.

Brooks, who is a professor of biomolecular engineering and an affiliated faculty at the UCSC Genomics Institute and Center for Molecular Biology of RNA, is the co-senior author on a new paper published in the journal Nature Methods that evaluates the strengths and weaknesses of the two leading long-read RNA sequencing platforms, Oxford Nanopore Technologies and Pacific Biosciences, and the computational methods used to evaluate the raw sequencing data. The researchers detailed how well the sequencing platforms and computational tools could define the location of genes and their protein variants, and offered recommendations for how researchers should use these methods in their experiments.

“That was a question that the field as a whole really wanted to ask: which methods should we be using?” Brooks said. 

“I really think this will have a huge impact on the field because of the fact that it was a community effort with so many people coming together and saying, ‘let's do this together and try to make it fair.’ I think this will be a guide for the field — everyone we worked with wanted to advance the field as a community.”

Evaluating methods

RNA is the molecular compound in cells that is responsible for translating DNA into proteins, producing what is known as a transcript, along with other functions. Long-read sequencing of RNA enables researchers to look at large stretches of RNA, which can contain more complex changes in genetic material known as alternative RNA processing. Researchers employ long-read RNA sequencing to find and define genes and their various isoforms, which are the various forms a gene can take due to processes such as splicing within the cell environment. These isoforms are critical to a wide range of biological research on many species.

Although the entire human genome has now been sequenced from end to end, researchers still face challenges in finding and defining all of the 20,000 genes on the human genome and their various isoforms. Similar challenges crop up in the study of other species, the vast majority of which do not have a reference genome.

These challenges are due to the error rate of sequencing methods — although errors are few, and the sequencing methods have steadily increased their accuracy for years, they are still not 100% accurate. While the various tools generally agree on the genes that have been previously mapped and defined, the researchers found the biggest discrepancies cropped up in what they report about genes that don’t have a large body of existing research.  

Brooks and the research team convened their consortium, called the Long-Read RNA-Seq Genome Annotation Assessment Project (LRGASP), as a continuation of several previous efforts to evaluate RNA sequencing techniques since the early 2000s, all aimed at determining the best methods for defining genes. They took lessons learned from these past efforts to inform LRGASP. 

One major change from previous consortiums was that computational tool developers themselves ran their tools on data that had been produced by the LRGASP consortium, ensuring that users who were intimately familiar with the sequencing products would be able to optimize them for the best-performing features. Another shift was that the consortium clearly defined their benchmarks at the beginning of the project.

Throughout the process, the LRGASP consortium invited the wider genomics community to participate as much as possible in the data production process, holding community calls through word of mouth and on X (formerly Twitter).

“It was very much an organic process of recruiting people,” Brooks said. “People just wanted to be part of this effort, and put a lot of work into it.”  

After this massive logistical effort, the consortium ultimately generated more than 427 million long-read sequences using both the Pacific Biosciences and Oxford Nanopore Technologies platforms, and evaluated computational methods from various tool developers, such as Brooks and UCSC Associate Professor of Biomolecular Engineering Chris Vollmers, also a co-senior author.

The data produced came from three species: humans, mice, and manatees. Both humans and mice were easy choices, but using data from manatees allowed the researchers to test the methods on a species that does not already have a well-established body of research or a reference genome, making it what researchers call a “non-model species.” It was important for the researchers to test the techniques for a non-model species, as this is a common use-case for RNA sequencing studies.

Guidance for future studies

After extensive data collection and analysis, the LRGASP consortium produced a set of recommendations for the RNA sequencing community, outlining different best practices for various research scenarios. 

Overall, the researchers found that long-read sequencing approaches perform much better than short-read sequencing. They found that various platforms were able to pick up a surprising amount of novel transcripts not documented in existing annotated references of human and mouse genomes.

“It’s nice to see that the long-read technology has pushed the field forward to be more confident,” Brooks said.

The researchers concluded that there is no one-size-fits-all best approach to RNA long-read sequencing, but rather the paper describes a series of best-practices depending on different goals that individual studies might have. Various tools and methods were found to have differences in error rates, sequencing depth, and read length, so researchers should prioritize which is most important to them for their area of study. 

For example, some scientists may be interested in detecting every single thing that a cell is producing via RNA. In the past, the common approach to this was to perform a higher volume of sequencing in order to detect transcripts that could have been missed. However, researchers found that more sequencing is not necessarily an effective approach to this issue, and that it is better to use an approach with a lower error rate and longer read length, as this enables scientists to discover longer areas of variation. 
 
Moving forward, the LRGASP team hopes their insight will help both sequencing and software tool developers improve the ever-evolving field of RNA-read technology. 

“I think this will help a lot of people who want to further develop the technology — there's still room for improvement on a lot of these methods,” Brooks said. 

Additional UC Santa Cruz researchers involved in this project include paper co-first author Mark Diekhans, Matthew Adams, Amit Behera, Namrita Dhillon, Colette Felton, Cindy Liang, Dennis Mulligan, Brandon Saint-John, and Alison Tang.