UShER, a software developed at UC Santa Cruz to map the genomic evolution of the SARS-CoV-2 virus, is the most widely used tool for tracking COVID-19 worldwide. A new paper from the tool’s creators outlines the “game-time” decisions that the programmers made to keep up with an evolutionary tree of millions of genomic sequences, offering guidance for the future of web tools for tracking pathogen evolution.
The paper, which was recently published in the journal Nature Genetics, aims to serve as a resource for bioinformaticians and public health officials. It lays out the results obtained by the UShER team and especially UCSC Bioinformatics Programmer and “keeper of the tree” Angie Hinrichs, showing the biggest and most important innovations along the way.
“Angie is recognized in the field as one of the major heroes of the pandemic, and rightly so. [The work] is pretty special and really worth documenting, because it probably, hopefully, won’t happen again,” said UCSC Associate Professor of Biomolecular Russell Corbett-Detig, who is the paper’s senior author.
UShER, which stands for Ultrafast Sample Placement on Existing tRees, is an online tool that takes in genomic sequences of major human pathogens (such as the SARS-CoV-2 virus) that have been sampled in communities around the world and maps them onto an evolutionary tree. The software runs an analysis which places new samples into their correct branches, rather than starting the analysis from over again and re-computing everything each time, which is how other tools function. They call their approach “online phylogenetics,” where “online” means that the analysis is continually extended to include new information, as opposed to the classical “offline” approach of starting over from scratch whenever new information becomes available.
To date, the massive tree contains more than 15 million sequences, which are interpreted by public health officials to track the spread of strains in their communities.
“This is a really widely used resource — people use it hundreds of times daily,” Corbett-Detig said. “I think we owed the community an explanation of all the decisions that we had made behind the scenes, because we've just been changing things to make it work for a long time.”
In the earlier days of the pandemic, the research team joked that the diagram of the virus’s evolution was more of a phylogenetic “lawn” than a tree because it was very wide and not very tall, due to the large number of samples and relatively small number of mutations.
Now, about four years since the beginning of the SARS-COV-2 pandemic, tracking the evolution of the virus looks very different from the early days. Simply put, there are many more samples and many more mutations, the sorting of which slowed down USHER’s software.
In the paper, the authors took a detailed look at their server logs to describe problems that arose and the strategies they used to solve them.
Through “profiling” UShER, a computer science process for determining the slowest elements of their program, the team was able to find out that it was the algorithm running more “tree traversals” than necessary that was slowing down the tree. This analysis, led by co-author UC San Diego Assistant Professor Yatish Turakhia’s student Cheng Ye, allowed the team to create additional levels of parallelism in the code in order to run the tree over an order of magnitude faster and more efficiently, drastically lowering USHER’s runtime and solving their largest problem.
These insights helped the team keep up the software that is a crucial piece in the public health response COVID-19, and could inform similar efforts for any future pandemics. Because there is now very cheap access to commercially available sequencing technology nearly everywhere, the major limitations of this work are now mainly on the analysis side rather than the data production side.
“We want to encourage doing [phylogenetics] this way for the next pandemic, or for any pathogen,” Hinrichs said. “We believe this can and should be applied to any pathogen as genome sequencing increases — we should be maintaining a resource where anybody can upload their new genome and instantly see what it's mostly closely related to.”
Going beyond just pandemic-level respiratory viruses, similar software is being developed to track other pathogens such as influenza and tuberculosis. Some of these other pathogens are harder to study than the relatively simple SARS-CoV-2 virus, for which scientists essentially have an ancestral reference point from when the virus first emerged in 2019. Other viruses and bacteria have larger, more complex genomes with more mutations, and so being able to quickly process data will be advantageous for handling these pathogens.
This paper offers insight into online analysis as a tool that can be used beyond phylogenetics, as biomedical data sets continue to get larger and larger. The team argues that for many data sets it will be more efficient to do online analyses that do not require analyzing an entire dataset from scratch each time more data is added in, but instead run an analysis to fit the new data into an established data set and structure.
“I don't want people to think this is a fundamentally solved problem,” Corbett-Detig said. “I think that with the bigger idea of online phylogenetics, there's going to be lots and lots of innovation in this field going forward. It's really useful to have us sit down and say ‘here are the things we’ve tried, and we hope there are better things soon!”
UShER's creators discuss the utility of their tool.