An analysis of millions of SARS-CoV-2 genomes finds that recombination of the virus is uncommon, but when it occurs, it is most often in the spike protein region, the area which allows the virus to attach to and infect host cells.
The study, led by scientists at UC Santa Cruz, was published August 11 in the journal Nature. It details a new software created by the researchers to search the COVID-19 phylogenetic tree, a diagram of the virus’s evolutionary history, for instances of recombination. This software is open source, allowing public health officials to use it to track instances of recombination within their communities.
Recombination occurs when two genetically distinct forms of the virus hybridize. This study focused on detectable recombination, when the hybridization results in a sequence that is genetically new, and not on instances where two sequences combine to form a sequence identical to an already existing one.
“It's really important for reconstructing the virus’s evolutionary history,” said Russell Corbett-Detig, senior author on the study and an associate professor of biomolecular engineering at the Baskin School of Engineering. “When there's recombination it's not one tree, it's many trees, and being able to trace that accurately is really crucial for understanding evolution of the virus.”
Findings on recombination
The researchers analyzed 1.6 million samples of COVID-19 and found 589 recombination events, which indicates that only about 2.7% of sequenced genomes result from recombination. These sequences were sourced from the UC Santa Cruz SARS-CoV-2 Browser, a repository for COVID-19 genomic data, which is now the largest collection of genomic sequences of a single species ever assembled, currently at nearly 12 million sequences.
While results show that recombination occurs more frequently in the spike protein region, it is not yet known why this is. This could potentially be the result of a mechanistic bias, indicating it is the natural tendency of all coronaviruses to recombine toward the three-prime region of the viral genome, which contains the spike protein, or that positive natural selection for COVID-19 is favoring recombinants that occur in this region.
While recombination does occur, there is no evidence that the resulting strains are more likely to be epidemiologically important. In fact, most recombinant variants die out, as do most of the thousands of mutated variants of COVID-19.
A new software, written primarily by UC San Diego Assistant Professor Yatish Turakhia during his postdoctoral training in Corbett-Detig’s lab, enabled the computational feat required for the analysis of millions of genomes. The software, called Recombination Inference using Phylogenetic PLacEmentS (RIPPLES), can efficiently search a massive phylogenetic tree of COVID-19 genomes to find instances where a new sequence appears to be a combination of two distinct sections of the tree. The COVID-19 phylogenetic tree, called UShER, was created by UCSC researchers and is the primary tool used by health officials worldwide to track the spread of variants in their community.
The researchers found recombination most often shows up on the COVID-19 phylogenetic tree in the form of “long branches,” making it appear that several mutations happened sequentially, which is quite rare.
“In a tree of millions of sequences, you find these long branches, which reduce the possible instances of detectable recombination down to only about 10’s of thousands of branches,” Turakhia said. “These long branches make recombination much easier to spot on the tree, which enables the efficient performance of the new software.”
Turakhia and his team aim to continue to improve RIPPLES’ speed and performance and to create visual tools to make it more accessible for a wider audience.
Use for public health
Knowing when recombination occurs is crucial for understanding the evolutionary lineage of a sequence of the virus. Recombination can complicate the process of tracing back the phylogenetic tree of a particular sequence because its genetic material is a result of two joining areas of the overall COVID-19 family tree.
This can help public officials understand when a lineage of COVID-19 which appears to be novel is truly an independent mutation introduced for the first time, or rather just a combination of two lineages that already existed in the community. Understanding when recombination occurs is also important from a public health perspective as it can potentially make the virus more adept at evading immunity.
Furthermore, the RIPPLES software’s availability and ease of use has positive implications for genomics experts and public health officials alike, who can efficiently search a set of COVID-19 genomic samples for recombination in just minutes.
This reflects a larger theme of the work of scalable translation of pathogen genomics data at Corbett-Detig’s lab and the UCSC Genomics Institute. Researchers are focused on creating tools that enable public health officials to automate and translate the questions they want to ask, and receive answers that are easy to act on and dependable.
“A big part of the success of our work has been that the software is extremely accessible and computationally cheap in the grand scheme of things,” Corbett-Detig said. “Anybody could take their hundred new SARS-CoV-2 genome sequences and figure out if there were potentially recombinant samples in just minutes on a basic laptop. Global public health needs to be democratized, to the point that anyone can do it, even if they're not a super wealthy lab with giant servers.”