Health
New AI tool detects hidden cancer mutations
UC Santa Cruz researchers unveil DeepSomatic, a deep learning method that will help make genomic sequencing a routine part of how cancer is diagnosed and treated
Key takeaways
- DeepSomatic uses AI to detect cancer-causing variants more accurately across all major DNA sequencing technologies, moving genome sequencing closer to being a standard part of cancer care.
- By comparing tumor and healthy DNA, the tool identifies the genetic changes that drive each patient’s cancer and can guide treatment decisions.
Every cancer carries a unique genetic fingerprint: variations in DNA known as “somatic variants” that occur in tumor DNA but are absent from the patient’s healthy cells. While some cancers may also have a hereditary component, it is somatic variants that largely drive a tumor’s growth and how it might respond to treatment. Despite major advances in sequencing, detecting these variants with high confidence has remained a major challenge, limiting the usefulness of genomic sequencing for diagnosing and treating cancer.
Now, a collaboration led by scientists at the UC Santa Cruz Genomics Institute and Google Research has developed DeepSomatic, a machine learning model that dramatically improves the accuracy of variant detection across all major sequencing technologies and has already been used to effectively identify variants in real pediatric leukemia and glioblastoma samples. The study, published this week in Nature Biotechnology, marks a major step toward bringing the benefits of long-read sequencing into everyday cancer diagnostics and treatment.
“With this tool, we’re overcoming the technical barriers that limit the accuracy of genomic sequencing when used in cancer care,” said Benedict Paten, professor of biomolecular engineering and a core member of the UC Santa Cruz Genomics Institute. “Our goal is to improve genome sequencing for cancer to give a more complete picture of the mutations present, ultimately improving diagnostic power and helping uncover new molecular mechanisms.”
A breakthrough in reading cancer’s “missed mutations”
Most current tools for cancer analyses rely on short-read sequencing, a technique that has high accuracy for short segments of DNA but is limited in its ability to map complex and repetitive regions of DNA where many harmful variants reside. In the last decade, researchers have developed long-read sequencing techniques that are much better at mapping these complex regions, but so far the promise of long reads for detecting and studying variations in cancer cell DNA remains untapped.
DeepSomatic bridges this gap by using deep learning to interpret data from both short- and long-read technologies and to cross-validate results between them. The result is a system that not only identifies known cancer-driving variants with greater precision but also uncovers new variants that were previously undetectable.
In benchmark tests, DeepSomatic outperformed all existing tools, achieving higher accuracy across all sequencing platforms for both single-nucleotide variants (when a single letter in a long string of DNA code is swapped out for another) and small insertions or deletions.
Real-world implications for patient care
To test DeepSomatic in a clinical context, the researchers analyzed patient samples from pediatric blood cancer and glioblastoma cases in collaboration with Children’s Mercy and the Translational Genomics Research Institute (TGen). The tool accurately identified key cancer mutations even in samples that were stored in formalin, a common preservative for clinical tissue that often produces technical challenges for sequencing. By increasing the sensitivity and confidence of mutation detection, DeepSomatic could help clinicians more reliably match patients to targeted therapies or clinical trials.
DeepSomatic was built on the DeepVariant framework originally developed at Google and extended by the UCSC team to recognize patterns unique to tumor DNA. Unlike traditional methods that rely on rigid statistical models, DeepSomatic learns directly from vast sets of labeled sequencing data, allowing it to distinguish true variants from noise even in complex regions of the genome.
Building on Severus and the next generation of cancer genomics
DeepSomatic follows another breakthrough from the UC Santa Cruz / NIH /Google collaboration: Severus, a complementary method for detecting larger, structural changes in cancer genomes that was also recently published in Nature Biotechnology. Severus and DeepSomatic together form an integrated toolkit that can analyze both small and large genetic alterations across sequencing platforms.
“These two tools provide a complete picture of cancer genomes,” Paten said. “With Severus, we can detect complex rearrangements, and with DeepSomatic, we can resolve the smaller but equally important variants. Together, they bring us closer to the comprehensive, multi-scale view of cancer genomes that will enable true precision oncology based on individual tumor genomes.”
Increasing impact through open science
Part of the success of the models comes from their training on a unique dataset. Unlike previous tools that relied on simulated or synthetic data, Severus and DeepSomatic have been trained on a collection of six matched tumor-healthy cell line pairs. These cell lines, which were generated from the tumor tissue and healthy tissue of six separate patients, allowed the models to learn how to distinguish cancer-specific somatic variations from the background of healthy variation. Each line was sequenced using short- and long-read methods, and the resulting data have been made openly available to the research community.
“The ability to train on real, multi-platform data rather than simulations is critical,” said Jimin Park, researcher at the UC Santa Cruz Genomics Institute and lead author of the study. “Cancers are incredibly diverse, so a model that performs well across multiple tumor types and sequencing technologies gives us a much stronger foundation for both research and clinical use.”
True to UC Santa Cruz’s long tradition of open genomics, all DeepSomatic models, code, and training data are being released publicly to encourage community use and further innovation. The team hopes that this open approach will accelerate progress toward clinical-grade long-read sequencing pipelines that are faster, cheaper, and more inclusive of diverse patient populations.
“Our goal is to make genomic sequencing robust and reliable enough for every cancer patient,” Park said. “By improving the accuracy of mutation detection across all sequencing technologies, we’re laying the foundation for genomics to become a true standard of care.”
Funding for this project was provided by the National Institutes of Health and multiple philanthropic organizations supporting pediatric cancer research.