ML Algorithm Accurately Identifies Cancer-Specific Structural in Long-Read DNA Sequencing Data
Posted on 31 May 2025
Long-read sequencing technologies are designed to analyze long, continuous stretches of DNA, offering significant potential to enhance researchers' abilities to detect complex genetic changes in cancer genomes. However, due to the intricate structure of cancer genomes, standard analysis tools, including those developed specifically for long-read sequencing data, often fall short. This can lead to false positives and unreliable data interpretations, ultimately undermining our understanding of how tumors evolve, respond to treatments, and affect patient diagnosis and management. To overcome this issue, researchers have developed a machine learning algorithm capable of identifying cancer-specific structural variations and copy number changes in long-read DNA sequencing data.
The algorithm, named SAVANA, was developed by researchers at the European Bioinformatics Institute (EMBL-EBI, Barcelona, Spain) and Genomics England's R&D laboratory (London, UK), in collaboration with clinical partners. SAVANA utilizes machine learning techniques to precisely identify structural variants—such as insertions, deletions, duplications, and rearrangements—along with the resulting copy number changes in cancer genomes using long-read sequencing data. Because cancer genomes are complex, traditional analysis tools often generate false positives, leading to inaccurate clinical interpretations of tumor biology. SAVANA significantly reduces these errors. The algorithm, described in Nature Methods, was tested on 99 human tumor samples, and its rapid processing and strong error-correction capabilities make it ideal for clinical applications. Recently, the method was applied to study osteosarcoma, a rare and aggressive bone cancer, where it helped uncover novel genomic rearrangements, shedding light on the mechanisms behind osteosarcoma's progression.

The team also compared SAVANA’s results from long-read data with data generated using Illumina sequencing, applying the standard whole-genome sequencing pipeline used in clinical settings. The results were highly consistent across both technologies, showing that SAVANA matches current clinical standards while revealing additional cancer-related alterations. SAVANA provides fast and reliable genomic analysis, which enhances the interpretation of clinical samples and improves cancer diagnosis and treatment strategies. This initiative represents the first global effort to incorporate whole-genome sequencing into routine clinical care. By integrating genomics into everyday clinical practice, the goal is to improve diagnostic precision and support personalized treatment plans for cancer patients. However, to achieve the full benefits of clinical genomics, accurate genomic data interpretation is essential, and this depends on specialized analytical tools. As part of its efforts to explore the clinical potential of long-read sequencing technology for faster and earlier cancer diagnosis, Genomics England has incorporated SAVANA into its research.
“Using SAVANA will ensure clinicians receive accurate and reliable genomic data, enabling them to confidently integrate advanced genomic sequencing methods such as long-read sequencing into routine patient care,” said Greg Elgar, Director of Sequencing R&D at Genomics England.
Related Links:
EMBL-EBI
Genomics England