Researchers Use Natural-Language Processing (NLP) Algorithms to Predict SARS-CoV-2 Virus Mutations
|
By LabMedica International staff writers Posted on 18 Jan 2021 |

Image: Researchers Use NLP Algorithms to Predict SARS-CoV-2 Virus Mutations (Photo courtesy of Baidu)
Natural-language processing (NLP) algorithms are now able to generate protein sequences and predict virus mutations, including key changes that help the SARS-CoV-2 virus evade the immune system.
The key insight making this possible is that many properties of biological systems can be interpreted in terms of words and sentences. In the last few years, a handful of researchers have shown that protein sequences and genetic codes can be modeled using NLP techniques. Now, computational biologists at the Massachusetts Institute of Technology (MIT; Cambridge, MA, USA) have pulled several of these strands together and use NLP to predict mutations that allow viruses to avoid being detected by antibodies in the human immune system, a process known as viral immune escape. The basic idea is that the interpretation of a virus by an immune system is analogous to the interpretation of a sentence by a human.
The team uses two different linguistic concepts: grammar and semantics (or meaning). The genetic or evolutionary fitness of a virus - characteristics such as how good it is at infecting a host - can be interpreted in terms of grammatical correctness. A successful, infectious virus is grammatically correct; an unsuccessful one is not. Similarly, mutations of a virus can be interpreted in terms of semantics. Mutations that make a virus appear different to things in its environment - such as changes in its surface proteins that make it invisible to certain antibodies - have altered its meaning. Viruses with different mutations can have different meanings, and a virus with a different meaning may need different antibodies to read it.
To model these properties, the researchers used an LSTM, a type of neural network that predates the transformer-based ones used by large language models like GPT-3. These older networks can be trained on far less data than transformers and still perform well for many applications. Instead of millions of sentences, they trained the NLP model on thousands of genetic sequences taken from three different viruses: 45,000 unique sequences for a strain of influenza, 60,000 for a strain of HIV, and between 3,000 and 4,000 for a strain of the SARS-CoV-2 virus.
NLP models work by encoding words in a mathematical space in such a way that words with similar meanings are closer together than words with different meanings. This is known as an embedding. For viruses, the embedding of the genetic sequences grouped viruses according to how similar their mutations were. The overall aim of the approach is to identify mutations that might let a virus escape an immune system without making it less infectious - that is, mutations that change a virus’s meaning without making it grammatically incorrect.
To test their approach, the team used a common metric for assessing predictions made by machine-learning models that scores accuracy on a scale between 0.5 (no better than chance) and 1 (perfect). In this case, they took the top mutations identified by the tool and, using real viruses in a lab, checked how many of them were actual escape mutations. Their results ranged from 0.69 for HIV to 0.85 for one coronavirus strain. This is better than results from other state-of-the-art models, according to the researchers.
The team has been running models on new variants of the coronavirus, including the so-called UK mutation, the mink mutation from Denmark, and variants taken from South Africa, Singapore and Malaysia. Using NLP accelerates a slow process. Previously, the genome of the virus taken from a COVID-19 patient in hospital could be sequenced and its mutations re-created and studied in a lab. However, that can take weeks, whereas the NLP model predicts potential mutations straight away, which focuses the lab work and speeds it up.
“We’re learning the language of evolution,” said Bonnie Berger, a computational biologist at the Massachusetts Institute of Technology. “Biology has its own language.”
Related Links:
Massachusetts Institute of Technology (MIT)
The key insight making this possible is that many properties of biological systems can be interpreted in terms of words and sentences. In the last few years, a handful of researchers have shown that protein sequences and genetic codes can be modeled using NLP techniques. Now, computational biologists at the Massachusetts Institute of Technology (MIT; Cambridge, MA, USA) have pulled several of these strands together and use NLP to predict mutations that allow viruses to avoid being detected by antibodies in the human immune system, a process known as viral immune escape. The basic idea is that the interpretation of a virus by an immune system is analogous to the interpretation of a sentence by a human.
The team uses two different linguistic concepts: grammar and semantics (or meaning). The genetic or evolutionary fitness of a virus - characteristics such as how good it is at infecting a host - can be interpreted in terms of grammatical correctness. A successful, infectious virus is grammatically correct; an unsuccessful one is not. Similarly, mutations of a virus can be interpreted in terms of semantics. Mutations that make a virus appear different to things in its environment - such as changes in its surface proteins that make it invisible to certain antibodies - have altered its meaning. Viruses with different mutations can have different meanings, and a virus with a different meaning may need different antibodies to read it.
To model these properties, the researchers used an LSTM, a type of neural network that predates the transformer-based ones used by large language models like GPT-3. These older networks can be trained on far less data than transformers and still perform well for many applications. Instead of millions of sentences, they trained the NLP model on thousands of genetic sequences taken from three different viruses: 45,000 unique sequences for a strain of influenza, 60,000 for a strain of HIV, and between 3,000 and 4,000 for a strain of the SARS-CoV-2 virus.
NLP models work by encoding words in a mathematical space in such a way that words with similar meanings are closer together than words with different meanings. This is known as an embedding. For viruses, the embedding of the genetic sequences grouped viruses according to how similar their mutations were. The overall aim of the approach is to identify mutations that might let a virus escape an immune system without making it less infectious - that is, mutations that change a virus’s meaning without making it grammatically incorrect.
To test their approach, the team used a common metric for assessing predictions made by machine-learning models that scores accuracy on a scale between 0.5 (no better than chance) and 1 (perfect). In this case, they took the top mutations identified by the tool and, using real viruses in a lab, checked how many of them were actual escape mutations. Their results ranged from 0.69 for HIV to 0.85 for one coronavirus strain. This is better than results from other state-of-the-art models, according to the researchers.
The team has been running models on new variants of the coronavirus, including the so-called UK mutation, the mink mutation from Denmark, and variants taken from South Africa, Singapore and Malaysia. Using NLP accelerates a slow process. Previously, the genome of the virus taken from a COVID-19 patient in hospital could be sequenced and its mutations re-created and studied in a lab. However, that can take weeks, whereas the NLP model predicts potential mutations straight away, which focuses the lab work and speeds it up.
“We’re learning the language of evolution,” said Bonnie Berger, a computational biologist at the Massachusetts Institute of Technology. “Biology has its own language.”
Related Links:
Massachusetts Institute of Technology (MIT)
Latest COVID-19 News
- New Immunosensor Paves Way to Rapid POC Testing for COVID-19 and Emerging Infectious Diseases
- Long COVID Etiologies Found in Acute Infection Blood Samples
- Novel Device Detects COVID-19 Antibodies in Five Minutes
- CRISPR-Powered COVID-19 Test Detects SARS-CoV-2 in 30 Minutes Using Gene Scissors
- Gut Microbiome Dysbiosis Linked to COVID-19
- Novel SARS CoV-2 Rapid Antigen Test Validated for Diagnostic Accuracy
- New COVID + Flu + R.S.V. Test to Help Prepare for `Tripledemic`
- AI Takes Guesswork Out Of Lateral Flow Testing
- Fastest Ever SARS-CoV-2 Antigen Test Designed for Non-Invasive COVID-19 Testing in Any Setting
- Rapid Antigen Tests Detect Omicron, Delta SARS-CoV-2 Variants
- Health Care Professionals Showed Increased Interest in POC Technologies During Pandemic, Finds Study
- Set Up Reserve Lab Capacity Now for Faster Response to Next Pandemic, Say Researchers
- Blood Test Performed During Initial Infection Predicts Long COVID Risk
- Low-Cost COVID-19 Testing Platform Combines Sensitivity of PCR and Speed of Antigen Tests
- Finger-Prick Blood Test Identifies Immunity to COVID-19
- Quick Test Kit Determines Immunity Against COVID-19 and Its Variants
Channels
Clinical Chemistry
view channel
New PSA-Based Prognostic Model Improves Prostate Cancer Risk Assessment
Prostate cancer is the second-leading cause of cancer death among American men, and about one in eight will be diagnosed in their lifetime. Screening relies on blood levels of prostate-specific antigen... Read more
Extracellular Vesicles Linked to Heart Failure Risk in CKD Patients
Chronic kidney disease (CKD) affects more than 1 in 7 Americans and is strongly associated with cardiovascular complications, which account for more than half of deaths among people with CKD.... Read moreMolecular Diagnostics
view channel
Diagnostic Device Predicts Treatment Response for Brain Tumors Via Blood Test
Glioblastoma is one of the deadliest forms of brain cancer, largely because doctors have no reliable way to determine whether treatments are working in real time. Assessing therapeutic response currently... Read more
Blood Test Detects Early-Stage Cancers by Measuring Epigenetic Instability
Early-stage cancers are notoriously difficult to detect because molecular changes are subtle and often missed by existing screening tools. Many liquid biopsies rely on measuring absolute DNA methylation... Read more
“Lab-On-A-Disc” Device Paves Way for More Automated Liquid Biopsies
Extracellular vesicles (EVs) are tiny particles released by cells into the bloodstream that carry molecular information about a cell’s condition, including whether it is cancerous. However, EVs are highly... Read more
Blood Test Identifies Inflammatory Breast Cancer Patients at Increased Risk of Brain Metastasis
Brain metastasis is a frequent and devastating complication in patients with inflammatory breast cancer, an aggressive subtype with limited treatment options. Despite its high incidence, the biological... Read moreHematology
view channel
New Guidelines Aim to Improve AL Amyloidosis Diagnosis
Light chain (AL) amyloidosis is a rare, life-threatening bone marrow disorder in which abnormal amyloid proteins accumulate in organs. Approximately 3,260 people in the United States are diagnosed... Read more
Fast and Easy Test Could Revolutionize Blood Transfusions
Blood transfusions are a cornerstone of modern medicine, yet red blood cells can deteriorate quietly while sitting in cold storage for weeks. Although blood units have a fixed expiration date, cells from... Read more
Automated Hemostasis System Helps Labs of All Sizes Optimize Workflow
High-volume hemostasis sections must sustain rapid turnaround while managing reruns and reflex testing. Manual tube handling and preanalytical checks can strain staff time and increase opportunities for error.... Read more
High-Sensitivity Blood Test Improves Assessment of Clotting Risk in Heart Disease Patients
Blood clotting is essential for preventing bleeding, but even small imbalances can lead to serious conditions such as thrombosis or dangerous hemorrhage. In cardiovascular disease, clinicians often struggle... Read moreImmunology
view channelBlood Test Identifies Lung Cancer Patients Who Can Benefit from Immunotherapy Drug
Small cell lung cancer (SCLC) is an aggressive disease with limited treatment options, and even newly approved immunotherapies do not benefit all patients. While immunotherapy can extend survival for some,... Read more
Whole-Genome Sequencing Approach Identifies Cancer Patients Benefitting From PARP-Inhibitor Treatment
Targeted cancer therapies such as PARP inhibitors can be highly effective, but only for patients whose tumors carry specific DNA repair defects. Identifying these patients accurately remains challenging,... Read more
Ultrasensitive Liquid Biopsy Demonstrates Efficacy in Predicting Immunotherapy Response
Immunotherapy has transformed cancer treatment, but only a small proportion of patients experience lasting benefit, with response rates often remaining between 10% and 20%. Clinicians currently lack reliable... Read moreMicrobiology
view channel
Comprehensive Review Identifies Gut Microbiome Signatures Associated With Alzheimer’s Disease
Alzheimer’s disease affects approximately 6.7 million people in the United States and nearly 50 million worldwide, yet early cognitive decline remains difficult to characterize. Increasing evidence suggests... Read moreAI-Powered Platform Enables Rapid Detection of Drug-Resistant C. Auris Pathogens
Infections caused by the pathogenic yeast Candida auris pose a significant threat to hospitalized patients, particularly those with weakened immune systems or those who have invasive medical devices.... Read morePathology
view channel
Engineered Yeast Cells Enable Rapid Testing of Cancer Immunotherapy
Developing new cancer immunotherapies is a slow, costly, and high-risk process, particularly for CAR T cell treatments that must precisely recognize cancer-specific antigens. Small differences in tumor... Read more
First-Of-Its-Kind Test Identifies Autism Risk at Birth
Autism spectrum disorder is treatable, and extensive research shows that early intervention can significantly improve cognitive, social, and behavioral outcomes. Yet in the United States, the average age... Read moreTechnology
view channel
Robotic Technology Unveiled for Automated Diagnostic Blood Draws
Routine diagnostic blood collection is a high‑volume task that can strain staffing and introduce human‑dependent variability, with downstream implications for sample quality and patient experience.... Read more
ADLM Launches First-of-Its-Kind Data Science Program for Laboratory Medicine Professionals
Clinical laboratories generate billions of test results each year, creating a treasure trove of data with the potential to support more personalized testing, improve operational efficiency, and enhance patient care.... Read moreAptamer Biosensor Technology to Transform Virus Detection
Rapid and reliable virus detection is essential for controlling outbreaks, from seasonal influenza to global pandemics such as COVID-19. Conventional diagnostic methods, including cell culture, antigen... Read more
AI Models Could Predict Pre-Eclampsia and Anemia Earlier Using Routine Blood Tests
Pre-eclampsia and anemia are major contributors to maternal and child mortality worldwide, together accounting for more than half a million deaths each year and leaving millions with long-term health complications.... Read moreIndustry
view channelNew Collaboration Brings Automated Mass Spectrometry to Routine Laboratory Testing
Mass spectrometry is a powerful analytical technique that identifies and quantifies molecules based on their mass and electrical charge. Its high selectivity, sensitivity, and accuracy make it indispensable... Read more
AI-Powered Cervical Cancer Test Set for Major Rollout in Latin America
Noul Co., a Korean company specializing in AI-based blood and cancer diagnostics, announced it will supply its intelligence (AI)-based miLab CER cervical cancer diagnostic solution to Mexico under a multi‑year... Read more
Diasorin and Fisher Scientific Enter into US Distribution Agreement for Molecular POC Platform
Diasorin (Saluggia, Italy) has entered into an exclusive distribution agreement with Fisher Scientific, part of Thermo Fisher Scientific (Waltham, MA, USA), for the LIAISON NES molecular point-of-care... Read more








