MIT's Comprehensive Map of the SARS-CoV-2 Genome and Analysis of Nearly 2,000 COVID Mutations

MIT researchers have determined the virus’ protein-coding gene set and analyzed new mutations’ likelihood of helping the virus adapt.

In early 2020, a few months after the COVID-19 pandemic began, scientists were able to sequence the full genome of SARS-CoV-2, the virus that causes the COVID-19 infection. While many of its genes were already known at that point, the full complement of protein-coding genes was unresolved.

Now, after performing an extensive comparative genomics study, MIT researchers have generated what they describe as the most accurate and complete gene annotation of the SARS-CoV-2 genome. In their study, which was published on May 11, 2021, in Nature Communications, they confirmed several protein-coding genes and found that a few others that had been suggested as genes do not code for any proteins.

“We were able to use this powerful comparative genomics approach for evolutionary signatures to discover the true functional protein-coding content of this enormously important genome,” says Manolis Kellis, who is the senior author of the study and a professor of computer science in MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) as well as a member of the Broad Institute of MIT and Harvard.

The research team also analyzed nearly 2,000 mutations that have arisen in different SARS-CoV-2 isolates since it began infecting humans, allowing them to rate how important those mutations may be in changing the virus’ ability to evade the immune system or become more infectious.

Comparative Genomics

The SARS-CoV-2 genome consists of nearly 30,000 RNA bases. Scientists have identified several regions known to encode protein-coding genes, based on their similarity to protein-coding genes found in related viruses. A few other regions were suspected to encode proteins, but they had not been definitively classified as protein-coding genes.

To nail down which parts of the SARS-CoV-2 genome actually contain genes, the researchers performed a type of study known as comparative genomics, in which they compare the genomes of similar viruses. The SARS-CoV-2 virus belongs to a subgenus of viruses called Sarbecovirus, most of which infect bats. The researchers performed their analysis on SARS-CoV-2, SARS-CoV (which caused the 2003 SARS outbreak), and 42 strains of bat sarbecoviruses.

Kellis has previously developed computational techniques for doing this type of analysis, which his team has also used to compare the human genome with genomes of other mammals. The techniques are based on analyzing whether certain DNA or RNA bases are conserved between species, and comparing their patterns of evolution over time.

Using these techniques, the researchers confirmed six protein-coding genes in the SARS-CoV-2 genome in addition to the five that are well established in all coronaviruses. They also determined that the region that encodes a gene called ORF3a also encodes an additional gene, which they name ORF3c. The gene has RNA bases that overlap with ORF3a but occur in a different reading frame. This gene-within-a-gene is rare in large genomes, but common in many viruses, whose genomes are under selective pressure to stay compact. The role for this new gene, as well as several other SARS-CoV-2 genes, is not known yet.

The researchers also showed that five other regions that had been proposed as possible genes do not encode functional proteins, and they also ruled out the possibility that there are any more conserved protein-coding genes yet to be discovered.

“We analyzed the entire genome and are very confident that there are no other conserved protein-coding genes,” says Irwin Jungreis, lead author of the study and a CSAIL research scientist. “Experimental studies are needed to figure out the functions of the uncharacterized genes, and by determining which ones are real, we allow other researchers to focus their attention on those genes rather than spend their time on something that doesn’t even get translated into protein.”

The researchers also recognized that many previous papers used not only incorrect gene sets, but sometimes also conflicting gene names. To remedy the situation, they brought together the SARS-CoV-2 community and presented a set of recommendations for naming SARS-CoV-2 genes, in a separate paper published a few weeks ago in Virology.

Fast Evolution

In the new study, the researchers also analyzed more than 1,800 mutations that have arisen in SARS-CoV-2 since it was first identified. For each gene, they compared how rapidly that particular gene has evolved in the past with how much it has evolved since the current pandemic began.

They found that in most cases, genes that evolved rapidly for long periods of time before the current pandemic have continued to do so, and those that tended to evolve slowly have maintained that trend. However, the researchers also identified exceptions to these patterns, which may shed light on how the virus has evolved as it has adapted to its new human host, Kellis says.

In one example, the researchers identified a region of the nucleocapsid protein, which surrounds the viral genetic material, that had many more mutations than expected from its historical evolution patterns. This protein region is also classified as a target of human B cells. Therefore, mutations in that region may help the virus evade the human immune system, Kellis says.

“The most accelerated region in the entire genome of SARS-CoV-2 is sitting smack in the middle of this nucleocapsid protein,” he says. “We speculate that those variants that don’t mutate that region get recognized by the human immune system and eliminated, whereas those variants that randomly accumulate mutations in that region are in fact better able to evade the human immune system and remain in circulation.”

The researchers also analyzed mutations that have arisen in variants of concern, such as the B.1.1.7 strain from England, the P.1 strain from Brazil, and the B.1.351 strain from South Africa. Many of the mutations that make those variants more dangerous are found in the spike protein, and help the virus spread faster and avoid the immune system. However, each of those variants carries other mutations as well.

“Each of those variants has more than 20 other mutations, and it’s important to know which of those are likely to be doing something and which aren’t,” Jungreis says. “So, we used our comparative genomics evidence to get a first-pass guess at which of these are likely to be important based on which ones were in conserved positions.”

This data could help other scientists focus their attention on the mutations that appear most likely to have significant effects on the virus’ infectivity, the researchers say. They have made the annotated gene set and their mutation classifications available in the University of California at Santa Cruz Genome Browser for other researchers who wish to use it.

“We can now go and actually study the evolutionary context of these variants and understand how the current pandemic fits in that larger history,” Kellis says. “For strains that have many mutations, we can see which of these mutations are likely to be host-specific adaptations, and which mutations are perhaps nothing to write home about.”

Reference: “SARS-CoV-2 gene content and COVID-19 mutation impact by comparing 44 Sarbecovirus genomes” by Irwin Jungreis, Rachel Sealfon and Manolis Kellis, 11 May 2021, Nature Communications.
DOI: 10.1038/s41467-021-22905-7

The research was funded by the National Human Genome Research Institute and the National Institutes of Health. Rachel Sealfon, a research scientist at the Flatiron Institute Center for Computational Biology, is also an author of the paper.

Never miss a breakthrough: Join the SciTechDaily newsletter.
Follow us on Google and Google News.

3 Comments

Gerald Brennan on May 13, 2021 5:12 am
Someone please clarify for me. I am confused.
The virologists who publish papers claiming they have isolated viruses are not performing the steps of isolation but instead are adding toxic and foreign substances to the patient sample in tissue culture and not separating or isolating particles or molecules. This is clear in the methods and materials section of every virology paper ever published.
So without an isolated and purified sample of SARS-CoV-2, how can we know the correct genome?
Vladimir on May 13, 2021 9:40 am
The research team also analyzed nearly 2,000 mutations ………………..
The SARS-CoV-2 genome consists of nearly 30,000 RNA bases ……………..
the researchers also analyzed more than 1,800 mutations ……………….
…
I’m, approximately, 1,000,000 times sure, NO science in this article !
John on May 16, 2021 8:00 am
“The research team also analyzed nearly 2,000 mutations ………………..
The SARS-CoV-2 genome consists of nearly 30,000 RNA bases ……………..
the researchers also analyzed more than 1,800 mutations ……………….
…
I’m, approximately, 1,000,000 times sure, NO science in this article !”
They analyzed 2,000 mutations, which is more than 1,800…and the 30,000 RNA bases are because they have multiple RNA bases for each mutation. I could be wrong but it makes perfect sense to me.

MIT’s Comprehensive Map of the SARS-CoV-2 Genome and Analysis of Nearly 2,000 COVID Mutations

A Genetic Variant You May Have Inherited From Neanderthals Reduces the Risk of Severe COVID-19

Some Coronaviruses Can Steal Their Host’s Genes to Elude Their Immune System

First Global Atlas of How the COVID Coronavirus Interacts With Human Cells

New Insights Into Why COVID-19 Infects Some Animals, but Not Others

New COVID-19 Research Provides Deep Insights Into Transmission and Mutation Properties of SARS-CoV-2

SARS-CoV-2 Infection Revealed in “Mini-Lungs” Shows How COVID-19 Damages the Lungs

Researchers Reveal Possible New COVID-19 Coronavirus Entry Points

Complex Puzzle Revealed: Never-Before-Seen Image of the SARS-CoV-2 Coronavirus Copy Machine

Bats Offer Clues to Treating COVID-19 – Secrets to Longevity and Disease Tolerance

3 Comments

Chimpanzees Keep Throwing Stones at the Same Trees – Scientists Want To Know Why

Coffee May Protect the Liver in More Ways Than Scientists Realized

AI Just Uncovered a Hidden Secret Inside Water

Scientists Catch a “Jumping Gene” Moving Between Species

This Tiny-Bead Procedure Is Helping Patients Avoid Knee Replacement

Neanderthals Nearly Vanished 75,000 Years Ago – Then One Group Repopulated Europe

AI Detects Hidden Warning Signs Before Major Earthquakes

Scientists Have Found Evidence That Dark Matter May Not Be Playing by the Rules

MIT’s Comprehensive Map of the SARS-CoV-2 Genome and Analysis of Nearly 2,000 COVID Mutations

Comparative Genomics

Fast Evolution

Related Articles

3 Comments