How a virus evolves in a pandemic

VIRUSES readily mutate. There is nothing surprising about this because it is their nature to do so. This happens due to the imperfect copying mechanism at work as viruses replicate in the cells of infected hosts.

The complete set of genetic information needed to sustain an organism, such as the virus, is its genome, which, in the case of viruses, can be made up either of the DNA or the RNA molecule. The DNA and the RNA can be thought of as a string of (genetic) letters, and a genome can be imagined to be long stretches of these letters with different parts of it encoding for different proteins required for the organism’s existence. Mutations are just random errors that occur during the process of copying these letters during viral multiplication and such errors accumulate during every replicating cycle, which can occur within hours or even less. RNA viruses mutate faster than DNA viruses because their replication mechanism is intrinsically more error-prone. Likewise single-stranded viruses mutate more often than double-stranded ones.

Viruses cannot exist in isolation; they need a host to replicate and survive. Mutations generate a diversity of virus population in a single infected host. This amazing ability of viruses to mutate is what drives their evolutionary change. Most mutations may be inconsequential. But mutations that adversely affect some virus function or the other, which impede its sustenance, will get removed by natural selection. If during an outbreak, a mutated virus with a greater (or lesser) degree of infectivity or virulence were to appear in a population, it does not immediately follow that the mutation will sustain and continue to spread with high frequency, unless it gives the virus a selective advantage as instances during the current COVID-19 pandemic that we consider below illustrate.

The causative virus of COVID-19, the coronavirus SARS-CoV-2 virus, is an RNA virus (with about 30,000 nucleotides—the basic building block of DNA/RNA—coding for 29 proteins) and is also single-stranded. So, frequent mutations in the virus are only to be expected and naturally, therefore, researchers have observed many mutations in the SARS-CoV-2 genomes from samples of COVID-19 patients in different parts of the world since the outbreak began in Wuhan in central China in December 2019. Most of these mutations have been substitutions of a single nucleotide, known as single nucleotide variants (SNVs), at different genetic sites in the genome. At the viral protein level, these SNVs translate into replacements of single amino acids in different proteins.

Most genome-based analyses of the dynamics of evolution so far were largely focussed on the early phase of the pandemic, up to early March at best. A Chinese study with 103 genomes that were available in a public database in January found that SARS-CoV-2 had evolved into two major types. A more recent study based on 160 genomes that were available until March 3, which was published on April 8, identified three major types. Given the limited sample sizes in these studies, and also not being over a sufficiently longer period, a clearer evolutionary picture did not emerge until now. We know that the geographical spread of the virus was extremely rapid in March, which would have greatly increased mutation probabilities.

Nidhan Biswas and Partha Majumder of the National Institute of Biomedical Genomics (NIBG) at Kalyani in West Bengal recently completed a more comprehensive and systematic analysis using a much larger public database of genomes, which maps their geographical origins, examines the emergence of virus groupings and their mutual relationships based on the observed mutations in an evolutionary tree (called phylogenetic tree) and the frequencies—both spatial and temporal—of their spread. This work is due to appear shortly in Indian Journal Medical Research (IJMR).

The two researchers have found that mutations of the virus have led to the emergence of a type that is distinctly and significantly different from the original virus that emerged from Wuhan and, by March end, this mutated version had already substantially replaced the ancestral version in virtually all geographical regions of the world. It has now begun to spread with much higher frequencies than the original version and the other mutated types that emerged during the course of the pandemic, and seems to be establishing itself as the major virus type being transmitted in most countries as infections continue to grow across the world.

Biswas and Majumder analysed 3,636 full genome sequences of SARS-CoV-2 obtained from virus isolates from patients from 55 countries available from the public database www.gisaid.org covering the period from December 2019 to March end. According to the authors, the entire set of mutations observed so far can be classified into 11 virus types, each of which can be characterised by one or a few defining mutations. Of these, Type A2a is emerging as the dominant virus type almost everywhere, sweeping away by selection the original Type O isolated from Wuhan that held sway in the early phase of the pandemic (Figure 1; all figures in enlarged view at the end of the article). This also implies that the other 10 types are derived from Type O.

Fig. 2 suggests that Type A2a began to emerge around the ninth or tenth week since the outbreak started and currently accounts for over half of all genomic sequences across the world. The unique mutation that is seen in Type A2a is obviously endowing the virus with a selective functional advantage over the other types, Type O in particular. The authors have argued that the increasing frequency of this evolved type in different parts of the world is an indication of positive selection pressure at work enabling the virus to establish itself in the human population across the world.

Before we discuss what this selective advantage is, and its enabling mutation, it is instructive to look at what is currently known about the evolutionary history of the SARS-CoV-2 virus itself and also talk about its early evolution revealed by data from the earlier phases of the pandemic as reported in scientific literature and the media.

Binding to ACE2 receptor

Structural and biochemical analyses have now clearly established that the SARS-CoV-2 virus is able to infect humans by gaining entry into human cells by its binding to the receptor ACE2, which is expressed in many types of human cells. The part of the virus that enables it with this critical function is the Spike (S) protein—the protrusions on the virus envelope that give the virus the prefix “corona”. The S protein has two sub-regions S1 and S2. While S1 contains the receptor binding domain (RBD) and enables the virus to attach itself to the target human cell, S2 is responsible for the later stage action—that of fusion of the viral membrane with the human cell and release of viral RNA into the cell, which, in turn, forces the cell machinery to make copies of the virus and disseminate.

For this to happen efficiently, the two conjoined sub-regions need to be split at the S1/S2 boundary for S2 to initiate fusion and efficient viral replication within the cell after S1 facilitates virus-cell binding. The emergence of an appropriate cleavage site at the S1/S2 boundary through evolution allows this new virus to exploit human cell enzymes such as furin and TMPRSS2 to perform this cleavage. This results in rapid proliferation and spread of the virus to different organs, particularly the lungs, causing the defining atypical pneumonia in COVID-19 positive individuals.

In a March 17 publication in Nature Medicine, a team of scientists from the U.S., the United Kingdom and Australia, led by Kristian Andersen of Scripps Research Institute, presented a reasonably convincing argument about the origin of the virus and its early evolution from the then available genome sequence data. According to their analysis, while SARS-CoV-2 has high affinity to the ACE2 receptor in humans, ferrets, cats and other species, a comparison of its RBD with SARS-CoV-1 (and other related beta coronaviruses) shows that in SARS-CoV-2, of the six amino acids in the RBD that are known to be critical for binding to ACE2, five had got mutated or changed to other amino acids. As a result, they said, though its affinity to ACE2 is high, it is not predicted to be ideal and optimal.

On the basis of this, they argued that the evolution of the critical S1/S2 cleavage site, which enables enhanced binding to the cell and virus-cell fusion, is a result of mutation and natural selection. This may have occurred either in humans through multiple chains of silent human-to-human transmission sometime before it was poised for the outbreak in December 2019 or in some intermediary animal host (having originated in bats) with human-like ACE2 receptor before making the jump to humans.

This cleavage site is unique to SARS-CoV-2 and is not present in the other beta coronaviruses of the same lineage, including SARS-CoV-1 (which caused the major SARS outbreak in 2002-03), and this, it was felt, could be key to its high infectivity and rapid transmission. This, they said, was similar to the emergence of a cleavage site in the hemagglutinin (HA) protein of the highly pathogenic strain of avian influenza virus following repeated passage among chickens. The specific features of the RBD and the S1/S2 cleavage site, including amino acid structure at the cleavage site, were shared by all SARS-Cov-2 genomes available until then, which pointed to a common ancestor virus, the paper said.

A March 25 report in The Washington Post quoted Peter Thielen of Johns Hopkins University, a molecular geneticist involved in SARS-CoV-2 research, as saying: “There are only about 10 genetic differences between the strains that have infected people in the United States and the original virus that spread in Wuhan…. That’s a relatively small number of mutations for having passed through a large number of people. At this point, the mutation rate of the virus would suggest that the vaccine developed for SARS-CoV-2 would be a single vaccine, rather than a new vaccine every year like the flu vaccine.”

This view that the virus had not mutated to any significant extent, and was relatively stable up to that point in time, was reiterated by Stanley Perlman of the University of Iowa and Benjamin Neuman of Texas A&M University in the Post article. “If it’s still around in a year,” Neuman had said, “by that point we might have some diversity.”

Yong Jia and associates from Taiwan and Australia carried out a phylogenetic analysis on 106 genomic data available up to March 24 on the U.S. National Centre for Biotechnology Information (NCBI) database of isolates from patients from 12 countries including China (34), the U.S. (54), India (2) and Nepal (1). This paper was posted on the preprint repository bioRxiv on April 11. Among its main conclusions was the observation that concurred with the view that the mutation rate and genetic diversity of the virus (from the data until then) was indeed low as compared to SARS-CoV-1.

“Overall,” the paper said with regard to gene sequences relevant to viral protein synthesis, “the gene sequences from different samples are highly homologous, sharing greater than 99.1% identity…” Specifically, the work also noted that the genes encoding for the Spike (S) protein on the virus envelope were more conserved than other protein encoding genes. This notwithstanding, it has been noted by other researchers that RBD is the most variable part of the genome and some sites of the S protein may be subjected to positive selection. Jia’s group had observed a total of 12 mutations in the S protein—which were all single amino acid substitutions—but only one of them pertained to the RBD of the virus, which is relevant from the perspective of infectivity of the virus.

This mutation, the work found, was responsible for disrupting a hydrogen bond at the interface between RBD and the receptor ACE2 in human cells. They argued that since the bond is important for the exceptional strong binding of the RBD and ACE2, this mutation would lead to a weakened binding of the virus to human cells. Interestingly, this mutation was seen in one of the Indian isolates obtained on January 27 from a case in Kerala whose origin was linked to Wuhan. From this, the authors inferred that mutations of significance were beginning to occur, notwithstanding the fact that this observation was based on data of one genome. More significantly, they found that all the genomes seemed to group as two clusters, indicating that the virus spread occurred from two sources. They, of course, added the caveat: “However, these results may be based on limited genomic data in the early stage of virus development. It is critical to study and monitor the mutation dynamics of SARS-COV-2.”

Unique mutation

In an earlier article , we had discussed a work by Indian scientists that had found another unique mutation in one of the two early Indian genomes submitted to the public database, which the authors had conjectured could trigger a protective microRNA response. This particular mutation in the Indian genome discussed above is different from the one discussed earlier. In fact, these two mutations had also been noticed by the scientists of the National Institute of Virology (NIV), Pune, who had carried out the first two complete genome sequencing from Indian samples both of which could be linked to the Wuhan strain. They had also pointed out that while the mutation with apparent weakened binding was in the S1/RBD region, the mutation that had the potential of eliciting a microRNA response was in the S2 region. But, with multiple passages of the virus as infections increased, these mutations—which may have even been single random events—seem to have been discarded by selection as neither mutation figures in any of the 11 main genome types in circulation at present, let alone the dominant one A2a.

Similarly, a recent work by a group of Chinese scientists from Zhejiang University, led by Hangping Yao, that was posted on the medRxiv preprint server on April 14 had found certain mutations with higher virulence and pathogenicity in the early phase of the outbreak itself, but most of these too do not seem to have occurred with greater frequency in the subsequent spread of the disease. The scientists had examined virus isolates taken between January 22 and February 4 from 11 patients admitted into the hospitals affiliated to the university, whose ages ranged from four months to 71 years. They noted that while data publicly available up to March 24 revealed several mutations, none had been directly linked to changes in viral pathogenicity. With that objective, they carried out functional characterisation of the 11 patient-derived isolates. They noticed considerable mutational diversity in general and in all recorded 33 mutations (of which 19 were novel when compared with publicly available 1111 genomic sequence data) including six mutations in the S protein.

Importantly, they found significant variation in the viral loads and cytopathic effects (structural changes to the cells) among these isolates when Vero cells (cell lines derived from African green monkeys), in which the structure of ACE2 receptor is believed to be similar to that in humans, were infected with the virus. The viral load difference between two extreme isolates was as high as 270-fold. In the next highest viral load, the difference was only 19-fold. This was claimed as direct evidence of SARS-CoV-2 having acquired mutations that altered its pathogenicity substantially.

Their other important finding was that, when data from these 11 isolates were compared with 725 high quality and high coverage publicly available genomes, some of the mutations were found to be defining or founding mutations for major clades (genome clusters with a common ancestor) of the virus that are currently known to be in circulation, particularly in the U.S. and Europe. Of the 725 genomes, 231 belonged to the European cluster and 208 belonged to the U.S. cluster. Epidemiologically, this is of significance as it implies that the origins of some of the currently circulating strains can be traced to China. Interestingly, the isolate that produced 19-fold viral load belonged to the European cluster. The isolate that had a 270-fold viral load, did not, however, seem to fit into any known cluster, which means that this strain got purged by negative selection.

Positive selection

Let us now return to the main burden of the article: the recent emergence through positive selection of a dominant strain of the virus in different regions of the world. As mentioned earlier, from among the 11 distinct genome types, the evolved Type A2a had emerged as the dominant one during the course of the pandemic, and had replaced the ancestral Type O that had dominated across the affected countries during the early phase.

Examining the evolutionary dynamics of the virus by analysing the publicly available data from 3,636 genomes up to April 6, Majumder and Biswas found that while there was considerable temporal variation in the frequencies with which different virus types were seen among the disease positives, spatial variation (across different geographical regions) was not very significant. They note that there is, however, significant micro-level spatial variation in the type frequencies, say across sub-regions of a country, which could be due to other epidemiological factors.

According to the authors, only five types—O, B, B1, A1a and A2a—have high frequencies in the genome collection, with 51 per cent—1,848 of 3,636—being Type A2a. Fig.3 shows the remarkable temporal change in the frequencies of different types across geographical regions. From Fig.4 (which includes Iceland and Congo as the number of genomes from there were proportionately larger for the number of cases), one can see that the type diversity initially increased in all affected countries (barring Italy) but by March end it had decreased leaving A2a as the dominant one. It would also seem that in China, though there is diversity, Type O remained dominant, but this could be a data artifact because it had deposited just one new genome in March, which is of Type A2a.

U.S. Pattern

The pattern in the U.S. is interesting. While the diversity had diminished by March, with O losing its dominance, the frequency of Type B1, which emerged strongly in February, remains significantly high even as A2a has become dominant. The biological reason for this co-existence of A2a with other types, like B1 in the U.S. and B in Spain (Fig. 4), is unclear, say the authors. It remains to be seen if competing types existing in the same region persist.

The number of Indian genomes included in the period analysed by this work is only the original two, which were discussed earlier in the article. However, considering that India suddenly submitted 33 additional genomes in April alone, the authors have separately looked at their type classification and frequencies. These 35 include complete genomic data of 21 isolates obtained from Indians returning from China, Iran and Italy as well as Italian tourists in India and their close contacts in India. Analysing these 35 genome sequences Biswas and Majumder have found that these fall into four types, O (5) and derivative types A2a (16), A3 (13) and B (1). Types A2a and A3 dominate and, according to the analysis, all the Type A3 isolates are from people with travel history to Iran, while the A2a is from people who had links to countries other than China and Iran

What is the significance of the mutations that define Type A2a, which, as the above finding shows, has acquired a foothold in India as well? According to Biswas and Majumder, the defining mutations are two: a primary SNV which replaces the nucleotide adenine (A) with guanine (G) in the viral genome at one genetic site, which translates into replacing the amino acid aspartic acid to glycine in the S protein; and a secondary one, which is an amino acid substitution (of Proline by Leucine) at another site.

Significantly, the S-protein mutation is at the S1/S2 boundary near the site where the cleaving enzyme furin acts. Arguing that this region is known to be subjected to strong positive selection pressure, the authors speculate that this mutation, either on its own or in conjunction with the second, may be providing the virus a selective advantage by making its entry into the cell much easier than before to cause enhanced transmission and infectivity, and perhaps pathogenicity as well. It may be pointed out here that Yao’s group in its work discussed above had noted this S-protein mutation to be the founding mutation for the European cluster. They also had found that one of 11 cases that showed 19-fold viral load belonged to this cluster. This also ties up with Biswas and Majumder’s finding that A2a had spread in Europe widely in February (Fig. 4) itself and, from Yao’s work, the defining mutation of A2a probably had its origin in China.

As Biswas and Majumder have emphasised, the changes in the virulence in Type A2a of SARS-CoV-2 need to be established with more detailed studies as its frequency increases across the world, India in particular where genomic data need to be obtained from greater number of isolates. This would be important for evolving appropriate pharmaceutical intervention strategies, including vaccine development, here and elsewhere.