We analyzed the nucleocapsid and surface proteins from several Coronaviridae viruses using an alignment-free computer program. Three isolates of novel, human coronavirus (SARS0CoV-2) (2019) that are responsible for the current pandemic and older SARS strains of human and animal coronaviruses were examined. The nucleocapsid and glycoprotein sequences are identical for the three novel 2019 human isolates and they are closely related to these sequences in six bat and human SARS coronaviruses. This strongly supports the bat origin of the pandemic, novel coronavirus. One surface glycoprotein fragment of 111 amino acids is the largest, conserved, common permutation in the examined bat SARS-like and human SARS viruses, including the Covid-19 virus. BLAST analysis confirmed that this fragment is conserved only in the human and bat SARS strains. This fragment likely is involved in infectivity and is of interest for vaccine development. Surface glycoprotein and nucleocapsid protein sequence homologies of 58.9% and 82.5%, respectively, between the novel SARS0CoV-2 strains and the human SARS (2018) virus suggest that existing anti-SARS vaccines may provide some protection against the novel coronavirus.
Viruses belonging to the family Coronaviridae are known to cause severe and acute lung inflammation in humans and other animals (1). Comparative analysis of coronavirus proteins is useful for understanding the relationships of these viruses with respect to their origins, for developing more specific diagnostic tests, and to design vaccines against the novel coronavirus that is causing widespread morbidity and mortality across the world. We used an alignment-free software program (Compare) developed by one of us (Babu V. Bassa) for comparing surface glycoprotein and nucleocapsid proteins of the coronaviruses. Non-alignment programs are considered to be superior to the alignment programs because of known uncertainties associated with the alignment of sequences (2). Our program extracts common amino acid sequences (permutations) that are five residues or larger from any given pair of proteins. This procedure identifies conserved fragments and provides information on the physical similarities among the primary structures of biological sequences.
The analysis was done using the unique software tool “Compare”. The algorithm was implemented in Microsoft’s Visual Basic language for the Windows Operating System. An outline of the algorithm is presented in Figure 1. The source code and the raw data will be made available to the Journal for distribution.
This example shows how the common permutation of the two sequences (“mnopqrs”) is identified.
In keeping with the current terminology, the coronavirus strain of the current outbreak is referred to as “novel coronavirus” and the strains prior to the 2019 outbreak that caused severe acute respiratory syndrome are referred to as “SARS strains” throughout the manuscript. Similarly bat SARS viruses are referred to as “bat SARS-like strains”.
The coronavirus sequences used in this analysis were obtained from the GeneBank. The gene bank accession numbers for all comparisons are given in Table 1. Additionally, a large number of animal coronaviruses were screened and were found to be very distant in terms of sequence homologies (described later in this section) to the novel coronavirus. The severe acute respiratory syndrome coronavirus 2 isolate 2019-nCoV WHU01, GeneBank- Accession number: MN988668 (11-FEB-2020), was obtained as the complete genome (3). The severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, GeneBank-Accession number: NC-045512.2 (13-MAR-2020), was obtained as the complete genome (4). The SARS coronavirus, GeneBank, Accession number: NC_004718 (13-AUG-2018), was obtained as the complete genome (5).
Coronavirus | SARS0CoV-2 AC: MT072688 | ||||
---|---|---|---|---|---|
Nucleocapsid protein | Surface glycoprotein | Accession | |||
%Homology | Largest Common Permutation | %Homology | Largest Common Permutation | ||
SARS0CoV-2(1) Feb 2020 | 100 | 419 | 100 | 1273 | MN988668 |
SARS0CoV-2(2) Mar 2020 | 100 | 419 | 100 | 1273 | NC_045512 |
Bat.RaTG13 Mar 2020 | 99 | 177 | 99 | 440 | MN996532 |
Rhinophol. affins-2014 | 85.4 | 99 | 59.2 | 111 | KF569996 |
SARS.BAT-July2017 | 83.7 | 66 | 57.9 | 111 | JX993988 |
SARS.Human Aug 2018 | 82.5 | 43 | 58.9 | 111 | NC_004718 |
SARS.Bat Dec 2017 | 82.5 | 43 | 59.4 | 111 | KY417152 |
Kenyan Bats Feb 2020 | 73.3 | 44 | 50.9 | 97 | KY352407 |
M.East.Res.Syndrome May 2016 | 16.6 | 8 | 5.4 | 14 | KX034100 |
Avian Aug 2018 | 8.2 | 6 | 3.1 | 8 | NC_001451 |
Human. Dec.2018 | 6.5 | 12 | 4.8 | 7 | NC_003045 |
Bat.Corona.HKUB Apr 2008 | 1 | 4 | 2.1 | 8 | EU420139 |
Several-hundred whole, genome sequences from the family Coronaviridae are available in the GeneBank; however, many were found to be repeats of the same strains. In the initial phase of this study we screened multiple combinations (more than 100) of nucleocapsid proteins and surface glycoproteins from various coronaviruses. The three novel coronavirus isolates (SARS0CoV-2) were identical and they were similar to SARS and bat SARS-like viruses. Based on this initial screening, pairs of viral strains were chosen for final analyses. Nucleocapsid protein (NCP) and surface glycoprotein (SGP) were selected for comparative analyses because of their known importance in infection and in the immune response. To calculate sequence homologies, the character lengths of the common fragments equal to or larger than five amino acid residues were summed and the percentages were computed based on the total sequence lengths. The homology parameter so obtained is a relative index applicable only to this method of calculation.
Protein fragments from the three isolates of novel coronavirus (SARS0CoV-2) are identical and they have higher degrees of sequence homology with SARS and bat SARS-like strains reported prior to the current outbreak (Table 1). The latest reported bat strain, Bat.RaTG13-Mar-2020, (6) has 99% sequence homology with the novel coronavirus with respect to both SGP and NCP. The Kenyan bat coronavirus genome that was deposited in 2016 by the Centers for Disease Control laboratory, Atlanta Georgia (7), has significant homology with novel coronavirus with respect to both NCP and SGP (Table 1). There is a largest common permutation (111 residues long) that is conserved in the SGP of novel coronavirus, SARS strains, and bat SARS-like coronaviruses. It is part of the 440 fragment of Bat.RaTG13 (Table 1). Therefore, the 111-residue is present in the novel coronavirus, in at least one human SARS strain of coronavirus, and in at least one bat SARS-like strain of coronavirus (Table 1). Based on this observation we have subjected the 111-fragment to a BLAST search and found that the 111-fragment is preserved only in SARS viruses (data not presented). The 111 SGP motif is absent in avian, MERS, some human, and some bat strains of coronaviruses. The NCP in corona viruses is only 419 amino acids long. As shown in Figure 3 and Table 1 there are several polypeptide motifs originating from this protein that are common to novel coronavirus, SARS and bat SARS- like strains of coronavirus. A compilation of common polypeptide motifs is presented in Figure 1, Figure 2, and Figure 3. These polypeptide motifs will be useful as detection tools in studying the origins of novel coronavirus. They also will be helpful in designing vaccine candidates.
For each numbered motif, the size is given in parenthesis and the location of the fragment in the molecule is indicated by the underlined residue number.
For each numbered motif, the size is given in parenthesis and the location of the fragment in the molecule is indicated by the underlined residue number.
For each numbered motif, the size is given in parenthesis and the location of the fragment in the molecule is indicated by the underlined residue number.
Unlike alignment-based sequence comparison programs, our software tool allows comparison of sequences by identifying and making profiles of common permutations between given pairs of biological sequences. The picture captured is easily understood and interpreted and does not have some uncertainties associated with alignment-based sequence comparison programs. The size of the largest common permutation is an easily understood parameter of the relationships among the sequence pairs. The program and its applications are more fully described in the methods section and by Figure 1. The validity of the program was established by usage that has reproduced results obtained by other programs with GeneBank data.
The nucleocapsid protein and surface glycoprotein of the SARS coronaviruses (Figure 5) are integral parts of the virus structure. They can be identical or can have varying degrees of similarity (homology) among viruses in this group (Table 1). As complex molecules on the virus surface, they are responsible for differences in host range, infectivity and pathogenicity.
Structural depiction of coronavirus (Source: Drazen JM). Spike glycoprotein (surface glycoprotein) and the nucleocapsid protein sequences were analyzed in the current study. One 111 amino acid residues long fragment belonging to the surface glycoprotein is conserved in many lethal strains of human and bat coronaviruses as revealed by our analysis.
The data presented strongly support a very close relationship among some bat and the novel human coronaviruses that are causing much morbidity and mortality across the world. The statistical probability of the occurrence of so many common permutations for two different proteins between any two strains of viruses purely by chance is infinitesimally small. The similarity between bat and human strains is, however, disputed by some scientists (8). With regard to this dispute and the natural selection hypothesis, we strongly disagree with the idea of using biological activities to determine the origins of viruses or any other species. We prefer, and have presented, physical evidence. Regardless of the dispute on the origins of the highly lethal virus strains, the high degree of homology (a physical characteristic) raises the theoretical possibility that vaccines against pre-2019 SARS strains will provide some cross-protection against the novel coronavirus strains. The comparative data between current and past strains of coronaviruses specifically establishes an approach for interim vaccine development. Avian, bovine, equine, canine, feline, calf-giraffe, rabbit, water deer, and some strains of human and bat coronaviruses have very low sequence homology with novel coronaviruses as analyzed by “Compare” (data not shown).
In conclusion, our data strongly support a close relationship among bat, the human SARS (2018 strain) and the novel coronavirus. The identified protein fragments are highly conserved in the lethal and in the highly-contagious SARS strains of the viruses including the older and the most recent ones. These proteins are essential to the virulence, lethality and infectivity of the viruses. They will be useful in designing vaccines and future improved diagnostic tests, and for understanding the nature of infection by these viruses and their potential future mutations.
NCP: Nucleocapsid protein, SGP: Surface glycoprotein, MERS: Middle East Respiratory Syndrome.