Aims The recent emergence of novel, pathogenic severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) poses a global health emergency. The coronaviral entry requires the spike (S)-protein for attachment to the host cell surface, and employs human angiotensin-converting enzyme 2 (hACE2) for entry and transmembrane protease serine 2 (TMPRSS2) for S-protein priming. Although coronaviruses undergo evolution by mutating themselves, it is also essential to know the host genetic factors. Here, we describe the single nucleotide variations (SNVs) in human ACE2 and TMPRSS2.
Methods The genetic variants derived from five population-sequencing projects were classified by variant type, allele frequency (AF), ethnic group and estimated pathogenicity. The SNVs in SARS-CoV-2/hACE2 contact residues were investigated. The genetic variability was normalised using non-linear regression and the total number of SNVs was estimated by the derived formulas.
Results We detected 349 and 551 SNVs in ACE2 and TMPRSS2, respectively, in a total of 156 513 individuals. The vast majority (>97%) of the SNVs were very rare (AF <0.1%) and population-specific, and were computationally estimated to be more frequently deleterious than the SNVs with high AF. These SNVs were distributed throughout the coding regions; some ACE2 variants were located in the SARS-CoV-2/hACE2 contact residues, with a hemizygous state occurring in males. Using regression analysis, the total numbers of genetic variations in ACE2 and TMPRSS2 were 1.1×103 and 1.5×103, respectively, for a population of one million people.
Conclusion The majority of SNVs in ACE2 and TMPRSS2 are rare, population-specific and deleterious, and a multitude of very rare SNVs may explain different susceptibility to SARS-CoV-2.
Data availability statement
All data relevant to the study are included in the article or uploaded as supplementary information.
This article is made freely available for use in accordance with BMJ’s website terms and conditions for the duration of the covid-19 pandemic or until otherwise determined by BMJ. You may use, download and print the article for any lawful, non-commercial purpose (including text and data mining) provided that all copyright notices and trade marks are retained.https://bmj.com/coronavirus/usage
Statistics from Altmetric.com
Severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) is a novel coronavirus that was first detected in Wuhan, China, and causes COVID-19.1–4 Since the initial detection of the virus, >10 million cases of COVID-19 have been confirmed worldwide as of 8 July 2020.5 The COVID-19 outbreak has resulted in a social disorder and the collapse of the medical care system globally.6–9
Presently, some Western countries have recorded the highest rates of SARS-CoV-2 infection, the highest number of deaths and the highest mortality rates.5 These data may be due to different factors, including: 1) the total number of tests performed, 2) the possible existence of more virulent strains, 3) the structure of population, 4) the percentage of chronic illness or smokers, 5) the occupational exposure to the virus as well as 6) the differences in environmental factors (eg, temperature, humidity and air pollution).10–17 However, there may also exist some peculiar genetic characteristics of the populations that may affect susceptibility to viral infection, disease severity and the number of patients shedding huge amounts of the virus. In addition, it is suggested that COVID-19 is more likely to affect males than females, and can result in severe and even fatal respiratory diseases such as acute respiratory distress syndrome12 18–25 and mild-to-moderate gastrointestinal symptoms, such as nausea, diarrhoea or abdominal pain.26–30
The spike (S)-protein of coronaviruses facilitates viral entry into target cells.31 32 Entry depends on S-protein binding to a cellular receptor and S-protein priming by a cellular protease. The SARS-CoV S-protein engages human angiotensin-converting enzyme 2 (hACE2) as entry receptor33–35 and employs the cellular serine protease, transmembrane protease serine 2 (hTMPRSS2), for S-protein priming.36–38 The coronavirus S-protein/hACE2 interface has been elucidated and the efficiency of hACE2 usage has been shown to be a key determinant of SARS-CoV transmissibility. Importantly, a previous study has demonstrated that a number of ACE2 variants could affect the association between ACE2 and S-protein in SARS-CoV or HCoV-NL63.35 Recent reports suggested that SARS-CoV and SARS-CoV-2 share 73% amino acid identity39 and the novel SARS-CoV-2 also uses the ACE2 and TMPRSS2 for entry into target cells.40 Therefore, the genetic variation in these two genes in different populations might be also critical for the susceptibility, symptoms and outcome of SARS-CoV-2 infection. Yet, to date, a comprehensive overview of the genetic diversity of the two virus-entry-related genes is lacking.
Here, we provided the largest data set of ACE2 and TMPRSS2 gene polymorphisms from five extensive population-sequencing projects (total 156 513 individuals). The very rare SNVs we identified could contribute to a better understanding of gender differences and different susceptibilities or responses to SARS-CoV-2 in different human populations under similar conditions.
Materials and methods
Analysis of genetic variants
Data were collected from the genotyping pipelines of the 1000 Genomes (1000G) project (http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes/),41 the US National Heart, Lung, and Blood Institute (NHLBI) (http://www.nhlbi.nih.gov/),42 gnomAD (https://gnomad.broadinstitute.org/),43 Tohoku Medical Megabank Organization (ToMMo) (https://www.megabank.tohoku.ac.jp/english/)44 45 and UK10K (https://www.uk10k.org/),46 which consisted of high-coverage whole-genome/whole-exome sequence data from various ethnic groups. The data set consisted of 156 513 individuals from various countries (online supplementary table S1; it should be noted that gnomAD includes the 1000G data set, but not other projects). The data set was then filtered using Variant Tools (http://varianttools.sourceforge.net/Annotation/HomePage) by variant type, allele frequency (AF), countries, ethnic/racial groups and pathogenicity. Information on variant types, positions and reference sequences were retrieved from NCBI dbSNP (http://www.nlm.nih.gov/SNP/).
Deleteriousness prediction methods
We comprehensively evaluated the predictive performance of 26 current deleteriousness-scoring methods, including 23 function prediction scores (SIFT, SIFT4G, PolyPhen-2-HDIV, PolyPhen-HVAR, LRT, MutationTaster, MutationAssessor, FATHMM, PROVEAN, VEST4, MetaSVM, MetaLR, M-CAP, REVEL, MutPred, MVP, MPC, PrimateAI, DEOGEN2, CADD, DANN, fathmm-MKL and GenoCnyon) and 3 conservation scores (GERP++, SiPhy and PhyloP). The scores were obtained from the dbNSFP database V.4.0.47 It is noted that the prediction scores obtained from the dbNFSP database underwent transformation from the original prediction scores according to the threshold value (online supplementary table S2).
Functional domains including transmembrane and signal peptide regions were predicted using InterPro (https://www.ebi.ac.uk/interpro/) with default options.
Non-linear regression model of ACE2 and TMPRSS2 genetic variation
The genetic variability of ACE2 and TMPRSS2 genes was normalised based on non-linear regression according to previous research.48 Normalisation allows the estimation of populations with different accumulated sample sizes. The relationship between AF and genetic variation was determined using a scatter plot. This plot showed a pattern of exponential decay, and thus, a negative exponential model was fitted. The formula was then converted and plotted against the population size as follows:
where NACE-m , NACE-f and NTMPRSS2 correspond to the estimated number of genetic variations, x refers to the population size and R2 is the coefficient of determination. The total number of genetic variations in the two genes were estimated using formulas (1), (2) and (3). Since the ACE2 gene is located on the X-chromosome, two different formulas, NACE-m (1) and NACE-f (2), were derived for males (46,XY) and females (46,XX), respectively.
Statistical analysis was performed using the Mann-Whitney U test. A probability of p<0.05 was considered to be statistically significant. Statistical analyses were performed using JMP software (V.10.0; SAS Institute, Cary, North Carolina, USA).
Genetic variations in human ACE2
In order to obtain a comprehensive overview of the genetic diversity of ACE2 and TMPRSS2, we collected SNV data sets from five extensive population-sequencing projects (1000G, NHLBI, gnomAD, ToMMo and UK10K). After removing the overlapping individuals, the data sets included genome/exome sequences of 156 513 individuals from diverse ethnic origins (online supplementary table S1).
The genetic diversity of ACE2 and TMPRSS2 is summarised in figures 1 and 2, respectively. ACE2 is located on the X-chromosome, which raises the possibility that differences in sex chromosome dosage (46,XY vs 46,XX) could cause the phenotype to be always expressed in males. In ACE2, 349 SNVs were identified in the coding regions and splice sites, and were broadly distributed throughout the coding regions (figure 1A). A complete list of the variants can be found in online supplementary table S3. The most identified SNVs (n=247, 70.8%) were missense, while synonymous variants occurred in 26.9% of the SNVs (n=94) (figure 1B). The remaining 2% of SNVs were stop-gained (n=2), start-loss (n=1), splice site variants (n=2) or indels (n=3) (figure 1B). Per gene, this resulted in a non-synonymous/synonymous variant ratio (dN/dS) >1. The majority of these SNVs were rare (n=347, 99.4%) or quite rare (n=169, 48.4%) with AF <1% or AF <0.001%, respectively (figure 1C).
In silico prediction of the functional effects of SNVs in human ACE2 gene
At present, the majority of the ACE2 variants have not been functionally characterised. To obtain insights into the likelihood of functionally deleterious effects of the identified variants, we comprehensively evaluated the predictive performance of 26 current deleteriousness-scoring methods, including 23 function prediction scores (SIFT, SIFT4G, PolyPhen-2-HDIV, PolyPhen-HVAR, LRT, MutationTaster, MutationAssessor, FATHMM, PROVEAN, VEST4, MetaSVM, MetaLR, M-CAP, REVEL, MutPred, MVP, MPC, PrimateAI, DEOGEN2, CADD, DANN, fathmm-MKL and GenoCnyon) and 3 conservation scores (GERP++, SiPhy and PhyloP) (online supplementary table S4). Fraction deleteriousness ranged from 0% to 90.3%, possibly due to the differences in information and algorithms used for this prediction, but the median was 46.8% (calculated by LRT) (figure 1D). Importantly, rare SNVs were estimated to be more frequently deleterious than the SNVs with a high AF by some scoring methods (table 1).
Estimation of the genetic variability of ACE2 in populations
To predict the number of overall ACE2 genetic variants in populations, we calculated the normalised genetic variability using non-linear regression (figure 1E). The observed relationship between AF and genetic variations exhibited a pattern of exponential decay (figure 1E). A negative exponential model was fitted to these decay curves and then evaluated for unseen ACE2 variations (figure 1E). Since allele count is different between males (46,XY) and females (46,XX), the per-individual genetic variability was estimated to be significantly higher in females than that in males regardless of population size (figure 1F). The total number of genetic variations in ACE2 genes was expected to be approximately 4.2×103 for male populations and 6.8×103 for female populations (average, 5.5×103) for a sample size of 10 million individuals, indicating that variants described to date constitute only a small fraction of genetic variability in ACE2 genes present on a population scale.
Genetic variants at the interface between the SARS-CoV-2 S-protein and human ACE2
A recent report suggested that while the sequence identity between the S-protein of SARS-CoV-2 and SARS-CoV is 73%, a significantly higher residue substitution rate was observed at the interaction interface with the ACE2 receptor.49 Out of 29 interface residues, only 10 residues (34%) in SARS-CoV-2 are conserved with respect to SARS-CoV. Similarly, only 12 residues (40%) in the SARS-CoV are conserved with respect to SARS-CoV-2. Based on these recent data, we searched for genetic variations in the interface between S-protein and ACE2. A total of seven SNVs in ACE2 were detected in the direct contact residues at the SARS-CoV-2 S-protein/hACE2 and SARS-CoV S-protein/hACE2 (table 2). There were also various SNVs in the neighbouring residues of the direct interface (online supplementary table S3).
Population-specific/population-enriched ACE2 genetic variations
At the sequence level, genetic diversity is generated by de novo transmittable variants in one individual, which then undergo natural selection and may spread through the population. Therefore, SNVs can become population-specific or population-enriched. Thus, we investigated the population-specific genetic variations in hACE2 by analysing four large population sequencing projects (NHLBI EA (US European American, n=4300) and NHLBI AA (US African-American, n=2203), ToMMo (Japan, n=4773) and UK10K (UK, n=3781); total 15 057 individuals). Out of 52 non-synonymous variants, 44 (84.6%) were detected exclusively in any of the four populations (figure 1G) and no commonly shared variants were detected except synonymous variants (online supplementary table S3). The 1000G projects also revealed the country-specific/country-enriched SNVs, although further population size accumulation is necessary (online supplementary table S3). The fraction of very rare SNVs was lower in NHLBI AA and UK10K populations compared with NHLBI EA and ToMMo populations (figure 1H).
Genetic variations in human TMPRSS2
A total of 551 SNVs were identified in coding regions and splice sites and were broadly distributed throughout the coding regions (figure 2A). A complete list of the variants is summarised in online supplementary table S5. Most identified SNVs (n=247, 70.8%) were missense, while synonymous variants occurred in 26.9% of the SNVs (n=94) (figure 2B). The remaining 2% of SNVs were stop-gained (n=2), start-loss (n=1) or splice site variants (n=2) or indels (n=3) (figure 2B). Per gene, this resulted in a dN/dS >1. The majority of these SNVs were rare (n=347, 99.4%) or quite rare (n=169, 48.4%) with AF <1% or AF <0.001%, respectively (figure 2C). Fraction deleteriousness ranged from 14.4% to 100%, and the median was 43.5% (calculated by fathmm-MKL) (figure 2D, online supplementary table S6). Many scoring models revealed that rare SNVs were estimated to be more frequently deleterious than SNVs with a high AF (table 1).
On the basis of the regression analysis, the total number of genetic variations was expected to be approximately 1.5×103 and 5.5×103 at a sample size of 1 and 10 million individuals, respectively (figure 2E and F). Out of 88 non-synonymous variants identified in the four large population-sequencing projects (NHLBI EA, NHLBI AA, ToMMo and UK10K; total 15 057 individuals), 73 (83.0%) were detected exclusively in any of the 4 populations (figure 2G), whereas only 2 variants were shared among the 4 populations (online supplementary table S5). There was no large difference in the fraction of rare and common variants among the four different populations (figure 2H).
In the present study, we systematically investigated genetic variations in two coronavirus-related human genes. This large-scale effort yielded a huge global reservoir of genetic polymorphisms. The growing number of genetic variants in these genes, to which this publication contributes, will increase the potential to predict the different susceptibilities or responses to SARS-CoV-2 in different populations (including genders) under similar conditions. For example, the chemokine receptor protein encoded by the CCR5 gene is best known to be the target for HIV infection, and genetic variations in CCR5 can lead to differences in the rate of HIV infection, with the best known being the CCR5-Δ32 genetic polymorphism (rs333).50 51 A previous study on SARS-CoV also has demonstrated that by altering the amino acid residues, ACE2 can be converted to an efficient/inefficient receptor for SARS-CoV and concluded that a number of ACE2 variants could affect the association between ACE2 and S-protein in SARS-CoV.35 Therefore, the present study could contribute to a better understanding of gender differences and different susceptibility and disease severity to viral infection in different ethnic groups. Furthermore, recent reports suggest that some candidate drugs do not target the coronavirus, but rather the hACE2 (eg, anti-ACE2 antibody40) or TMPRSS2 (eg, Nafamostat52 53). Therefore, the present study is also expected to be an essential step towards the translation of knowledge about genetic polymorphisms into clinical strategies that benefit the patients.
Take home messages
A total of 349 and 551 single nucleotide variations (SNVs) were detected in ACE2 and TMPRSS2, respectively, in 156 513 individuals.
The vast majority (>97%) of these SNVs were rare, population-specific and were computationally estimated to be deleterious.
The SNVs in ACE2 were distributed throughout the protein-coding regions and some were located in the severe acute respiratory syndrome coronavirus 2/human ACE2 contact residues.
The total number of genetic variations in ACE2 and TMPRSS2 were estimated to be 1.1×103 and 1.5×103, respectively, for a population of 1 million people.
Data availability statement
All data relevant to the study are included in the article or uploaded as supplementary information.
Handling editor Runjan Chetty.
Contributors KF and KU designed the study, performed experiments and wrote the paper.
Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.
Competing interests None declared.
Provenance and peer review Not commissioned; internally peer reviewed.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.