Article Text

Download PDFPDF

Genetic variations in the human severe acute respiratory syndrome coronavirus receptor ACE2 and serine protease TMPRSS2
Free
  1. Kohei Fujikura1,
  2. Kazuma Uesaka2
  1. 1 Department of Diagnostic Pathology, Kobe University Graduate School of Medicine, Kobe, Japan
  2. 2 Center for Gene Research, Nagoya University, Nagoya, Japan
  1. Correspondence to Dr Kohei Fujikura, Department of Pathology, Kobe City Medical Center General Hospital, Kobe 650-0017, Japan; kfuji{at}med.kobe-u.ac.jp

Abstract

Aims The recent emergence of novel, pathogenic severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) poses a global health emergency. The coronaviral entry requires the spike (S)-protein for attachment to the host cell surface, and employs human angiotensin-converting enzyme 2 (hACE2) for entry and transmembrane protease serine 2 (TMPRSS2) for S-protein priming. Although coronaviruses undergo evolution by mutating themselves, it is also essential to know the host genetic factors. Here, we describe the single nucleotide variations (SNVs) in human ACE2 and TMPRSS2.

Methods The genetic variants derived from five population-sequencing projects were classified by variant type, allele frequency (AF), ethnic group and estimated pathogenicity. The SNVs in SARS-CoV-2/hACE2 contact residues were investigated. The genetic variability was normalised using non-linear regression and the total number of SNVs was estimated by the derived formulas.

Results We detected 349 and 551 SNVs in ACE2 and TMPRSS2, respectively, in a total of 156 513 individuals. The vast majority (>97%) of the SNVs were very rare (AF <0.1%) and population-specific, and were computationally estimated to be more frequently deleterious than the SNVs with high AF. These SNVs were distributed throughout the coding regions; some ACE2 variants were located in the SARS-CoV-2/hACE2 contact residues, with a hemizygous state occurring in males. Using regression analysis, the total numbers of genetic variations in ACE2 and TMPRSS2 were 1.1×103 and 1.5×103, respectively, for a population of one million people.

Conclusion The majority of SNVs in ACE2 and TMPRSS2 are rare, population-specific and deleterious, and a multitude of very rare SNVs may explain different susceptibility to SARS-CoV-2.

  • infections
  • genetics
  • viruses

This article is made freely available for use in accordance with BMJ’s website terms and conditions for the duration of the covid-19 pandemic or until otherwise determined by BMJ. You may use, download and print the article for any lawful, non-commercial purpose (including text and data mining) provided that all copyright notices and trade marks are retained.

https://bmj.com/coronavirus/usage
View Full Text

Statistics from Altmetric.com

Introduction

Severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) is a novel coronavirus that was first detected in Wuhan, China, and causes COVID-19.1–4 Since the initial detection of the virus, >10 million cases of COVID-19 have been confirmed worldwide as of 8 July 2020.5 The COVID-19 outbreak has resulted in a social disorder and the collapse of the medical care system globally.6–9

Presently, some Western countries have recorded the highest rates of SARS-CoV-2 infection, the highest number of deaths and the highest mortality rates.5 These data may be due to different factors, including: 1) the total number of tests performed, 2) the possible existence of more virulent strains, 3) the structure of population, 4) the percentage of chronic illness or smokers, 5) the occupational exposure to the virus as well as 6) the differences in environmental factors (eg, temperature, humidity and air pollution).10–17 However, there may also exist some peculiar genetic characteristics of the populations that may affect susceptibility to viral infection, disease severity and the number of patients shedding huge amounts of the virus. In addition, it is suggested that COVID-19 is more likely to affect males than females, and can result in severe and even fatal respiratory diseases such as acute respiratory distress syndrome12 18–25 and mild-to-moderate gastrointestinal symptoms, such as nausea, diarrhoea or abdominal pain.26–30

The spike (S)-protein of coronaviruses facilitates viral entry into target cells.31 32 Entry depends on S-protein binding to a cellular receptor and S-protein priming by a cellular protease. The SARS-CoV S-protein engages human angiotensin-converting enzyme 2 (hACE2) as entry receptor33–35 and employs the cellular serine protease, transmembrane protease serine 2 (hTMPRSS2), for S-protein priming.36–38 The coronavirus S-protein/hACE2 interface has been elucidated and the efficiency of hACE2 usage has been shown to be a key determinant of SARS-CoV transmissibility. Importantly, a previous study has demonstrated that a number of ACE2 variants could affect the association between ACE2 and S-protein in SARS-CoV or HCoV-NL63.35 Recent reports suggested that SARS-CoV and SARS-CoV-2 share 73% amino acid identity39 and the novel SARS-CoV-2 also uses the ACE2 and TMPRSS2 for entry into target cells.40 Therefore, the genetic variation in these two genes in different populations might be also critical for the susceptibility, symptoms and outcome of SARS-CoV-2 infection. Yet, to date, a comprehensive overview of the genetic diversity of the two virus-entry-related genes is lacking.

Here, we provided the largest data set of ACE2 and TMPRSS2 gene polymorphisms from five extensive population-sequencing projects (total 156 513 individuals). The very rare SNVs we identified could contribute to a better understanding of gender differences and different susceptibilities or responses to SARS-CoV-2 in different human populations under similar conditions.

Materials and methods

Analysis of genetic variants

Data were collected from the genotyping pipelines of the 1000 Genomes (1000G) project (http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes/),41 the US National Heart, Lung, and Blood Institute (NHLBI) (http://www.nhlbi.nih.gov/),42 gnomAD (https://gnomad.broadinstitute.org/),43 Tohoku Medical Megabank Organization (ToMMo) (https://www.megabank.tohoku.ac.jp/english/)44 45 and UK10K (https://www.uk10k.org/),46 which consisted of high-coverage whole-genome/whole-exome sequence data from various ethnic groups. The data set consisted of 156 513 individuals from various countries (online supplementary table S1; it should be noted that gnomAD includes the 1000G data set, but not other projects). The data set was then filtered using Variant Tools (http://varianttools.sourceforge.net/Annotation/HomePage) by variant type, allele frequency (AF), countries, ethnic/racial groups and pathogenicity. Information on variant types, positions and reference sequences were retrieved from NCBI dbSNP (http://www.nlm.nih.gov/SNP/).

Deleteriousness prediction methods

We comprehensively evaluated the predictive performance of 26 current deleteriousness-scoring methods, including 23 function prediction scores (SIFT, SIFT4G, PolyPhen-2-HDIV, PolyPhen-HVAR, LRT, MutationTaster, MutationAssessor, FATHMM, PROVEAN, VEST4, MetaSVM, MetaLR, M-CAP, REVEL, MutPred, MVP, MPC, PrimateAI, DEOGEN2, CADD, DANN, fathmm-MKL and GenoCnyon) and 3 conservation scores (GERP++, SiPhy and PhyloP). The scores were obtained from the dbNSFP database V.4.0.47 It is noted that the prediction scores obtained from the dbNFSP database underwent transformation from the original prediction scores according to the threshold value (online supplementary table S2).

Domain prediction

Functional domains including transmembrane and signal peptide regions were predicted using InterPro (https://www.ebi.ac.uk/interpro/) with default options.

Non-linear regression model of ACE2 and TMPRSS2 genetic variation

The genetic variability of ACE2 and TMPRSS2 genes was normalised based on non-linear regression according to previous research.48 Normalisation allows the estimation of populations with different accumulated sample sizes. The relationship between AF and genetic variation was determined using a scatter plot. This plot showed a pattern of exponential decay, and thus, a negative exponential model was fitted. The formula was then converted and plotted against the population size as follows:

Embedded Image (1)

Embedded Image (2)

Embedded Image (3)

where NACE-m , NACE-f and NTMPRSS2 correspond to the estimated number of genetic variations, x refers to the population size and R2 is the coefficient of determination. The total number of genetic variations in the two genes were estimated using formulas (1), (2) and (3). Since the ACE2 gene is located on the X-chromosome, two different formulas, NACE-m (1) and NACE-f (2), were derived for males (46,XY) and females (46,XX), respectively.

Statistical analysis

Statistical analysis was performed using the Mann-Whitney U test. A probability of p<0.05 was considered to be statistically significant. Statistical analyses were performed using JMP software (V.10.0; SAS Institute, Cary, North Carolina, USA).

Results

Genetic variations in human ACE2

In order to obtain a comprehensive overview of the genetic diversity of ACE2 and TMPRSS2, we collected SNV data sets from five extensive population-sequencing projects (1000G, NHLBI, gnomAD, ToMMo and UK10K). After removing the overlapping individuals, the data sets included genome/exome sequences of 156 513 individuals from diverse ethnic origins (online supplementary table S1).

The genetic diversity of ACE2 and TMPRSS2 is summarised in figures 1 and 2, respectively. ACE2 is located on the X-chromosome, which raises the possibility that differences in sex chromosome dosage (46,XY vs 46,XX) could cause the phenotype to be always expressed in males. In ACE2, 349 SNVs were identified in the coding regions and splice sites, and were broadly distributed throughout the coding regions (figure 1A). A complete list of the variants can be found in online supplementary table S3. The most identified SNVs (n=247, 70.8%) were missense, while synonymous variants occurred in 26.9% of the SNVs (n=94) (figure 1B). The remaining 2% of SNVs were stop-gained (n=2), start-loss (n=1), splice site variants (n=2) or indels (n=3) (figure 1B). Per gene, this resulted in a non-synonymous/synonymous variant ratio (dN/dS) >1. The majority of these SNVs were rare (n=347, 99.4%) or quite rare (n=169, 48.4%) with AF <1% or AF <0.001%, respectively (figure 1C).

Figure 1

Genetic variation in ACE2. (A) The distribution of nucleotide polymorphisms along the full-length ACE2 gene. The vertical bar indicates allele frequency (AF) (%). Single nucleotide variations (SNVs) are grouped by type: missense (blue), stop-gained (orange), start-lost (grey), indel (yellow) and splice site (green). The putative functional domains are depicted by coloured boxes. (B) Pie chart of 349 SNVs in ACE2. Each colour code corresponds to a different SNV type. (C) Relative abundance of SNVs plotted over their AFs. SNVs are grouped by type: missense (blue), synonymous (orange), other non-synonymous (grey) and indel (yellow). The inset demonstrates that the majority of SNVs in coding regions were quite rare (48.4%, AF <0.001%), while 90.3% were classified as rare (0.001%< AF <0.01%) and only 9.7% had low frequency or were common. (D) Percentage of fraction deleterious. Twenty-three software were employed to predict the pathogenicity of missense variants. (E) The non-linear regression fitting was performed based on the scatter plot showing the relationship between AF and variation of SNVs. (F) Relationship between total number of SNVs and population size. The ACE2 variations are expected to rise as the population size increases. A higher number of rare SNVs could be detected in females compared with males. An enlarged view of the graph is also indicated in the upper panel. (G) Chow-Ruskey diagrams showing the number of shared and unique genetic variants for ACE2 genes across four large-scale population studies. (H) Comparison of AFs across four large-scale population studies. ACE2, angiotensin-converting enzyme 2; HEMGH, metalloprotease zinc-binding site; NHLBI, National Heart, Lung, and Blood Institute; TM, transmembrane domain; ToMMo,Tohoku Medical Megabank Organization.

Figure 2

Genetic variation in TMPRSS2. (A) The distribution of nucleotide polymorphisms along the full-length TMPRSS2 gene. The vertical bar indicates allele frequency (AF) (%). Single nucleotide variations (SNVs) are grouped by type: missense (blue), stop-gained (yellow), stop-lost (orange), splice site (green) and indel (grey). The putative functional domains are depicted by coloured boxes. (B) Pie chart of 551 SNVs in TMPRSS2. Each colour code corresponds to a different SNV type. (C) Relative abundance of SNVs plotted over their AFs. SNVs are grouped by type: missense (blue), synonymous (orange), other non-synonymous (grey) and indel (yellow). The inset demonstrates that the majority of SNVs in coding regions were quite rare (49.9%, AF <0.001%), while 87.3% were classified as rare (0.001%< AF <0.01%) and only 12.7% had low frequency or were common. (D) Percentage of fraction deleterious. Twenty-three software were employed to predict the pathogenicity of missense variants. (E) The non-linear regression fitting was performed based on the scatter plot showing the relationship between AF and variation of SNVs. (F) Relationship between total number of SNVs and population size. The TMPRSS2 variations are expected to rise as the population size increases. An enlarged view of the graph is also indicated in the upper panel. (G) Chow-Ruskey diagrams showing the number of shared and unique genetic variants for TMPRSS2 genes across four large-scale population studies. (H) Comparison of AFs across four large-scale population studies. ACE2, angiotensin-converting enzyme 2; LDLRA, low-density lipoprotein receptor A domain; NHLBI, National Heart, Lung, and Blood Institute; SRCR, scavenger receptor cysteine-rich domain; TM, transmembrane domain; TMPRSS2, transmembrane protease serine 2; ToMMo,Tohoku Medical Megabank Organization.

In silico prediction of the functional effects of SNVs in human ACE2 gene

At present, the majority of the ACE2 variants have not been functionally characterised. To obtain insights into the likelihood of functionally deleterious effects of the identified variants, we comprehensively evaluated the predictive performance of 26 current deleteriousness-scoring methods, including 23 function prediction scores (SIFT, SIFT4G, PolyPhen-2-HDIV, PolyPhen-HVAR, LRT, MutationTaster, MutationAssessor, FATHMM, PROVEAN, VEST4, MetaSVM, MetaLR, M-CAP, REVEL, MutPred, MVP, MPC, PrimateAI, DEOGEN2, CADD, DANN, fathmm-MKL and GenoCnyon) and 3 conservation scores (GERP++, SiPhy and PhyloP) (online supplementary table S4). Fraction deleteriousness ranged from 0% to 90.3%, possibly due to the differences in information and algorithms used for this prediction, but the median was 46.8% (calculated by LRT) (figure 1D). Importantly, rare SNVs were estimated to be more frequently deleterious than the SNVs with a high AF by some scoring methods (table 1).

Table 1

Comparison of allele frequency between putative deleterious variants and putative tolerated variants

Estimation of the genetic variability of ACE2 in populations

To predict the number of overall ACE2 genetic variants in populations, we calculated the normalised genetic variability using non-linear regression (figure 1E). The observed relationship between AF and genetic variations exhibited a pattern of exponential decay (figure 1E). A negative exponential model was fitted to these decay curves and then evaluated for unseen ACE2 variations (figure 1E). Since allele count is different between males (46,XY) and females (46,XX), the per-individual genetic variability was estimated to be significantly higher in females than that in males regardless of population size (figure 1F). The total number of genetic variations in ACE2 genes was expected to be approximately 4.2×103 for male populations and 6.8×103 for female populations (average, 5.5×103) for a sample size of 10 million individuals, indicating that variants described to date constitute only a small fraction of genetic variability in ACE2 genes present on a population scale.

Genetic variants at the interface between the SARS-CoV-2 S-protein and human ACE2

A recent report suggested that while the sequence identity between the S-protein of SARS-CoV-2 and SARS-CoV is 73%, a significantly higher residue substitution rate was observed at the interaction interface with the ACE2 receptor.49 Out of 29 interface residues, only 10 residues (34%) in SARS-CoV-2 are conserved with respect to SARS-CoV. Similarly, only 12 residues (40%) in the SARS-CoV are conserved with respect to SARS-CoV-2. Based on these recent data, we searched for genetic variations in the interface between S-protein and ACE2. A total of seven SNVs in ACE2 were detected in the direct contact residues at the SARS-CoV-2 S-protein/hACE2 and SARS-CoV S-protein/hACE2 (table 2). There were also various SNVs in the neighbouring residues of the direct interface (online supplementary table S3).

Table 2

Genetic variations in hACE2 detected in the interface between SARS-CoV/SARS-CoV-2 S-protein and hACE2

Population-specific/population-enriched ACE2 genetic variations

At the sequence level, genetic diversity is generated by de novo transmittable variants in one individual, which then undergo natural selection and may spread through the population. Therefore, SNVs can become population-specific or population-enriched. Thus, we investigated the population-specific genetic variations in hACE2 by analysing four large population sequencing projects (NHLBI EA (US European American, n=4300) and NHLBI AA (US African-American, n=2203), ToMMo (Japan, n=4773) and UK10K (UK, n=3781); total 15 057 individuals). Out of 52 non-synonymous variants, 44 (84.6%) were detected exclusively in any of the four populations (figure 1G) and no commonly shared variants were detected except synonymous variants (online supplementary table S3). The 1000G projects also revealed the country-specific/country-enriched SNVs, although further population size accumulation is necessary (online supplementary table S3). The fraction of very rare SNVs was lower in NHLBI AA and UK10K populations compared with NHLBI EA and ToMMo populations (figure 1H).

Genetic variations in human TMPRSS2

A total of 551 SNVs were identified in coding regions and splice sites and were broadly distributed throughout the coding regions (figure 2A). A complete list of the variants is summarised in online supplementary table S5. Most identified SNVs (n=247, 70.8%) were missense, while synonymous variants occurred in 26.9% of the SNVs (n=94) (figure 2B). The remaining 2% of SNVs were stop-gained (n=2), start-loss (n=1) or splice site variants (n=2) or indels (n=3) (figure 2B). Per gene, this resulted in a dN/dS >1. The majority of these SNVs were rare (n=347, 99.4%) or quite rare (n=169, 48.4%) with AF <1% or AF <0.001%, respectively (figure 2C). Fraction deleteriousness ranged from 14.4% to 100%, and the median was 43.5% (calculated by fathmm-MKL) (figure 2D, online supplementary table S6). Many scoring models revealed that rare SNVs were estimated to be more frequently deleterious than SNVs with a high AF (table 1).

On the basis of the regression analysis, the total number of genetic variations was expected to be approximately 1.5×103 and 5.5×103 at a sample size of 1 and 10 million individuals, respectively (figure 2E and F). Out of 88 non-synonymous variants identified in the four large population-sequencing projects (NHLBI EA, NHLBI AA, ToMMo and UK10K; total 15 057 individuals), 73 (83.0%) were detected exclusively in any of the 4 populations (figure 2G), whereas only 2 variants were shared among the 4 populations (online supplementary table S5). There was no large difference in the fraction of rare and common variants among the four different populations (figure 2H).

Discussion

In the present study, we systematically investigated genetic variations in two coronavirus-related human genes. This large-scale effort yielded a huge global reservoir of genetic polymorphisms. The growing number of genetic variants in these genes, to which this publication contributes, will increase the potential to predict the different susceptibilities or responses to SARS-CoV-2 in different populations (including genders) under similar conditions. For example, the chemokine receptor protein encoded by the CCR5 gene is best known to be the target for HIV infection, and genetic variations in CCR5 can lead to differences in the rate of HIV infection, with the best known being the CCR5-Δ32 genetic polymorphism (rs333).50 51 A previous study on SARS-CoV also has demonstrated that by altering the amino acid residues, ACE2 can be converted to an efficient/inefficient receptor for SARS-CoV and concluded that a number of ACE2 variants could affect the association between ACE2 and S-protein in SARS-CoV.35 Therefore, the present study could contribute to a better understanding of gender differences and different susceptibility and disease severity to viral infection in different ethnic groups. Furthermore, recent reports suggest that some candidate drugs do not target the coronavirus, but rather the hACE2 (eg, anti-ACE2 antibody40) or TMPRSS2 (eg, Nafamostat52 53). Therefore, the present study is also expected to be an essential step towards the translation of knowledge about genetic polymorphisms into clinical strategies that benefit the patients.

Take home messages

  • A total of 349 and 551 single nucleotide variations (SNVs) were detected in ACE2 and TMPRSS2, respectively, in 156 513 individuals.

  • The vast majority (>97%) of these SNVs were rare, population-specific and were computationally estimated to be deleterious.

  • The SNVs in ACE2 were distributed throughout the protein-coding regions and some were located in the severe acute respiratory syndrome coronavirus 2/human ACE2 contact residues.

  • The total number of genetic variations in ACE2 and TMPRSS2 were estimated to be 1.1×103 and 1.5×103, respectively, for a population of 1 million people.

References

View Abstract

Footnotes

  • Handling editor Runjan Chetty.

  • Contributors KF and KU designed the study, performed experiments and wrote the paper.

  • Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

  • Competing interests None declared.

  • Patient consent for publication Not required.

  • Provenance and peer review Not commissioned; internally peer reviewed.

  • Data availability statement All data relevant to the study are included in the article or uploaded as supplementary information.

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.