John and wayne have the exact same genes. this indicates that they must be

Common single-nucleotide polymorphisms [SNPs] are predicted to collectively explain 40–50% of phenotypic variation in human height, but identifying the specific variants and associated regions requires huge sample sizes1. Here, using data from a genome-wide association study of 5.4 million individuals of diverse ancestries, we show that 12,111 independent SNPs that are significantly associated with height account for nearly all of the common SNP-based heritability. These SNPs are clustered within 7,209 non-overlapping genomic segments with a mean size of around 90 kb, covering about 21% of the genome. The density of independent associations varies across the genome and the regions of increased density are enriched for biologically relevant genes. In out-of-sample estimation and prediction, the 12,111 SNPs [or all SNPs in the HapMap 3 panel2] account for 40% [45%] of phenotypic variance in populations of European ancestry but only around 10–20% [14–24%] in populations of other ancestries. Effect sizes, associated regions and gene prioritization are similar across ancestries, indicating that reduced prediction accuracy is likely to be explained by linkage disequilibrium and differences in allele frequency within associated regions. Finally, we show that the relevant biological pathways are detectable with smaller sample sizes than are needed to implicate causal genes and variants. Overall, this study provides a comprehensive map of specific genomic regions that contain the vast majority of common height-associated variants. Although this map is saturated for populations of European ancestry, further research is needed to achieve equivalent saturation in other ancestries.

Main

Since 2007, genome-wide association studies [GWASs] have identified thousands of associations between common SNPs and height, mainly using studies with participants of European ancestry. The largest GWAS published so far for adult height focused on common variation and reported up to 3,290 independent associations in 712 loci using a sample size of up to 700,000 individuals3. Adult height, which is highly heritable and easily measured, has provided a larger number of common genetic associations than any other human phenotype. In addition, a large collection of genes has been implicated in disorders of skeletal growth, and these are enriched in loci mapped by GWASs of height in the normal range. These features make height an attractive model trait for assessing the role of common genetic variation in defining the genetic and biological architecture of polygenic human phenotypes.

As available sample sizes continue to increase for GWASs of common variants, it becomes important to consider whether these larger samples can ‘saturate’ or nearly completely catalogue the information that can be derived from GWASs. This question of completeness can take several forms, including prediction accuracy compared with heritability attributable to common variation, the mapping of associated genomic regions that account for this heritability, and whether increasing sample sizes continue to provide additional information about the identity of prioritized genes and gene sets. Furthermore, because most GWASs continue to be performed largely in populations of European ancestry, it is necessary to address these questions of completeness in the context of multiple ancestries. Finally, some have proposed that, when sample sizes become sufficiently large, effectively every gene and genomic region will be implicated by GWASs, rather than certain subsets of genes and biological pathways being specified4.

Here, using data from 5.4 million individuals, we set out to map common genetic associations with adult height, using variants catalogued in the HapMap 3 project [HM3], and to assess the saturation of this map with respect to variants, genomic regions and likely causal genes and gene sets. We identify significant variants, examine signal density across the genome, perform out-of-sample estimation and prediction analyses within studies of individuals of European ancestry and other ancestries and prioritize genes and gene sets as likely mediators of the effects on height. We show that this set of common variants reaches predicted limits for prediction accuracy within populations of European ancestry and largely saturates both the genomic regions associated with height and broad categories of gene sets that are likely to be relevant; future work will be required to extend prediction accuracy to populations of other ancestries, to account for rarer genetic variation and to more definitively connect associated regions with individual probable causal genes and variants.

An overview of our study design and analysis strategy is provided in Extended Data Fig. 1.

Meta-analysis identifies 12,111 height-associated SNPs

We performed genetic analysis of up to 5,380,080 individuals from 281 studies from the GIANT consortium and 23andMe. Supplementary Fig. 1 represents projections of these 281 studies onto principal components reflecting differences in allele frequencies across ancestry groups in the 1000 Genomes Project [1KGP]5. Altogether, our discovery sample includes 4,080,687 participants of predominantly European ancestries [75.8% of total sample]; 472,730 participants with predominantly East Asian ancestries [8.8%]; 455,180 participants of Hispanic ethnicity with typically admixed ancestries [8.5%]; 293,593 participants of predominantly African ancestries—mostly African American individuals with admixed African and European ancestries [5.5%]; and 77,890 participants of predominantly South Asian ancestries [1.4%]. We refer to these five groups of participants or cohorts as EUR, EAS, HIS, AFR and SAS, respectively, while recognizing that these commonly used groupings oversimplify the actual genetic diversity among participants. Cohort-specific information is provided in Supplementary Tables 1–3. We tested the association between standing height and 1,385,132 autosomal bi-allelic SNPs from the HM3 tagging panel2, which contains more than 1,095,888 SNPs with a minor allele frequency [MAF] greater than 1% in each of the five ancestral groups included in our meta-analysis. Supplementary Fig. 2 shows the frequency and imputation quality distribution of HM3 SNPs across all five groups of cohorts.

We first performed separate meta-analyses in each of the five groups of cohorts. We identified 9,863, 1,511, 918, 453 and 69 quasi-independent genome-wide significant [GWS; P  0.8] with at least one variant reaching marginal genome-wide significance [PGWAS  0.5 across all comparisons [Supplementary Fig. 13]. Thus, the observed GWS height associations are substantially shared across major ancestral groups, consistent with previous studies based on smaller sample sizes13,14.

To find signals that are specific to certain groups, we tested whether any individual SNPs detected in non-EUR GWASs are conditionally independent of signals detected in EUR GWASs. We fitted an approximate joint model that includes GWS SNPs identified in EUR and non-EUR, using LD reference panels specific to each ancestry group. After excluding SNPs in strong LD [\[{r}_{{\rm{LD}}}^{2}\] > 0.8 in either ancestry group], we found that 2, 17, 49 and 63 of the GWS SNPs detected in SAS, AFR, EAS and HIS GWASs, respectively, are conditionally independent of GWS SNPs identified in EUR GWASs [Supplementary Table 9]. On average, these conditionally independent SNPs have a larger MAF and effect size in non-EUR than in EUR cohorts, which may have contributed to an increased statistical power of detection. The largest frequency difference relative to EUR was observed for rs2463169 [height-increasing G allele frequency: 23% in AFR versus 84% in EUR] within the intron of PAWR, which encodes the prostate apoptosis response-4 protein. Of note, rs2463169 is located within the 12q21.2 locus, where a strong signal of positive selection in West African Yoruba populations was previously reported15. The estimated effect at rs2463169 is β ≈ 0.034 s.d. per G allele in AFR versus β ≈ −0.002 s.d. per G allele in EUR, and the P value of marginal association in EUR is PEUR = 0.08, suggesting either a true difference in effect size or nearby causal variant[s] with differing LD to rs2463169.

Given that our results show a strong genetic overlap of GWAS signals across ancestries, we performed a fixed-effect meta-analysis of all five ancestry groups to maximize statistical power for discovering associations due to shared causal variants. The mean Cochran’s heterogeneity Q-statistic is around 34% across SNPs, which indicates moderate heterogeneity of SNP effects between ancestries. The mean chi-square association statistic in our fixed-effect meta-analysis [hereafter referred to as METAFE] is around 36, and around 18% of all HM3 SNPs are marginally GWS. Moreover, we found that allele frequencies in our METAFE were very similar to that of EUR [mean fixation index of genetic differentiation [FST] across SNPs between EUR and METAFE is around 0.001], as expected because our METAFE consists of more than 75% EUR participants and around 14% participants with admixed European and non-European ancestries that is, HIS and AFR]. To further assess whether LD in our METAFE could be reasonably approximated by the LD from EUR, we performed an LD score regression16 analysis of our METAFE using LD scores estimated in EUR. In this analysis, we focused on the attenuation ratio statistic [RLDSC-EUR], for which large values can also indicate strong LD inconsistencies between a given reference and GWAS summary statistics. A threshold of RLDSC > 20% was recommended by the authors of the LDSC software as a rule-of-thumb to detect such inconsistencies. Using EUR LD scores in the GWAS of HIS, which is the non-EUR group that is genetically closest to EUR [FST ≈ 0.02], yields an estimated RLDSC-EUR of around 25% [standard error [s.e.] 1.8%], consistent with strong LD differences between HIS and EUR. By contrast, in our METAFE, we found an estimated RLDSC-EUR of around 4.5% [s.e. 0.8%], which is significantly lower than 20% and not statistically different from 3.8% [s.e. 0.8%] in our EUR meta-analysis. Furthermore, we show in Supplementary Note 1 that using a composite LD reference containing samples from various ancestries [with proportions matching that in our METAFE] does not improve signal detection over using an EUR LD reference. Altogether, these analyses suggest that LD in our METAFE can be reasonably approximated by LD from EUR.

We therefore proceeded to identify quasi-independent GWS SNPs from the multi-ancestry meta-analysis by performing a COJO analysis of our METAFE, using genotypes from around 350,000 unrelated EUR participants in the UK Biobank [UKB] as an LD reference. We identified 12,111 quasi-independent GWS SNPs, including 9,920 [82%] primary signals with a GWS marginal effect and 2,191 secondary signals that only reached GWS in a joint regression model [Supplementary Table 10]. Figure 1 represents the relationship between frequency and joint effect sizes of minor alleles at these 12,111 associations. Of the GWS SNPs obtained from the non-EUR meta-analyses above that were conditionally independent of the EUR GWS SNPs, 0/2 in SAS, 5/17 in AFR, 27/49 in EAS and 27/63 in HIS were marginally significant in our METAFE [Supplementary Table 9], and 24 of those [highlighted in Fig. 2] overlapped with our list of 12,111 quasi-independent GWS SNPs.

Fig. 1: Relationship between frequency and estimated effect sizes of minor alleles.

Each dot represents one of the 12,111 quasi-independent GWS SNPs that were identified in our cross-ancestry GWAS meta-analysis. Data underlying this figure are available in Supplementary Table 10. SNP effect estimates [y axis] are expressed in height standard deviation [s.d.] per minor allele as defined in our cross-ancestry GWAS meta-analysis. SNPs were stratified in five classes according to their P value [P] of association. We show two curves representing the theoretical relationship between frequency and expected magnitude of SNP effect detectable at P  5] from 348,501 unrelated EUR participants in the UKB as our LD reference. For EAS, we used genotypes at 1,034,263 quality-controlled [MAF > 1%, SNP missingness  5; SNP missingness  1%, SNP missingness  1%] from n = 4,883 unrelated samples from the Hispanic Community Health Study/Study of Latinos [HCHS/SOL; dbGaP accession number: phs001395.v2.p1] cohorts. Finally, we performed a COJO analysis of the combined meta-analysis of all ancestries [referred to as METAFE in the main text] using 348,501 unrelated EUR participants in the UKB as the reference panel.

To assess whether SNPs detected in non-EUR were independent of signals detected in EUR, we performed another COJO analysis of ancestry groups GWAS by fitting jointly SNPs detected in EUR with those detected in each of the non-EUR GWAS meta-analyses. For each non-EUR GWAS, we performed a single-step COJO analysis only including SNPs identified in that non-EUR GWAS and for which the LD squared correlation [\[{r}_{{\rm{LD}}}^{2}\]] with any of the EUR signals [marginally or conditionally GWS] is lower than 0.8 in both EUR and corresponding non-EUR data. Single-step COJO analyses were performed using the --cojo-joint option of GCTA, which does not involve model selection and simply approximates a multivariate regression model in which all selected SNPs on a chromosome are fitted jointly. LD correlations used in these filters were estimated in ancestry-matched samples of the 1000 Genomes Project [1KGP; release 3]. More specifically, LD was estimated in 661 AFR, 347 HIS [referred to with the AMR label in 1KGP], 504 EAS, 503 EUR and 489 SAS 1KGP participants. We used the same LD reference samples in these analyses as for our main discovery analysis described at the beginning of the section.

F ST calculation and [stratified] LD score regression

We used two statistics to evaluate whether an EUR LD reference could approximate well enough the LD structure in our trans-ancestry GWAS meta-analysis. The first statistic that we used is the Wright fixation index57, which measures allele frequency divergence between two populations. We used the Hudson’s estimator of FST58 as previously recommended59 to compare allele frequencies from our METAFE with that from our EUR GWAS meta-analysis and an independent replication sample from the EBB. The other statistic that we used is the attenuation ratio statistic from the LD score regression methodology. These LD score regression analyses were performed using version 1.0 of the LDSC software and using LD scores calculated from EUR participants in the 1KGP [see ‘URLs’ section]. Moreover, we performed a stratified LD score regression analysis to quantify the enrichment of height heritability in 97 genomic annotations curated and described previously40. as the baseline-LD model. Annotation-weighted LD scores used for those analyses were also calculated using data from 1KGP [see ‘URLs’ section].

Density of GWS signal and enrichment near OMIM genes

We defined the density of independent signals around each GWS SNP as the number of other independent associations identified with COJO within a 100-kb window on both sides. Therefore, a SNP with no other associations within 100 kb has a density of 0, whereas a SNP colocalizing with 20 other GWS associations within 100 kb will have a density of 20. We quantified the standard error of the mean signal density across the genome using a leave-one-chromosome-out jackknife procedure. We then quantified the enrichment of 462 curated OMIM18 genes near GWS SNPs with a large signal density, by counting the number of OMIM genes within 100 kb of a GWS SNP, then comparing that number for SNPs with a density of 0 and those with a density of at least 1. The strength of the enrichment was measured using an odds ratio calculated from a 2×2 contingency table: 'presence/absence of an OMIM gene' versus 'density of 0 or larger than 0'. To assess the significance of the enrichment, we simulated the distribution of enrichment statistics for a random set of 462 length-matched genes. We used 22 length classes [2 Mb] to match OMIM genes with random genes. OMIM genes within a given length class were matched with the same number of non-OMIM genes present in the class. We sampled 1,000 random sets of genes and calculated for each them an enrichment statistic. Enrichment P value was calculated as the number of times enrichment statistics of random genes exceeded that of OMIM genes. The list of OMIM genes is provided in Supplementary Table 11.

Genomic colocalization of GWS SNPs identified across ancestries

We assessed the genomic colocalization between 2,747 GWS SNPs identified in non-EUR [Supplementary Tables 5–8] and 9,863 GWS SNPs identified in EUR [Supplementary Table 4] by quantifying the proportion of EUR GWS SNPs identified within 100 kb of any non-EUR GWS SNP. We tested the statistical significance of this proportion by comparing it with the proportion of EUR GWS SNPs identified within 100 kb of random HM3 SNPs matched with non-EUR GWS SNPs on 24 binary functional annotations39.

These 24 annotations [for example, coding or conserved] are thoroughly described in a previous study39 and were downloaded from //alkesgroup.broadinstitute.org/LDSCORE/baselineLD_v2.1_annots/.

Our matching strategy consists of three steps. First, we calibrated a statistical model to predict the probability for a given HM3 SNP to be GWS in any of our non-EUR GWAS meta-analyses as a function of their annotation. For that, we used a logistic regression of the non-EUR GWS status [1 = if the SNP is GWS in any of the non-EUR GWAS; 0 = otherwise] onto the 24 annotations as regressors. Second, we used that model to predict the probability to be GWS in non-EUR. Thirdly, we used the predicted probability to sample [with replacement] 1,000 random sets of 2,747 SNPs. Finally, we estimated the proportion of EUR GWS SNPs within 100 kb of SNPs in each sampled SNP set. We report in the main text the mean and s.d. over these 1,000 proportions.

To validate our matching strategy, we compared the mean value of each of these 24 annotations [for example, proportion of coding SNPs] between non-EUR GWS SNPs and each of the 1,000 random sets of SNPs, using a Fisher’s exact test. For each of the 24 annotations, both the mean and median P value were greater than 0.6 and the proportion of P values 

Chủ Đề