nome-wide polygenic scores for commondiseases identify individuals with risk equivalentto monogenic mutationsAmit V. Khera1,2,3,4,5, Mark Chaffin 4,5, Krishna G. Aragam1,2,3,4, Mary E. Haas4, Carolina RoselliSeung Hoan Choi4, Pradeep Natarajan 2,3,4, Eric S. Lander4, Steven A. Lubitz 2,3,4,Patrick T. Ellinor 2,3,4 and Sekar Kathiresan 1,2,3,4*A key public health need is to identify individuals at high riskfor a given disease to enable enhanced screening or preventive therapies. Because most common diseases have a geneticcomponent, one important approach is to stratify individualsbased on inherited DNA variation1. Proposed clinical applications have largely focused on finding carriers of rare monogenicmutations at several-fold increased risk. Although most disease risk is polygenic in nature2–5, it has not yet been possibleto use polygenic predictors to identify individuals at risk comparable to monogenic mutations. Here, we develop and validate genome-wide polygenic scores for five common diseases.The approach identifies 8.0, 6.1, 3.5, 3.2, and 1.5% of the population at greater than threefold increased risk for coronaryartery disease, atrial fibrillation, type 2 diabetes, inflammatory bowel disease, and breast cancer, respectively. For coronary artery disease, this prevalence is 20-fold higher thanthe carrier frequency of rare monogenic mutations conferringcomparable risk6. We propose that it is time to contemplatethe inclusion of polygenic risk prediction in clinical care, anddiscuss relevant issues.For various common diseases, genes have been identified inwhich rare mutations confer several-fold increased risk in heterozygous carriers. An important example is the presence of afamilial hypercholesterolemia mutation in 0.4% of the population, which confers an up to threefold increased risk for coronaryartery disease (CAD)6. Aggressive treatment to lower circulatingcholesterol levels among such carriers can significantly reducerisk7. Another example is the p.Glu508Lys missense mutation inHNF1A, with a carrier frequency of 0.1% of the general population and 0.7% of Latinos8, which confers up to fivefold increasedrisk for type 2 diabetes9. Although the ascertainment of monogenic mutations can be highly relevant for carriers and theirfamilies, the vast majority of disease occurs in those withoutsuch mutations.For most common diseases, polygenic inheritance, involvingmany common genetic variants of small effect, plays a greater rolethan rare monogenic mutations2–5. However, it has been unclearwhether it is possible to create a genome-wide polygenic score(GPS) to identify individuals at clinically significantly increasedrisk—for example, comparable to levels conferred by rare monogenic mutations10,11.4,Previous studies to create GPSs had only limited success, providing insufficient risk stratification for clinical utility (for example,identifying 20% of a population at 1.4-fold increased risk relative tothe rest of the population)12. These initial efforts were hampered bythree challenges: (1) the small size of initial genome-wide association studies (GWASs), which affected the precision of the estimatedimpact of individual variants on disease risk; (2) limited computational methods for creating GPSs; and (3) a lack of large datasetsneeded to validate and test GPS.Using much larger studies and improved algorithms, we set outto revisit the question of whether a GPS can identify subgroups ofthe population with risk approaching or exceeding that of a monogenic mutation. We studied five common diseases with major publichealth impact: CAD, atrial fibrillation, type 2 diabetes, inflammatory bowel disease, and breast cancer.For each of the diseases, we created several candidate GPSs basedon summary statistics and imputation from recent large GWASs inparticipants of primarily European ancestry (Table 1). Specifically,we derived 24 predictors based on a pruning and thresholdingmethod, and 7 additional predictors using the recently describedLDPred algorithm13 (Methods, Fig. 1 and Supplementary Tables 1–6).These scores were validated and tested within the UK Biobank,which has aggregated genotype data and extensive phenotypicinformation on 409,258 participants of British ancestry (averageage: 57 years; 55% female)14,15.We used an initial validation dataset of the 120,280 participantsin the UK Biobank phase 1 genotype data release to select the GPSswith the best performance, defined as the maximum area under thereceiver-operator curve (AUC). We then assessed the performancein an independent testing dataset comprised of the 288,978 participants in the UK Biobank phase 2 genotype data release. For eachdisease, the discriminative capacity within the testing dataset wasnearly identical to that observed in the validation dataset.Taking CAD as an example, our polygenic predictors were derivedfrom a GWAS involving 184,305 participants16 and evaluated basedon their ability to detect the participants in the UK Biobank validationdataset diagnosed with CAD (Table 1). The predictors had AUCsranging from 0.79–0.81 in the validation set, with the best predictor (GPSCAD) involving 6,630,150 variants (Supplementary Table 1).This predictor performed equivalently well in the testing dataset,with an AUC of 0.81.Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA. 2Cardiology Division of the Department of Medicine, MassachusettsGeneral Hospital, Boston, MA, USA. 3Harvard Medical School, Boston, MA, USA. 4Cardiovascular Disease Initiative of the Broad Institute of Harvard andMIT, Cambridge, MA, USA. 5These authors contributed equally: Amit V. Khera, Mark Chaffin. *e-mail: [email protected] Genetics

LettersNATure GeneTicsTable 1 GPS derivation and testing for five common, complex diseasesDiseaseDiscoveryGWAS (n)Prevalence in validationdatasetPrevalence in testingdatasetPolymorphismsin GPSTuningparameterAUC (95%CI) invalidationdatasetAUC(95% CI)in 3/120,280 (3.4%)8,676/288,978 (3.0%)6,630,150LDPred(ρ 0.001)0.81 (0.80–0.81)0.81(0.81–0.81)Atrial fibrillation17,931 cases; 2,024/120,280 (1.7%)115,142controls304,576/288,978 (1.6%)6,730,541LDPred(ρ 0.003)0.77 (0.76–0.78)0.77(0.76–0.77)Type 2 diabetes26,676cases;132,532controls312,785/120,280 (2.4%)5,853/288,978 (2.0%)6,917,436LDPred(ρ 0.01)0.72 (0.72–0.73)0.73(0.72–0.73)Inflammatorybowel disease12,882cases;21,770controls321,360/120,280 (1.1%)3,102/288,978 (1.1%)6,907,112LDPred(ρ 0.1)0.63 (0.62–0.65)0.63(0.62–0.64)Breast cancer122,977cases;105,974controls332,576/63,347 (4.1%)6,586/157,895 (4.2%)5,218Pruning andthresholding(r/2 0.2;P 5 10 4)0.68 (0.67–0.69)0.69(0.68–0.69)AUC was determined using a logistic regression model adjusted for age, sex, genotyping array, and the first four principal components of ancestry. The breast cancer analysis was restricted to femaleparticipants. For the LDPred algorithm, the tuning parameter ρ reflects the proportion of polymorphisms assumed to be causal for the disease. For the pruning and thresholding strategy, r2 reflects thedegree of independence from other variants in the linkage disequilibrium reference panel, and P reflects the P value noted for a given variant in the discovery GWAS. CI, confidence interval.Linkage disequilibrium reference panel from1000 Genomes Europeans (n 503)ValidationDerive 31 candidate polygenic scores for each disease:(1) Pruning and thresholding (24 scores)(2) LDPred algorithm (7 scores)Choose best polygenic score based onmaximal AUC in UK Biobankphase 1 validation dataset (n 120,280)TestingDerivationAssociation statistics from previouslypublished genome-wide association studyAssess association of best polygenic scorewith disease in UK Biobank phase 2 testingdataset (n 288,978)Fig. 1 Study design and workflow. A GPS for each disease was derived by combining summary association statistics from a recent large GWAS anda linkage disequilibrium reference panel of 503 Europeans34. Then, 31 candidate GPSs were derived using two strategies: (1) ‘pruning and thresholding’(that is, the aggregation of independent polymorphisms that exceeded a specified level of significance in the discovery GWAS); and (2) the LDPredcomputational algorithm13, a Bayesian approach to calculate a posterior mean effect for all variants based on a prior (effect size in the previous GWAS)and subsequent shrinkage based on linkage disequilibrium. The seven candidate LDPred scores vary with respect to the tuning parameter ρ (that is, theproportion of variants assumed to be causal), as previously recommended13. The optimal GPS for each disease was chosen based on the AUC in theUK Biobank phase 1 validation dataset (n 120,280 Europeans) and subsequently calculated in an independent UK Biobank phase 2 testing dataset(n 288,978 Europeans).We then investigated whether our polygenic predictor, GPSCAD,could identify individuals at similar risk to the threefold increasedrisk conferred by a familial hypercholesterolemia mutation6. Acrossthe population, GPSCAD is normally distributed with the empiricalrisk of CAD rising sharply in the right tail of the distribution, from0.8% in the lowest percentile to 11.1% in the highest percentile(Fig. 2). The median GPSCAD percentile score was 69 for individualswith CAD versus 49 for individuals without CAD. By analogy to theNature Genetics

LettersNATure GeneTicsDensity0.3bOdds ratio versusremainder of population threefold (8.0%) fourfold (2.3%) fivefold (0.5%)c100900.20.11080Prevalence of CAD (%)0.4Polygenic score Genome-wide polygenic score for CADCase010 20 30 40 50 60 70 80 90 100CADPercentile of polygenic scoreFig. 2 Risk for CAD according to GPS. a, Distribution of GPSCAD in the UK Biobank testing dataset (n 288,978). The x axis represents GPSCAD, with valuesscaled to a mean of 0 and a standard deviation of 1 to facilitate interpretation. Shading reflects the proportion of the population with three-, four-, andfivefold increased risk versus the remainder of the population. The odds ratio was assessed in a logistic regression model adjusted for age, sex, genotypingarray, and the first four principal components of ancestry. b, GPSCAD percentile among CAD cases versus controls in the UK Biobank testing dataset.Within each boxplot, the horizontal lines reflect the median, the top and bottom of each box reflect the interquartile range, and the whiskers reflectthe maximum and minimum values within each grouping. c, Prevalence of CAD according to 100 groups of the testing dataset binned according to thepercentile of the GPSCAD.traditional analytic strategy for monogenic mutations, we defined‘carriers’ as individuals with GPSCAD above a given threshold and‘non-carriers’ as all others.We found that 8% of the population had inherited a geneticpredisposition that conferred threefold increased risk for CAD(Table 2). Strikingly, the polygenic score identified 20-fold morepeople at comparable or greater risk than were found by familialhypercholesterolemia mutations in previous studies6,7. Moreover,2.3% of the population (‘carriers’) had inherited fourfoldincreased risk for CAD and 0.5% (‘carriers’) had inherited fivefold increased risk. GPSCAD performed substantially better thantwo previously published polygenic scores for CAD that included50 and 49,310 variants, respectively (Supplementary Table 7 andSupplementary Fig. 1)17,18.GPSCAD has the advantage that it can be assessed from the timeof birth, well before the discriminative capacity emerges for the riskfactors (for example, hypertension or type 2 diabetes) used in clinical practice to predict CAD. Moreover, even for our middle-agedstudy population, practising clinicians could not identify the 8% ofindividuals at threefold risk based on GPSCAD using conventionalrisk factors in the absence of genotype information (SupplementaryTable 8). For example, conventional risk factors such as hypercholesterolemia were present in 20% of those with threefold risk basedon GPSCAD versus 13% of those in the remainder of the distribution.Hypertension was present in 32 versus 28%, and a family historyof heart disease was present in 44 versus 35%, respectively. Makinghigh GPSCAD individuals aware of their inherited susceptibility mayfacilitate intensive prevention efforts. For example, we previouslyshowed that a high polygenic risk for CAD may be offset by one oftwo interventions: adherence to a healthy lifestyle or cholesterollowering therapy with statin medications19–21.Our results for CAD generalized to the four other diseases:risk increased sharply in the right tail of the GPS distribution(Fig. 3). For each disease, the shape of the observed risk gradient was consistent with predicted risk based only on the GPS(Supplementary Figs. 2 and 3).Atrial fibrillation is an underdiagnosed and often asymptomaticdisorder in which an irregular heart rhythm predisposes to bloodclots and is a leading cause of ischemic stroke22. The polygenicNature Genetics 2 Proportion of the population at three-, four- andfivefold increased risk for each of the five common diseasesHigh GPS definitionIndividuals in testingdataset (n)% of individuals23,119/288,9788.0Odds ratio 3.0CADAtrial fibrillation17,627/288,9786.1Type 2 diabetes10,099 288,9783.5Inflammatory boweldisease9,209 288,9783.2Breast cancer2,369/157,8951.5Any of the five diseases 57,115/288,97819.8Odds ratio 4.0CAD6,631/288,9782.3Atrial fibrillation4,335/288,9781.5Type 2 diabetes578/288,9780.2Inflammatory boweldisease2,297/288,9780.8Breast cancer474/157,895Any of the five diseases 14,029/288,9780.34.9Odds ratio 5.0CAD1,443/288,978Atrial fibrillation2,020 288,9780.7Type 2 diabetes144/288,9780.05Inflammatory boweldisease571/288,9780.2Breast cancer158/157,8950.1Any of the five diseases 4,305/288,9780.51.5For each disease, progressively more extreme tails of the GPS distribution were compared with theremainder of the population in a logistic regression model with disease status as the outcome, andage, sex, the first four principal components of ancestry, and genotyping array as predictors. Thebreast cancer analysis was restricted to female participants.

LettersNATure GeneTicsab6Prevalence of type 2 diabetes (%)Prevalence of atrial fibrillation (%)654321543210010 20 30 40 50 60 70 80 90 1000Percentile of polygenic scorePercentile of polygenic scored412Prevalence of breast cancer (%)Prevalence of inflammatory bowel disease (%)c10 20 30 40 50 60 70 80 90 100321108642010 20 30 40 50 60 70 80 90 100Percentile of polygenic score010 20 30 40 50 60 70 80 90 100Percentile of polygenic scoreFig. 3 Risk gradient for disease according to the GPS percentile. 100 groups of the testing dataset were derived according to the percentile of thedisease-specific GPS. a–d, Prevalence of disease displayed for the risk of atrial fibrillation (a), type 2 diabetes (b), inflammatory bowel disease (c), andbreast cancer (d) according to the GPS percentile.predictor identified 6.1% of the population at threefold risk andthe top 1% had 4.63-fold risk (Tables 2 and 3). Screening for atrialfibrillation has become increasingly feasible owing to the development of ‘wearable’ device technology; these efforts to increasedetection may have maximal utility in those with high GPSAF.Type 2 diabetes is a key driver of cardiovascular and renal disease, with rapidly increasing global prevalence23. The polygenic predictor identified 3.5% of the population at threefold risk and thetop 1% had 3.30-fold risk (Tables 2 and 3). Both medications and anintensive lifestyle intervention have been proven to prevent progression to type 2 diabetes24, but widespread implementation has beenlimited by side effects and cost, respectively. Ascertainment of thosewith high GPST2D may provide an opportunity to target such interventions with increased precision.Inflammatory bowel disease involves chronic intestinal inflammation and often requires lifelong anti-inflammatory medicationsor surgery to remove afflicted segments of the intestines25. Thepolygenic predictor identified 3.2% of the population at threefoldrisk and the top 1% had 3.87-fold risk (Tables 2 and 3). Althoughno therapies to prevent inflammatory bowel disease are currentlyavailable, ascertainment of those with increased GPSIBD mayenable enrichment of a clinical trial population to assess a novelpreventive therapy.Breast cancer is the leading cause of malignancy-related death inwomen. The polygenic predictor identified 1.5% of the populationat threefold risk (Tables 2 and 3). Moreover, 0.1% of women had fivefold risk of breast cancer, corresponding to a breast cancerprevalence of 19.0% in this group versus 4.2% in the remaining 99.9%of the distribution. The role of screening mammograms for asymptomatic middle-aged women has remained controversial owing toa low incidence of breast cancer in this age group and a high falsepositive rate. Knowledge of GPSBC may inform clinical decisionmaking about the appropriate age to recommend screening26.These results show that, for a number of common diseases,polygenic risk scores can now identify a substantially larger fraction of the population than is found by rare monogenic mutations,at comparable or greater disease risk. Our validation and testingwere performed in the UK Biobank population. Individuals whovolunteered for the UK Biobank tended to be more healthy thanthe general population27; although this non-random ascertainment is likely to deflate disease prevalence, we expect the relativeimpact of genetic risk strata to be generalizable across study populations. Additional studies are warranted to develop polygenic riskscores for many other common diseases with large GWAS dataand validate risk estimates within population biobanks and clinical health systems.Polygenic risk scores differ in important ways from the identification of rare monogenic risk factors. Whereas identifying carriersof rare monogenic mutations requires sequencing of specific genesand careful interpretation of the functional effects of the mutationsfound, polygenic scores can be readily calculated for many diseases simultaneously, based on data from a single genotyping array.Nature Genetics

LettersNATure GeneTicsTable 3 Prevalence and clinical impact of a high GPSHigh GPS definitionReference groupOdds ratio95% CIP valueCADTop 20% of distributionRemaining 80%2.552.43–2.67 1 10–300Top 10% of distributionRemaining 90%2.892.74–3.05 1 10–300Top 5% of distributionRemaining 95%3.343.12–3.586.5 10–264Top 1% of distributionRemaining 99%4.834.25–5.461.0 10–132Top 0.5% of distributionRemaining 99.5%5.174.34–6.127.9 10–78Top 20% of distributionRemaining 80%2.432.29–2.592.1 10–177Top 10% of distributionRemaining 90%2.742.55–2.947.0 10–169Top 5% of distributionRemaining 95%3.222.95–3.511.1 10–152Top 1% of distributionRemaining 99%4.633.96–5.392.9 10–84Top 0.5% of distributionRemaining 99.5%5.234.24–6.393.5 10–56Top 20% of distributionRemaining 80%2.332.20–2.463.1 10–201Top 10% of distributionRemaining 90%2.492.34–2.661.2 10–167Top 5% of distributionRemaining 95%2.752.53–2.981.7 10–130Top 1% of distributionRemaining 99%3.302.81–3.851.4 10–49Top 0.5% of distributionRemaining 99.5%3.482.79–4.294.3 10–30Top 20% of distributionRemaining 80%2.192.03–2.367.7 10–95Top 10% of distributionRemaining 90%2.432.22–2.658.8 10–88Top 5% of distributionRemaining 95%2.662.38–2.963.0 10–68Top 1% of distributionRemaining 99%3.873.18–4.661.4 10–43Top 0.5% of distributionRemaining 99.5%4.813.74–6.089.0 10–37Top 20% of distributionRemaining 80%2.071.97–2.193.4 10–159Top 10% of distributionRemaining 90%2.322.18–2.482.3 10–148Top 5% of distributionRemaining 95%2.552.35–2.762.1 10–112Top 1% of distributionRemaining 99%3.362.88–3.911.3 10–54Top 0.5% of distributionRemaining 99.5%3.833.11–4.688.2 10–38Atrial fibrillationType 2 diabetesInflammatory bowel diseaseBreast cancerOdds ratios were calculated by comparing those with high GPS with the remainder of the population in a logistic regression model adjusted for age, sex, genotyping array, and the first four principalcomponents of ancestry. The breast cancer analysis was restricted to female participants. CI, confidence interval.In our testing dataset, 19.8% of participants were at threefoldincreased risk for at least 1 of the 5 diseases studied (Table 2).The potential to identify individuals at significantly higher geneticrisk, across a wide range of common diseases and at any age, poses anumber of opportunities and challenges for clinical medicine.Where effective prevention or early detection strategies areavailable, key issues will include the allocation of attention andresources across individuals with different levels of genetic riskand integration of genetic risk stratification with other risk factors,including rare monogenic mutations, and clinical, and environmental factors. Where such strategies do not exist or are suboptimal, the identification of individuals at high risk should facilitatethe design of efficient natural-history studies to discover earlymarkers of disease onset and clinical trials to test prevention strategies. In both cases, it is important to recognize that the risk associated with a high polygenic score may not reflect a single underlyingmechanism, but rather the combined influence of multiple pathways28. Nonetheless, prevention and detection strategies may haveutility regardless of the underlying mechanism, as is the case forstatin therapy for CAD, blood-thinning medications to preventstroke in those with atrial fibrillation, or intensified mammography screening for breast cancer.Nature Genetics communication will require serious consideration. Whilepolygenic risk scores can be simultaneously calculated at birthfor all common diseases, the usefulness of the knowledge and thepotential harms to the individual may vary with the disease andstage of life—from juvenile diabetes to Alzheimer’s disease. Yet, itmay not be feasible or appropriate to withhold information thatcan be readily calculated from genetic data. Moreover, it will beimportant to consider how to assess both absolute and relative risksand how to communicate these risks to best serve each patient;for example, to encourage the adoption of lifestyle modifications ordisease screening.Finally, we highlight a crucial equity issue. The polygenic riskscores described here were derived and tested in individuals ofprimarily European ancestry—the group in which most geneticstudies have been undertaken to date. Because allele frequencies,linkage disequilibrium patterns, and effect sizes of common polymorphisms vary with ancestry, the specific GPS here will not haveoptimal predictive power for other ethnic groups29. It will be important for the biomedical community to ensure that all ethnic groupshave access to genetic risk prediction of comparable quality, whichwill require undertaking or expanding GWAS in non-Europeanethnic groups.

LettersURLs. 1000 Genomes Phase 3, 3/; UK Biobank,; R statistical software,; PLINK 2.0,; Hail,, including statements of data availability and any associated accession codes and references, are available at : 15 February 2018; Accepted: 21 June 2018;Published: xx xx xxxxReferences1. Green, E. D. & Guyer, M. S., National Human Genome Research Institute.Charting a course for genomic medicine from base pairs to bedside. Nature470, 204–213 (2011).2. Fisher, R. A. The correlation between relatives on the supposition ofMendelian inheritance. Proc. R. Soc. Edinb. 52, 99–433 (1918).3. Gibson, G. Rare and common variants: twenty arguments. Nat. Rev. Genet.13, 135–145 (2012).4. Golan, D., Lander, E. S. & Rosset, S. Measuring missing heritability: inferringthe contribution of common variants. Proc. Natl Acad. Sci. USA 111,E5272–E5281 (2014).5. Fuchsberger, C. et al. The genetic architecture of type 2 diabetes. Nature 536,41–47 (2016).6. Abul-Husn, N. S. et al. Genetic identification of familial hypercholesterolemiawithin a single U.S. health care system. Science 354, pii: aaf7000 (2016).7. Nordestgaard, B. G. et al. Familial hypercholesterolaemia is underdiagnosedand undertreated in the general population: guidance for clinicians to preventcoronary heart disease: consensus statement of the European AtherosclerosisSociety. Eur. Heart J. 34, 3478–3490a (2013).8. Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans.Nature 536, 285–291 (2016).9. Estrada, K. et al. Association of a low-frequency variant in HNF1A with type2 diabetes in a Latino population. JAMA 311, 2305–2314 (2014).10. Chatterjee, N. et al. Projecting the performance of risk prediction based onpolygenic analyses of genome-wide association studies. Nat. Genet. 45,400–405 (2013).11. Zhang, Y. et al. Estimation of complex effect-size distributions usingsummary-level statistics from genome-wide association studies across 32complex traits and implications for the future. Preprint at 75406 (2017).12. Ripatti, S. et al. A multilocus genetic risk score for coronary heart disease:case-control and prospective cohort analyses. Lancet 376, 1393–1400 (2010).13. Vilhjálmsson, B. J. et al. Modeling linkage disequilibrium increases accuracyof polygenic scores. Am. J. Hum. Genet. 97, 576–592 (2015).14. Sudlow, C. et al. UK Biobank: an open access resource for identifying thecauses of a wide range of complex diseases of middle and old age. PLoS Med.12, e1001779 (2015).15. Bycroft, C. et al. Genome-wide genetic data on 500,000 UK Biobankparticipants. Preprint at 66298 (2017).16. Nikpay, M. et al. A comprehensive 1,000 Genomes-based genome-wideassociation meta-analysis of coronary artery disease. Nat. Genet. 47,1121–1130 (2015).17. Tada, H. et al. Risk prediction by genetic risk scores for coronary heartdisease is independent of self-reported family history. Eur. Heart J. 37,561–567 (2016).18. Abraham, G. et al. Genomic prediction of coronary heart disease.Eur. Heart J. 37, 3267–3278 (2016).19. Khera, A. V. et al. Genetic risk, adherence to a healthy lifestyle, and coronarydisease. N. Engl. J. Med. 375, 2349–2358 (2016).20. Mega, J. L. et al. Genetic risk, coronary heart disease events, and the clinicalbenefit of statin therapy: an analysis of primary and secondary preventiontrials. Lancet 385, 2264–2271 (2015).21. Natarajan, P. et al. Polygenic risk score identifies subgroup with higherburden of atherosclerosis and greater relative benefit from statin therapy inthe primary prevention setting. Circulation 135, 2091–2101 (2017).22. January, C. T. et al. 2014 AHA/ACC/HRS guideline for the management ofpatients with atrial fibrillation: a report of the American College ofNATure GeneTicsCardiology/American Heart Association Task Force on practice guidelinesand the Heart Rhythm Society. Circulation 130, e199–e267 (2014).23. GBD 2015 Disease and Injury Incidence and Prevalence Collaborators.Global, regional, and national incidence, prevalence, and years livedwith disability for 310 diseases and injuries, 1990–2015: a systematicanalysis for the Global Burden of Disease Study 2015. Lancet 388,1545–1602 (2016).24. Knowler, W. C. et al. Reduction in the incidence of type 2 diabetes withlifestyle intervention or metformin. N. Engl. J. Med. 346, 393–403 (2002).25. Abraham, C. & Cho, J. H. Inflammatory bowel disease. N. Engl. J. Med. 361,2066–2078 (2009).26. Pharoah, P. D., Antoniou, A. C., Easton, D. F. & Ponder, B. A. Polygenes,risk prediction, and targeted prevention of breast cancer. N. Engl. J. Med. 358,2796–2803 (2008).27. Fry, A. et al. Comparison of sociodemographic and health-relatedcharacteristics of UK Biobank participants with those of the generalpopulation. Am. J. Epidemiol. 186, 1026–1034 (2017).28. Khera, A. V. & Kathiresan, S. Is coronary atherosclerosis one disease ormany? Setting realistic expectations for precision medicine. Circulation 135,1005–1007 (2017).29. Martin, A. R. et al. Human demographic history impacts genetic riskprediction across diverse populations. Am. J. Hum. Genet. 100,635–649 (2017).30. Christophersen, I. E. et al. Large-scale analyses of common and rarevariants identify 12 new loci associated with atrial fibrillation. Nat. Genet. 49,946–952 (2017).31. Scott, R. A. et al. An expanded genome-wide association study of type 2diabetes in Europeans. Diabetes 66, 2888–2902 (2017).32. Liu, J. Z. et al. Association analyses identify 38 susceptibility loci forinflammatory bowel disease and highlight shared genetic risk acrosspopulations. Nat. Genet. 47, 979–986 (2015).33. Michailidou, K. et al. Association analysis identifies 65 new breast cancer riskloci. Nature 551, 92–94 (2017).34. The 1000 Genomes Project Consortium. A global reference for humangenetic variation. Nature 526, 68–74 (2015).AcknowledgementsUK Biobank analyses were conducted via application 7089 using a protocol approvedby the Partners HealthCare Institutional Review Board. The analysis was supported bya KL2/Catalyst Medical Research Investigator Training award from Harvard Catalystfunded by the National Institutes of Health (TR001100 to A.V.K.), a Junior FacultyResearch Award from the National Lipid Association (to A.V.K.), the National Heart,Lung, and Blood Institute of the US National Institutes of Health under award numbersT32 HL007208 (to K.G.A.), K23HL114724 (to S.A.L.), R0

Top 20% of distribution Remaining 80% 2.55 2.43–2.67 1 10 –300 Top 10% of distribution Remaining 90% 2.89 2.74–3.05 1 10 –300 Top 5% of distribution Remaining 95% 3.34 3.12–3.58 6.5 10 –264,