All

2017

Day F, Thompson D, Helgason H, Chasman D, Finucane H, Sulem P, Ruth K, Whalen S, Sarkar A, Albrecht E, Altmaier E, Amini M, Barbieri C, Boutin T, Campbell A, Demerath E, Giri A, He C, Hottenga J, Karlsson R, Kolcic I, Loh PR, . , Murray A, Murabito J, Stefansson K, Ong K, Perry J. Genomic analyses identify hundreds of variants associated with age at menarche and support a role for puberty timing in cancer risk. Nat Genet. 2017;49(6):834–841.
The timing of puberty is a highly polygenic childhood trait that is epidemiologically associated with various adult diseases. Using 1000 Genomes Project-imputed genotype data in up to ∼370,000 women, we identify 389 independent signals (P < 5 × 10(-8)) for age at menarche, a milestone in female pubertal development. In Icelandic data, these signals explain ∼7.4% of the population variance in age at menarche, corresponding to ∼25% of the estimated heritability. We implicate ∼250 genes via coding variation or associated expression, demonstrating significant enrichment in neural tissues. Rare variants near the imprinted genes MKRN3 and DLK1 were identified, exhibiting large effects when paternally inherited. Mendelian randomization analyses suggest causal inverse associations, independent of body mass index (BMI), between puberty timing and risks for breast and endometrial cancers in women and prostate cancer in men. In aggregate, our findings highlight the complexity of the genetic regulation of puberty timing and support causal links with cancer susceptibility.
Hill A, Loh PR, Bharadwaj R, Pons P, Shang J, Guinan E, Lakhani K, Kilty I, Jelinsky S. Stepwise Distributed Open Innovation Contests for Software Development: Acceleration of Genome-Wide Association Analysis. Gigascience. 2017;6(5):1–10.
Background: The association of differing genotypes with disease-related phenotypic traits offers great potential to both help identify new therapeutic targets and support stratification of patients who would gain the greatest benefit from specific drug classes. Development of low-cost genotyping and sequencing has made collecting large-scale genotyping data routine in population and therapeutic intervention studies. In addition, a range of new technologies is being used to capture numerous new and complex phenotypic descriptors. As a result, genotype and phenotype datasets have grown exponentially. Genome-wide association studies associate genotypes and phenotypes using methods such as logistic regression. As existing tools for association analysis limit the efficiency by which value can be extracted from increasing volumes of data, there is a pressing need for new software tools that can accelerate association analyses on large genotype-phenotype datasets. Results: Using open innovation (OI) and contest-based crowdsourcing, the logistic regression analysis in a leading, community-standard genetics software package (PLINK 1.07) was substantially accelerated. OI allowed us to do this in <6 months by providing rapid access to highly skilled programmers with specialized, difficult-to-find skill sets. Through a crowd-based contest a combination of computational, numeric, and algorithmic approaches was identified that accelerated the logistic regression in PLINK 1.07 by 18- to 45-fold. Combining contest-derived logistic regression code with coarse-grained parallelization, multithreading, and associated changes to data initialization code further developed through distributed innovation, we achieved an end-to-end speedup of 591-fold for a data set size of 6678 subjects by 645 863 variants, compared to PLINK 1.07's logistic regression. This represents a reduction in run time from 4.8 hours to 29 seconds. Accelerated logistic regression code developed in this project has been incorporated into the PLINK2 project. Conclusions: Using iterative competition-based OI, we have developed a new, faster implementation of logistic regression for genome-wide association studies analysis. We present lessons learned and recommendations on running a successful OI process for bioinformatics.
Hayeck T, Loh PR, Pollack S, Gusev A, Patterson N, Zaitlen N, Price A. Mixed Model Association with Family-Biased Case-Control Ascertainment. Am J Hum Genet. 2017;100(1):31–39.
Mixed models have become the tool of choice for genetic association studies; however, standard mixed model methods may be poorly calibrated or underpowered under family sampling bias and/or case-control ascertainment. Previously, we introduced a liability threshold-based mixed model association statistic (LTMLM) to address case-control ascertainment in unrelated samples. Here, we consider family-biased case-control ascertainment, where case and control subjects are ascertained non-randomly with respect to family relatedness. Previous work has shown that this type of ascertainment can severely bias heritability estimates; we show here that it also impacts mixed model association statistics. We introduce a family-based association statistic (LT-Fam) that is robust to this problem. Similar to LTMLM, LT-Fam is computed from posterior mean liabilities (PML) under a liability threshold model; however, LT-Fam uses published narrow-sense heritability estimates to avoid the problem of biased heritability estimation, enabling correct calibration. In simulations with family-biased case-control ascertainment, LT-Fam was correctly calibrated (average χ(2) = 1.00-1.02 for null SNPs), whereas the Armitage trend test (ATT), standard mixed model association (MLM), and case-control retrospective association test (CARAT) were mis-calibrated (e.g., average χ(2) = 0.50-1.22 for MLM, 0.89-2.65 for CARAT). LT-Fam also attained higher power than other methods in some settings. In 1,259 type 2 diabetes-affected case subjects and 5,765 control subjects from the CARe cohort, downsampled to induce family-biased ascertainment, LT-Fam was correctly calibrated whereas ATT, MLM, and CARAT were again mis-calibrated. Our results highlight the importance of modeling family sampling bias in case-control datasets with related samples.

2016

Galinsky K, Loh PR, Mallick S, Patterson N, Price A. Population Structure of UK Biobank and Ancient Eurasians Reveals Adaptation at Genes Influencing Blood Pressure. Am J Hum Genet. 2016;99(5):1130–1139.
Analyzing genetic differences between closely related populations can be a powerful way to detect recent adaptation. The very large sample size of the UK Biobank is ideal for using population differentiation to detect selection and enables an analysis of the UK population structure at fine resolution. In this study, analyses of 113,851 UK Biobank samples showed that population structure in the UK is dominated by five principal components (PCs) spanning six clusters: Northern Ireland, Scotland, northern England, southern England, and two Welsh clusters. Analyses of ancient Eurasians revealed that populations in the northern UK have higher levels of Steppe ancestry and that UK population structure cannot be explained as a simple mixture of Celts and Saxons. A scan for unusual population differentiation along the top PCs identified a genome-wide-significant signal of selection at the coding variant rs601338 in FUT2 (p = 9.16 × 10(-9)). In addition, by combining evidence of unusual differentiation within the UK with evidence from ancient Eurasians, we identified genome-wide-significant (p = 5 × 10(-8)) signals of recent selection at two additional loci: CYP1A2-CSK and F12. We detected strong associations between diastolic blood pressure in the UK Biobank and both the variants with selection signals at CYP1A2-CSK (p = 1.10 × 10(-19)) and the variants with ancient Eurasian selection signals at the ATXN2-SH2B3 locus (p = 8.00 × 10(-33)), implicating recent adaptation related to blood pressure.
Loh PR, Danecek P, Palamara PF, Fuchsberger C, A Reshef Y, Finucane H, Schoenherr S, Forer L, McCarthy S, Abecasis G, Durbin R, Price A. Reference-based phasing using the Haplotype Reference Consortium panel. Nat Genet. 2016;48(11):1443–1448.
Haplotype phasing is a fundamental problem in medical and population genetics. Phasing is generally performed via statistical phasing in a genotyped cohort, an approach that can yield high accuracy in very large cohorts but attains lower accuracy in smaller cohorts. Here we instead explore the paradigm of reference-based phasing. We introduce a new phasing algorithm, Eagle2, that attains high accuracy across a broad range of cohort sizes by efficiently leveraging information from large external reference panels (such as the Haplotype Reference Consortium; HRC) using a new data structure based on the positional Burrows-Wheeler transform. We demonstrate that Eagle2 attains a ∼20× speedup and ∼10% increase in accuracy compared to reference-based phasing using SHAPEIT2. On European-ancestry samples, Eagle2 with the HRC panel achieves >2× the accuracy of 1000 Genomes-based phasing. Eagle2 is open source and freely available for HRC-based phasing via the Sanger Imputation Service and the Michigan Imputation Server.
Horikoshi M, Beaumont R, Day F, Warrington N, Kooijman M, Fernandez-Tajes J, Feenstra B, Zuydam N, Gaulton K, Grarup N, Bradfield J, Strachan D, Li-Gao R, Ahluwalia T, Kreiner E, Rueedi R, Lyytikäinen LP, Cousminer D, Wu Y, Thiering E, Wang C, Have C, Hottenga JJ, Vilor-Tejedor N, Joshi P, Boh ETH, Ntalla I, Pitkänen N, Mahajan A, Leeuwen E, Joro R, Lagou V, Nodzenski M, Diver L, Zondervan K, Bustamante M, Marques-Vidal P, Mercader J, Bennett A, Rahmioglu N, Nyholt D, Ma RCW, Tam CHT, Tam WH, CHARGE Consortium Hematology Working Group, Ganesh S, Rooij F, Jones S, Loh PR, . , Timpson N, Perry J, Evans D, McCarthy M, Freathy R. Genome-wide associations for birth weight and correlations with adult disease. Nature. 2016;538(7624):248–252.
Birth weight (BW) has been shown to be influenced by both fetal and maternal factors and in observational studies is reproducibly associated with future risk of adult metabolic diseases including type 2 diabetes (T2D) and cardiovascular disease. These life-course associations have often been attributed to the impact of an adverse early life environment. Here, we performed a multi-ancestry genome-wide association study (GWAS) meta-analysis of BW in 153,781 individuals, identifying 60 loci where fetal genotype was associated with BW (P < 5 × 10(-8)). Overall, approximately 15% of variance in BW was captured by assays of fetal genetic variation. Using genetic association alone, we found strong inverse genetic correlations between BW and systolic blood pressure (Rg = -0.22, P = 5.5 × 10(-13)), T2D (Rg = -0.27, P = 1.1 × 10(-6)) and coronary artery disease (Rg = -0.30, P = 6.5 × 10(-9)). In addition, using large -cohort datasets, we demonstrated that genetic factors were the major contributor to the negative covariance between BW and future cardiometabolic risk. Pathway analyses indicated that the protein products of genes within BW-associated regions were enriched for diverse processes including insulin signalling, glucose homeostasis, glycogen biosynthesis and chromatin remodelling. There was also enrichment of associations with BW in known imprinted regions (P = 1.9 × 10(-4)). We demonstrate that life-course associations between early growth phenotypes and adult cardiometabolic disease are in part the result of shared genetic effects and identify some of the pathways through which these causal genetic effects are mediated.
Das S, Forer L, Schönherr S, Sidore C, Locke A, Kwong A, Vrieze S, Chew E, Levy S, McGue M, Schlessinger D, Stambolian D, Loh PR, Iacono W, Swaroop A, Scott L, Cucca F, Kronenberg F, Boehnke M, Abecasis G, Fuchsberger C. Next-generation genotype imputation service and methods. Nat Genet. 2016;48(10):1284–1287.
Genotype imputation is a key component of genetic association studies, where it increases power, facilitates meta-analysis, and aids interpretation of signals. Genotype imputation is computationally demanding and, with current tools, typically requires access to a high-performance computing cluster and to a reference panel of sequenced genomes. Here we describe improvements to imputation machinery that reduce computational requirements by more than an order of magnitude with no loss of accuracy in comparison to standard imputation tools. We also describe a new web-based service for imputation that facilitates access to new reference panels and greatly improves user experience and productivity.
Loh PR, Palamara PF, Price A. Fast and accurate long-range phasing in a UK Biobank cohort. Nat Genet. 2016;48(7):811–6.
Recent work has leveraged the extensive genotyping of the Icelandic population to perform long-range phasing (LRP), enabling accurate imputation and association analysis of rare variants in target samples typed on genotyping arrays. Here we develop a fast and accurate LRP method, Eagle, that extends this paradigm to populations with much smaller proportions of genotyped samples by harnessing long (>4-cM) identical-by-descent (IBD) tracts shared among distantly related individuals. We applied Eagle to N ≈ 150,000 samples (0.2% of the British population) from the UK Biobank, and we determined that it is 1-2 orders of magnitude faster than existing methods while achieving similar or better phasing accuracy (switch error rate ≈ 0.3%, corresponding to perfect phase in a majority of 10-Mb segments). We also observed that, when used within an imputation pipeline, Eagle prephasing improved downstream imputation accuracy in comparison to prephasing in batches using existing methods, as necessary to achieve comparable computational cost.
Day F, Helgason H, Chasman D, Rose L, Loh PR, Scott R, Helgason A, Kong A, Masson G, Magnusson OT, Gudbjartsson D, Thorsteinsdottir U, Buring J, Ridker P, Sulem P, Stefansson K, Ong K, Perry J. Physical and neurobehavioral determinants of reproductive onset and success. Nat Genet. 2016;48(6):617–623.
The ages of puberty, first sexual intercourse and first birth signify the onset of reproductive ability, behavior and success, respectively. In a genome-wide association study of 125,667 UK Biobank participants, we identify 38 loci associated (P < 5 × 10(-8)) with age at first sexual intercourse. These findings were taken forward in 241,910 men and women from Iceland and 20,187 women from the Women's Genome Health Study. Several of the identified loci also exhibit associations (P < 5 × 10(-8)) with other reproductive and behavioral traits, including age at first birth (variants in or near ESR1 and RBM6-SEMA3F), number of children (CADM2 and ESR1), irritable temperament (MSRA) and risk-taking propensity (CADM2). Mendelian randomization analyses infer causal influences of earlier puberty timing on earlier first sexual intercourse, earlier first birth and lower educational attainment. In turn, likely causal consequences of earlier first sexual intercourse include reproductive, educational, psychiatric and cardiometabolic outcomes.
Galinsky K, Bhatia G, Loh PR, Georgiev S, Mukherjee S, Patterson N, Price A. Fast Principal-Component Analysis Reveals Convergent Evolution of ADH1B in Europe and East Asia. Am J Hum Genet. 2016;98(3):456–72.
Searching for genetic variants with unusual differentiation between subpopulations is an established approach for identifying signals of natural selection. However, existing methods generally require discrete subpopulations. We introduce a method that infers selection using principal components (PCs) by identifying variants whose differentiation along top PCs is significantly greater than the null distribution of genetic drift. To enable the application of this method to large datasets, we developed the FastPCA software, which employs recent advances in random matrix theory to accurately approximate top PCs while reducing time and memory cost from quadratic to linear in the number of individuals, a computational improvement of many orders of magnitude. We apply FastPCA to a cohort of 54,734 European Americans, identifying 5 distinct subpopulations spanning the top 4 PCs. Using the PC-based test for natural selection, we replicate previously known selected loci and identify three new genome-wide significant signals of selection, including selection in Europeans at ADH1B. The coding variant rs1229984(∗)T has previously been associated to a decreased risk of alcoholism and shown to be under selection in East Asians; we show that it is a rare example of independent evolution on two continents. We also detect selection signals at IGFBP3 and IGH, which have also previously been associated to human disease.