Allele frequency estimates in admixed populations, such as Hispanics and Latinos, rely on the sample's specific admixture composition and thus may differ between two seemingly similar populations. However, ancestry-specific allele frequencies, i.e., pertaining to the ancestral populations of an admixed group, may be particularly useful for prioritizing genetic variants for genetic discovery and personalized genomic health. We developed a method, ancestry-specific allele frequency estimation in admixed populations (AFA), to estimate the frequencies of biallelic variants in admixed populations with an unlimited number of ancestries. AFA uses maximum-likelihood estimation by modeling the conditional probability of having an allele given proportions of genetic ancestries. It can be applied using either local ancestry interval proportions encompassing the variant (local-ancestry-specific allele frequency estimations in admixed populations [LAFAs]) or global proportions of genetic ancestries (global-ancestry-specific allele frequency estimations in admixed populations [GAFAs]), which are easier to compute and are more widely available. Simulations and comparisons to existing software demonstrated the high accuracy of the method. We implemented AFA on high-quality imputed data of ∼9,000 Hispanics and Latinos from the Hispanic Community Health Study/Study of Latinos (HCHS/SOL), an understudied, admixed population with three predominant continental ancestries: Amerindian, European, and African. Comparison of the European and African estimated frequencies to the respective gnomAD frequencies demonstrated high correlations (Pearson R2 = 0.97-0.99). We provide a genome-wide dataset of the estimated ancestry-specific allele frequencies for available variants with allele frequency between 5% and 95% in at least one of the three ancestral populations. Association analysis of Amerindian-enriched variants with cardiometabolic traits identified five loci associated with lipid traits in Hispanics and Latinos, demonstrating the utility of ancestry-specific allele frequencies in admixed populations.
Statistical Methods
Mendelian randomization has been widely used to assess the causal effect of a heritable exposure variable on an outcome of interest, using genetic variants as instrumental variables. In practice, data on the exposure variable can be incomplete due to high cost of measurement and technical limits of detection. In this paper, we propose a valid and efficient method to handle both unmeasured and undetectable values of the exposure variable in one-sample Mendelian randomization analysis with individual-level data. We estimate the causal effect of the exposure variable on the outcome using maximum likelihood estimation and develop an expectation maximization algorithm for the computation of the estimator. Simulation studies show that the proposed method performs well in making inference on the causal effect. We apply our method to the Hispanic Community Health Study/Study of Latinos, a community-based prospective cohort study, and estimate the causal effect of several metabolites on phenotypes of interest.
Linear mixed models (LMMs) are widely used in genome-wide association studies (GWASs) to account for population structure and relatedness, for both continuous and binary traits. Motivated by the failure of LMMs to control type I errors in a GWAS of asthma, a binary trait, we show that LMMs are generally inappropriate for analyzing binary traits when population stratification leads to violation of the LMM's constant-residual variance assumption. To overcome this problem, we develop a computationally efficient logistic mixed model approach for genome-wide analysis of binary traits, the generalized linear mixed model association test (GMMAT). This approach fits a logistic mixed model once per GWAS and performs score tests under the null hypothesis of no association between a binary trait and individual genetic variants. We show in simulation studies and real data analysis that GMMAT effectively controls for population structure and relatedness when analyzing binary traits in a wide variety of study designs.
Investigators often meta-analyze multiple genome-wide association studies (GWASs) to increase the power to detect associations of single nucleotide polymorphisms (SNPs) with a trait. Meta-analysis is also performed within a single cohort that is stratified by, e.g., sex or ancestry group. Having correlated individuals among the strata may complicate meta-analyses, limit power, and inflate Type 1 error. For example, in the Hispanic Community Health Study/Study of Latinos (HCHS/SOL), sources of correlation include genetic relatedness, shared household, and shared community. We propose a novel mixed-effect model for meta-analysis, "MetaCor," which accounts for correlation between stratum-specific effect estimates. Simulations show that MetaCor controls inflation better than alternatives such as ignoring the correlation between the strata or analyzing all strata together in a "pooled" GWAS, especially with different minor allele frequencies (MAFs) between strata. We illustrate the benefits of MetaCor on two GWASs in the HCHS/SOL. Analysis of dental caries (tooth decay) stratified by ancestry group detected a genome-wide significant SNP (rs7791001, P-value = 3.66×10-8, compared to 4.67×10-7 in pooled), with different MAFs between strata. Stratified analysis of body mass index (BMI) by ancestry group and sex reduced overall inflation from λGC=1.050 (pooled) to λGC=1.028 (MetaCor). Furthermore, even after removing close relatives to obtain nearly uncorrelated strata, a naïve stratified analysis resulted in λGC=1.058 compared to λGC=1.027 for MetaCor.
The difference-in-differences (DID) approach is a well known strategy for estimating the effect of an exposure in the presence of unobserved confounding. The approach is most commonly used when pre-and post-exposure outcome measurements are available, and one can assume that the association of the unobserved confounder with the outcome is equal in the two exposure groups, and constant over time. Then, one recovers the treatment effect by regressing the change in outcome over time on the exposure. In this paper, we interpret the difference-in-differences as a negative outcome control (NOC) approach. We show that the pre-exposure outcome is a negative control outcome, as it cannot be influenced by the subsequent exposure, and it is affected by both observed and unobserved confounders of the exposure-outcome association of interest. The relation between DID and NOC provides simple conditions under which negative control outcomes can be used to detect and correct for confounding bias. However, for general negative control outcomes, the DID-like assumption may be overly restrictive and rarely credible, because it requires that both the outcome of interest and the control outcome are measured on the same scale. Thus, we present a scale-invariant generalization of the DID that may be used in broader NOC contexts. The proposed approach is demonstrated in simulations and on a Normative Aging Study data set, in which Body Mass Index is used for NOC of the relationship between air pollution and inflammatory outcomes.
We consider variable selection for high-dimensional multivariate regression using penalized likelihoods when the number of outcomes and the number of covariates might be large. To account for within-subject correlation, we consider variable selection when a working precision matrix is used and when the precision matrix is jointly estimated using a two-stage procedure. We show that under suitable regularity conditions, penalized regression coefficient estimators are consistent for model selection for an arbitrary working precision matrix, and have the oracle properties and are efficient when the true precision matrix is used or when it is consistently estimated using sparse regression. We develop an efficient computation procedure for estimating regression coefficients using the coordinate descent algorithm in conjunction with sparse precision matrix estimation using the graphical LASSO (GLASSO) algorithm. We develop the Bayesian Information Criterion (BIC) for estimating the tuning parameter and show that BIC is consistent for model selection. We evaluate finite sample performance for the proposed method using simulation studies and illustrate its application using the type II diabetes gene expression pathway data.