# Statistical issues in Mendelian randomization: use of genetic instrumental variables for assessing causal associations

Thesis

Mendelian randomization is an epidemiological method for using genetic variation to estimate the causal effect of the change in a modifiable phenotype on an outcome from observational data. A genetic variant satisfying the assumptions of an instrumental variable for the phenotype of interest can be used to divide a population into subgroups which differ systematically only in the phenotype. This gives a causal estimate which is asymptotically free of bias from confounding and reverse causation. However, the variance of the causal estimate is large compared to traditional regression methods, requiring large amounts of data and necessitating methods for efficient data synthesis. Additionally, if the association between the genetic variant and the phenotype is not strong, then the causal estimates will be biased due to the “weak instrument” in finite samples in the direction of the observational association. This bias may convince a researcher that an observed association is causal. If the causal parameter estimated is an odds ratio, then the parameter of association will differ depending on whether viewed as a population-averaged causal effect or a personal causal effect conditional on covariates. We introduce a Bayesian framework for instrumental variable analysis, which is less susceptible to weak instrument bias than traditional two-stage methods, has correct coverage with weak instruments, and is able to efficiently combine gene–phenotype–outcome data from multiple heterogeneous sources. Methods for imputing missing genetic data are developed, allowing multiple genetic variants to be used without reduction in sample size. We focus on the question of a binary outcome, illustrating how the collapsing of the odds ratio over heterogeneous strata in the population means that the two-stage and the Bayesian methods estimate a population-averaged marginal causal effect similar to that estimated by a randomized trial, but which typically differs from the conditional effect estimated by standard regression methods. We show how these methods can be adjusted to give an estimate closer to the conditional effect. We apply the methods and techniques discussed to data on the causal effect of C-reactive protein on fibrinogen and coronary heart disease, concluding with an overall estimate of causal association based on the totality of available data from 42 studies.