Strategies for imputing missing covariate values in observational data

O Carroll; (2022) Strategies for imputing missing covariate values in observational data. PhD thesis, London School of Hygiene & Tropical Medicine. DOI: 10.17037/PUBS.04665779

Copy

Multivariable model-building is an important aspect of statistical analyses and should be given careful consideration. A common issue when conducting an analysis is the presence of partially-observed covariates. Missing data in covariates are known to result in biased estimates of associations with the outcome and loss of power to detect associations. The impact of missing data in the prediction context has been less studied. When using a dataset to train a model for prediction it is essential to evaluate its performance. Two popular internal validation methods for evaluating a prediction model are K-fold cross-validation and using the bootstrap algorithm to correct for optimism. Methods for handling missing data in this process are not well established and will be the primary focus of this thesis. Multiple imputation is a method commonly used to handle missing data involving replacing a missing value with a plausible value across multiple copies of the original dataset and will be used here to handle the various challenges that missing data pose. This thesis will assess how to combine multiple imputation with internal validation techniques in an `ideal' and `pragmatic' setting. The use of two imputation models is proposed, one to impute the dataset to estimate the coefficients of the prediction model and the other to evaluate the prediction model. Consideration is given to data leakage which can occur during the imputation process. The presence of missing data further presents challenges when selecting covariates and flexibly modelling covariates. An extension to the internal validation methods will include covariate selection and assessment of the functional form of continuous covariates using fractional polynomials. Finally, methods will be demonstrated using the Rotterdam breast cancer study data which is a publicly available dataset. The final part of the thesis turns to the handling of missing data in studies of associations. While methods for handling missing data in this context are well established for simple settings, extensions to deal with considerations such as functional forms, covariate selection and time-varying effects are more challenging, and it is not clear to what extent they have been used in practice. This thesis presents findings from a systematic review investigating how researchers commonly handle missing data in observational time-to-event studies. A particular focus is given to the methods researchers used to deal with unobserved values, assess the functional forms of continuous covariates and select covariates for the model of interest. Recommendations for dealing with missing values in practice while handling these complicated aspects of an analysis are given.

Item Type	Thesis (Doctoral)
Thesis Type	Doctoral
Thesis Name	PhD
Contributors	Keogh, R; Morris, T
Grant number	ES/P000592/1
Copyright Holders	Orlagh Carroll

Explore Further

Carroll, Orlagh

The Economic and Social Research Council

Dept of Medical Statistics

picture_as_pdf: 2022_EPH_PhD_Carroll_O.pdf
subject: Accepted Version
: Available under Creative Commons: 4.0

View

Download

picture_as_pdf

Supplemental Material

Atom

BibTeX

OpenURL ContextObject in Span

Multiline CSV

OpenURL ContextObject

Dublin Core

MPEG-21 DIDL

EndNote

HTML Citation

JSON

MARC (ASCII)

MARC (ISO 2709)

METS

MODS

RDF+N3

RDF+N-Triples

RDF+XML

RIOXX2 XML

Reference Manager

Refer

Simple Metadata

ASCII Citation

EP3 XML

Export

Downloads