Strategies for imputing missing covariate values in observational data

OCarroll; (2022) Strategies for imputing missing covariate values in observational data. PhD thesis, London School of Hygiene & Tropical Medicine. DOI: 10.17037/PUBS.04665779
Copy

Multivariable model-building is an important aspect of statistical analyses and should be given careful consideration. A common issue when conducting an analysis is the presence of partially-observed covariates. Missing data in covariates are known to result in biased estimates of associations with the outcome and loss of power to detect associations. The impact of missing data in the prediction context has been less studied. When using a dataset to train a model for prediction it is essential to evaluate its performance. Two popular internal validation methods for evaluating a prediction model are K-fold cross-validation and using the bootstrap algorithm to correct for optimism. Methods for handling missing data in this process are not well established and will be the primary focus of this thesis. Multiple imputation is a method commonly used to handle missing data involving replacing a missing value with a plausible value across multiple copies of the original dataset and will be used here to handle the various challenges that missing data pose. This thesis will assess how to combine multiple imputation with internal validation techniques in an `ideal' and `pragmatic' setting. The use of two imputation models is proposed, one to impute the dataset to estimate the coefficients of the prediction model and the other to evaluate the prediction model. Consideration is given to data leakage which can occur during the imputation process. The presence of missing data further presents challenges when selecting covariates and flexibly modelling covariates. An extension to the internal validation methods will include covariate selection and assessment of the functional form of continuous covariates using fractional polynomials. Finally, methods will be demonstrated using the Rotterdam breast cancer study data which is a publicly available dataset. The final part of the thesis turns to the handling of missing data in studies of associations. While methods for handling missing data in this context are well established for simple settings, extensions to deal with considerations such as functional forms, covariate selection and time-varying effects are more challenging, and it is not clear to what extent they have been used in practice. This thesis presents findings from a systematic review investigating how researchers commonly handle missing data in observational time-to-event studies. A particular focus is given to the methods researchers used to deal with unobserved values, assess the functional forms of continuous covariates and select covariates for the model of interest. Recommendations for dealing with missing values in practice while handling these complicated aspects of an analysis are given.



picture_as_pdf
2022_EPH_PhD_Carroll_O.pdf
subject
Accepted Version
Available under Creative Commons: 4.0

View Download
picture_as_pdf

Supplemental Material


Explore Further

Read more research from the creator(s):

Find work funded by this grant:

Find work associated with the faculties and division(s):