Bayesian Feature Selection via Variational Inference in Omics Data

DAV Scott; (2022) Bayesian Feature Selection via Variational Inference in Omics Data. PhD thesis, London School of Hygiene & Tropical Medicine. DOI: 10.17037/PUBS.04668862

Copy

The advent of genome sequencing has led to a dramatic change in the scale and breadth of information within biology. Omics technologies have enabled a single experiment to generate a very large amount of raw data, of increasingly complex phenomena. This data is often highdimensional, the size raises questions about the efficiency of the computational approach used to estimate the model and the number of attributes often exceed the number of observations. The focus of the thesis is on Bayesian feature selection in high-dimensional omics data via variational inference. Our objective is to develop and implement reliable inferential tools that scale efficiently with dimensionality. Our first algorithm identifies compositional covariates and effect sizes associated with a response of interest via auxiliary indicator variables. This is particularly useful for data sets generated from genome sequencing technology such as human microbiome, as these only contain information on the relative magnitudes of the compositional components. Novel priors account for model constraints and a Monte Carlo step, guided by the data, is introduced to estimate intractable marginal expectations. We extend the methodology to a multidimensional response, where different compositional covariates are free to be associated with different responses. This allows the relationship between the microbiome and complex phenotypes such as lipids or metabolites to be explored in one model, facilitating a system genetics approach to understanding the flow of biological information. By a reparameterisation of the likelihood, we are able to perform fast covariance and covariate selection despite the vast model space. A hierarchical Bayesian model is developed for clusters of individuals who exhibit different causal pathways to the same multi-dimensional endpoint. Again, we are able to reparametrise the likelihood to incorporate fast predictor and covariance selection within a large model space. We capture the different latent structures across the clusters to aid model fitting and understanding. Sparse feature selection is performed both within each expert and in the unsupervised learning of cluster detection. Our hope is that the software which follows the methods we have outlined will be used by practitioners to develop biological understanding and insight.