Genetic sequence data are well described by hidden Markov models (HMMs) in which latent states correspond to clusters of similar mutation patterns. Theory from statistical genetics suggests that these HMMs are nonhomogeneous (their transition probabilities vary along the chromosome) and have large support for self transitions. We develop a new nonparametric model of genetic sequence data, based on the hierarchical Dirichlet process, which supports these self transitions and nonhomogeneity. Our model provides a parameterization of the genetic process that is more parsimonious than other more general nonparametric models which have previously been applied to population genetics. We provide truncation-free MCMC inference for our model using a new auxiliary sampling scheme for Bayesian nonparametric HMMs. In a series of experiments on male X chromosome data from the Thousand Genomes Project and also on data simulated from a population bottleneck we show the benefits of our model over the popular finite model fastPHASE, which can itself be seen as a parametric truncation of our model. We find that the number of HMM states found by our model is correlated with the time to the most recent common ancestor in population bottlenecks. This work demonstrates the flexibility of Bayesian nonparametrics applied to large and complex genetic data.

@article{EllTeh2016a,
author = {Elliott, L. T. and Teh, Y. W.},
journal = {Electronic Journal of Statistics},
title = {A Nonparametric {HMM} for Genetic Imputation and Coalescent Inference},
year = {2016}
}

2015

M. De Iorio,
L. Elliott,
S. Favaro,
Y. W. Teh,
Bayesian Nonparametric Inference of Population Admixtures, 2015.

We propose a Bayesian nonparametric model to infer population admixture, extending the Hierarchical Dirichlet Process to allow for correlation between loci due to Linkage Disequilibrium. Given multilocus genotype data from a sample of individuals, the model allows inferring classifying individuals as unadmixed or admixed, inferring the number of subpopulations ancestral to an admixed population and the population of origin of chromosomal regions. Our model does not assume any specific mutation process and can be applied to most of the commonly used genetic markers. We present a MCMC algorithm to perform posterior inference from the model and discuss methods to summarise the MCMC output for the analysis of population admixture. We demonstrate the performance of the proposed model in simulations and in a real application, using genetic data from the EDAR gene, which is considered to be ancestry-informative due to well-known variations in allele frequency as well as phenotypic effects across ancestry. The structure analysis of this dataset leads to the identification of a rare haplotype in Europeans.

@unpublished{De-EllFav2015a,
author = {{De Iorio}, M. and Elliott, L. and Favaro, S. and Teh, Y. W.},
note = {ArXiv e-prints: 1503.08278},
title = {{B}ayesian Nonparametric Inference of Population Admixtures},
year = {2015},
bdsk-url-1 = {https://arxiv.org/pdf/1503.08278v1.pdf}
}

2012

L. Elliott,
Y. W. Teh,
Scalable Imputation of Genetic Data with a Discrete Fragmentation-Coagulation Process, in Advances in Neural Information Processing Systems (NIPS), 2012.

We present a Bayesian nonparametric model for genetic sequence data in which a set of genetic sequences is modelled using a Markov model of partitions. The partitions at consecutive locations in the genome are related by their clusters first splitting and then merging. Our model can be thought of as a discrete time analogue of continuous time fragmentation-coagulation processes [Teh et al 2011], preserving the important properties of projectivity, exchangeability and reversibility, while being more scalable. We apply this model to the problem of genotype imputation, showing improved computational efficiency while maintaining the same accuracies as in [Teh et al 2011].

@inproceedings{EllTeh2012a,
author = {Elliott, L. and Teh, Y. W.},
booktitle = {Advances in Neural Information Processing Systems (NIPS)},
title = {Scalable Imputation of Genetic Data with a Discrete Fragmentation-Coagulation Process},
year = {2012},
bdsk-url-1 = {http://papers.nips.cc/paper/4782-scalable-imputation-of-genetic-data-with-a-discrete-fragmentation-coagulation-process},
bdsk-url-2 = {http://papers.nips.cc/paper/4782-scalable-imputation-of-genetic-data-with-a-discrete-fragmentation-coagulation-process.pdf}
}

2011

Y. W. Teh,
C. Blundell,
L. T. Elliott,
Modelling Genetic Variations with Fragmentation-Coagulation Processes, in Advances in Neural Information Processing Systems (NIPS), 2011.

We propose a novel class of Bayesian nonparametric models for sequential data called fragmentation-coagulation processes (FCPs). FCPs model a set of sequences using a partition-valued Markov process which evolves by splitting and merging clusters. An FCP is exchangeable, projective, stationary and reversible, and its equilibrium distributions are given by the Chinese restaurant process. As opposed to hidden Markov models, FCPs allow for flexible modelling of the number of clusters, and they avoid label switching non-identifiability problems. We develop an efficient Gibbs sampler for FCPs which uses uniformization and the forward-backward algorithm. Our development of FCPs is motivated by applications in population genetics, and we demonstrate the utility of FCPs on problems of genotype imputation with phased and unphased SNP data.

@inproceedings{TehBluEll2011a,
author = {Teh, Y. W. and Blundell, C. and Elliott, L. T.},
booktitle = {Advances in Neural Information Processing Systems (NIPS)},
title = {Modelling Genetic Variations with Fragmentation-Coagulation Processes},
year = {2011},
bdsk-url-1 = {https://papers.nips.cc/paper/4211-modelling-genetic-variations-using-fragmentation-coagulation-processes},
bdsk-url-2 = {https://papers.nips.cc/paper/4211-modelling-genetic-variations-using-fragmentation-coagulation-processes.pdf}
}

@software{De-EllFav2015b,
author = {{De Iorio}, M. and Elliott, L. T. and Favaro, S. and Teh, Y. W.},
title = {HDPStructure},
year = {2015},
bdsk-url-1 = {https://github.com/BigBayes/HDPStructure}
}