Bayesian nonparametrics, probabilistic learning, deep learning
I am a Professor of Statistical Machine Learning at the Department of Statistics, University of Oxford and a Research Scientist at Google DeepMind. I am a European Research Council Consolidator Fellow and an Alan Turing Institute Faculty Fellow. I am interested in developing foundational methodologies for statistical machine learning.
Publications
2017
J. Mitrovic,
D. Sejdinovic,
Y. W. Teh,
Deep Kernel Machines via the Kernel Reparametrization Trick, in International Conference on Learning Representations (ICLR) - Workshop Track, 2017.
While deep neural networks have achieved state-of-the-art performance on many tasks across varied domains, they still remain black boxes whose inner workings are hard to interpret and understand. In this paper, we develop a novel method for efficiently capturing the behaviour of deep neural networks using kernels. In particular, we construct a hierarchy of increasingly complex kernels that encode individual hidden layers of the network. Furthermore, we discuss how our framework motivates a novel supervised weight initialization method that discovers highly discriminative features already at initialization.
@inproceedings{MitSejTeh2017,
author = {Mitrovic, J. and Sejdinovic, D. and Teh, Y. W.},
title = {{Deep Kernel Machines via the Kernel Reparametrization Trick}},
booktitle = {International Conference on Learning Representations (ICLR) - Workshop Track},
year = {2017}
}
S. Flaxman,
Y. W. Teh,
D. Sejdinovic,
Poisson Intensity Estimation with Reproducing Kernels, in Artificial Intelligence and Statistics (AISTATS), 2017, to appear.
Despite the fundamental nature of the inhomogeneous Poisson process in the theory and application of stochastic processes, and its attractive generalizations (e.g. Cox process), few tractable nonparametric modeling approaches of intensity functions exist, especially in high dimensional settings. In this paper we develop a new, computationally tractable Reproducing Kernel Hilbert Space (RKHS) formulation for the inhomogeneous Poisson process. We model the square root of the intensity as an RKHS function. The modeling challenge is that the usual representer theorem arguments no longer apply due to the form of the inhomogeneous Poisson process likelihood. However, we prove that the representer theorem does hold in an appropriately transformed RKHS, guaranteeing that the optimization of the penalized likelihood can be cast as a tractable finite-dimensional problem. The resulting approach is simple to implement, and readily scales to high dimensions and large-scale datasets.
@inproceedings{FlaTehSej2017,
author = {Flaxman, S. and Teh, Y. W. and Sejdinovic, D.},
title = {{Poisson Intensity Estimation with Reproducing Kernels}},
booktitle = {Artificial Intelligence and Statistics (AISTATS)},
year = {2017},
pages = {to appear}
}
M. Battiston,
S. Favaro,
Y. W. Teh,
Multi-armed bandit for species discovery: A Bayesian nonparametric approach, Journal of the American Statistical Association, to appear, 2017.
Let (P1,...,PJ) denote J populations of animals from distinct regions. A priori, it is
unknown which species are present in each region and what are their corresponding frequencies.
Species are shared among populations and each species can be present in more than one region with
its frequency varying across populations. In this paper we consider the problem of sequentially
sampling these populations in order to observe the greatest number of different species. We adopt a
Bayesian nonparametric approach and endow (P1,...,PJ) with a Hierarchical Pitman-Yor process
prior. As a consequence of the hierarchical structure, the J unknown discrete probability measures
share the same support, that of their common random base measure. Given this prior choice,
we propose a sequential rule that, at every time step, given the information available up to that
point, selects the population from which to collect the next observation. Rather than picking the
population with the highest posterior estimate of producing a new value, the proposed rule includes
a Thompson sampling step to better balance the exploration-exploitation trade-off. We also propose
an extension of the algorithm to deal with incidence data, where multiple observations are collected
in a time period. The performance of the proposed algorithms is assessed through a simulation
study and compared to three other strategies. Finally, we compare these algorithms using a dataset
of species of trees, collected from different plots in South America.
@article{BatFavTeh2017a,
author = {Battiston, M. and Favaro, S. and Teh, Y. W.},
title = {Multi-armed bandit for species discovery: A Bayesian nonparametric approach},
journal = {Journal of the American Statistical Association},
year = {2017},
pages = {to appear}
}
2016
J. Mitrovic,
D. Sejdinovic,
Y. W. Teh,
DR-ABC: Approximate Bayesian Computation with Kernel-Based Distribution Regression, in International Conference on Machine Learning (ICML), 2016, 1482–1491.
Performing exact posterior inference in complex generative models is often difficult or impossible due to an expensive to evaluate or intractable likelihood function. Approximate Bayesian computation (ABC) is an inference framework that constructs an approximation to the true likelihood based on the similarity between the observed and simulated data as measured by a predefined set of summary statistics. Although the choice of appropriate problem-specific summary statistics crucially influences the quality of the likelihood approximation and hence also the quality of the posterior sample in ABC, there are only few principled general-purpose approaches to the selection or construction of such summary statistics. In this paper, we develop a novel framework for this task using kernel-based distribution regression. We model the functional relationship between data distributions and the optimal choice (with respect to a loss function) of summary statistics using kernel-based distribution regression. We show that our approach can be implemented in a computationally and statistically efficient way using the random Fourier features framework for large-scale kernel learning. In addition to that, our framework shows superior performance when compared to related methods on toy and real-world problems.
@inproceedings{MitSejTeh2016,
author = {Mitrovic, J. and Sejdinovic, D. and Teh, Y. W.},
title = {{DR-ABC: Approximate Bayesian Computation with Kernel-Based Distribution Regression}},
booktitle = {International Conference on Machine Learning (ICML)},
year = {2016},
pages = {1482--1491}
}
T. Fernandez,
Y. W. Teh,
Posterior Consistency for a Non-parametric Survival Model under a Gaussian Process Prior, 2016.
@unpublished{FerTeh2016a,
author = {Fernandez, T. and Teh, Y. W.},
note = {ArXiv e-prints: 1611.02335},
title = {Posterior Consistency for a Non-parametric Survival Model under a {G}aussian Process Prior},
year = {2016}
}
V. Perrone,
P. A. Jenkins,
D. Spano,
Y. W. Teh,
Poisson Random Fields for Dynamic Feature Models, 2016.
@unpublished{PerJenSpa2016a,
author = {Perrone, V. and Jenkins, P. A. and Spano, D. and Teh, Y. W.},
note = {ArXiv e-prints: 1611.07460},
title = {{P}oisson Random Fields for Dynamic Feature Models},
year = {2016}
}
T. Fernandez,
N. Rivera,
Y. W. Teh,
Gaussian Processes for Survival Analysis, in Advances in Neural Information Processing Systems (NIPS), 2016.
We introduce a semi-parametric Bayesian model for survival analysis. The model is centred on a parametric baseline hazard, and uses a Gaussian process to model variations away from it nonparametrically, as well as dependence on covariates. As opposed to many other methods in survival analysis, our framework does not impose unnecessary constraints in the hazard rate or in the survival function. Furthermore, our model handles left, right and interval censoring mechanisms common in survival analysis. We propose a MCMC algorithm to perform inference and an approximation scheme based on random Fourier features to make computations faster. We report experimental results on synthetic and real data, showing that our model performs better than competing models such as Cox proportional hazards, ANOVA-DDP and random survival forests.
@inproceedings{FerRivTeh2016a,
author = {Fernandez, T. and Rivera, N. and Teh, Y. W.},
title = {Gaussian Processes for Survival Analysis},
booktitle = {Advances in Neural Information Processing Systems (NIPS)},
year = {2016}
}
H. Kim,
Y. W. Teh,
Scalable Structure Discovery in Regression using Gaussian Processes, in Proceedings of the 2016 Workshop on Automatic Machine Learning, 2016.
Automatic Bayesian Covariance Discovery(ABCD) in Lloyd et. al (2014) provides a framework for automating statistical modelling as well as exploratory data analysis for regression problems. However ABCD does not scale due to its O(N^3) running time. This is undesirable not only because the average size of data sets is growing fast, but also because there is potentially more information in bigger data, implying a greater need for more expressive models that can discover sophisticated structure. We propose a scalable version of ABCD, to encompass big data within the boundaries of automated statistical modelling.
@inproceedings{KimTeh2016a,
author = {Kim, H. and Teh, Y. W.},
title = {Scalable Structure Discovery in Regression using Gaussian Processes},
booktitle = {Proceedings of the 2016 Workshop on Automatic Machine Learning},
year = {2016}
}
L. T. Elliott,
Y. W. Teh,
A Nonparametric HMM for Genetic Imputation and Coalescent Inference, Electronic Journal of Statistics, 2016.
Genetic sequence data are well described by hidden Markov models (HMMs) in which latent states correspond to clusters of similar mutation patterns. Theory from statistical genetics suggests that these HMMs are nonhomogeneous (their transition probabilities vary along the chromosome) and have large support for self transitions. We develop a new nonparametric model of genetic sequence data, based on the hierarchical Dirichlet process, which supports these self transitions and nonhomogeneity. Our model provides a parameterization of the genetic process that is more parsimonious than other more general nonparametric models which have previously been applied to population genetics. We provide truncation-free MCMC inference for our model using a new auxiliary sampling scheme for Bayesian nonparametric HMMs. In a series of experiments on male X chromosome data from the Thousand Genomes Project and also on data simulated from a population bottleneck we show the benefits of our model over the popular finite model fastPHASE, which can itself be seen as a parametric truncation of our model. We find that the number of HMM states found by our model is correlated with the time to the most recent common ancestor in population bottlenecks. This work demonstrates the flexibility of Bayesian nonparametrics applied to large and complex genetic data.
@article{EllTeh2016a,
author = {Elliott, L. T. and Teh, Y. W.},
title = {A Nonparametric {HMM} for Genetic Imputation and Coalescent Inference},
journal = {Electronic Journal of Statistics},
year = {2016}
}
S. Favaro,
A. Lijoi,
C. Nava,
B. Nipoti,
I. Prüenster,
Y. W. Teh,
On the Stick-Breaking Representation for Homogeneous NRMIs, Bayesian Analysis, vol. 11, 697–724, 2016.
In this paper, we consider homogeneous normalized random measures with independent increments (hNRMI), a class of nonparametric priors recently introduced in the literature. Many of their distributional properties are known by now but their stick-breaking representation is missing. Here we display such a representation, which will feature dependent stick-breaking weights, and then derive explicit versions for noteworthy special cases of hNRMI. Posterior characterizations are also discussed. Finally, we devise an algorithm for slice sampling mixture models based on hNRMIs, which relies on the representation we have obtained, and implement it to analyze real data.
@article{FavLijNav2016a,
author = {Favaro, S. and Lijoi, A. and Nava, C. and Nipoti, B. and Pr\"uenster, I. and Teh, Y. W.},
title = {On the Stick-Breaking Representation for Homogeneous {NRMIs}},
journal = {Bayesian Analysis},
year = {2016},
volume = {11},
pages = {697-724}
}
Y. W. Teh,
Bayesian Nonparametric Modelling and the Ubiquitous Ewens Sampling Formula, Statistical Science, vol. 31, no. 1, 34–36, 2016.
We introduce the Mondrian kernel, a fast random feature approximation to the Laplace kernel. It is suitable for both batch and online learning, and admits a fast kernel-width-selection procedure as the random features can be re-used efficiently for all kernel widths. The features are constructed by sampling trees via a Mondrian process [Roy and Teh, 2009], and we highlight the connection to Mondrian forests [Lakshminarayanan et al., 2014], where trees are also sampled via a Mondrian process, but fit independently. This link provides a new insight into the relationship between kernel methods and random forests.
@inproceedings{BalLakGha2016a,
author = {Balog, M. and Lakshminarayanan, B. and Ghahramani, Z. and Roy, D. M. and Teh, Y. W.},
title = {The {M}ondrian Kernel},
booktitle = {Uncertainty in Artificial Intelligence (UAI)},
year = {2016}
}
Y. W. Teh,
A. H. Thiéry,
S. J. Vollmer,
Consistency and Fluctuations for Stochastic Gradient Langevin Dynamics, Journal of Machine Learning Research, 2016.
Applying standard Markov chain Monte Carlo (MCMC) algorithms to large data sets is computationally expensive. Both the calculation of the acceptance probability and the creation of informed proposals usually require an iteration through the whole data set. The recently proposed stochastic gradient Langevin dynamics (SGLD) method circumvents this problem by generating proposals which are only based on a subset of the data, by skipping the accept-reject step and by using decreasing step-sizes sequence (δ_m)_m≥0. We provide in this article a rigorous mathematical framework for analysing this algorithm. We prove that, under verifiable assumptions, the algorithm is consistent, satisfies a central limit theorem (CLT) and its asymptotic bias-variance decomposition can be characterized by an explicit functional of the step-sizes sequence (δ_m)_m≥0. We leverage this analysis to give practical recommendations for the notoriously difficult tuning of this algorithm: it is asymptotically optimal to use a step-size sequence of the type δ_m≍m^−1/3, leading to an algorithm whose mean squared error (MSE) decreases at rate O(m^−1/3).
@article{TehThiVol2016a,
author = {Teh, Y. W. and Thi\'ery, A. H. and Vollmer, S. J.},
title = {Consistency and Fluctuations for Stochastic Gradient {L}angevin Dynamics},
journal = {Journal of Machine Learning Research},
year = {2016}
}
S. J. Vollmer,
K. C. Zygalakis,
Y. W. Teh,
Exploration of the (Non-)asymptotic Bias and Variance of Stochastic Gradient Langevin Dynamics, Journal of Machine Learning Research (JMLR), 2016.
Applying standard Markov chain Monte Carlo (MCMC) algorithms to large data sets is computationally infeasible. The recently proposed stochastic gradient Langevin dynamics (SGLD) method circumvents this problem in three ways: it generates proposed moves using only a subset of the data, it skips the Metropolis- Hastings accept-reject step, and it uses sequences of decreasing step sizes. In Teh et al. (2014), we provided the mathematical foundations for the decreasing step size SGLD, including consistency and a central limit theorem. However, in practice the SGLD is run for a relatively small number of iterations, and its step size is not decreased to zero. The present article investigates the behaviour of the SGLD with fixed step size. In particular we characterise the asymptotic bias explicitly, along with its dependence on the step size and the variance of the stochastic gradient. On that basis a modified SGLD which removes the asymptotic bias due to the variance of the stochastic gradients up to first order in the step size is derived. Moreover, we are able to obtain bounds on the finite-time bias, variance and mean squared error (MSE). The theory is illustrated with a Gaussian toy model for which the bias and the MSE for the estimation of moments can be obtained explicitly. For this toy model we study the gain of the SGLD over the standard Euler method in the limit of large data sets.
@article{VolZygTeh2016a,
author = {Vollmer, S. J. and Zygalakis, K. C. and Teh, Y. W.},
title = {Exploration of the (Non-)asymptotic Bias and Variance of Stochastic Gradient {L}angevin Dynamics},
journal = {Journal of Machine Learning Research (JMLR)},
year = {2016}
}
J. Arbel,
S. Favaro,
B. Nipoti,
Y. W. Teh,
Bayesian nonparametric inference for discovery probabilities: credible intervals and large sample asymptotics, Statistica Sinica, 2016.
Given a sample of size n from a population of individuals belonging to different species with unknown proportions, a popular problem of practical interest consists in making inference on the probability D_n(l) that the (n+1)-th draw coincides with a species with frequency l in the sample, for any l = 0,1,…,n. This paper contributes to the methodology of Bayesian nonparametric inference for D_n(l). Specifically, under the general framework of Gibbs-type priors we show how to derive credible intervals for a Bayesian nonparametric estimation of D_n(l), and we investigate the large n asymptotic behaviour of such an estimator. Of particular interest are special cases of our results obtained under the specification of the two parameter Poisson–Dirichlet prior and the normalized generalized Gamma prior, which are two of the most commonly used Gibbs-type priors. With respect to these two prior specifications, the proposed results are illustrated through a simulation study and a benchmark Expressed Sequence Tags dataset. To the best our knowledge, this illustration provides the first comparative study between the two parameter Poisson–Dirichlet prior and the normalized generalized Gamma prior in the context of Bayesian nonparemetric inference for D_n(l).
@article{ArbFavNip2016a,
author = {Arbel, J. and Favaro, S. and Nipoti, B. and Teh, Y. W.},
title = {{B}ayesian nonparametric inference for discovery probabilities: credible intervals and large sample asymptotics},
journal = {Statistica Sinica},
year = {2016}
}
B. Lakshminarayanan,
D. M. Roy,
Y. W. Teh,
Mondrian Forests for Large-Scale Regression when Uncertainty Matters, in Artificial Intelligence and Statistics (AISTATS), 2016.
Many real-world regression problems demand a measure of the uncertainty associated with each prediction. Standard decision forests deliver efficient state-of-the-art predictive performance, but high-quality uncertainty estimates are lacking. Gaussian processes (GPs) deliver uncertainty estimates, but scaling GPs to large-scale data sets comes at the cost of approximating the uncertainty estimates. We extend Mondrian forests, first proposed by Lakshminarayanan et al. (2014) for classification problems, to the large-scale nonparametric regression setting. Using a novel hierarchical Gaussian prior that dovetails with the Mondrian forest framework, we obtain principled uncertainty estimates, while still retaining the computational advantages of decision forests. Through a combination of illustrative examples, real-world large-scale datasets and Bayesian optimization benchmarks, we demonstrate that Mondrian forests outperform approximate GPs on large-scale regression tasks and deliver better-calibrated uncertainty assessments than decision-forest-based methods.
@inproceedings{LakRoyTeh2016a,
author = {Lakshminarayanan, B. and Roy, D. M. and Teh, Y. W.},
title = {{M}ondrian Forests for Large-Scale Regression when Uncertainty Matters},
booktitle = {Artificial Intelligence and Statistics (AISTATS)},
year = {2016}
}
Hamiltonian Monte Carlo (HMC) is a popular Markov chain Monte Carlo (MCMC) algorithm that generates proposals for a Metropolis-Hastings algorithm by simulating the dynamics of a Hamiltonian system. However, HMC is sensitive to large time discretizations and performs poorly if there is a mismatch between the spatial geometry of the target distribution and the scales of the momentum distribution. In particular the mass matrix of HMC is hard to tune well. In order to alleviate these problems we propose relativistic Hamiltonian Monte Carlo, a version of HMC based on relativistic dynamics that introduce a maximum velocity on particles. We also derive stochastic gradient versions of the algorithm and show that the resulting algorithms bear interesting relationships to gradient clipping, RMSprop, Adagrad and Adam, popular optimisation methods in deep learning. Based on this, we develop relativistic stochastic gradient descent by taking the zero-temperature limit of relativistic stochastic gradient Hamiltonian Monte Carlo. In experiments we show that the relativistic algorithms perform better than classical Newtonian variants and Adam.
@unpublished{LuPerHas2016a,
author = {Lu, X. and Perrone, V. and Hasenclever, L. and Teh, Y. W. and Vollmer, S. J.},
note = {ArXiv e-prints: 1609.04388},
title = {Relativistic {M}onte {C}arlo},
year = {2016}
}
D. Glowacka,
Y. W. Teh,
J. Shawe-Taylor,
Image Retrieval with a Bayesian Model of Relevance Feedback, 2016.
A content-based image retrieval system based on multinomial relevance feedback is proposed. The system relies on an interactive search paradigm where at each round a user is presented with k images and selects the one closest to their ideal target. Two approaches, one based on the Dirichlet distribution and one based the Beta distribution, are used to model the problem motivating an algorithm that trades exploration and exploitation in presenting the images in each round. Experimental results show that the new approach compares favourably with previous work.
@unpublished{GloTehSha2016a,
author = {Glowacka, D. and Teh, Y. W. and Shawe-Taylor, J.},
note = {ArXiv e-prints: 1603.09522},
title = {Image Retrieval with a {B}ayesian Model of Relevance Feedback},
year = {2016}
}
We introduce the Tucker Gaussian Process (TGP), a model for regression that regularises a Gaussian Process (GP) towards simpler regression functions for enhanced generalisation performance. We derive it using a novel approach to scalable GP learning, and show that our model is particularly well-suited to grid-structured data and problems where the dependence on covariates is close to being separable. A prime example is collaborative filtering, for which our model provides an effective GP based method that has a low-rank matrix factorisation at its core. We show that TGP generalises classical Bayesian matrix factorisation models, and goes beyond them to give a natural and elegant method for incorporating side information.
@unpublished{KimLuFla2016a,
author = {Kim, H. and Lu, X. and Flaxman, S. and Teh, Y. W.},
note = {ArXiv e-prints: 1605.07025},
title = {{T}ucker {G}aussian Process for Regression and Collaborative Filtering},
year = {2016}
}
M. Battiston,
S. Favaro,
D. M. Roy,
Y. W. Teh,
A Characterization of Product-Form Exchangeable Feature Probability Functions, 2016.
We characterize the class of exchangeable feature allocations assigning probability V_n,k ∏^k_l=1 W_m_lU_n−m_l to a feature allocation of n individuals, displaying k features with counts (m_1,…,m_k) for these features. Each element of this class is parametrized by a countable matrix V and two sequences U and W of non-negative weights. Moreover, a consistency condition is imposed to guarantee that the distribution for feature allocations of n−1 individuals is recovered from that of n individuals, when the last individual is integrated out. In Theorem 1.1, we prove that the only members of this class satisfying the consistency condition are mixtures of the Indian Buffet Process over its mass parameter γ and mixtures of the Beta–Bernoulli model over its dimensionality parameter N. Hence, we provide a characterization of these two models as the only, up to randomization of the parameters, consistent exchangeable feature allocations having the required product form.
@unpublished{BatFavRoy2016a,
author = {Battiston, M. and Favaro, S. and Roy, D. M. and Teh, Y. W.},
note = {ArXiv e-prints: 1607.02066},
title = {A Characterization of Product-Form Exchangeable Feature Probability Functions},
year = {2016}
}
We propose a Bayesian nonparametric prior for time-varying networks. To each node of the network is associated a positive parameter, modeling the sociability of that node. Sociabilities are assumed to evolve over time, and are modeled via a dynamic point process model. The model is able to (a) capture smooth evolution of the interaction between nodes, allowing edges to appear/disappear over time (b) capture long term evolution of the sociabilities of the nodes (c) and yield sparse graphs, where the number of edges grows subquadratically with the number of nodes. The evolution of the sociabilities is described by a tractable time-varying gamma process. We provide some theoretical insights into the model and apply it to three real world datasets.
@unpublished{PalCarTeh2016a,
author = {Palla, K. and Caron, F. and Teh, Y. W.},
note = {ArXiv e-prints: 1607.01624},
title = {Bayesian Nonparametrics for Sparse Dynamic Networks},
year = {2016}
}
2015
A. G. Deshwar,
L. Boyles,
J. Wintersinger,
P. C. Boutros,
Y. W. Teh,
Q. Morris,
Abstract B2-59: PhyloSpan: using multimutation reads to resolve subclonal architectures from heterogeneous tumor samples, Cancer Research, vol. 75, 2015.
@article{BouBoyDes2015a,
author = {Deshwar, A. G. and Boyles, L. and Wintersinger, J. and Boutros, P. C. and Teh, Y. W. and Morris, Q.},
title = {Abstract {B2-59}: {PhyloSpan}: using multimutation reads to resolve subclonal architectures from heterogeneous tumor samples},
journal = {Cancer Research},
year = {2015},
volume = {75},
doi = {10.1158/1538-7445.AM2015-4865},
booktitle = {AACR Special Conference on Computational and Systems Biology of Cancer}
}
S. Favaro,
B. Nipoti,
Y. W. Teh,
Rediscovery of Good-Turing Estimators via Bayesian Nonparametrics, Biometrics, 2015.
The problem of estimating discovery probabilities originated in the context of statistical ecology, and in recent years it has become popular due to its frequent appearance in challenging applications arising in genetics, bioinformatics, linguistics, designs of experiments, machine learning, etc. A full range of statistical approaches, parametric and nonparametric as well as frequentist and Bayesian, has been proposed for estimating discovery probabilities. In this article, we investigate the relationships between the celebrated Good–Turing approach, which is a frequentist nonparametric approach developed in the 1940s, and a Bayesian nonparametric approach recently introduced in the literature. Specifically, under the assumption of a two parameter Poisson-Dirichlet prior, we show that Bayesian nonparametric estimators of discovery probabilities are asymptotically equivalent, for a large sample size, to suitably smoothed Good–Turing estimators. As a by-product of this result, we introduce and investigate a methodology for deriving exact and asymptotic credible intervals to be associated with the Bayesian nonparametric estimators of discovery probabilities. The proposed methodology is illustrated through a comprehensive simulation study and the analysis of Expressed Sequence Tags data generated by sequencing a benchmark complementary DNA library.
@article{FavNipTeh2015b,
author = {Favaro, S. and Nipoti, B. and Teh, Y. W.},
title = {Rediscovery of {Good-Turing} Estimators via {B}ayesian Nonparametrics},
journal = {Biometrics},
year = {2015},
doi = {10.1111/biom.12366}
}
P. G. Moreno,
A. Artés-Rodríguez,
Y. W. Teh,
F. Perez-Cruz,
Bayesian Nonparametric Crowdsourcing, Journal of Machine Learning Research (JMLR), 2015.
Crowdsourcing has been proven to be an effective and efficient tool to annotate large data-sets. User annotations are often noisy, so methods to combine the annotations to produce reliable estimates of the ground truth are necessary. We claim that considering the existence of clusters of users in this combination step can improve the performance. This is especially important in early stages of crowdsourcing implementations, where the number of annotations is low. At this stage there is not enough information to accurately estimate the bias introduced by each annotator separately, so we have to resort to models that consider the statistical links among them. In addition, finding these clusters is interesting in itself as knowing the behavior of the pool of annotators allows implementing efficient active learning strategies. Based on this, we propose in this paper two new fully unsupervised models based on a Chinese restaurant process (CRP) prior and a hierarchical structure that allows inferring these groups jointly with the ground truth and the properties of the users. Efficient inference algorithms based on Gibbs sampling with auxiliary variables are proposed. Finally, we perform experiments, both on synthetic and real databases, to show the advantages of our models over state-of-the-art algorithms.
@article{MorArtTeh2015a,
author = {Moreno, P. G. and Art\'es-Rodr\'iguez, A. and Teh, Y. W. and Perez-Cruz, F.},
title = {{B}ayesian Nonparametric Crowdsourcing},
journal = {Journal of Machine Learning Research (JMLR)},
year = {2015}
}
R. P. Adams,
E. B. Fox,
E. B. Sudderth,
Y. W. Teh,
Guest Editors’ Introduction to the Special Issue on Bayesian Nonparametrics, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015.
@article{AdaFoxSud2015a,
author = {Adams, R. P. and Fox, E. B. and Sudderth, E. B. and Teh, Y. W.},
title = {Guest Editors' Introduction to the Special Issue on Bayesian Nonparametrics},
journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
year = {2015}
}
M. Lomeli,
S. Favaro,
Y. W. Teh,
A hybrid sampler for Poisson-Kingman mixture models, in Advances in Neural Information Processing Systems (NIPS), 2015.
This paper concerns the introduction of a new Markov Chain Monte Carlo scheme for posterior sampling in Bayesian nonparametric mixture models with priors that belong to the general Poisson-Kingman class. We present a novel and compact way of representing the infinite dimensional component of the model such that while explicitly representing this infinite component it has less memory and storage requirements than previous MCMC schemes. We describe comparative simulation results demonstrating the efficacy of the proposed MCMC algorithm against existing marginal and conditional MCMC samplers.
@inproceedings{LomFavTeh2015a,
author = {Lomeli, M. and Favaro, S. and Teh, Y. W.},
title = {A hybrid sampler for {Poisson-Kingman} mixture models},
booktitle = {Advances in Neural Information Processing Systems (NIPS)},
year = {2015}
}
M. De Iorio,
S. Favaro,
Y. W. Teh,
Bayesian Inference on Population Structure: From Parametric to Nonparametric Modeling, in Nonparametric Bayesian Inference in Biostatistics, Springer, 2015.
Making inference on population structure from genotype data requires to identify the actual subpopulations and assign individuals to these populations. The source populations are assumed to be in Hardy-Weinberg equilibrium, but the allelic frequencies of these populations and even the number of populations present in a sample are unknown. In this chapter we present a review of some Bayesian parametric and nonparametric models for making inference on population structure, with emphasis on model-based clustering methods. Our aim is to show how recent developments in Bayesian nonparametrics have been usefully exploited in order to introduce natural nonparametric counterparts of some of the most celebrated parametric approaches for inferring population structure. We use data from the 1000 Genomes project (http://www.1000genomes.org/) to provide a brief illustration of some of these nonparametric approaches.
@incollection{De-FavTeh2015a,
author = {{De Iorio}, M. and Favaro, S. and Teh, Y. W.},
title = {{B}ayesian Inference on Population Structure: From Parametric to Nonparametric Modeling},
booktitle = {Nonparametric {B}ayesian Inference in Biostatistics},
publisher = {Springer},
year = {2015},
doi = {10.1007/978-3-319-19518-6_7}
}
T. Lienart,
Y. W. Teh,
A. Doucet,
Expectation Particle Belief Propagation, in Advances in Neural Information Processing Systems (NIPS), 2015.
We propose an original particle-based implementation of the Loopy Belief Propagation (LPB) algorithm for pairwise Markov Random Fields (MRF) on a continuous state space. The algorithm constructs adaptively efficient proposal distributions approximating the local beliefs at each note of the MRF. This is achieved by considering proposal distributions in the exponential family whose parameters are updated iterately in an Expectation Propagation (EP) framework. The proposed particle scheme provides consistent estimation of the LBP marginals as the number of particles increases. We demonstrate that it provides more accurate results than the Particle Belief Propagation (PBP) algorithm of Ihler and McAllester (2009) at a fraction of the computational cost and is additionally more robust empirically. The computational complexity of our algorithm at each iteration is quadratic in the number of particles. We also propose an accelerated implementation with sub-quadratic computational complexity which still provides consistent estimates of the loopy BP marginal distributions and performs almost as well as the original procedure.
@inproceedings{LieTehDou2015a,
author = {Lienart, T. and Teh, Y. W. and Doucet, A.},
title = {Expectation Particle Belief Propagation},
booktitle = {Advances in Neural Information Processing Systems (NIPS)},
year = {2015}
}
S. Favaro,
B. Nipoti,
Y. W. Teh,
Random variate generation for Laguerre-type exponentially tilted α-stable distributions, Electronic Journal of Statistics, vol. 9, 1230–1242, 2015.
Exact sampling methods have been recently developed for generating random variates for exponentially tilted α-stable distributions. In this paper we show how to generate, exactly, random variates for a more general class of tilted α-stable distributions, which is referred to as the class of Laguerre-type exponentially tilted α-stable distributions. Beside the exponentially tilted α-stable distribution, such a class includes also the Erlang tilted α-stable distribution. This is a special case of the so-called gamma tilted α-stable distribution, for which an efficient exact random variate generator is currently not available in the literature. Our result fills this gap.
@article{FavNipTeh2015a,
author = {Favaro, S. and Nipoti, B. and Teh, Y. W.},
title = {Random variate generation for {L}aguerre-type exponentially tilted {$\alpha$}-stable distributions},
journal = {Electronic Journal of Statistics},
year = {2015},
volume = {9},
pages = {1230-1242},
doi = {10.1214/15-EJS1033}
}
M. Lomeli,
S. Favaro,
Y. W. Teh,
A Marginal Sampler for σ-Stable Poisson-Kingman Mixture Models, jcgs, 2015.
We investigate the class of σ-stable Poisson-Kingman random probability measures (RPMs) in the context of Bayesian nonparametric mixture modeling. This is a large class of discrete RPMs which encompasses most of the popular discrete RPMs used in Bayesian nonparametrics, such as the Dirichlet process, Pitman-Yor process, the normalized inverse Gaussian process and the normalized generalized Gamma process. We show how certain sampling properties and marginal characterisations of σ-stable Poisson-Kingman RPMs can be usefully exploited for devising a Markov chain Monte Carlo (MCMC) algorithm for performing posterior inference with a Bayesian nonparametric mixture model. Specifically, we introduce a novel and efficient MCMC sampling scheme in an augmented space that has a fixed number of auxiliary variables per iteration. We apply our sampling scheme to a density estimation and clustering tasks with unidimensional and multidimensional datasets, and compare it against competing MCMC sampling schemes.
@article{LomFavTeh2015b,
author = {Lomeli, M. and Favaro, S. and Teh, Y. W.},
title = {A Marginal Sampler for {$\sigma$}-Stable {P}oisson-{K}ingman Mixture Models},
journal = jcgs,
year = {2015},
doi = {10.1080/10618600.2015.1110526}
}
M. Balog,
Y. W. Teh,
The Mondrian Process for Machine Learning, 2015.
This report is concerned with the Mondrian process and its applications in machine learning. The Mondrian process is a guillotine-partition-valued stochastic process that possesses an elegant self-consistency property. The first part of the report uses simple concepts from applied probability to define the Mondrian process and explore its properties.
The Mondrian process has been used as the main building block of a clever online random forest classification algorithm that turns out to be equivalent to its batch counterpart. We outline a slight adaptation of this algorithm to regression, as the remainder of the report uses regression as a case study of how Mondrian processes can be utilized in machine learning. In particular, the Mondrian process will be used to construct a fast approximation to the computationally expensive kernel ridge regression problem with a Laplace kernel.
The complexity of random guillotine partitions generated by a Mondrian process and hence the complexity of the resulting regression models is controlled by a lifetime hyperparameter. It turns out that these models can be efficiently trained and evaluated for all lifetimes in a given range at once, without needing to retrain them from scratch for each lifetime value. This leads to an efficient procedure for determining the right model complexity for a dataset at hand.
The limitation of having a single lifetime hyperparameter will motivate the final Mondrian grid model, in which each input dimension is endowed with its own lifetime parameter. In this model we preserve the property that its hyperparameters can be tweaked without needing to retrain the modified model from scratch.
@unpublished{BalTeh2015a,
author = {Balog, M. and Teh, Y. W.},
note = {ArXiv e-prints: 1507.05181},
title = {The {M}ondrian Process for Machine Learning},
year = {2015}
}
L. Hasenclever,
S. Webb,
T. Lienart,
S. Vollmer,
B. Lakshminarayanan,
C. Blundell,
Y. W. Teh,
Distributed Bayesian Learning with Stochastic Natural-gradient Expectation Propagation and the Posterior Server, 2015.
This paper makes two contributions to Bayesian machine learning algorithms. Firstly, we propose stochastic natural gradient expectation propagation (SNEP), a novel alternative to expectation propagation (EP), a popular variational inference algorithm. SNEP is a black box variational algorithm, in that it does not require any simplifying assumptions on the distribution of interest, beyond the existence of some Monte Carlo sampler for estimating the moments of the EP tilted distributions. Further, as opposed to EP which has no guarantee of convergence, SNEP can be shown to be convergent, even when using Monte Carlo moment estimates. Secondly, we propose a novel architecture for distributed Bayesian learning which we call the posterior server. The posterior server allows scalable and robust Bayesian learning in cases where a dataset is stored in a distributed manner across a cluster, with each compute node containing a disjoint subset of data. An independent Monte Carlo sampler is run on each compute node, with direct access only to the local data subset, but which targets an approximation to the global posterior distribution given all data across the whole cluster. This is achieved by using a distributed asynchronous implementation of SNEP to pass messages across the cluster. We demonstrate SNEP and the posterior server on distributed Bayesian learning of logistic regression and neural networks.
@unpublished{HasWebLie2015a,
author = {Hasenclever, L. and Webb, S. and Lienart, T. and Vollmer, S. and Lakshminarayanan, B. and Blundell, C. and Teh, Y. W.},
note = {ArXiv e-prints: 1512.09327},
title = {Distributed {B}ayesian Learning with Stochastic Natural-gradient Expectation Propagation and the Posterior Server},
year = {2015}
}
P. Orbanz,
L. James,
Y. W. Teh,
Scaled subordinators and generalizations of the Indian buffet process, 2015.
We study random families of subsets of ℕ that are similar to exchangeable random partitions, but do not require constituent sets to be disjoint: Each element of ℕ may be contained in multiple subsets. One class of such objects, known as Indian buffet processes, has become a popular tool in machine learning. Based on an equivalence between Indian buffet and scale-invariant Poisson processes, we identify a random scaling variable whose role is similar to that played in exchangeable partition models by the total mass of a random measure. Analogous to the construction of exchangeable partitions from normalized subordinators, random families of sets can be constructed from randomly scaled subordinators. Coupling to a heavy-tailed scaling variable induces a power law on the number of sets containing the first n elements. Several examples, with properties desirable in applications, are derived explicitly. A relationship to exchangeable partitions is made precise as a correspondence between scaled subordinators and Poisson-Kingman measures, generalizing a result of Arratia, Barbour and Tavare on scale-invariant processes.
@unpublished{OrbJamTeh2015a,
author = {Orbanz, P. and James, L. and Teh, Y. W.},
note = {ArXiv e-prints: 1510.07309},
title = {Scaled subordinators and generalizations of the {I}ndian buffet process},
year = {2015}
}
M. De Iorio,
L. Elliott,
S. Favaro,
Y. W. Teh,
Bayesian Nonparametric Inference of Population Admixtures, 2015.
We propose a Bayesian nonparametric model to infer population admixture, extending the Hierarchical Dirichlet Process to allow for correlation between loci due to Linkage Disequilibrium. Given multilocus genotype data from a sample of individuals, the model allows inferring classifying individuals as unadmixed or admixed, inferring the number of subpopulations ancestral to an admixed population and the population of origin of chromosomal regions. Our model does not assume any specific mutation process and can be applied to most of the commonly used genetic markers. We present a MCMC algorithm to perform posterior inference from the model and discuss methods to summarise the MCMC output for the analysis of population admixture. We demonstrate the performance of the proposed model in simulations and in a real application, using genetic data from the EDAR gene, which is considered to be ancestry-informative due to well-known variations in allele frequency as well as phenotypic effects across ancestry. The structure analysis of this dataset leads to the identification of a rare haplotype in Europeans.
@unpublished{De-EllFav2015a,
author = {{De Iorio}, M. and Elliott, L. and Favaro, S. and Teh, Y. W.},
note = {ArXiv e-prints: 1503.08278},
title = {{B}ayesian Nonparametric Inference of Population Admixtures},
year = {2015}
}
B. Lakshminarayanan,
D. M. Roy,
Y. W. Teh,
Particle Gibbs for Bayesian Additive Regression Trees, in Proceedings of the International Conference on Artificial Intelligence and Statistics, 2015.
Additive regression trees are flexible non-parametric models and popular off-the-shelf tools for real-world non-linear regression. In application domains, such as bioinformatics, where there is also demand for probabilistic predictions with measures of uncertainty, the Bayesian additive regression trees (BART) model, introduced by Chipman et al. (2010), is increasingly popular. As data sets have grown in size, however, the standard Metropolis–Hastings algorithms used to per- form inference in BART are proving inadequate. In particular, these Markov chains make local changes to the trees and suffer from slow mixing when the data are high- dimensional or the best-fitting trees are more than a few layers deep. We present a novel sampler for BART based on the Particle Gibbs (PG) algorithm (Andrieu et al., 2010) and a top-down particle filtering algorithm for Bayesian decision trees (Lakshminarayanan et al., 2013). Rather than making local changes to individual trees, the PG sampler proposes a complete tree to fit the residual. Experiments show that the PG sampler outperforms existing samplers in many settings.
@inproceedings{LakRoyTeh2015b,
author = {Lakshminarayanan, B. and Roy, D. M. and Teh, Y. W.},
title = {Particle {G}ibbs for {B}ayesian Additive Regression Trees},
booktitle = {Proceedings of the International Conference on Artificial Intelligence and Statistics},
year = {2015}
}
2014
M. Welling,
Y. W. Teh,
C. Andrieu,
J. Kominiarczuk,
T. Meeds,
B. Shahbaba,
S. Vollmer,
Bayesian Inference and Big Data: A Snapshot from a Workshop, ISBA Bulletin, 2014.
@article{WelTehAnd2014a,
author = {Welling, M. and Teh, Y. W. and Andrieu, C. and Kominiarczuk, J. and Meeds, T. and Shahbaba, B. and Vollmer, S.},
title = {Bayesian Inference and Big Data: A Snapshot from a Workshop},
journal = {ISBA Bulletin},
year = {2014}
}
M. Xu,
B. Lakshminarayanan,
Y. W. Teh,
J. Zhu,
B. Zhang,
Distributed Bayesian Posterior Sampling via Moment Sharing, in Advances in Neural Information Processing Systems, 2014.
We propose a distributed Markov chain Monte Carlo (MCMC) inference algorithm for large scale Bayesian posterior simulation. We assume that the dataset is partitioned and stored across nodes of a cluster. Our procedure involves an independent MCMC posterior sampler at each node based on its local partition of the data. Moment statistics of the local posteriors are collected from each sampler and propagated across the cluster using expectation propagation message passing with low communication costs. The moment sharing scheme improves posterior estimation quality by enforcing agreement among the samplers. We demonstrate the speed and inference quality of our method with empirical studies on Bayesian logistic regression and sparse linear regression with a spike-and-slab prior.
@inproceedings{XuLakTeh2014a,
author = {Xu, M. and Lakshminarayanan, B. and Teh, Y. W. and Zhu, J. and Zhang, B.},
title = {Distributed {B}ayesian Posterior Sampling via Moment Sharing},
booktitle = {Advances in Neural Information Processing Systems},
year = {2014}
}
S. Favaro,
M. Lomeli,
Y. W. Teh,
On a Class of σ-stable Poisson-Kingman Models and an Effective Marginalized Sampler, Statistics and Computing, 2014.
We investigate the use of a large class of discrete random probability measures, which is referred to as the class Q, in the context of Bayesian nonparametric mixture modeling. The class Q encompasses both the the two-parameter Poisson–Dirichlet process and the normalized generalized Gamma process, thus allowing us to comparatively study the inferential advantages of these two well-known nonparametric priors. Apart from a highly flexible parameterization, the distinguishing feature of the class Q is the availability of a tractable posterior distribution. This feature, in turn, leads to derive an efficient marginal MCMC algorithm for posterior sampling within the framework of mixture models. We demonstrate the efficacy of our modeling framework on both one-dimensional and multi-dimensional datasets.
@article{FavLomTeh2014a,
author = {Favaro, S. and Lomeli, M. and Teh, Y. W.},
title = {On a Class of {$\sigma$}-stable {P}oisson-{K}ingman Models and an Effective Marginalized Sampler},
journal = {Statistics and Computing},
year = {2014},
doi = {10.1007/s11222-014-9499-4}
}
T. Herlau,
M. Mörup,
Y. W. Teh,
M. N. Schmidt,
Adaptive Reconfiguration Moves for Dirichlet Mixtures, submitted, 2014.
Bayesian mixture models are widely applied for unsupervised learning and exploratory data analysis. Markov chain Monte Carlo based on Gibbs sampling and split-merge moves are widely used for inference in these models. However, both methods are restricted to limited types of transitions and suffer from torpid mixing and low accept rates even for problems of modest size. We propose a method that considers a broader range of transitions that are close to equilibrium by exploiting multiple chains in parallel and using the past states adaptively to inform the proposal distribution. The method significantly improves on Gibbs and split-merge sampling as quantified using convergence diagnostics and acceptance rates. Adaptive MCMC methods which use past states to inform the proposal distribution has given rise to many ingenious sampling schemes for continuous problems and the present work can be seen as an important first step in bringing these benefits to partition-based problems
@article{HerMorTeh2014a,
author = {Herlau, T. and M\"orup, M. and Teh, Y. W. and Schmidt, M. N.},
title = {Adaptive Reconfiguration Moves for Dirichlet Mixtures},
journal = {submitted},
year = {2014}
}
S. Favaro,
M. Lomeli,
B. Nipoti,
Y. W. Teh,
On the Stick-Breaking Representation of σ-stable Poisson-Kingman Models, Electronic Journal of Statistics, vol. 8, 1063–1085, 2014.
In this paper we investigate the stick-breaking representation for the class of σ-stable Poisson-Kingman models, also known as Gibbs-type random probability measures. This class includes as special cases most of the discrete priors commonly used in Bayesian nonparametrics, such as the two parameter Poisson-Dirichlet process and the normalized generalized Gamma process. Under the assumption σ=u/v, for any coprime integers 1≤u<v such that u/v≤1/2, we show that a σ-stable Poisson-Kingman model admits an explicit stick-breaking representation in terms of random variables which are obtained by suitably transforming Gamma random variables and products of independent Beta and Gamma random variables.
@article{FavLomNip2014a,
author = {Favaro, S. and Lomeli, M. and Nipoti, B. and Teh, Y. W.},
title = {On the Stick-Breaking Representation of {$\sigma$}-stable {P}oisson-{K}ingman Models},
journal = {Electronic Journal of Statistics},
year = {2014},
volume = {8},
pages = {1063-1085},
doi = {10.1214/14-EJS921}
}
B. Lakshminarayanan,
D. Roy,
Y. W. Teh,
Mondrian Forests: Efficient Online Random Forests, in Advances in Neural Information Processing Systems (NIPS), 2014.
@inproceedings{LakRoyTeh2014a,
author = {Lakshminarayanan, B. and Roy, D. and Teh, Y. W.},
title = {{M}ondrian Forests: Efficient Online Random Forests},
booktitle = {Advances in Neural Information Processing Systems (NIPS)},
year = {2014}
}
B. Paige,
F. Wood,
A. Doucet,
Y. W. Teh,
Asynchronous Anytime Sequential Monte Carlo, in Advances in Neural Information Processing Systems (NIPS), 2014.
We introduce a new sequential Monte Carlo algorithm we call the particle cascade. The particle cascade is an asynchronous, anytime alternative to traditional sequential Monte Carlo algorithms that is amenable to parallel and distributed implementations. It uses no barrier synchronizations which leads to improved particle throughput and memory efficiency. It is an anytime algorithm in the sense that it can be run forever to emit an unbounded number of particles while keeping within a fixed memory budget. We prove that the particle cascade provides an unbiased marginal likelihood estimator which can be straightforwardly plugged into existing pseudo-marginal methods.
@inproceedings{PaiWooDou2014a,
author = {Paige, B. and Wood, F. and Doucet, A. and Teh, Y. W.},
title = {Asynchronous Anytime Sequential {M}onte {C}arlo},
booktitle = {Advances in Neural Information Processing Systems (NIPS)},
year = {2014}
}
F. Caron,
Y. W. Teh,
B. T. Murphy,
Bayesian Nonparametric Plackett-Luce Models for the Analysis of Preferences for College Degree Programmes, Annals of Applied Statistics, vol. 8, no. 2, 1145–1181, 2014.
In this paper we propose a Bayesian nonparametric model for clustering partial ranking data. We start by developing a Bayesian nonparametric extension of the popular Plackett–Luce choice model that can handle an infinite number of choice items. Our framework is based on the theory of random atomic measures, with the prior specified by a completely random measure. We characterise the posterior distribution given data, and derive a simple and effective Gibbs sampler for posterior simulation. We then develop a Dirichlet process mixture extension of our model and apply it to investigate the clustering of preferences for college degree programmes amongst Irish secondary school graduates. The existence of clusters of applicants who have similar preferences for degree programmes is established and we determine that subject matter and geographical location of the third level institution characterise these clusters.
@article{CarTehMur2014a,
author = {Caron, F. and Teh, Y. W. and Murphy, B. T.},
title = {{B}ayesian Nonparametric {P}lackett-{L}uce Models for the Analysis of Preferences for College Degree Programmes},
journal = {Annals of Applied Statistics},
year = {2014},
volume = {8},
number = {2},
pages = {1145-1181},
doi = {10.1214/14-AOAS717}
}
2013
S. Favaro,
Y. W. Teh,
MCMC for Normalized Random Measure Mixture Models, Statistical Science, vol. 28, no. 3, 335–359, 2013.
This paper concerns the use of Markov chain Monte Carlo methods for posterior sampling in Bayesian nonparametric mixture models with normalized random measure priors. Making use of some recent posterior characterizations for the class of normalized random measures, we propose novel Markov chain Monte Carlo methods of both marginal type and conditional type. The proposed marginal samplers are generalizations of Neal’s well-regarded Algorithm 8 for Dirichlet process mixture models, whereas the conditional sampler is a variation of those recently introduced in the literature. For both the marginal and conditional methods, we consider as a running example a mixture model with an underlying normalized generalized Gamma process prior, and describe comparative simulation results demonstrating the efficacies of the proposed methods.
@article{FavTeh2013a,
author = {Favaro, S. and Teh, Y. W.},
title = {{MCMC} for Normalized Random Measure Mixture Models},
journal = {Statistical Science},
year = {2013},
volume = {28},
number = {3},
pages = {335-359}
}
B. Lakshminarayanan,
D. Roy,
Y. W. Teh,
Top-down Particle Filtering for Bayesian Decision Trees, in International Conference on Machine Learning (ICML), 2013.
Decision tree learning is a popular approach for classification and regression in machine learning and statistics, and Bayesian formulations—which introduce a prior distribution over decision trees, and formulate learning as posterior inference given data—have been shown to produce competitive performance. Unlike classic decision tree learning algorithms like ID3, C4.5 and CART, which work in a top-down manner, existing Bayesian algorithms produce an approximation to the posterior distribution by evolving a complete tree (or collection thereof) iteratively via local Monte Carlo modifications to the structure of the tree, e.g., using Markov chain Monte Carlo (MCMC). We present a sequential Monte Carlo (SMC) algorithm that instead works in a top-down manner, mimicking the behavior and speed of classic algorithms. We demonstrate empirically that our approach delivers accuracy comparable to the most popular MCMC method, but operates more than an order of magnitude faster, and thus represents a better computation-accuracy tradeoff.
@inproceedings{LakRoyTeh2013a,
author = {Lakshminarayanan, B. and Roy, D. and Teh, Y. W.},
title = {Top-down Particle Filtering for {B}ayesian Decision Trees},
booktitle = {International Conference on Machine Learning (ICML)},
year = {2013}
}
C. Blundell,
Y. W. Teh,
Bayesian Hierarchical Community Discovery, in Advances in Neural Information Processing Systems (NIPS), 2013.
We propose an efficient Bayesian nonparametric model for discovering hierarchical community structure in social networks. Our model is a tree-structured mixture of potentially exponentially many stochastic blockmodels. We describe a family of greedy agglomerative model selection algorithms whose worst case scales quadratically in the number of vertices of the network, but independent of the number of communities. Our algorithms are two orders of magnitude faster than the infinite relational model, achieving comparable or better accuracy.
@inproceedings{BluTeh2013a,
author = {Blundell, C. and Teh, Y. W.},
title = {{B}ayesian Hierarchical Community Discovery},
booktitle = {Advances in Neural Information Processing Systems (NIPS)},
year = {2013}
}
V. Rao,
Y. W. Teh,
Fast MCMC sampling for Markov jump processes and extensions, Journal of Machine Learning Research (JMLR), vol. 14, 3295–3320, 2013.
Markov jump processes (or continuous-time Markov chains) are a simple and important class of continuous-time dynamical systems. In this paper, we tackle the problem of simulating from the posterior distribution over paths in these models, given partial and noisy observations. Our approach is an auxiliary variable Gibbs sampler, and is based on the idea of uniformization. This sets up a Markov chain over paths by alternately sampling a finite set of virtual jump times given the current path, and then sampling a new path given the set of extant and virtual jump times. The first step involves simulating a piecewise-constant inhomogeneous Poisson process, while for the second, we use a standard hidden Markov model forward filtering-backward sampling algorithm. Our method is exact and does not involve approximations like time- discretization. We demonstrate how our sampler extends naturally to MJP-based models like Markov-modulated Poisson processes and continuous-time Bayesian networks, and show significant computational benefits over state-of-the-art MCMC samplers for these models.
@article{RaoTeh2013a,
author = {Rao, V. and Teh, Y. W.},
title = {Fast {MCMC} sampling for {M}arkov jump processes and extensions},
journal = {Journal of Machine Learning Research (JMLR)},
year = {2013},
volume = {14},
pages = {3295-3320}
}
S. Patterson,
Y. W. Teh,
Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex, in Advances in Neural Information Processing Systems (NIPS), 2013.
In this paper we investigate the use of Langevin Monte Carlo methods on the probability simplex and propose a new method, Stochastic gradient Riemannian Langevin dynamics, which is simple to implement and can be applied online. We apply this method to latent Dirichlet allocation in an online setting, and demonstrate that it achieves substantial performance improvements to the state of the art online variational Bayesian methods.
@inproceedings{PatTeh2013a,
author = {Patterson, S. and Teh, Y. W.},
title = {Stochastic Gradient {R}iemannian {L}angevin Dynamics on the Probability Simplex},
booktitle = {Advances in Neural Information Processing Systems (NIPS)},
year = {2013}
}
X. Zhang,
W. S. Lee,
Y. W. Teh,
Learning with Invariances via Linear Functionals on Reproducing Kernel Hilbert Space, in Advances in Neural Information Processing Systems, 2013.
Incorporating invariance information is important for many learning problems. To exploit invariances, most existing methods resort to approximations that either lead to expensive optimization problems such as semi-definite programming, or rely on separation oracles to retain tractability. Some methods further limit the space of functions and settle for non-convex models. In this paper, we propose a framework for learning in reproducing kernel Hilbert spaces (RKHS) using local invariances that explicitly characterize the behavior of the target function around data instances. These invariances are \emphcompactly encoded as linear functionals whose value are penalized by some loss function. Based on a representer theorem that we establish, our formulation can be efficiently optimized via a convex program. For the representer theorem to hold, the linear functionals are required to be bounded in the RKHS, and we show that this is true for a variety of commonly used RKHS and invariances. Experiments on learning with unlabeled data and transform invariances show that the proposed method yields better or similar results compared with the state of the art.
@inproceedings{ZhaLeeTeh2013a,
author = {Zhang, X. and Lee, W. S. and Teh, Y. W.},
title = {Learning with Invariances via Linear Functionals on Reproducing Kernel {H}ilbert Space},
booktitle = {Advances in Neural Information Processing Systems},
year = {2013}
}
C. Chen,
V. A. Rao,
W. Buntine,
Y. W. Teh,
Dependent Normalized Random Measures, in International Conference on Machine Learning (ICML), 2013.
@inproceedings{CheRaoBun2013a,
author = {Chen, C. and Rao, V. A. and Buntine, W. and Teh, Y. W.},
title = {Dependent Normalized Random Measures},
booktitle = {International Conference on Machine Learning (ICML)},
year = {2013}
}
B. Lakshminarayanan,
Y. W. Teh,
Inferring Ground Truth from Multi-annotator Ordinal Data: A Probabilistic Approach, 2013.
A popular approach for large scale data annotation tasks is crowdsourcing, wherein each data point is labeled by multiple noisy annotators. We consider the problem of inferring ground truth from noisy ordinal labels obtained from multiple annotators of varying and unknown expertise levels. Annotation models for ordinal data have been proposed mostly as extensions of their binary/categorical counterparts and have received little attention in the crowdsourcing literature. We propose a new model for crowdsourced ordinal data that accounts for instance difficulty as well as annotator expertise, and derive a variational Bayesian inference algorithm for parameter estimation. We analyze the ordinal extensions of several state-of-the-art annotator models for binary/categorical labels and evaluate the performance of all the models on two real world datasets containing ordinal query-URL relevance scores, collected through Amazon’s Mechanical Turk. Our results indicate that the proposed model performs better or as well as existing state-of-the-art methods and is more resistant to ‘spammy’ annotators (i.e., annotators who assign labels randomly without actually looking at the instance) than popular baselines such as mean, median, and majority vote which do not account for annotator expertise.
@unpublished{LakTeh2013a,
author = {Lakshminarayanan, B. and Teh, Y. W.},
note = {ArXiv e-prints: 1305.0015},
title = {Inferring Ground Truth from Multi-annotator Ordinal Data: A Probabilistic Approach},
year = {2013}
}
2012
F. Caron,
Y. W. Teh,
Bayesian Nonparametric Models for Ranked Data, in Advances in Neural Information Processing Systems (NIPS), 2012.
We develop a Bayesian nonparametric extension of the popular Plackett-Luce choice model that can handle an infinite number of choice items. Our framework is based on the theory of random atomic measures, with the prior specified by a gamma process. We derive a posterior characterization and a simple and effective Gibbs sampler for posterior simulation. We then develop a time-varying extension of our model, and apply our model to the New York Times lists of weekly bestselling books.
@inproceedings{CarTeh2012a,
author = {Caron, F. and Teh, Y. W.},
title = {Bayesian Nonparametric Models for Ranked Data},
booktitle = {Advances in Neural Information Processing Systems (NIPS)},
year = {2012}
}
V. Rao,
Y. W. Teh,
MCMC for Continuous-Time Discrete-State Systems, in Advances in Neural Information Processing Systems (NIPS), 2012.
We propose a simple and novel framework for MCMC inference in continuous-time discrete-state systems with pure jump trajectories. We construct an exact MCMC sampler for such systems by alternately sampling a random discretization of time given a trajectory of the system, and then a new trajectory given the discretization. The first step can be performed efficiently using properties of the Poisson process, while the second step can avail of discrete-time MCMC techniques based on the forward-backward algorithm. We compare our approach to particle MCMC and a uniformization-based sampler, and show its advantages.
@inproceedings{RaoTeh2012a,
author = {Rao, V. and Teh, Y. W.},
title = {{MCMC} for Continuous-Time Discrete-State Systems},
booktitle = {Advances in Neural Information Processing Systems (NIPS)},
year = {2012}
}
A. Mnih,
Y. W. Teh,
A Fast and Simple Algorithm for Training Neural Probabilistic Language Models, in International Conference on Machine Learning (ICML), 2012.
Neural probabilistic language models (NPLMs) have recently superseded smoothed n-gram models as the best-performing model class for language modelling. Unfortunately, the adoption of NPLMs is held back by their notoriously long training times, which can be measured in weeks even for moderately-sized datasets. These are a consequence of the models being explicitly normalized, which leads to having to consider all words in the vocabulary when computing the log-likelihood gradients. We propose a fast and simple algorithm for training NPLMs based on noise-contrastive estimation, a newly introduced procedure for estimating unnormalized continuous distributions. We investigate the behaviour of the algorithm on the Penn Treebank corpus and show that it reduces the training times by more than an order of magnitude without affecting the quality of the resulting models. The algorithm is also more efficient and much more stable than importance sampling because it requires far fewer noise samples to perform well. We demonstrate the scalability of the proposed approach by training several neural language models on a 47M-word corpus with a 80K-word vocabulary, obtaining state-of-the-art results in the Microsoft Research Sentence Completion Challenge.
@inproceedings{MniTeh2012a,
author = {Mnih, A. and Teh, Y. W.},
title = {A Fast and Simple Algorithm for Training Neural Probabilistic Language Models},
booktitle = {International Conference on Machine Learning (ICML)},
year = {2012}
}
N. Heess,
D. Silver,
Y. W. Teh,
Actor-Critic Reinforcement Learning with Energy-Based Policies, in JMLR Workshop and Conference Proceedings: EWRL 2012, 2012.
We consider reinforcement learning in Markov decision processes with high dimensional state and action spaces. We parametrize policies using energy-based models (particularly restricted Boltzmann machines), and train them using policy gradient learning. Our approach builds upon Sallans and Hinton (2004), who parameterized value functions using energy-based models, trained using a non-linear variant of temporal-difference (TD) learning. Unfortunately, non-linear TD is known to diverge in theory and practice. We introduce the first sound and efficient algorithm for training energy-based policies, based on an actor-critic architecture. Our algorithm is computationally efficient, converges close to a local optimum, and outperforms Sallans and Hinton (2004) in several high dimensional domains.
@inproceedings{HeeSilTeh2012a,
author = {Heess, N. and Silver, D. and Teh, Y. W.},
title = {Actor-Critic Reinforcement Learning with Energy-Based Policies},
booktitle = {JMLR Workshop and Conference Proceedings: EWRL 2012},
year = {2012}
}
A. Mnih,
Y. W. Teh,
Learning Label Trees for Probabilistic Modelling of Implicit Feedback, in Advances in Neural Information Processing Systems (NIPS), 2012.
@inproceedings{MniTeh2012b,
author = {Mnih, A. and Teh, Y. W.},
title = {Learning Label Trees for Probabilistic Modelling of Implicit Feedback},
booktitle = {Advances in Neural Information Processing Systems (NIPS)},
year = {2012}
}
L. Elliott,
Y. W. Teh,
Scalable Imputation of Genetic Data with a Discrete Fragmentation-Coagulation Process, in Advances in Neural Information Processing Systems (NIPS), 2012.
We present a Bayesian nonparametric model for genetic sequence data in which a set of genetic sequences is modelled using a Markov model of partitions. The partitions at consecutive locations in the genome are related by their clusters first splitting and then merging. Our model can be thought of as a discrete time analogue of continuous time fragmentation-coagulation processes [Teh et al 2011], preserving the important properties of projectivity, exchangeability and reversibility, while being more scalable. We apply this model to the problem of genotype imputation, showing improved computational efficiency while maintaining the same accuracies as in [Teh et al 2011].
@inproceedings{EllTeh2012a,
author = {Elliott, L. and Teh, Y. W.},
title = {Scalable Imputation of Genetic Data with a Discrete Fragmentation-Coagulation Process},
booktitle = {Advances in Neural Information Processing Systems (NIPS)},
year = {2012}
}
2011
V. Rao,
Y. W. Teh,
Gaussian Process Modulated Renewal Processes, in Advances in Neural Information Processing Systems (NIPS), 2011.
Renewal processes are generalizations of the Poisson process on the real line, whose intervals are drawn i.i.d. from some distribution. Modulated renewal processes allow these distributions to vary with time, allowing the introduction nonstationarity. In this work, we take a nonparametric Bayesian approach, modeling this nonstationarity with a Gaussian process. Our approach is based on the idea of uniformization, allowing us to draw exact samples from an otherwise intractable distribution. We develop a novel and efficient MCMC sampler for posterior inference. In our experiments, we test these on a number of synthetic and real datasets.
@inproceedings{RaoTeh2011b,
author = {Rao, V. and Teh, Y. W.},
title = {Gaussian Process Modulated Renewal Processes},
booktitle = {Advances in Neural Information Processing Systems (NIPS)},
year = {2011}
}
V. Rao,
Y. W. Teh,
Fast MCMC sampling for Markov jump processes and continuous time Bayesian networks, in Uncertainty in Artificial Intelligence (UAI), 2011.
Markov jump processes and continuous time Bayesian networks are important classes of continuous time dynamical systems. In this paper, we tackle the problem of inferring unobserved paths in these models by introducing a fast auxiliary variable Gibbs sampler. Our approach is based on the idea of uniformization, and sets up a Markov chain over paths by sampling a finite set of virtual jump times and then running a standard hidden Markov model forward filtering-backward sampling algorithm over states at the set of extant and virtual jump times. We demonstrate significant computational benefits over a state-of-the-art Gibbs sampler on a number of continuous time Bayesian networks.
@inproceedings{RaoTeh2011a,
author = {Rao, V. and Teh, Y. W.},
title = {Fast {MCMC} sampling for {M}arkov jump processes and continuous time {B}ayesian networks},
booktitle = {Uncertainty in Artificial Intelligence (UAI)},
year = {2011}
}
D. Görür,
Y. W. Teh,
Concave-Convex Adaptive Rejection Sampling, jcgs, 2011.
We describe a method for generating independent samples from univariate density functions using adaptive rejection sampling without the log-concavity requirement. The method makes use of the fact that many functions can be expressed as a sum of concave and convex functions. Using a concave-convex decomposition, we bound the log-density by separately bounding the concave and convex parts using piecewise linear functions. The upper bound can then be used as the proposal distribution in rejection sampling. We demonstrate the applicability of the concave-convex approach on a number of standard distributions and describe an application to the efficient construction of sequential Monte Carlo proposal distributions for inference over genealogical trees. Computer code for the proposed algorithms is available online.
@article{GorTeh2011a,
author = {{G\"or\"ur}, D. and Teh, Y. W.},
title = {Concave-Convex Adaptive Rejection Sampling},
journal = jcgs,
year = {2011},
doi = {10.1198/jcgs.2011.09058}
}
C. Blundell,
Y. W. Teh,
K. A. Heller,
Discovering Non-binary Hierarchical Structures with Bayesian Rose Trees, in Mixture Estimation and Applications, C. P. Robert, K. Mengersen, and M. Titterington, Eds. John Wiley & Sons, 2011.
@incollection{BluTehHel2011a,
author = {Blundell, C. and Teh, Y. W. and Heller, K. A.},
title = {Discovering Non-binary Hierarchical Structures with {B}ayesian Rose Trees},
booktitle = {Mixture Estimation and Applications},
publisher = {John Wiley \& Sons},
year = {2011},
editor = {Robert, C. P. and Mengersen, K. and Titterington, M.}
}
R. Silva,
C. Blundell,
Y. W. Teh,
Mixed Cumulative Distribution Networks, in Artificial Intelligence and Statistics (AISTATS), 2011.
Directed acyclic graphs (DAGs) are a popular framework to express multivariate probability distributions. Acyclic directed mixed graphs (ADMGs) are generalizations of DAGs that can succinctly capture much richer sets of conditional independencies, and are especially useful in modeling the effects of latent variables implicitly. Unfortunately, there are currently no parameterizations of general ADMGs. In this paper, we apply recent work on cumulative distribution networks and copulas to propose one general construction for ADMG models. We consider a simple parameter estimation approach, and report some encouraging experimental results.
@inproceedings{SilBluTeh2011a,
author = {Silva, R. and Blundell, C. and Teh, Y. W.},
title = {Mixed Cumulative Distribution Networks},
booktitle = {Artificial Intelligence and Statistics (AISTATS)},
year = {2011}
}
Y. W. Teh,
C. Blundell,
L. T. Elliott,
Modelling Genetic Variations with Fragmentation-Coagulation Processes, in Advances in Neural Information Processing Systems (NIPS), 2011.
We propose a novel class of Bayesian nonparametric models for sequential data called fragmentation-coagulation processes (FCPs). FCPs model a set of sequences using a partition-valued Markov process which evolves by splitting and merging clusters. An FCP is exchangeable, projective, stationary and reversible, and its equilibrium distributions are given by the Chinese restaurant process. As opposed to hidden Markov models, FCPs allow for flexible modelling of the number of clusters, and they avoid label switching non-identifiability problems. We develop an efficient Gibbs sampler for FCPs which uses uniformization and the forward-backward algorithm. Our development of FCPs is motivated by applications in population genetics, and we demonstrate the utility of FCPs on problems of genotype imputation with phased and unphased SNP data.
@inproceedings{TehBluEll2011a,
author = {Teh, Y. W. and Blundell, C. and Elliott, L. T.},
title = {Modelling Genetic Variations with Fragmentation-Coagulation Processes},
booktitle = {Advances in Neural Information Processing Systems (NIPS)},
year = {2011}
}
F. Wood,
J. Gasthaus,
C. Archambeau,
L. James,
Y. W. Teh,
The Sequence Memoizer, Communications of the Association for Computing Machines, vol. 54, no. 2, 91–98, 2011.
Probabilistic models of sequences play a central role in most machine translation, automated speech recognition, lossless compression, spell-checking, and gene identification applications to name but a few. Unfortunately, real-world sequence data often exhibit long range dependencies which can only be captured by computationally challenging, complex models. Sequence data arising from natural processes also often exhibits power-law properties, yet common sequence models do not capture such properties. The sequence memoizer is a new hierarchical Bayesian model for discrete sequence data that captures long range dependencies and power-law characteristics, while remaining computationally attractive. Its utility as a language model and general purpose lossless compressor is demonstrated.
@article{WooGasArc2011a,
author = {Wood, F. and Gasthaus, J. and Archambeau, C. and James, L. and Teh, Y. W.},
title = {The Sequence Memoizer},
journal = {Communications of the Association for Computing Machines},
year = {2011},
volume = {54},
number = {2},
pages = {91-98}
}
M. Welling,
Y. W. Teh,
Bayesian Learning via Stochastic Gradient Langevin Dynamics, in International Conference on Machine Learning (ICML), 2011.
In this paper we propose a new framework for learning from large scale datasets based on iterative learning from small mini-batches. By adding the right amount of noise to a standard stochastic gradient optimization al- gorithm we show that the iterates will con- verge to samples from the true posterior dis- tribution as we anneal the stepsize. This seamless transition between optimization and Bayesian posterior sampling provides an in- built protection against overfitting. We also propose a practical method for Monte Carlo estimates of posterior statistics which moni- tors a “sampling threshold” and collects sam- ples after it has been surpassed. We apply the method to three models: a mixture of Gaussians, logistic regression and ICA with natural gradients.
@inproceedings{WelTeh2011a,
author = {Welling, M. and Teh, Y. W.},
title = {{B}ayesian Learning via Stochastic Gradient {L}angevin Dynamics},
booktitle = {International Conference on Machine Learning (ICML)},
year = {2011}
}
2010
C. Blundell,
Y. W. Teh,
K. A. Heller,
Bayesian Rose Trees, in Uncertainty in Artificial Intelligence (UAI), 2010.
Hierarchical structure is ubiquitous in data across many domains. There are many hierarchical clustering methods, frequently used by domain experts, which strive to discover this structure. However, most of these methods limit discoverable hierarchies to those with binary branching structure. This limitation, while computationally convenient, is often undesirable. In this paper we explore a Bayesian hierarchical clustering algorithm that can produce trees with arbitrary branching structure at each node, known as rose trees. We interpret these trees as mixtures over partitions of a data set, and use a computationally efficient, greedy agglomerative algorithm to find the rose trees which have high marginal likelihood given the data. Lastly, we perform experiments which demonstrate that rose trees are better models of data than the typical binary trees returned by other hierarchical clustering algorithms.
@inproceedings{BluTehHel2010a,
author = {Blundell, C. and Teh, Y. W. and Heller, K. A.},
title = {{B}ayesian Rose Trees},
booktitle = {Uncertainty in Artificial Intelligence (UAI)},
year = {2010}
}
J. Gasthaus,
Y. W. Teh,
Improvements to the Sequence Memoizer, in Advances in Neural Information Processing Systems (NIPS), 2010.
The sequence memoizer is a model for sequence data with state-of-the-art performance on language modeling and compression. We propose a number of improvements to the model and inference algorithm, including an enlarged range of hyperparameters, a memory-efficient representation, and inference algorithms operating on the new representation. Our derivations are based on precise definitions of the various processes that will also allow us to provide an elementary proof of the mysterious coagulation and fragmentation properties used in the original paper on the sequence memoizer by Wood et al. (2009). We present some experimental results supporting our improvements.
@inproceedings{GasTeh2010a,
author = {Gasthaus, J. and Teh, Y. W.},
title = {Improvements to the Sequence Memoizer},
booktitle = {Advances in Neural Information Processing Systems (NIPS)},
year = {2010}
}
Y. W. Teh,
M. I. Jordan,
Hierarchical Bayesian Nonparametric Models with Applications, in Bayesian Nonparametrics, N. Hjort, C. Holmes, P. Müller, and S. Walker, Eds. Cambridge University Press, 2010.
Hierarchical modeling is a fundamental concept in Bayesian statistics. The basic idea is that parameters are endowed with distributions which may themselves introduce new parameters, and this construction recurses. In this review we discuss the role of hierarchical modeling in Bayesian nonparametrics, focusing on models in which the infinite-dimensional parameters are treated hierarchically. For example, we consider a model in which the base measure for a Dirichlet process is itself treated as a draw from another Dirichlet process. This yields a natural recursion that we refer to as a hierarchical Dirichlet process. We also discuss hierarchies based on the Pitman-Yor process and on completely random processes. We demonstrate the value of these hierarchical constructions in a wide range of practical applications, in problems in computational biology, computer vision and natural language processing.
@incollection{TehJor2010a,
author = {Teh, Y. W. and Jordan, M. I.},
title = {Hierarchical {B}ayesian Nonparametric Models with Applications},
booktitle = {Bayesian Nonparametrics},
publisher = {Cambridge University Press},
year = {2010},
editor = {Hjort, N. and Holmes, C. and Müller, P. and Walker, S.},
annote = {Technical Report 770. Department of Statistics, University of California
at Berkeley.
}
}
Y. W. Teh,
Dirichlet Processes, in Encyclopedia of Machine Learning, Springer, 2010.
@incollection{Teh2010a,
author = {Teh, Y. W.},
title = {{D}irichlet Processes},
booktitle = {Encyclopedia of Machine Learning},
publisher = {Springer},
year = {2010}
}
P. Orbanz,
Y. W. Teh,
Bayesian Nonparametric Models, in Encyclopedia of Machine Learning, Springer, 2010.
@incollection{OrbTeh2010a,
author = {Orbanz, P. and Teh, Y. W.},
title = {{B}ayesian Nonparametric Models},
booktitle = {Encyclopedia of Machine Learning},
publisher = {Springer},
year = {2010}
}
J. Gasthaus,
F. Wood,
Y. W. Teh,
Lossless compression based on the Sequence Memoizer, in Data Compression Conference, 2010.
In this work we describe a sequence compression method based on combining a Bayesian nonparametric sequence model with entropy encoding. The model, a hierarchy of Pitman-Yor processes of unbounded depth previously proposed by Wood et al. [2009] in the context of language modelling, allows modelling of long-range dependencies by allowing conditioning contexts of unbounded length. We show that incremental approximate inference can be performed in this model, thereby allowing it to be used in a text compression setting. The resulting compressor reliably outperforms several PPM variants on many types of data, but is particularly effective in compressing data that exhibits power law properties.
@inproceedings{GasWooTeh2010a,
author = {Gasthaus, J. and Wood, F. and Teh, Y. W.},
title = {Lossless compression based on the Sequence Memoizer},
booktitle = {Data Compression Conference},
year = {2010}
}
2009
V. Rao,
Y. W. Teh,
Spatial Normalized Gamma Processes, in Advances in Neural Information Processing Systems (NIPS), 2009, vol. 22.
Dependent Dirichlet processes (DPs) are dependent sets of random measures, each being marginally Dirichlet process distributed. They are used in Bayesian nonparametric models when the usual exchangebility assumption does not hold. We propose a simple and general framework to construct dependent DPs by marginalizing and normalizing a single gamma process over an extended space. The result is a set of DPs, each located at a point in a space such that neighboring DPs are more dependent. We describe Markov chain Monte Carlo inference, involving the typical Gibbs sampling and three different Metropolis-Hastings proposals to speed up convergence. We report an empirical study of convergence speeds on a synthetic dataset and demonstrate an application of the model to topic modeling through time.
@inproceedings{RaoTeh2009a,
author = {Rao, V. and Teh, Y. W.},
title = {Spatial Normalized Gamma Processes},
booktitle = {Advances in Neural Information Processing Systems (NIPS)},
year = {2009},
volume = {22}
}
F. Wood,
Y. W. Teh,
A Hierarchical Nonparametric Bayesian Approach to Statistical Language Model Domain Adaptation, in Artificial Intelligence and Statistics (AISTATS), 2009.
In this paper we present a doubly hierarchical Pitman-Yor process language model. Its bottom layer of hierarchy consists of multiple hierarchical Pitman-Yor process language models, one each for some number of domains. The novel top layer of hierarchy consists of a mechanism to couple together multiple language models such that they share statistical strength. Intuitively this sharing results in the ?adaptation? of a latent shared language model to each domain. We introduce a general formalism capable of describing the overall model which we call the graphical Pitman-Yor process and explain how to perform Bayesian inference in it. We present encouraging language model domain adaptation results that both illustrate the potential benefits of our new model and suggest new avenues of inquiry.
@inproceedings{WooTeh2009a,
author = {Wood, F. and Teh, Y. W.},
title = {A Hierarchical Nonparametric {B}ayesian Approach to Statistical Language Model Domain Adaptation},
booktitle = {Artificial Intelligence and Statistics (AISTATS)},
year = {2009}
}
D. M. Roy,
Y. W. Teh,
The Mondrian Process, in Advances in Neural Information Processing Systems (NIPS), 2009, vol. 21.
We describe a novel stochastic process that can be used to construct a multidimensional generalization of the stick-breaking process and which is related to the classic stick breaking process described by [Sethuraman 1994] in one dimension. We describe how the process can be applied to relational data modeling using the de Finetti representation for infinitely and partially exchangeable arrays.
@inproceedings{RoyTeh2009a,
author = {Roy, D. M. and Teh, Y. W.},
title = {The {M}ondrian Process},
booktitle = {Advances in Neural Information Processing Systems (NIPS)},
year = {2009},
volume = {21}
}
D. Görür,
Y. W. Teh,
An Efficient Sequential Monte-Carlo Algorithm for Coalescent Clustering, in Advances in Neural Information Processing Systems (NIPS), 2009, vol. 21.
We propose an efficient sequential Monte Carlo inference scheme for the recently proposed coalescent clustering model (Teh et al, 2008). Our algorithm has a quadratic runtime while those in (Teh et al, 2008) is cubic. In experiments, we were surprised to find that in addition to being more efficient, it is also a better sequential Monte Carlo sampler than the best in (Teh et al, 2008), when measured in terms of variance of estimated likelihood and effective sample size.
@inproceedings{GorTeh2009a,
author = {{G\"or\"ur}, D. and Teh, Y. W.},
title = {An Efficient Sequential {M}onte-{C}arlo Algorithm for Coalescent Clustering},
booktitle = {Advances in Neural Information Processing Systems (NIPS)},
year = {2009},
volume = {21}
}
Y. W. Teh,
D. Görür,
Indian Buffet Processes with Power-law Behavior, in Advances in Neural Information Processing Systems (NIPS), 2009, vol. 22.
The Indian buffet process (IBP) is an exchangeable distribution over binary matrices used in Bayesian nonparametric featural models. In this paper we propose a three-parameter generalization of the IBP exhibiting power-law behavior. We achieve this by generalizing the beta process (the de Finetti measure of the IBP) to the stable-beta process and deriving the IBP corresponding to it. We find interesting relationships between the stable-beta process and the Pitman-Yor process (another stochastic process used in Bayesian nonparametric models with interesting power-law properties). We show that our power-law IBP is a good model for word occurrences in documents with improved performance over the normal IBP.
@inproceedings{TehGor2009a,
author = {Teh, Y. W. and {G\"or\"ur}, D.},
title = {{I}ndian Buffet Processes with Power-law Behavior},
booktitle = {Advances in Neural Information Processing Systems (NIPS)},
year = {2009},
volume = {22}
}
K. A. Heller,
Y. W. Teh,
D. Görür,
Infinite Hierarchical Hidden Markov Models, in Artificial Intelligence and Statistics (AISTATS), 2009, vol. 5.
In this paper we present the Infinite Hierarchical Hidden Markov Model (IHHMM), a nonparametric generalization of Hierarchical Hidden Markov Models (HHMMs). HHMMs have been used for modeling sequential data in applications such as speech recognition, detecting topic transitions in video and extracting information from text. The IHHMM provides more flexible modeling of sequential data by allowing a potentially unbounded number of levels in the hierarchy, instead of requiring the specification of a fixed hierarchy depth. Inference and learning are performed efficiently using Gibbs sampling and a modified forward-backtrack algorithm. We show encouraging demonstrations of the workings of the IHHMM.
@inproceedings{HelTehGor2009a,
author = {Heller, K. A. and Teh, Y. W. and {G\"or\"ur}, D.},
title = {Infinite Hierarchical Hidden {M}arkov Models},
booktitle = {Artificial Intelligence and Statistics (AISTATS)},
year = {2009},
volume = {5}
}
F. Doshi,
K. T. Miller,
J. Van Gael,
Y. W. Teh,
Variational Inference for the Indian Buffet Process, in Artificial Intelligence and Statistics (AISTATS), 2009, vol. 5.
The Indian Buffet Process (IBP) is a nonparametric prior for latent feature models in which observations are influenced by a combination of several hidden features. For example, images may be composed of several objects or sounds may consist of several notes. Latent feature models seek to infer what these latent features from a set of observations. Current inference methods for the IBP have all relied on sampling. While these methods are guaranteed to be accurate in the limit, in practice, samplers for the IBP tend to mix slow. We develop a deterministic variational method for the IBP. We provide theoretical guarantees on its truncation bounds and demonstrate its superior performance for high dimensional data sets.
@inproceedings{DosMilVan2009a,
author = {Doshi, F. and Miller, K. T. and {Van Gael}, J. and Teh, Y. W.},
title = {Variational Inference for the {I}ndian Buffet Process},
booktitle = {Artificial Intelligence and Statistics (AISTATS)},
year = {2009},
volume = {5}
}
J. Van Gael,
Y. W. Teh,
Z. Ghahramani,
The Infinite Factorial Hidden Markov Model, in Advances in Neural Information Processing Systems (NIPS), 2009, vol. 21.
We introduces a new probability distribution over a potentially infinite number of binary Markov chains which we call the Markov Indian buffet process. This process extends the IBP to allow temporal dependencies in the hidden variables. We use this stochastic process to build a nonparametric extension of the factorial hidden Markov model. After working out an inference scheme which combines slice sampling and dynamic programming we demonstrate how the infinite factorial hidden Markov model can be used for blind source separation.
@inproceedings{VanTehGha2009a,
author = {{Van Gael}, J. and Teh, Y. W. and Ghahramani, Z.},
title = {The Infinite Factorial Hidden {M}arkov Model},
booktitle = {Advances in Neural Information Processing Systems (NIPS)},
year = {2009},
volume = {21}
}
J. Gasthaus,
F. Wood,
D. Görür,
Y. W. Teh,
Dependent Dirichlet Process Spike Sorting, in Advances in Neural Information Processing Systems (NIPS), 2009, vol. 21, 497–504.
In this paper we propose a new incremental spike sorting model that automatically eliminates refractory period violations, accounts for action potential waveform drift, and can handle “appearance” and “disappearance” of neurons. Our approach is to augment a known time-varying Dirichlet process that ties together a sequence of infinite Gaussian mixture models, one per action potential waveform observation, with an interspike-interval-dependent likelihood that prohibits refractory period violations. We demonstrate this model by showing results from sorting two publicly available neural data recordings for which the a partial ground truth labeling is known.
@inproceedings{GasWooGor2009a,
author = {Gasthaus, J. and Wood, F. and {G\"or\"ur}, D. and Teh, Y. W.},
title = {Dependent {D}irichlet Process Spike Sorting},
booktitle = {Advances in Neural Information Processing Systems (NIPS)},
year = {2009},
volume = {21},
pages = {497-504}
}
F. Wood,
C. Archambeau,
J. Gasthaus,
L. F. James,
Y. W. Teh,
A Stochastic Memoizer for Sequence Data, in International Conference on Machine Learning (ICML), 2009, vol. 26, 1129–1136.
We propose an unbounded-depth, hierarchical, Bayesian nonparametric model for discrete sequence data. This model can be estimated from a single training sequence, yet shares statistical strength between subsequent symbol predictive distributions in such a way that predictive performance generalizes well. The model builds on a specific parameterization of an unbounded-depth hierarchical Pitman-Yor process. We introduce analytic marginalization steps (using coagulation operators) to reduce this model to one that can be represented in time and space linear in the length of the training sequence. We show how to perform inference in such a model without truncation approximation and introduce fragmentation operators necessary to do predictive inference. We demonstrate the sequence memoizer by using it as a language model, achieving state-of-the-art results.
@inproceedings{WooArcGas2009a,
author = {Wood, F. and Archambeau, C. and Gasthaus, J. and James, L. F. and Teh, Y. W.},
title = {A Stochastic Memoizer for Sequence Data},
booktitle = {International Conference on Machine Learning (ICML)},
year = {2009},
volume = {26},
pages = {1129-1136}
}
G. R. Haffari,
Y. W. Teh,
Hierarchical Dirichlet Trees for Information Retrieval, in Proceedings of the Annual Meeting of the North American Association for Computational Linguistics and the Human Language Technology Conference, 2009.
We propose a principled probabilisitc framework which uses trees over the vocabulary to capture similarities among terms in an information retrieval setting. This allows the retrieval of documents based not just on occurrences of specific query terms, but also on similarities between terms (an effect similar to query expansion). Additionally our principled generative model exhibits an effect similar to inverse document frequency. We give encouraging experimental evidence of the superiority of the hierarchical Dirichlet tree compared to standard baselines.
@inproceedings{HafTeh2009a,
author = {Haffari, G. R. and Teh, Y. W.},
title = {Hierarchical {D}irichlet Trees for Information Retrieval},
booktitle = {Proceedings of the Annual Meeting of the North American Association for Computational Linguistics and the Human Language Technology Conference},
year = {2009}
}
G. Quon,
Y. W. Teh,
E. Chan,
T. Hughes,
M. Brudno,
Q. Morris,
A Mixture Model for the Evolution of Gene Expression in Non-homogeneous Datasets, in Advances in Neural Information Processing Systems (NIPS), 2009, vol. 21.
We address the challenge of assessing conservation of gene expression in complex, non-homogeneous datasets. Recent studies have demonstrated the success of probabilistic models in studying the evolution of gene expression in simple eukaryotic organisms such as yeast, for which measurements are typically scalar and independent. Models capable of studying expression evolution in much more complex organisms such as vertebrates are particularly important given the medical and scientific interest in species such as human and mouse. We present a statistical model that makes a number of significant extensions to previous models to enable characterization of changes in expression among highly complex organisms. We demonstrate the efficacy of our method on a microarray dataset containing diverse tissues from multiple vertebrate species. We anticipate that the model will be invaluable in the study of gene expression patterns in other diverse organisms as well, such as worms and insects.
@inproceedings{QuoTehCha2009a,
author = {Quon, G. and Teh, Y. W. and Chan, E. and Hughes, T. and Brudno, M. and Morris, Q.},
title = {A Mixture Model for the Evolution of Gene Expression in Non-homogeneous Datasets},
booktitle = {Advances in Neural Information Processing Systems (NIPS)},
year = {2009},
volume = {21}
}
A. Asuncion,
M. Welling,
P. Smyth,
Y. W. Teh,
On Smoothing and Inference for Topic Models, in Uncertainty in Artificial Intelligence (UAI), 2009.
Latent Dirichlet analysis, or topic modeling, is a flexible latent variable framework for modeling high-dimensional sparse count data. Various learning algorithms have been developed in recent years, including collapsed Gibbs sampling, variational inference, and maximum a posteriori estimation, and this variety motivates the need for careful empirical comparisons. In this paper, we highlight the close connections between these approaches. We find that the main differences are attributable to the amount of smoothing applied to the counts. When the hyperparameters are optimized, the differences in performance among the algorithms diminish significantly. The ability of these algorithms to achieve solutions of comparable accuracy gives us the freedom to select computationally efficient approaches. Using the insights gained from this comparative study, we show how accurate topic models can be learned in several seconds on text corpora with thousands of documents.
@inproceedings{AsuWelSmy2009a,
author = {Asuncion, A. and Welling, M. and Smyth, P. and Teh, Y. W.},
title = {On Smoothing and Inference for Topic Models},
booktitle = {Uncertainty in Artificial Intelligence (UAI)},
year = {2009}
}
2008
H. L. Chieu,
W. S. Lee,
Y. W. Teh,
Cooled and Relaxed Survey Propagation for MRFs, in Advances in Neural Information Processing Systems (NIPS), 2008, vol. 20.
We describe a new algorithm, Relaxed Survey Propagation (RSP), for finding MAP configurations in Markov random fields. We compare its performance with state-of-the-art algorithms including the max-product belief propagation, its sequential tree-reweighted variant, residual (sum-product) belief propagation, and tree-structured expectation propagation. We show that it outperforms all approaches for Ising models with mixed couplings, as well as on a web person disambiguation task formulated as a supervised clustering problem.
@inproceedings{ChiLeeTeh2008a,
author = {Chieu, H. L. and Lee, W. S. and Teh, Y. W.},
title = {Cooled and Relaxed Survey Propagation for {MRFs}},
booktitle = {Advances in Neural Information Processing Systems (NIPS)},
year = {2008},
volume = {20}
}
Y. W. Teh,
H. Daume III,
D. M. Roy,
Bayesian Agglomerative Clustering with Coalescents, in Advances in Neural Information Processing Systems (NIPS), 2008, vol. 20.
We introduce a new Bayesian model for hierarchical clustering based on a prior over trees called Kingman’s coalescent. We develop novel greedy and sequential Monte Carlo inferences which operate in a bottom-up agglomerative fashion. We show experimentally the superiority of our algorithms over the state-of-the-art, and demonstrate our approach in document clustering and phylolinguistics.
@inproceedings{TehDauRoy2008a,
author = {Teh, Y. W. and {Daume III}, H. and Roy, D. M.},
title = {{B}ayesian Agglomerative Clustering with Coalescents},
booktitle = {Advances in Neural Information Processing Systems (NIPS)},
year = {2008},
volume = {20}
}
J. Van Gael,
Y. Saatci,
Y. W. Teh,
Z. Ghahramani,
Beam Sampling for the Infinite Hidden Markov Model, in International Conference on Machine Learning (ICML), 2008, vol. 25.
The infinite hidden Markov model is a nonparametric extension of the widely used hidden Markov model. Our paper introduces a new inference algorithm for the infinite Hidden Markov model called beam sampling. Beam sampling combines slice sampling, which limits the number of states considered at each time step to a finite number, with dynamic programming, which samples whole state trajectories efficiently. Our algorithm typically outperforms the Gibbs sampler and is more robust. We present applications of iHMM inference using the beam sampler on changepoint detection and text prediction problems.
@inproceedings{VanSaaTeh2008a,
author = {{Van Gael}, J. and Saatci, Y. and Teh, Y. W. and Ghahramani, Z.},
title = {Beam Sampling for the Infinite Hidden {M}arkov Model},
booktitle = {International Conference on Machine Learning (ICML)},
year = {2008},
volume = {25}
}
M. Welling,
Y. W. Teh,
H. J. Kappen,
Hybrid Variational/Gibbs Collapsed Inference in Topic Models, in Uncertainty in Artificial Intelligence (UAI), 2008, vol. 24.
Variational Bayesian inference and (collapsed) Gibbs sampling are the two important classes of inference algorithms for Bayesian networks. Both have their advantages and disadvantages: collapsed Gibbs sampling is unbiased but is also inefficient for large count values and requires averaging over many samples to reduce variance. On the other hand, variational Bayesian inference is efficient and accurate for large count values but suffers from bias for small counts. We propose a hybrid algorithm that combines the best of both worlds: it samples very small counts and applies variational updates to large counts. This hybridization is shown to significantly improve testset perplexity relative to variational inference at no computational cost.
@inproceedings{WelTehKap2008a,
author = {Welling, M. and Teh, Y. W. and Kappen, H. J.},
title = {Hybrid {Variational/Gibbs} Collapsed Inference in Topic Models},
booktitle = {Uncertainty in Artificial Intelligence (UAI)},
year = {2008},
volume = {24}
}
Y. W. Teh,
K. Kurihara,
M. Welling,
Collapsed Variational Inference for HDP, in Advances in Neural Information Processing Systems (NIPS), 2008, vol. 20.
A wide variety of Dirichlet-multinomial ‘topic’ models have found interesting applications in recent years. While Gibbs sampling remains an important method of inference in such models, variational techniques have certain advantages such as easy assessment of convergence, easy optimization without the need to maintain detailed balance, a bound on the marginal likelihood, and side-stepping of issues with topic-identifiability. The most accurate variational technique thus far, namely collapsed variational latent Dirichlet allocation, did not deal with model selection nor did it include inference for hyperparameters. We address both issues by generalizing the technique, obtaining the first variational algorithm to deal with the
hierarchical Dirichlet process and to deal with hyperparameters of Dirichlet variables. Experiments show a significant improvement in accuracy.
@inproceedings{TehKurWel2008a,
author = {Teh, Y. W. and Kurihara, K. and Welling, M.},
title = {Collapsed Variational Inference for {HDP}},
booktitle = {Advances in Neural Information Processing Systems (NIPS)},
year = {2008},
volume = {20}
}
2007
K. Kurihara,
M. Welling,
Y. W. Teh,
Collapsed Variational Dirichlet Process Mixture Models, in Proceedings of the International Joint Conference on Artificial Intelligence, 2007, vol. 20.
Nonparametric Bayesian mixture models, in particular Dirichlet process (DP) mixture models, have shown great promise for density estimation and data clustering. Given the size of today’s datasets, computational efficiency becomes an essential ingredient in the applicability of these techniques to real world data. We study and experimentally compare a number of variational Bayesian (VB) approximations to the DP mixture model. In particular we consider the standard VB approximation where parameters are assumed to be independent from cluster assignment variables, and a novel collapsed VB approximation where mixture weights are marginalized out. For both VB approximations we consider two different ways to approximate the DP, by truncating the stick-breaking construction, and by using a finite mixture model with a symmetric
Dirichlet prior.
@inproceedings{KurWelTeh2007a,
author = {Kurihara, K. and Welling, M. and Teh, Y. W.},
title = {Collapsed Variational {D}irichlet Process Mixture Models},
booktitle = {Proceedings of the International Joint Conference on Artificial Intelligence},
year = {2007},
volume = {20}
}
Y. W. Teh,
D. Görür,
Z. Ghahramani,
Stick-breaking Construction for the Indian Buffet Process, in Artificial Intelligence and Statistics (AISTATS), 2007, vol. 11.
The Indian buffet process (IBP) is a Bayesian nonparametric distribution whereby objects are modelled using an unbounded number of latent features. In this paper we derive a stick-breaking representation for the IBP. Based on this new representation, we develop slice samplers for the IBP that are efficient, easy to implement and are more generally applicable than the currently available Gibbs sampler. This representation, along with the work of Thibaux and Jordan, also illuminates interesting theoretical connections between the IBP, Chinese restaurant processes, Beta processes and Dirichlet processes.
@inproceedings{TehGorGha2007a,
author = {Teh, Y. W. and {G\"or\"ur}, D. and Ghahramani, Z.},
title = {Stick-breaking Construction for the {I}ndian Buffet Process},
booktitle = {Artificial Intelligence and Statistics (AISTATS)},
year = {2007},
volume = {11}
}
J. F. Cai,
W. S. Lee,
Y. W. Teh,
NUS-ML: Improving Word Sense Disambiguation Using Topic Features, in Proceedings of the International Workshop on Semantic Evaluations, 2007, vol. 4.
We participated in SemEval-1 English coarse-grained all-words task (task 7), English fine-grained all-words task (task 17, subtask 3) and English coarse-grained lexical sample task (task 17, subtask 1). The same method with different labeled data is used for the tasks; SemCor is the labeled corpus used to train our system for the all words tasks while the labeled corpus that is provided is used for the lexical sample task. The knowledge sources include part-of-speech of neighboring words, single words in the surrounding context, local collocations, and syntactic patterns. In addition, we constructed a topic feature, targeted to capture the global context information, using the latent dirichlet allocation (LDA) algorithm with unlabeled corpus. A modified naive Bayes classifier is constructed to incorporate all the features. We achieved 81.6%, 57.6%, 88.7% for coarse-grained all words task, fine-grained all-words task and coarse-grained lexical sample task respectively.
@inproceedings{CaiLeeTeh2007a,
author = {Cai, J. F. and Lee, W. S. and Teh, Y. W.},
title = {{NUS-ML}: Improving Word Sense Disambiguation Using Topic Features},
booktitle = {Proceedings of the International Workshop on Semantic Evaluations},
year = {2007},
volume = {4}
}
Y. J. Lim,
Y. W. Teh,
Variational Bayesian Approach to Movie Rating Prediction, in Proceedings of KDD Cup and Workshop, 2007.
Singular value decomposition (SVD) is a matrix decomposition algorithm that returns the optimal (in the sense of squared error) low-rank decomposition of a matrix. SVD has found widespread use across a variety of machine learning applications, where its output is interpreted as compact and informative representations of data. The Netflix Prize challenge, and collaborative filtering in general, is an ideal application for SVD, since the data is a matrix of ratings given by users to movies. It is thus not surprising to observe that most currently successful teams use SVD, either with an extension, or to interpolate with results returned by other algorithms. Unfortunately SVD can easily overfit due to the extreme data sparsity of the matrix in the Netflix Prize challenge, and care must be taken to regularize properly.
In this paper, we propose a Bayesian approach to alleviate overfitting in SVD, where priors are introduced and all parameters are integrated out using variational inference. We show experimentally that this gives significantly improved results over vanilla SVD. For truncated SVDs of rank 5, 10, 20, and 30, our proposed Bayesian approach achieves 2.2% improvement over a na¨ıve approach, 1.6% improvement over a gradient descent approach dealing with unobserved entries properly, and 0.9% improvement over a maximum a posteriori (MAP) approach.
@inproceedings{LimTeh2007a,
author = {Lim, Y. J. and Teh, Y. W.},
title = {Variational {B}ayesian Approach to Movie Rating Prediction},
booktitle = {Proceedings of KDD Cup and Workshop},
year = {2007}
}
J. F. Cai,
W. S. Lee,
Y. W. Teh,
Improving Word Sense Disambiguation Using Topic Features, in Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-coNLL), 2007.
This paper presents a novel approach for exploiting the global context for the task of word sense disambiguation (WSD). This is done by using topic features constructed using the latent dirichlet allocation (LDA) algorithm on unlabeled data. The features are incorporated into a modified naive Bayes network alongside other features such as part-of-speech of neighboring words, single words in the surrounding context, local collocations, and syntactic patterns. In both the English all-words task and the English lexical sample task, the method achieved significant improvement over the simple naive Bayes classifier and higher accuracy than the best official scores on Senseval-3 for both task.
@inproceedings{CaiLeeTeh2007b,
author = {Cai, J. F. and Lee, W. S. and Teh, Y. W.},
title = {Improving Word Sense Disambiguation Using Topic Features},
booktitle = {Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-coNLL)},
year = {2007}
}
Y. W. Teh,
D. Newman,
M. Welling,
A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation, in Advances in Neural Information Processing Systems (NIPS), 2007, vol. 19, 1353–1360.
Latent Dirichlet allocation (LDA) is a Bayesian network that has recently gained much popularity in applications ranging from document modeling to computer vision. Due to the large scale nature of these applications, current inference procedures like variational Bayes and Gibbs sampling have been found lacking. In this paper we propose the collapsed variational Bayesian inference algorithm for LDA, and show that it is computationally efficient, easy to implement and significantly more accurate than standard variational Bayesian inference for LDA.
@inproceedings{TehNewWel2007a,
author = {Teh, Y. W. and Newman, D. and Welling, M.},
title = {A Collapsed Variational {B}ayesian Inference Algorithm for Latent {D}irichlet Allocation},
booktitle = {Advances in Neural Information Processing Systems (NIPS)},
year = {2007},
volume = {19},
pages = {1353-1360}
}
2006
Y. W. Teh,
A Hierarchical Bayesian Language Model based on Pitman-Yor Processes, in Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, 2006, 985–992.
@inproceedings{Teh2006b,
author = {Teh, Y. W.},
title = {A Hierarchical {B}ayesian Language Model based on {P}itman-{Y}or Processes},
booktitle = {Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics},
year = {2006},
pages = {985-992}
}
Y. W. Teh,
A Bayesian Interpretation of Interpolated Kneser-Ney, School of Computing, National University of Singapore, TRA2/06, 2006.
Interpolated Kneser-Ney is one of the best smoothing methods for n-gram language models. Previous explanations for its superiority have been based on intuitive and empirical justifications of specific properties of the method. We propose a novel interpretation of interpolated Kneser-Ney as approximate inference in a hierarchical Bayesian model consisting of Pitman-Yor processes. As opposed to past explanations, our interpretation can recover exactly the formulation of interpolated Kneser-Ney, and performs better than interpolated Kneser-Ney when a better inference procedure is used.
@techreport{Teh2006a,
author = {Teh, Y. W.},
title = {A {B}ayesian Interpretation of Interpolated {K}neser-{N}ey},
institution = {School of Computing, National University of Singapore},
year = {2006},
number = {TRA2/06}
}
E. P. Xing,
K. Sohn,
M. I. Jordan,
Y. W. Teh,
Bayesian Multi-population Haplotype Inference via a Hierarchical Dirichlet process mixture, in International Conference on Machine Learning (ICML), 2006, vol. 23.
Uncovering the haplotypes of single nucleotide polymorphisms and their population demography is essential for many biological and medical applications. Methods for haplotype inference developed thus far—including methods based on coalescence, finite and infinite mixtures, and maximal parsimony— ignore the underlying population structure in the genotype data. As noted by Pritchard (2001), different populations can share certain portion of their genetic ancestors, as well as have their own genetic components through migration and diversification. In this paper, we address the problem of multipopulation haplotype inference. We capture cross-population structure using a nonparametric Bayesian prior known as the hierarchical Dirichlet process (HDP) (Teh et al., 2006), conjoining this prior with a recently developed Bayesian methodology for haplotype phasing known as DP-Haplotyper (Xing et al., 2004). We also develop an efficient sampling algorithm for the HDP based on a two-level nested Polya urn scheme. We show that our model outperforms extant algorithms on both simulated and real biological data.
@inproceedings{XinSohJor2006a,
author = {Xing, E. P. and Sohn, K. and Jordan, M. I. and Teh, Y. W.},
title = {{B}ayesian Multi-population Haplotype Inference via a Hierarchical {D}irichlet process mixture},
booktitle = {International Conference on Machine Learning (ICML)},
year = {2006},
volume = {23}
}
Y. W. Teh,
M. I. Jordan,
M. J. Beal,
D. M. Blei,
Hierarchical Dirichlet Processes, Journal of the American Statistical Association, vol. 101, no. 476, 1566–1581, 2006.
We consider problems involving groups of data, where each observation within a group is a draw from a mixture model, and where it is desirable to share mixture components between groups. We assume that the number of mixture components is unknown a priori and is to be inferred from the data. In this setting it is natural to consider sets of Dirichlet processes, one for each group, where the well-known clustering property of the Dirichlet process provides a nonparametric prior for the number of mixture components within each group. Given our desire to tie the mixture models in the various groups, we consider a hierarchical model, specifically one in which the base measure for the child Dirichlet processes is itself distributed according to a Dirichlet process. Such a base measure being discrete, the child Dirichlet processes necessarily share atoms. Thus, as desired, the mixture models in the different groups necessarily share mixture components. We discuss representations of hierarchical Dirichlet processes in terms of a stick-breaking process, and a generalization of the Chinese restaurant process that we refer to as the “Chinese restaurant franchise.” We present Markov chain Monte Carlo algorithms for posterior inference in hierarchical Dirichlet process mixtures, and describe applications to problems in information retrieval and text modelling.
@article{TehJorBea2006a,
author = {Teh, Y. W. and Jordan, M. I. and Beal, M. J. and Blei, D. M.},
title = {Hierarchical {D}irichlet Processes},
journal = {Journal of the American Statistical Association},
year = {2006},
volume = {101},
number = {476},
pages = {1566-1581}
}
G. E. Hinton,
S. Osindero,
Y. W. Teh,
A Fast Learning Algorithm for Deep Belief Networks, Neural Computation, vol. 18, no. 7, 1527–1554, 2006.
We show how to use “complementary priors” to eliminate the explaining-away effects that make inference difficult in densely connected belief nets that have many hidden layers. Using complementary priors, we derive a fast, greedy algorithm that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory. The fast, greedy algorithm is used to initialize a slower learning procedure that fine-tunes the weights using a contrastive version of the wake-sleep algorithm. After fine-tuning, a network with three hidden layers forms a very good generative model of the joint distribution of handwritten digit images and their labels. This generative model gives better digit classification than the best discriminative learning algorithms. The low-dimensional manifolds on which the digits lie are modeled by long ravines in the free-energy landscape of the top-level associative memory, and it is easy to explore these ravines by using the directed connections to display what the associative memory has in mind.
@article{HinOsiTeh2006a,
author = {Hinton, G. E. and Osindero, S. and Teh, Y. W.},
title = {A Fast Learning Algorithm for Deep Belief Networks},
journal = {Neural Computation},
year = {2006},
volume = {18},
number = {7},
pages = {1527-1554}
}
G. E. Hinton,
S. Osindero,
M. Welling,
Y. W. Teh,
Unsupervised Discovery of Non-linear Structure Using Contrastive Backpropagation, Cognitive Science, vol. 30, no. 4, 725–731, 2006.
We describe a way of modeling high-dimensional data vectors by using an unsupervised, nonlinear, multilayer neural network in which the activity of each neuron-like unit makes an additive contribution to a global energy score that indicates how surprised the network is by the data vector. The connection weights that determine how the activity of each unit depends on the activities in earlier layers are learned by minimizing the energy assigned to data vectors that are actually observed and maximizing the energy assigned to “confabulations” that are generated by perturbing an observed data vector in a direction that decreases its energy under the current model.
@article{HinOsiWel2006a,
author = {Hinton, G. E. and Osindero, S. and Welling, M. and Teh, Y. W.},
title = {Unsupervised Discovery of Non-linear Structure Using Contrastive Backpropagation},
journal = {Cognitive Science},
year = {2006},
volume = {30},
number = {4},
pages = {725-731}
}
W. S. Lee,
X. Zhang,
Y. W. Teh,
Semi-supervised Learning in Reproducing Kernel Hilbert Spaces Using Local Invariances, School of Computing, National University of Singapore, TRB3/06, 2006.
We propose a framework for semi-supervised learning in reproducing kernel Hilbert spaces using local invariances that explicitly characterize the behavior of the target function around both labeled and unlabeled data instances. Such invariances include: invariance to small changes to the data instances, invariance to averaging across a small neighbourhood around data instances, and invariance to local transformations such as translation and rotation. These invariances are approximated by minimizing loss functions on derivatives and local averages of the functions. We use a regularized cost function, consisting of the sum of loss functions penalized with the squared norm of the function, and give a representer theorem showing that an optimal function can be represented as a linear combination of a finite number of basis functions. For the representer theorem to hold, the derivatives and local averages are required to be bounded linear functionals in the reproducing kernel Hilbert space. We show that this is true in the reproducing kernel Hilbert spaces defined by Gaussian and polynomial kernels.
@techreport{LeeZhaTeh2006a,
author = {Lee, W. S. and Zhang, X. and Teh, Y. W.},
title = {Semi-supervised Learning in Reproducing Kernel {H}ilbert Spaces Using Local Invariances},
institution = {School of Computing, National University of Singapore},
year = {2006},
number = {TRB3/06}
}
2005
Y. W. Teh,
M. I. Jordan,
M. J. Beal,
D. M. Blei,
Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes, in Advances in Neural Information Processing Systems (NIPS), 2005, vol. 17.
We propose the hierarchical Dirichlet process (HDP), a nonparametric Bayesian model for clustering problems involving multiple groups of data. Each group of data is modeled with a mixture, with the number of components being open-ended and inferred automatically by the model. Further, components can be shared across groups, allowing dependencies across groups to be modeled effectively as well as conferring generalization to new groups. Such grouped clustering problems occur often in
practice, e.g. in the problem of topic discovery in document corpora. We report experimental results on three text corpora showing the effective and superior performance of the HDP over previous models.
@inproceedings{TehJorBea2005a,
author = {Teh, Y. W. and Jordan, M. I. and Beal, M. J. and Blei, D. M.},
title = {Sharing Clusters among Related Groups: Hierarchical {D}irichlet Processes},
booktitle = {Advances in Neural Information Processing Systems (NIPS)},
year = {2005},
volume = {17}
}
J. Edwards,
Y. W. Teh,
D. A. Forsyth,
R. Bock,
M. Maire,
G. Vesom,
Making Latin Manuscripts Searchable using gHMM’s, in Advances in Neural Information Processing Systems (NIPS), 2005, vol. 17.
We describe a method that can make a scanned, handwritten mediaeval latin manuscript accessible to full text search. A generalized HMM is fitted, using transcribed latin to obtain a transition model and one example each of 22 letters to obtain an emission model. We show results for unigram, bigram and trigram models. Our method transcribes 25 pages of a manuscript of Terence with fair accuracy (75% of letters correctly transcribed). Search results are very strong; we use examples of variant spellings to demonstrate that the search respects the ink of the document. Furthermore, our model produces fair searches on a document from which we obtained no training data.
@inproceedings{EdwTehFor2005a,
author = {Edwards, J. and Teh, Y. W. and Forsyth, D. A. and Bock, R. and Maire, M. and Vesom, G.},
title = {Making Latin Manuscripts Searchable using {gHMM}'s},
booktitle = {Advances in Neural Information Processing Systems (NIPS)},
year = {2005},
volume = {17}
}
Y. W. Teh,
M. Seeger,
M. I. Jordan,
Semiparametric Latent Factor Models, in Artificial Intelligence and Statistics (AISTATS), 2005, vol. 10.
We propose a semiparametric model for regression problems involving multiple response variables. The model makes use of a set of Gaussian processes that are linearly mixed to capture dependencies that may exist among the response variables. We propose an efficient approximate inference scheme for this semiparametric model whose complexity is linear in the number of training data points. We present experimental results in the domain of multi-joint robot arm dynamics.
@inproceedings{TehSeeJor2005a,
author = {Teh, Y. W. and Seeger, M. and Jordan, M. I.},
title = {Semiparametric Latent Factor Models},
booktitle = {Artificial Intelligence and Statistics (AISTATS)},
year = {2005},
volume = {10}
}
M. Seeger,
Y. W. Teh,
M. I. Jordan,
Semiparametric Latent Factor Models, Division of Computer Science, University of California at Berkeley, 2005.
We propose a semiparametric model for regression and classification problems involving multiple response variables. The model makes use of a set of Gaussian processes to model the relationship to the inputs in a nonparametric fashion. Conditional dependencies between the responses can be captured through a linear mixture of the driving processes. This feature becomes important if some of the responses of predictive interest are less densely supplied by observed data than related auxiliary ones. We propose an efficient approximate inference scheme for this semiparametric model whose complexity is linear in the number of training data points.
@techreport{SeeTehJor2005a,
author = {Seeger, M. and Teh, Y. W. and Jordan, M. I.},
title = {Semiparametric Latent Factor Models},
institution = {Division of Computer Science, University of California at Berkeley},
year = {2005}
}
M. Welling,
T. Minka,
Y. W. Teh,
Structured Region Graphs: Morphing EP into GBP, in Uncertainty in Artificial Intelligence (UAI), 2005, vol. 21.
GBP and EP are two successful algorithms for approximate probabilistic inference, which are based on different approximation strategies. An open problem in both algorithms has been how to choose an appropriate approximation structure. We introduce ’structured region graphs’, a formalism which marries these two strategies, reveals a deep connection between them, and suggests how to choose good approximation structures. In this formalism, each region has an internal structure which defines an exponential family, whose sufficient statistics must be matched by the parent region. Reduction operators on these structures allow conversion between EP and GBP free energies. Thus it is revealed that all EP approximations on discrete variables are special cases of GBP, and conversely that some wellknown GBP approximations, such as overlapping squares, are special cases of EP. Furthermore, region graphs derived from EP have a number of good structural properties, including maxent-normality and overall counting number of one. The result is a convenient framework for producing high-quality approximations with a user-adjustable level of complexity.
@inproceedings{WelMinTeh2005a,
author = {Welling, M. and Minka, T. and Teh, Y. W.},
title = {Structured Region Graphs: Morphing {EP} into {GBP}},
booktitle = {Uncertainty in Artificial Intelligence (UAI)},
year = {2005},
volume = {21}
}
2004
M. Welling,
M. Rosen-Zvi,
Y. W. Teh,
Approximate Inference by Markov Chains on Union Spaces, in International Conference on Machine Learning (ICML), 2004, vol. 21.
@inproceedings{WelRosTeh2004a,
author = {Welling, M. and Rosen-Zvi, M. and Teh, Y. W.},
title = {Approximate Inference by Markov Chains on Union Spaces},
booktitle = {International Conference on Machine Learning (ICML)},
year = {2004},
volume = {21}
}
M. Welling,
Y. W. Teh,
Linear Response Algorithms for Approximate Inference in Graphical Models, Neural Computation, vol. 16, 197–221, 2004.
Belief propagation (BP) on cyclic graphs is an efficient algorithm for computing approximate marginal probability distributions over single nodes and neighboring nodes in the graph. However, it does not prescribe a way to compute joint distributions over pairs of distant nodes in the graph. In this article, we propose two new algorithms for approximating these pairwise probabilities, based on the linear response theorem. The first is a propagation algorithm that is shown to converge if BP converges to a stable fixed point. The second algorithm is based on matrix inversion. Applying these ideas to gaussian random fields, we derive a propagation algorithm for computing the inverse of a matrix.
@article{WelTeh2004a,
author = {Welling, M. and Teh, Y. W.},
title = {Linear Response Algorithms for Approximate Inference in Graphical Models},
journal = {Neural Computation},
year = {2004},
volume = {16},
pages = {197-221}
}
T. Miller,
A. C. Berg,
J. Edwards,
M. Maire,
R. White,
Y. W. Teh,
E. Learned-Miller,
D. A. Forsyth,
Faces and Names in the News, in Proceedings of the Conference on Computer Vision and Pattern Recognition, 2004.
We show quite good face clustering is possible for a dataset of inaccurately and ambiguously labelled face images. Our dataset is 44,773 face images, obtained by applying a face finder to approximately half a million captioned news images. This dataset is more realistic than usual face recognition datasets, because it contains faces captured “in the wild” in a variety of configurations with respect to the camera, taking a variety of expressions, and under illumination of widely varying color. Each face image is associated with a set of names, automatically extracted from the associated caption. Many, but not all such sets contain the
correct name.
We cluster face images in appropriate discriminant coordinates. We use a clustering procedure to break ambiguities in labelling and identify incorrectly labelled faces. A merging procedure then identifies variants of names that refer to the same individual. The resulting representation can be used to label faces in news images or to organize news pictures by individuals present.
An alternative view of our procedure is as a process that cleans up noisy supervised data. We demonstrate how to use entropy measures to evaluate such procedures.
@inproceedings{MilBerEdw2004a,
author = {Miller, T. and Berg, A. C. and Edwards, J. and Maire, M. and White, R. and Teh, Y. W. and Learned-Miller, E. and Forsyth, D. A.},
title = {Faces and Names in the News},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition},
year = {2004}
}
Y. W. Teh,
M. I. Jordan,
M. J. Beal,
D. M. Blei,
Hierarchical Dirichlet Processes, Department of Statistics, University of California at Berkeley, 653, 2004.
We consider problems involving groups of data, where each observation within a group is a draw from a mixture model, and where it is desirable to share mixture components between groups. We assume that the number of mixture components is unknown a priori and is to be inferred from the data. In this setting it is natural to consider sets of Dirichlet processes, one for each group, where the well-known clustering property of the Dirichlet process provides a nonparametric prior for the number of mixture components within each group. Given our desire to tie the mixture models in the various groups, we consider a hierarchical model, specifically one in which the base measure for the child Dirichlet processes is itself distributed according to a Dirichlet process. Such a base measure being discrete, the child Dirichlet processes necessarily share atoms. Thus, as desired, the mixture models in the different groups necessarily share mixture components. We discuss representations of hierarchical Dirichlet processes in terms of a stick-breaking process, and a generalization of the Chinese restaurant process that we refer to as the “Chinese restaurant franchise.” We present Markov chain Monte Carlo algorithms for posterior inference in hierarchical Dirichlet process mixtures, and describe applications to problems in information retrieval and text modelling.
@techreport{TehJorBea2004a,
author = {Teh, Y. W. and Jordan, M. I. and Beal, M. J. and Blei, D. M.},
title = {Hierarchical {D}irichlet Processes},
institution = {Department of Statistics, University of California at Berkeley},
year = {2004},
number = {653}
}
2003
Y. W. Teh,
M. Welling,
S. Osindero,
G. E. Hinton,
Energy-Based Models for Sparse Overcomplete Representations, Journal of Machine Learning Research (JMLR), vol. 4, 1235–1260, 2003.
@article{TehWelOsi2003a,
author = {Teh, Y. W. and Welling, M. and Osindero, S. and Hinton, G. E.},
title = {Energy-Based Models for Sparse Overcomplete Representations},
journal = {Journal of Machine Learning Research (JMLR)},
year = {2003},
volume = {4},
pages = {1235-1260}
}
Y. W. Teh,
Bethe Free Energy and Contrastive Divergence Approximations for Undirected Graphical Models, PhD thesis, Department of Computer Science, University of Toronto, 2003.
As the machine learning community tackles more complex and harder problems, the graphical models needed to solve such problems become larger and more complicated. As a result performing inference and learning exactly for such graphical models become ever more expensive, and approximate inference and learning techniques become ever more prominent.
There are a variety of techniques for approximate inference and learning in the literature. This thesis contributes some new ideas in the products of experts (PoEs) class of models (Hinton, 2002), and the Bethe free energy approximations (Yedidia et al., 2001).
For PoEs, our contribution is in developing new PoE models for continuous-valued domains. We developed RBMrate, a model for discretized continuous-valued data. We applied it to face recognition to demonstrate its abilities. We also developed energy-based models (EBMs) – flexible probabilistic models where the building blocks consist of energy terms computed using a feed-forward network. We show that standard square noiseless independent components analysis (ICA) (Bell and Sejnowski, 1995) can be viewed as a restricted form of EBMs. Extending this relationship with ICA, we describe sparse and over-complete representations of data where the inference process is trivial since it is simply an EBM.
For Bethe free energy approximations, our contribution is a theory relating belief propagation and iterative scaling. We show that both belief propagation and iterative scaling updates can be derived as fixed point equations for constrained minimization of the Bethe free energy. This allows usto develop a new algorithm to directly minimize the Bethe free energy, and to apply
the Bethe free energy to learning in addition to inference. We also describe improvements to the efficiency of standard learning algorithms for undirected graphical models (Jirousek and Preucil, 1995).
@phdthesis{Teh2003a,
author = {Teh, Y. W.},
title = {Bethe Free Energy and Contrastive Divergence Approximations for Undirected Graphical Models},
school = {Department of Computer Science, University of Toronto},
year = {2003}
}
Y. W. Teh,
S. Roweis,
Automatic Alignment of Local Representations, in Advances in Neural Information Processing Systems (NIPS), 2003, vol. 15.
We present an automatic alignment procedure which maps the disparate internal representations learned by several local dimensionality reduction experts into a single, coherent global coordinate system for the original data space. Our algorithm can be applied to any set of experts, each of which produces a low-dimensional local representation of a highdimensional input. Unlike recent efforts to coordinate such models by modifying their objective functions [1, 2], our algorithm is invoked after training and applies an efficient eigensolver to post-process the trained models. The post-processing has no local optima and the size of the system it must solve scales with the number of local models rather than the number of original data points, making it more efficient than model-free algorithms such as Isomap [3] or LLE [4].
@inproceedings{TehRow2003a,
author = {Teh, Y. W. and Roweis, S.},
title = {Automatic Alignment of Local Representations},
booktitle = {Advances in Neural Information Processing Systems (NIPS)},
year = {2003},
volume = {15}
}
Y. W. Teh,
M. Welling,
On Improving the Efficiency of the Iterative Proportional Fitting Procedure, in Artificial Intelligence and Statistics (AISTATS), 2003, vol. 9.
Iterative proportional fitting (IPF) on junction trees is an important tool for learning in graphical models. We identify the propagation and IPF updates on the junction tree as fixed point equations of a single constrained entropy maximization problem. This allows a more efficient message updating protocol than the well known effective IPF of Jirousek and Preucil (1995). When the junction tree has an intractably large maximum clique size we propose to maximize an approximate constrained entropy based on region graphs (Yedidia et al., 2002). To maximize the new objective we propose a “loopy” version of IPF. We show that this yields accurate estimates of the weights of undirected graphical models in a simple experiment.
@inproceedings{TehWel2003a,
author = {Teh, Y. W. and Welling, M.},
title = {On Improving the Efficiency of the Iterative Proportional Fitting Procedure},
booktitle = {Artificial Intelligence and Statistics (AISTATS)},
year = {2003},
volume = {9}
}
M. Welling,
Y. W. Teh,
Approximate Inference in Boltzmann Machines, Artificial Intelligence, vol. 143, no. 1, 19–50, 2003.
Inference in Boltzmann machines is NP-hard in general. As a result approximations are often necessary. We discuss first order mean field and second order Onsager truncations of the Plefka expansion of the Gibbs free energy. The Bethe free energy is introduced and rewritten as a Gibbs free energy. From there a convergent belief optimization algorithm is derived to minimize the Bethe free energy. An analytic expression for the linear response estimate of the covariances is found which is exact on Boltzmann trees. Finally, a number of theorems is proven concerning the Plefka expansion, relating the first order mean field and the second order Onsager approximation to the Bethe approximation. Experiments compare mean field approximation, Onsager approximation, belief propagation and belief optimization.
@article{WelTeh2003a,
author = {Welling, M. and Teh, Y. W.},
title = {Approximate Inference in {B}oltzmann Machines},
journal = {Artificial Intelligence},
year = {2003},
volume = {143},
number = {1},
pages = {19-50}
}
2002
Y. W. Teh,
M. Welling,
The Unified Propagation and Scaling Algorithm, in Advances in Neural Information Processing Systems (NIPS), 2002, vol. 14.
In this paper we will show that a restricted class of constrained minimum divergence problems, named generalized inference problems, can be solved by approximating the KL divergence with a Bethe free energy. The algorithm we derive is closely related to both loopy belief propagation and iterative scaling. This unified propagation and scaling algorithm reduces to a convergent alternative to loopy belief propagation when no constraints are present. Experiments show the viability of our algorithm.
@inproceedings{TehWel2002a,
author = {Teh, Y. W. and Welling, M.},
title = {The Unified Propagation and Scaling Algorithm},
booktitle = {Advances in Neural Information Processing Systems (NIPS)},
year = {2002},
volume = {14}
}
S. Kakade,
Y. W. Teh,
S. Roweis,
An Alternate Objective Function for Markovian Fields, in International Conference on Machine Learning (ICML), 2002, vol. 19.
In labelling or prediction tasks, a trained model’s test performance is often based on the quality of its single-time marginal distributions over labels rather than its joint distribution over label sequences. We propose using a new cost function for discriminative learning that more accurately reflects such test time conditions. We present an efficient method to compute the gradient of this cost for Maximum Entropy Markov Models, Conditional Random Fields, and for an extension of these models involving hidden states. Our experimental results show that the new cost can give significant improvements and that it provides a novel and effective way of dealing with the ’label-bias’ problem.
@inproceedings{KakTehRow2002a,
author = {Kakade, S. and Teh, Y. W. and Roweis, S.},
title = {An Alternate Objective Function for {M}arkovian Fields},
booktitle = {International Conference on Machine Learning (ICML)},
year = {2002},
volume = {19}
}
2001
M. Welling,
Y. W. Teh,
Belief Optimization for Binary Networks : A Stable Alternative to Loopy Belief Propagation, in Uncertainty in Artificial Intelligence (UAI), 2001, vol. 17.
We present a novel inference algorithm for arbitrary, binary, undirected graphs. Unlike loopy belief propagation, which iterates fixed point equations, we directly descend on the Bethe free energy. The algorithm consists of two phases, first we update the pairwise probabilities, given the marginal probabilities at each unit,using an analytic expression. Next, we update the marginal probabilities, given the pairwise probabilities by following the negative gradient of the Bethe free energy. Both steps are guaranteed to decrease the Bethe free energy, and since it is lower bounded, the algorithm is guaranteed to converge to a local minimum. We also show that the Bethe free energy is equal to the TAP free energy up to second order in the weights. In experiments we confirm that when belief propagation converges it usually finds identical solutions as our belief optimization method. However, in cases where belief propagation fails to converge, belief optimization continues to converge to reasonable beliefs. The stable nature of belief optimization makes it ideally suited for learning graphical models from data.
@inproceedings{WelTeh2001a,
author = {Welling, M. and Teh, Y. W.},
title = {Belief Optimization for Binary Networks : A Stable Alternative to Loopy Belief Propagation},
booktitle = {Uncertainty in Artificial Intelligence (UAI)},
year = {2001},
volume = {17}
}
G. E. Hinton,
Y. W. Teh,
Discovering multiple constraints that are frequently Approximately Satisfied, in Uncertainty in Artificial Intelligence (UAI), 2001, vol. 17, 227–234.
Some high-dimensional data.sets can be modelled by assuming that there are many different linear constraints, each of which is Frequently Approximately Satisfied (FAS) by the data. The probability of a data vector under the model is then proportional to the product of the probabilities of its constraint violations. We describe three methods of learning products of constraints using a heavy-tailed probability distribution for the violations.
@inproceedings{HinTeh2001a,
author = {Hinton, G. E. and Teh, Y. W.},
title = {Discovering multiple constraints that are frequently Approximately Satisfied},
booktitle = {Uncertainty in Artificial Intelligence (UAI)},
year = {2001},
volume = {17},
pages = {227-234}
}
G. E. Hinton,
M. Welling,
Y. W. Teh,
S. Osindero,
A New View of ICA, in Proceedings of the International Conference on Independent Component Analysis and Blind Signal Separation, 2001, vol. 3.
We present a new way of interpreting ICA as a probability density model and a new way of fitting this model to data. The advantage of our approach is that it suggests simple, novel extensions to overcomplete, undercomplete and multilayer non-linear versions of ICA.
@inproceedings{HinWelTeh2001a,
author = {Hinton, G. E. and Welling, M. and Teh, Y. W. and Osindero, S.},
title = {A New View of {ICA}},
booktitle = {Proceedings of the International Conference on Independent Component Analysis and Blind Signal Separation},
year = {2001},
volume = {3}
}
Y. W. Teh,
G. E. Hinton,
Rate-Coded Restricted Boltzmann Machines for Face Recognition, in Advances in Neural Information Processing Systems (NIPS), 2001, vol. 13.
We describe a neurally-inspired, unsupervised learning algorithm that builds a non-linear generative model for pairs of face images from the same individual. Individuals are then recognized by finding the highest relative probability pair among all pairs that consist of a test image and an image whose identity is known. Our method compares favorably with other methods in the literature. The generative model consists of a single layer of rate-coded, non-linear feature detectors and it has the property that, given a data vector, the true posterior probability distribution over the feature detector activities can be inferred rapidly without iteration or approximation. The weights of the feature detectors are learned by comparing the correlations of pixel intensities and feature activations in two phases: When the network is observing real data and when it is observing reconstructions of real data generated from the feature activations.
@inproceedings{TehHin2001a,
author = {Teh, Y. W. and Hinton, G. E.},
title = {Rate-Coded Restricted {Boltzmann} Machines for Face Recognition},
booktitle = {Advances in Neural Information Processing Systems (NIPS)},
year = {2001},
volume = {13}
}
Y. W. Teh,
M. Welling,
Passing and Bouncing Messages for Generalized Inference, Gatsby Computational Neuroscience Unit, University College London, GCNU TR 2001-01, 2001.
@techreport{TehWel2001a,
author = {Teh, Y. W. and Welling, M.},
title = {Passing and Bouncing Messages for Generalized Inference},
institution = {Gatsby Computational Neuroscience Unit, University College London},
year = {2001},
number = {GCNU TR 2001-01}
}
2000
G. E. Hinton,
Z. Ghahramani,
Y. W. Teh,
Learning to Parse Images, in Advances in Neural Information Processing Systems (NIPS), 2000, vol. 12.
We describe a class of probabilistic models that we call credibility networks. Using parse trees as internal representations of images, credibility networks are able to perform segmentation and recognition simultaneously, removing the need for ad hoc segmentation heuristics. Promising results in the problem of segmenting handwritten digits were obtained.
@inproceedings{HinGhaTeh2000a,
author = {Hinton, G. E. and Ghahramani, Z. and Teh, Y. W.},
title = {Learning to Parse Images},
booktitle = {Advances in Neural Information Processing Systems (NIPS)},
year = {2000},
volume = {12}
}
1998
F. Bacchus,
Y. W. Teh,
Making Forward Chaining Relevant, in Proceedings of the International Conference on Artificial Intelligence Planning Systems, 1998.
Planning by forward chaining through the world space has long been dismissed as being “obviously” infeasible. Nevertheless,
this approach to planning has many advantages. Most importantly forward chaining planners maintain complete descriptions of the intermediate states that arise during the course of the plan’s execution. These states can be utilized to provide highly effective search control. Another advantage is that such planners can support richer planning representations that can model, e.g., resources and resource consumption. Forward chaining planners are still plagued however by their traditional weaknesses: a lack of goal direction, and the fact that they search totally ordered action sequences. In this paper we address the issue of goal direction. We present two algorithms that provide a forward chaining planner with more information about the goal, and allow it to avoid certain types of irrelevant state information and actions.
@inproceedings{BacTeh1998a,
author = {Bacchus, F. and Teh, Y. W.},
title = {Making Forward Chaining Relevant},
booktitle = {Proceedings of the International Conference on Artificial Intelligence Planning Systems},
year = {1998}
}
Software
2016
B. Lakshminarayanan,
D. M. Roy,
Y. W. Teh,
Mondrian Forest. 2016.
This paper makes two contributions to Bayesian machine learning algorithms. Firstly, we propose stochastic natural gradient expectation propagation (SNEP), a novel alternative to expectation propagation (EP), a popular variational inference algorithm. SNEP is a black box variational algorithm, in that it does not require any simplifying assumptions on the distribution of interest, beyond the existence of some Monte Carlo sampler for estimating the moments of the EP tilted distributions. Further, as opposed to EP which has no guarantee of convergence, SNEP can be shown to be convergent, even when using Monte Carlo moment estimates. Secondly, we propose a novel architecture for distributed Bayesian learning which we call the posterior server. The posterior server allows scalable and robust Bayesian learning in cases where a dataset is stored in a distributed manner across a cluster, with each compute node containing a disjoint subset of data. An independent Monte Carlo sampler is run on each compute node, with direct access only to the local data subset, but which targets an approximation to the global posterior distribution given all data across the whole cluster. This is achieved by using a distributed asynchronous implementation of SNEP to pass messages across the cluster. We demonstrate SNEP and the posterior server on distributed Bayesian learning of logistic regression and neural networks.
@software{HasWebLie2016a,
author = {Hasenclever, L. and Webb, S. and Lienart, T. and Vollmer, S. and Lakshminarayanan, B. and Blundell, C. and Teh, Y. W.},
title = {Posterior Server},
year = {2016}
}
@software{BoyTeh2015a,
author = {Boyles, L. and Teh, Y. W.},
title = {{CPABS}: Cancer Phylogenetic Reconstruction with {A}ldous' Beta Splitting},
year = {2015}
}
2014
M. Xu,
B. Lakshminarayanan,
Y. W. Teh,
J. Zhu,
B. Zhang,
SMS: Sampling via Moment Sharing, Advances in Neural Information Processing Systems. 2014.
We propose a distributed Markov chain Monte Carlo (MCMC) inference algorithm for large scale Bayesian posterior simulation. We assume that the dataset is partitioned and stored across nodes of a cluster. Our procedure involves an independent MCMC posterior sampler at each node based on its local partition of the data. Moment statistics of the local posteriors are collected from each sampler and propagated across the cluster using expectation propagation message passing with low communication costs. The moment sharing scheme improves posterior estimation quality by enforcing agreement among the samplers. We demonstrate the speed and inference quality of our method with empirical studies on Bayesian logistic regression and sparse linear regression with a spike-and-slab prior.
@software{XuLakTeh2014b,
author = {Xu, M. and Lakshminarayanan, B. and Teh, Y. W. and Zhu, J. and Zhang, B.},
title = {{SMS}: Sampling via Moment Sharing},
year = {2014},
booktitle = {Advances in Neural Information Processing Systems}
}