Scalable spatiotemporal statistics and Bayesian machine learning for public policy and social science
Seth is a postdoc working on scalable methods for spatiotemporal statistics and Bayesian machine learning, applied to public policy / social science areas including crime and public health. He completed his PhD at Carnegie Mellon University in August 2015 in a program that is joint between public policy and machine learning.
Publications
2018
H. Law,
D. Sutherland,
D. Sejdinovic,
S. Flaxman,
Bayesian Approaches to Distribution Regression, in Artificial Intelligence and Statistics (AISTATS), 2018.
Distribution regression has recently attracted much interest as a generic solution to the problem of supervised learning where labels are available at the group level, rather than at the individual level. Current approaches, however, do not propagate the uncertainty in observations due to sampling variability in the groups. This effectively assumes that small and large groups are estimated equally well, and should have equal weight in the final regression. We account for this uncertainty with a Bayesian distribution regression formalism, improving the robustness and performance of the model when group sizes vary. We frame our models in a neural network style, allowing for simple MAP inference using backpropagation to learn the parameters, as well as MCMC-based inference which can fully propagate uncertainty. We demonstrate our approach on illustrative toy datasets, as well as on a challenging problem of predicting age from images.
@inproceedings{LawSutSejFla2018,
author = {Law, H.C.L. and Sutherland, D.J. and Sejdinovic, D. and Flaxman, S.},
title = {{Bayesian Approaches to Distribution Regression}},
booktitle = {Artificial Intelligence and Statistics (AISTATS)},
year = {2018}
}
2017
S. Flaxman,
Y. Teh,
D. Sejdinovic,
Poisson Intensity Estimation with Reproducing Kernels, Electronic Journal of Statistics, vol. 11, no. 2, 5081–5104, 2017.
Despite the fundamental nature of the inhomogeneous Pois-
son process in the theory and application of stochastic processes, and its
attractive generalizations (e.g. Cox process), few tractable nonparametric
modeling approaches of intensity functions exist, especially when observed
points lie in a high-dimensional space. In this paper we develop a new,
computationally tractable Reproducing Kernel Hilbert Space (RKHS) for-
mulation for the inhomogeneous Poisson process. We model the square root
of the intensity as an RKHS function. Whereas RKHS models used in su-
pervised learning rely on the so-called representer theorem, the form of
the inhomogeneous Poisson process likelihood means that the representer
theorem does not apply. However, we prove that the representer theorem
does hold in an appropriately transformed RKHS, guaranteeing that the
optimization of the penalized likelihood can be cast as a tractable finite-
dimensional problem. The resulting approach is simple to implement, and
readily scales to high dimensions and large-scale datasets.
@article{FlaTehSej2017ejs,
author = {Flaxman, S. and Teh, Y.W. and Sejdinovic, D.},
title = {{{Poisson Intensity Estimation with Reproducing Kernels}}},
journal = {Electronic Journal of Statistics},
year = {2017},
volume = {11},
number = {2},
pages = {5081--5104}
}
The algorithms for causal discovery and more broadly for learning the structure of graphical models require well calibrated and consistent conditional independence (CI) tests. We revisit the CI tests which are based on two-step procedures and involve regression with subsequent (unconditional) independence test (RESIT) on regression residuals and investigate the assumptions under which these tests operate. In particular, we demonstrate that when going beyond simple functional relationships with additive noise, such tests can lead to an inflated number of false discoveries. We study the relationship of these tests with those based on dependence measures using reproducing kernel Hilbert spaces (RKHS) and propose an extension of RESIT which uses RKHS-valued regression. The resulting test inherits the simple two-step testing procedure of RESIT, while giving correct Type I control and competitive power. When used as a component of the PC algorithm, the proposed test is more robust to the case where hidden variables induce a switching behaviour in the associations present in the data.
@inproceedings{ZhaFilFlaSej2017,
author = {Zhang, Q. and Filippi, S. and Flaxman, S. and Sejdinovic, D.},
title = {{Feature-to-Feature Regression for a Two-Step Conditional Independence Test}},
booktitle = {Uncertainty in Artificial Intelligence (UAI)},
year = {2017}
}
J. Runge,
D. Sejdinovic,
S. Flaxman,
Detecting causal associations in large nonlinear time series datasets, ArXiv e-prints:1702.07007, 2017.
Detecting causal associations in time series datasets is a key challenge for novel insights into complex dynamical systems such as the Earth system or the human brain. Interactions in high-dimensional dynamical systems often involve time-delays, nonlinearity, and strong autocorrelations. These present major challenges for causal discovery techniques such as Granger causality leading to low detection power, biases, and unreliable hypothesis tests. Here we introduce a reliable and fast method that outperforms current approaches in detection power and scales up to high-dimensional datasets. It overcomes detection biases, especially when strong autocorrelations are present, and allows ranking associations in large-scale analyses by their causal strength. We provide mathematical proofs, evaluate our method in extensive numerical experiments, and illustrate its capabilities in a large-scale analysis of the global surface-pressure system where we unravel spurious associations and find several potentially causal links that are difficult to detect with standard methods. The broadly applicable method promises to discover novel causal insights also in many other fields of science.
@unpublished{RunSejFla2017,
author = {Runge, J. and Sejdinovic, D. and Flaxman, S.},
journal = {ArXiv e-prints:1702.07007},
title = {{Detecting causal associations in large nonlinear time series datasets}},
year = {2017}
}
S. Flaxman,
Y. W. Teh,
D. Sejdinovic,
Poisson Intensity Estimation with Reproducing Kernels, in Artificial Intelligence and Statistics (AISTATS), 2017.
Despite the fundamental nature of the inhomogeneous Poisson process in the theory and application of stochastic processes, and its attractive generalizations (e.g. Cox process), few tractable nonparametric modeling approaches of intensity functions exist, especially in high dimensional settings. In this paper we develop a new, computationally tractable Reproducing Kernel Hilbert Space (RKHS) formulation for the inhomogeneous Poisson process. We model the square root of the intensity as an RKHS function. The modeling challenge is that the usual representer theorem arguments no longer apply due to the form of the inhomogeneous Poisson process likelihood. However, we prove that the representer theorem does hold in an appropriately transformed RKHS, guaranteeing that the optimization of the penalized likelihood can be cast as a tractable finite-dimensional problem. The resulting approach is simple to implement, and readily scales to high dimensions and large-scale datasets.
@inproceedings{FlaTehSej2017,
author = {Flaxman, S. and Teh, Y. W. and Sejdinovic, D.},
booktitle = {Artificial Intelligence and Statistics (AISTATS)},
title = {{Poisson Intensity Estimation with Reproducing Kernels}},
year = {2017}
}
2016
W. Herlands,
A. Wilson,
H. Nickisch,
S. Flaxman,
D. Neill,
W. Van Panhuis,
E. Xing,
Scalable Gaussian Processes for Characterizing Multidimensional Change Surfaces, in Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, 2016, 1013–1021.
We present a scalable Gaussian process model for identifying and characterizing smooth multidimensional changepoints, and automatically learning changes in expressive covariance structure. We use Random Kitchen Sink features to flexibly define a change surface in combination with expressive spectral mixture kernels to capture the complex statistical structure. Finally, through the use of novel methods for additive non-separable kernels, we can scale the model to large datasets. We demonstrate the model on numerical and real world data, including a large spatio-temporal disease dataset where we identify previously unknown heterogeneous changes in space and time.
@inproceedings{HerWilNicketal2016,
author = {Herlands, William and Wilson, Andrew and Nickisch, Hannes and Flaxman, Seth and Neill, Daniel and Van Panhuis, Wilbert and Xing, Eric},
title = {{Scalable Gaussian Processes} for Characterizing Multidimensional Change Surfaces},
booktitle = {Proceedings of the 19th International Conference on Artificial Intelligence and Statistics},
pages = {1013--1021},
year = {2016}
}
C. Loeffler,
S. Flaxman,
Is Gun Violence Contagious?, 2016.
Existing theories of gun violence predict stable spatial concentrations and contagious diffusion of gun violence into surrounding areas. Recent empirical studies have reported confirmatory evidence of such spatiotemporal diffusion of gun violence. However, existing tests cannot readily distinguish spatiotemporal clustering from spatiotemporal diffusion. This leaves as an open question whether gun violence actually is contagious or merely clusters in space and time. Compounding this problem, gun violence is subject to considerable measurement error with many nonfatal shootings going unreported to police. Using point process data from an acoustical gunshot locator system and a combination of Bayesian spatiotemporal point process modeling and space/time interaction tests, this paper demonstrates that contemporary urban gun violence does diffuse, but only slightly, suggesting that a disease model for infectious spread of gun violence is a poor fit for the geographically stable and temporally stochastic process observed.
@unpublished{LoeFla2016,
author = {Loeffler, Charles and Flaxman, Seth},
title = {{Is Gun Violence Contagious?}},
note = {ArXiv e-prints: 1611.06713},
year = {2016}
}
B. Goodman,
S. Flaxman,
European Union regulations on algorithmic decision-making and a “right to explanation,” Jun-2016.
We summarize the potential impact that the European Union’s new General Data Protection Regulation will have on the routine use of machine learning algorithms. Slated to take effect as law across the EU in 2018, it will restrict automated individual decision-making (that is, algorithms that make decisions based on user-level predictors) which "significantly affect" users. The law will also effectively create a "right to explanation," whereby a user can ask for an explanation of an algorithmic decision that was made about them. We argue that while this law will pose large challenges for industry, it highlights opportunities for computer scientists to take the lead in designing algorithms and evaluation frameworks which avoid discrimination and enable explanation.
@unpublished{2016arXiv160608813G,
author = {Goodman, Bryce and Flaxman, Seth},
title = {{European Union regulations on algorithmic decision-making and a ``right to explanation''}},
note = {ArXiv e-prints: 1606.08813},
archiveprefix = {arXiv},
year = {2016},
month = jun
}
S. Flaxman,
D. Sutherland,
Y. Wang,
Y. W. Teh,
Understanding the 2016 US Presidential Election using ecological inference and distribution regression with census microdata, Arxiv e-prints, Nov-2016.
We combine fine-grained spatially referenced census data with the vote outcomes from the 2016 US presidential election. Using this dataset, we perform ecological inference using distribution regression (Flaxman et al, KDD 2015) with a multinomial-logit regression so as to model the vote outcome Trump, Clinton, Other / Didn’t vote as a function of demographic and socioeconomic features. Ecological inference allows us to estimate "exit poll" style results like what was Trump’s support among white women, but for entirely novel categories. We also perform exploratory data analysis to understand which census variables are predictive of voting for Trump, voting for Clinton, or not voting for either. All of our methods are implemented in python and R and are available online for replication.
@unpublished{flaxsuthetal2016,
author = {Flaxman, Seth and Sutherland, Dougal and Wang, Yu-Xiang and Teh, Yee Whye},
title = {Understanding the 2016 US Presidential Election using ecological inference and distribution regression with census microdata},
journal = {Arxiv e-prints},
note = {ArXiv e-prints: 1611.03787},
year = {2016},
month = nov
}
S. Bhatt,
E. Cameron,
S. Flaxman,
D. J. Weiss,
D. L. Smith,
P. W. Gething,
Improved prediction accuracy for disease risk mapping using Gaussian Process stacked generalisation, Dec-2016.
Maps of infectious disease—charting spatial variations in the force of infection, degree of endemicity, and the burden on human health—provide an essential evidence base to support planning towards global health targets. Contemporary disease mapping efforts have embraced statistical modelling approaches to properly acknowledge uncertainties in both the available measurements and their spatial interpolation. The most common such approach is that of Gaussian process regression, a mathematical framework comprised of two components: a mean function harnessing the predictive power of multiple independent variables, and a covariance function yielding spatio-temporal shrinkage against residual variation from the mean. Though many techniques have been developed to improve the flexibility and fitting of the covariance function, models for the mean function have typically been restricted to simple linear terms. For infectious diseases, known to be driven by complex interactions between environmental and socio-economic factors, improved modelling of the mean function can greatly boost predictive power. Here we present an ensemble approach based on stacked generalisation that allows for multiple, non-linear algorithmic mean functions to be jointly embedded within the Gaussian process framework. We apply this method to mapping Plasmodium falciparum prevalence data in Sub-Saharan Africa and show that the generalised ensemble approach markedly out-performs any individual method.
@unpublished{bhattetal2016,
author = {Bhatt, S. and Cameron, E. and Flaxman, Seth and Weiss, D. J. and Smith, D. L. and Gething, P. W.},
title = {Improved prediction accuracy for disease risk mapping using Gaussian Process stacked generalisation},
note = {ArXiv e-prints: 1612.03278},
year = {2016},
month = dec
}
S. Flaxman,
D. Sejdinovic,
J. Cunningham,
S. Filippi,
Bayesian Learning of Kernel Embeddings, in Uncertainty in Artificial Intelligence (UAI), 2016, 182–191.
Kernel methods are one of the mainstays of machine learning, but the problem of kernel learning remains challenging, with only a few heuristics and very little theory. This is of particular importance in methods based on estimation of kernel mean embeddings of probability measures. For characteristic kernels, which include most commonly used ones, the kernel mean embedding uniquely determines its probability measure, so it can be used to design a powerful statistical testing framework, which includes nonparametric two-sample and independence tests. In practice, however, the performance of these tests can be very sensitive to the choice of kernel and its lengthscale parameters. To address this central issue, we propose a new probabilistic model for kernel mean embeddings, the Bayesian Kernel Embedding model, combining a Gaussian process prior over the Reproducing Kernel Hilbert Space containing the mean embedding with a conjugate likelihood function, thus yielding a closed form posterior over the mean embedding. The posterior mean of our model is closely related to recently proposed shrinkage estimators for kernel mean embeddings, while the posterior uncertainty is a new, interesting feature with various possible applications. Critically for the purposes of kernel learning, our model gives a simple, closed form marginal pseudolikelihood of the observed data given the kernel hyperparameters. This marginal pseudolikelihood can either be optimized to inform the hyperparameter choice or fully Bayesian inference can be used.
@inproceedings{FlaSejCunFil2016,
author = {Flaxman, S. and Sejdinovic, D. and Cunningham, J.P. and Filippi, S.},
booktitle = {Uncertainty in Artificial Intelligence (UAI)},
pages = {182--191},
title = {{Bayesian Learning of Kernel Embeddings}},
year = {2016},
bdsk-url-1 = {http://www.auai.org/uai2016/proceedings/papers/145.pdf},
bdsk-url-2 = {http://www.auai.org/uai2016/proceedings/supp/145_supp.pdf}
}
We introduce the Tucker Gaussian Process (TGP), a model for regression that regularises a Gaussian Process (GP) towards simpler regression functions for enhanced generalisation performance. We derive it using a novel approach to scalable GP learning, and show that our model is particularly well-suited to grid-structured data and problems where the dependence on covariates is close to being separable. A prime example is collaborative filtering, for which our model provides an effective GP based method that has a low-rank matrix factorisation at its core. We show that TGP generalises classical Bayesian matrix factorisation models, and goes beyond them to give a natural and elegant method for incorporating side information.
@unpublished{KimLuFla2016a,
author = {Kim, H. and Lu, X. and Flaxman, S. and Teh, Y. W.},
note = {ArXiv e-prints: 1605.07025},
title = {{T}ucker {G}aussian Process for Regression and Collaborative Filtering},
year = {2016},
bdsk-url-1 = {https://arxiv.org/pdf/1605.07025.pdf}
}