Scalable spatiotemporal statistics and Bayesian machine learning for public policy and social science
Seth was a postdoc working on scalable methods for spatiotemporal statistics and Bayesian machine learning, applied to public policy / social science areas including crime and public health. He completed his PhD at Carnegie Mellon University in August 2015 in a program that is joint between public policy and machine learning.
Seth is now a lecturer in the statistics section of the Department of Mathematics at Imperial College London, joint with the Data Science Institute.
Publications
2019
S. Flaxman
,
M. Chirico
,
P. Pereira
,
C. Loeffler
,
Scalable high-resolution forecasting of sparse spatiotemporal events with kernel methods: a winning solution to the NIJ “Real-Time Crime Forecasting Challenge,”Revised and resubmit at Annals of Applied Statistics, 2019.
We propose a generic spatiotemporal event forecasting method, which we developed for the National Institute of Justice’s (NIJ) Real-Time Crime Forecasting Challenge. Our solution to the challenge is a spatiotemporal forecasting model combining scalable randomized Reproducing Kernel Hilbert Space (RKHS) methods for approximating Gaussian processes with autoregressive smoothing kernels in a regularized supervised learning framework. While the smoothing kernels capture the two main approaches in current use in the field of crime forecasting, kernel density estimation (KDE) and self-exciting point process (SEPP) models, the RKHS component of the model can be understood as an approximation to the popular log-Gaussian Cox Process model. For inference, we discretize the spatiotemporal point pattern and learn a log intensity function using the Poisson likelihood and highly efficient gradient-based optimization methods. Model hyperparameters including quality of RKHS approximation, spatial and temporal kernel lengthscales, number of autoregressive lags, bandwidths for smoothing kernels, as well as cell shape, size, and rotation, were learned using crossvalidation. Resulting predictions significantly exceeded baseline KDE estimates and SEPP models for sparse events.
@article{flaxman2019scalable,
title = {Scalable high-resolution forecasting of sparse spatiotemporal events with kernel methods: a winning solution to the NIJ ``Real-Time Crime Forecasting Challenge''},
author = {Flaxman, Seth and Chirico, Michael and Pereira, Pau and Loeffler, Charles},
journal = {Revised and resubmit at Annals of Applied Statistics},
year = {2019}
}
2018
C. Loeffler
,
S. Flaxman
,
Is gun violence contagious? A spatiotemporal test, Journal of Quantitative Criminology, vol. 34, no. 4, 999–1017, 2018.
Existing theories of gun violence predict stable spatial concentrations and contagious diffusion of gun violence into surrounding areas. Recent empirical studies have reported confirmatory evidence of such spatiotemporal diffusion of gun violence. However, existing tests cannot readily distinguish spatiotemporal clustering from spatiotemporal diffusion. This leaves as an open question whether gun violence actually is contagious or merely clusters in space and time. Compounding this problem, gun violence is subject to considerable measurement error with many nonfatal shootings going unreported to police. Using point process data from an acoustical gunshot locator system and a combination of Bayesian spatiotemporal point process modeling and space/time interaction tests, this paper demonstrates that contemporary urban gun violence does diffuse, but only slightly, suggesting that a disease model for infectious spread of gun violence is a poor fit for the geographically stable and temporally stochastic process observed.
@article{loeffler2018gun,
title = {Is gun violence contagious? A spatiotemporal test},
author = {Loeffler, Charles and Flaxman, Seth},
journal = {Journal of Quantitative Criminology},
volume = {34},
number = {4},
pages = {999--1017},
year = {2018},
publisher = {Springer}
}
H. Law
,
D. Sejdinovic
,
E. Cameron
,
T. Lucas
,
S. Flaxman
,
K. Battle
,
K. Fukumizu
,
Variational Learning on Aggregate Outputs with Gaussian Processes, in Advances in Neural Information Processing Systems (NeurIPS), 2018, to appear.
While a typical supervised learning framework assumes that the inputs and the outputs are measured at the same levels of granularity, many applications, including global mapping of disease, only have access to outputs at a much coarser level than that of the inputs. Aggregation of outputs makes generalization to new inputs much more difficult. We consider an approach to this problem based on variational learning with a model of output aggregation and Gaussian processes, where aggregation leads to intractability of the standard evidence lower bounds. We propose new bounds and tractable approximations, leading to improved prediction accuracy and scalability to large datasets, while explicitly taking uncertainty into account. We develop a framework which extends to several types of likelihoods, including the Poisson model for aggregated count data. We apply our framework to a challenging and important problem, the fine-scale spatial modelling of malaria incidence, with over 1 million observations.
@inproceedings{LawSejCamLucFlaBatFuk2018,
author = {Law, H.C.L. and Sejdinovic, D. and Cameron, E. and Lucas, T.C.D. and Flaxman, S. and Battle, K. and Fukumizu, K.},
title = {{{Variational Learning on Aggregate Outputs with Gaussian Processes}}},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
pages = {to appear},
year = {2018}
}
J. Ton
,
S. Flaxman
,
D. Sejdinovic
,
S. Bhatt
,
Spatial Mapping with Gaussian Processes and Nonstationary Fourier Features, Spatial Statistics, vol. 28, 59–78, 2018.
The use of covariance kernels is ubiquitous in the field of spatial statistics. Kernels allow data to be mapped into high-dimensional feature spaces and can thus extend simple linear additive methods to nonlinear methods with higher order interactions. However, until recently, there has been a strong reliance on a limited class of stationary kernels such as the Matern or squared exponential, limiting the expressiveness of these modelling approaches. Recent machine learning research has focused on spectral representations to model arbitrary stationary kernels and introduced more general representations that include classes of nonstationary kernels. In this paper, we exploit the connections between Fourier feature representations, Gaussian processes and neural networks to generalise previous approaches and develop a simple and efficient framework to learn arbitrarily complex nonstationary kernel functions directly from the data, while taking care to avoid overfitting using state-of-the-art methods from deep learning. We highlight the very broad array of kernel classes that could be created within this framework. We apply this to a time series dataset and a remote sensing problem involving land surface temperature in Eastern Africa. We show that without increasing the computational or storage complexity, nonstationary kernels can be used to improve generalisation performance and provide more interpretable results.
@article{TonFlaSejBha2018,
author = {Ton, J.-F. and Flaxman, S. and Sejdinovic, D. and Bhatt, S.},
title = {{Spatial Mapping with Gaussian Processes and Nonstationary Fourier Features}},
journal = {Spatial Statistics},
year = {2018},
volume = {28},
pages = {59--78}
}
H. Law
,
D. Sutherland
,
D. Sejdinovic
,
S. Flaxman
,
Bayesian Approaches to Distribution Regression, in Artificial Intelligence and Statistics (AISTATS), 2018.
Distribution regression has recently attracted much interest as a generic solution to the problem of supervised learning where labels are available at the group level, rather than at the individual level. Current approaches, however, do not propagate the uncertainty in observations due to sampling variability in the groups. This effectively assumes that small and large groups are estimated equally well, and should have equal weight in the final regression. We account for this uncertainty with a Bayesian distribution regression formalism, improving the robustness and performance of the model when group sizes vary. We frame our models in a neural network style, allowing for simple MAP inference using backpropagation to learn the parameters, as well as MCMC-based inference which can fully propagate uncertainty. We demonstrate our approach on illustrative toy datasets, as well as on a challenging problem of predicting age from images.
@inproceedings{LawSutSejFla2018,
author = {Law, H.C.L. and Sutherland, D.J. and Sejdinovic, D. and Flaxman, S.},
title = {{Bayesian Approaches to Distribution Regression}},
booktitle = {Artificial Intelligence and Statistics (AISTATS)},
year = {2018}
}
2017
B. Goodman
,
S. Flaxman
,
European Union Regulations on Algorithmic Decision Making and a “Right to Explanation,”AI Magazine, vol. 38, no. 3, 50–58, 2017.
@article{goodman2017european,
author = {Goodman, Bryce and Flaxman, Seth},
journal = {AI Magazine},
number = {3},
pages = {50--58},
publisher = {American Association for Artificial Intelligence},
title = {{European Union} Regulations on Algorithmic Decision Making and a ``Right to Explanation''},
volume = {38},
year = {2017}
}
S. Flaxman
,
Y. Teh
,
D. Sejdinovic
,
Poisson Intensity Estimation with Reproducing Kernels, Electronic Journal of Statistics, vol. 11, no. 2, 5081–5104, 2017.
Despite the fundamental nature of the inhomogeneous Pois-
son process in the theory and application of stochastic processes, and its
attractive generalizations (e.g. Cox process), few tractable nonparametric
modeling approaches of intensity functions exist, especially when observed
points lie in a high-dimensional space. In this paper we develop a new,
computationally tractable Reproducing Kernel Hilbert Space (RKHS) for-
mulation for the inhomogeneous Poisson process. We model the square root
of the intensity as an RKHS function. Whereas RKHS models used in su-
pervised learning rely on the so-called representer theorem, the form of
the inhomogeneous Poisson process likelihood means that the representer
theorem does not apply. However, we prove that the representer theorem
does hold in an appropriately transformed RKHS, guaranteeing that the
optimization of the penalized likelihood can be cast as a tractable finite-
dimensional problem. The resulting approach is simple to implement, and
readily scales to high dimensions and large-scale datasets.
@article{FlaTehSej2017ejs,
author = {Flaxman, S. and Teh, Y.W. and Sejdinovic, D.},
title = {{{Poisson Intensity Estimation with Reproducing Kernels}}},
journal = {Electronic Journal of Statistics},
year = {2017},
volume = {11},
number = {2},
pages = {5081--5104}
}
The algorithms for causal discovery and more broadly for learning the structure of graphical models require well calibrated and consistent conditional independence (CI) tests. We revisit the CI tests which are based on two-step procedures and involve regression with subsequent (unconditional) independence test (RESIT) on regression residuals and investigate the assumptions under which these tests operate. In particular, we demonstrate that when going beyond simple functional relationships with additive noise, such tests can lead to an inflated number of false discoveries. We study the relationship of these tests with those based on dependence measures using reproducing kernel Hilbert spaces (RKHS) and propose an extension of RESIT which uses RKHS-valued regression. The resulting test inherits the simple two-step testing procedure of RESIT, while giving correct Type I control and competitive power. When used as a component of the PC algorithm, the proposed test is more robust to the case where hidden variables induce a switching behaviour in the associations present in the data.
@inproceedings{ZhaFilFlaSej2017,
author = {Zhang, Q. and Filippi, S. and Flaxman, S. and Sejdinovic, D.},
title = {{Feature-to-Feature Regression for a Two-Step Conditional Independence Test}},
booktitle = {Uncertainty in Artificial Intelligence (UAI)},
year = {2017}
}
J. Runge
,
P. Nowack
,
M. Kretschmer
,
S. Flaxman
,
D. Sejdinovic
,
Detecting Causal Associations in Large Nonlinear Time Series Datasets, ArXiv e-prints:1702.07007, 2017.
Detecting causal associations in time series datasets is a key challenge for novel insights into complex dynamical systems such as the Earth system or the human brain. Interactions in high-dimensional dynamical systems often involve time-delays, nonlinearity, and strong autocorrelations. These present major challenges for causal discovery techniques such as Granger causality leading to low detection power, biases, and unreliable hypothesis tests. Here we introduce a reliable and fast method that outperforms current approaches in detection power and scales up to high-dimensional datasets. It overcomes detection biases, especially when strong autocorrelations are present, and allows ranking associations in large-scale analyses by their causal strength. We provide mathematical proofs, evaluate our method in extensive numerical experiments, and illustrate its capabilities in a large-scale analysis of the global surface-pressure system where we unravel spurious associations and find several potentially causal links that are difficult to detect with standard methods. The broadly applicable method promises to discover novel causal insights also in many other fields of science.
@unpublished{RunSejFla2017,
author = {Runge, J. and Nowack, P. and Kretschmer, M. and Flaxman, S. and Sejdinovic, D.},
title = {{{Detecting Causal Associations in Large Nonlinear Time Series Datasets}}},
journal = {ArXiv e-prints:1702.07007},
year = {2017}
}
S. Flaxman
,
Y. W. Teh
,
D. Sejdinovic
,
Poisson Intensity Estimation with Reproducing Kernels, in Artificial Intelligence and Statistics (AISTATS), 2017.
Despite the fundamental nature of the inhomogeneous Poisson process in the theory and application of stochastic processes, and its attractive generalizations (e.g. Cox process), few tractable nonparametric modeling approaches of intensity functions exist, especially in high dimensional settings. In this paper we develop a new, computationally tractable Reproducing Kernel Hilbert Space (RKHS) formulation for the inhomogeneous Poisson process. We model the square root of the intensity as an RKHS function. The modeling challenge is that the usual representer theorem arguments no longer apply due to the form of the inhomogeneous Poisson process likelihood. However, we prove that the representer theorem does hold in an appropriately transformed RKHS, guaranteeing that the optimization of the penalized likelihood can be cast as a tractable finite-dimensional problem. The resulting approach is simple to implement, and readily scales to high dimensions and large-scale datasets.
@inproceedings{FlaTehSej2017,
author = {Flaxman, S. and Teh, Y. W. and Sejdinovic, D.},
booktitle = {Artificial Intelligence and Statistics (AISTATS)},
title = {{Poisson Intensity Estimation with Reproducing Kernels}},
year = {2017}
}
2016
S. Bhatt
,
E. Cameron
,
S. Flaxman
,
D. J. Weiss
,
D. L. Smith
,
P. W. Gething
,
Improved prediction accuracy for disease risk mapping using Gaussian Process stacked generalisation, Dec-2016.
Maps of infectious disease—charting spatial variations in the force of infection, degree of endemicity, and the burden on human health—provide an essential evidence base to support planning towards global health targets. Contemporary disease mapping efforts have embraced statistical modelling approaches to properly acknowledge uncertainties in both the available measurements and their spatial interpolation. The most common such approach is that of Gaussian process regression, a mathematical framework comprised of two components: a mean function harnessing the predictive power of multiple independent variables, and a covariance function yielding spatio-temporal shrinkage against residual variation from the mean. Though many techniques have been developed to improve the flexibility and fitting of the covariance function, models for the mean function have typically been restricted to simple linear terms. For infectious diseases, known to be driven by complex interactions between environmental and socio-economic factors, improved modelling of the mean function can greatly boost predictive power. Here we present an ensemble approach based on stacked generalisation that allows for multiple, non-linear algorithmic mean functions to be jointly embedded within the Gaussian process framework. We apply this method to mapping Plasmodium falciparum prevalence data in Sub-Saharan Africa and show that the generalised ensemble approach markedly out-performs any individual method.
@unpublished{bhattetal2016,
author = {Bhatt, S. and Cameron, E. and Flaxman, Seth and Weiss, D. J. and Smith, D. L. and Gething, P. W.},
title = {Improved prediction accuracy for disease risk mapping using Gaussian Process stacked generalisation},
note = {ArXiv e-prints: 1612.03278},
year = {2016},
month = dec
}
S. Flaxman
,
D. Sutherland
,
Y. Wang
,
Y. W. Teh
,
Understanding the 2016 US Presidential Election using ecological inference and distribution regression with census microdata, Arxiv e-prints, Nov-2016.
We combine fine-grained spatially referenced census data with the vote outcomes from the 2016 US presidential election. Using this dataset, we perform ecological inference using distribution regression (Flaxman et al, KDD 2015) with a multinomial-logit regression so as to model the vote outcome Trump, Clinton, Other / Didn’t vote as a function of demographic and socioeconomic features. Ecological inference allows us to estimate "exit poll" style results like what was Trump’s support among white women, but for entirely novel categories. We also perform exploratory data analysis to understand which census variables are predictive of voting for Trump, voting for Clinton, or not voting for either. All of our methods are implemented in python and R and are available online for replication.
@unpublished{flaxsuthetal2016,
author = {Flaxman, Seth and Sutherland, Dougal and Wang, Yu-Xiang and Teh, Yee Whye},
title = {Understanding the 2016 US Presidential Election using ecological inference and distribution regression with census microdata},
journal = {Arxiv e-prints},
note = {ArXiv e-prints: 1611.03787},
year = {2016},
month = nov
}
W. Herlands
,
A. Wilson
,
H. Nickisch
,
S. Flaxman
,
D. Neill
,
W. Van Panhuis
,
E. Xing
,
Scalable Gaussian Processes for Characterizing Multidimensional Change Surfaces, in Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, 2016, 1013–1021.
We present a scalable Gaussian process model for identifying and characterizing smooth multidimensional changepoints, and automatically learning changes in expressive covariance structure. We use Random Kitchen Sink features to flexibly define a change surface in combination with expressive spectral mixture kernels to capture the complex statistical structure. Finally, through the use of novel methods for additive non-separable kernels, we can scale the model to large datasets. We demonstrate the model on numerical and real world data, including a large spatio-temporal disease dataset where we identify previously unknown heterogeneous changes in space and time.
@inproceedings{HerWilNicketal2016,
author = {Herlands, William and Wilson, Andrew and Nickisch, Hannes and Flaxman, Seth and Neill, Daniel and Van Panhuis, Wilbert and Xing, Eric},
title = {{Scalable Gaussian Processes} for Characterizing Multidimensional Change Surfaces},
booktitle = {Proceedings of the 19th International Conference on Artificial Intelligence and Statistics},
pages = {1013--1021},
year = {2016}
}
We tackle the problem of collaborative filtering (CF) with side information, through the lens of Gaussian Process (GP) regression. Driven by the idea of using the kernel to explicitly model user-item similarities, we formulate the GP in a way that allows the incorporation of low-rank matrix factorisation, arriving at our model, the Tucker Gaussian Process (TGP). Consequently, TGP generalises classical Bayesian matrix factorisation models, and goes beyond them to give a natural and elegant method for incorporating side information, giving enhanced predictive performance for CF problems. Moreover we show that it is a novel model for regression, especially well-suited to grid-structured data and problems where the dependence on covariates is close to being separable.
@unpublished{kimluflateh16,
title = {Collaborative Filtering with Side Information: a Gaussian Process Perspective},
author = {Kim, H. and Lu, X. and Flaxman, S. and Teh, Y. W.},
note = {ArXiv e-prints: 1605.07025},
year = {2016}
}
S. Flaxman
,
D. Sejdinovic
,
J. Cunningham
,
S. Filippi
,
Bayesian Learning of Kernel Embeddings, in Uncertainty in Artificial Intelligence (UAI), 2016, 182–191.
Kernel methods are one of the mainstays of machine learning, but the problem of kernel learning remains challenging, with only a few heuristics and very little theory. This is of particular importance in methods based on estimation of kernel mean embeddings of probability measures. For characteristic kernels, which include most commonly used ones, the kernel mean embedding uniquely determines its probability measure, so it can be used to design a powerful statistical testing framework, which includes nonparametric two-sample and independence tests. In practice, however, the performance of these tests can be very sensitive to the choice of kernel and its lengthscale parameters. To address this central issue, we propose a new probabilistic model for kernel mean embeddings, the Bayesian Kernel Embedding model, combining a Gaussian process prior over the Reproducing Kernel Hilbert Space containing the mean embedding with a conjugate likelihood function, thus yielding a closed form posterior over the mean embedding. The posterior mean of our model is closely related to recently proposed shrinkage estimators for kernel mean embeddings, while the posterior uncertainty is a new, interesting feature with various possible applications. Critically for the purposes of kernel learning, our model gives a simple, closed form marginal pseudolikelihood of the observed data given the kernel hyperparameters. This marginal pseudolikelihood can either be optimized to inform the hyperparameter choice or fully Bayesian inference can be used.
@inproceedings{FlaSejCunFil2016,
author = {Flaxman, S. and Sejdinovic, D. and Cunningham, J.P. and Filippi, S.},
booktitle = {Uncertainty in Artificial Intelligence (UAI)},
pages = {182--191},
title = {{Bayesian Learning of Kernel Embeddings}},
year = {2016},
bdsk-url-1 = {http://www.auai.org/uai2016/proceedings/papers/145.pdf},
bdsk-url-2 = {http://www.auai.org/uai2016/proceedings/supp/145_supp.pdf}
}
Software
2017
S. Flaxman
,
M. Chirico
,
P. Pereira
,
C. Loeffler
,
Forecasting Crime in Portland. 2017.
@software{FlaScal2017,
author = {Flaxman, Seth and Chirico, Michael and Pereira, Pau and Loeffler, Charles},
title = {Forecasting Crime in Portland},
year = {2017}
}