Professor
MIT, Mathematics
I work at the intersection of statistics, machine learning, and optimization, focusing primarily on the design and analysis of statistical methods for high-dimensional problems. My recent research focuses on statistical optimal transport and its applications to geometric data analysis and sampling.
MIT, Mathematics
MIT, Mathematics
MIT, Mathematics
Princeton University, ORFE
Georgia Tech, Mathematics
Ph.D. in Mathematics
University of Paris VI
B. Sc. in Applied Mathematics
University of Paris VI
B. Sc. in Statistics
Institute for Statistics (ISUP)
Rigollet's current research is focused on the interplay between statistics, optimization and geometry, primarily in the context of optimal transport.
If you are a prospective Ph.D. student interested in working with me, please apply through MIT Department of Mathematics and indicate Statistics as your primary field of interest. Unfortunately, I cannot respond to individual requests during the application process.
If you are a prospective Postdoc interested in working with me, you may apply to the following positions at MIT. I encourage you to apply to all of these postions.
Paxton Turner
Postdoc, Statistics department, Harvard
Julien Clancy
Quant, Citadel
Jan-Christian Huetter
Scientist, Genentech
Jonathan Niles-Weed
Assistant Professor, Courant Institute and Center for Data Science, NYU
Cheng Mao
Assistant Professor, School of Math, Georgia Tech
Lucy Xia (co-advised by J. Fan)
Assistant Professor, Business School, HKUST
Quentin Berthet
Research Scientist, Google Brain Paris
Xin Tong (co-advised by J. Fan)
Associate Professor, Business School, USC
Thibaut Le Gouic
Assistant Professor, Ecole Central Marseille
Tyler Maunu
Assistant Professor, Math Department, Brandeis
Jingbo Liu
Assistant Professor, Statistics Department, UIUC
Andrej Risteski
Assistant Professor, Machine Learning Department, CMU
Elina Robeva
Assistant Professor, Mathematics Department, UBC
Geoffrey Schiebinger
Assistant Professor, Mathematics Department, UBC
Aden Forrow
Research Fellow, Mathematical Institute, Oxford
Victor-Emmanuel Brunel
Associate Professor, ENSAE-CREST, Paris, France
Afonso Bandeira
Professor, Mathematics Department, ETH Zurich
Irene Waldspurger
CNRS Researcher, Universite Paris Dauphine
We revisit the problem of recovering a low-rank positive semidefinite matrix from rank-one projections using tools from optimal transport. More specifically, we show that a variational formulation of this problem is equivalent to computing a Wasserstein barycenter. In turn, this new perspective enables the development of new geometric first-order methods with strong convergence guarantees in Bures-Wasserstein distance. Experiments on simulated data demonstrate the advantages of our new methodology over existing methods.
Comparing the representations learned by different neural networks has recently emerged as a key tool to understand various architectures and ultimately optimize them. In this work, we introduce GULP, a family of distance measures between representations that is explicitly motivated by downstream predictive tasks. By construction, GULP provides uniform control over the difference in prediction performance between two representations, with respect to regularized linear prediction tasks. Moreover, it satisfies several desirable structural properties, such as the triangle inequality and invariance under orthogonal transformations, and thus lends itself to data embedding and visualization. We extensively evaluate GULP relative to other methods, and demonstrate that it correctly differentiates between architecture families, converges over the course of training, and captures generalization performance on downstream linear tasks.
We study the sample complexity of entropic optimal transport in high dimensions using computationally efficient plug-in estimators. We significantly advance the state of the art by establishing dimension-free, parametric rates for estimating various quantities of interest, including the entropic regression function which is a natural analog to the optimal transport map. As an application, we propose a practical model for transfer learning based on entropic optimal transport and establish parametric rates of convergence for nonparametric regression and classification.
Along with Markov chain Monte Carlo (MCMC) methods, variational inference (VI) has emerged as a central computational approach to large-scale Bayesian inference. Rather than sampling from the true posterior \(\pi\), VI aims at producing a simple but effective approximation \(\hat \pi\) to \(\pi\) for which summary statistics are easy to compute. However, unlike the well-studied MCMC methodology, VI is still poorly understood and dominated by heuristics. In this work, we propose principled methods for VI, in which \(\hat \pi\) is taken to be a Gaussian or a mixture of Gaussians, which rest upon the theory of gradient flows on the Bures-Wasserstein space of Gaussian measures. Akin to MCMC, it comes with strong theoretical guarantees when π is log-concave.
We describe an efficient algorithm to compute solutions for the general two-player Blotto game on n battlefields with heterogeneous values. While explicit constructions for such solutions have been limited to specific, largely symmetric or homogeneous, setups, this algorithmic resolution covers the most general situation to date: value-asymmetric game with asymmetric budget. The proposed algorithm rests on recent theoretical advances regarding Sinkhorn iterations for matrix and tensor scaling. An important case which had been out of reach of previous attempts is that of heterogeneous but symmetric battlefield values with asymmetric budget. In this case, the Blotto game is constant-sum so optimal solutions exist, and our algorithm samples from an \(\varepsilon\)-optimal solution in time \(O(n^2 + \varepsilon^{-4})\), independently of budgets and battlefield values. In the case of asymmetric values where optimal solutions need not exist but Nash equilibria do, our algorithm samples from an \(\varepsilon\)-Nash equilibrium with similar complexity but where implicit constants depend on various parameters of the game such as battlefield values.
We introduce a novel relaxation of combinatorial discrepancy called Gaussian discrepancy, whereby binary signings are replaced with correlated standard Gaussian random variables. This relaxation effectively reformulates an optimization problem over the Boolean hypercube into one over the space of correlation matrices. We show that Gaussian discrepancy is a tighter relaxation than the previously studied vector and spherical discrepancy problems, and we construct a fast online algorithm that achieves a version of the Banaszczyk bound for Gaussian discrepancy. This work also raises new questions such as the Komlós conjecture for Gaussian discrepancy, which may shed light on classical discrepancy problems.
We propose a method based on optimal transport theory for causal inference in classical treatment and control study designs. Our approach sheds a new light on existing approaches and generalizes them to settings with high-dimensional data. The implementation of our method leverages recent advances in computational optimal transport to produce an estimate of high-dimensional counterfactual outcomes. The benefits of this extension are demonstrated both on synthetic and real data that are beyond the reach of existing methods. In particular, we revisit the classical Card & Krueger dataset on the effect of a minimum wage increase on employment in fast food restaurants and obtain new insights about the impact of raising the minimum wage on employment of full- and part-time workers in the fast food industry.
Motivated by cutting-edge applications like cryo-electron microscopy (cryo-EM), the Multi-Reference Alignment (MRA) model entails the learning of an unknown signal from repeated measurements of its images under the latent action of a group of isometries and additive noise of magnitude \(\sigma\). Despite significant interest, a clear picture for understanding rates of estimation in this model has emerged only recently, particularly in the high-noise regime \(\sigma \gg 1\) that is highly relevant in applications. Recent investigations have revealed a remarkable asymptotic sample complexity of order \(\sigma^6\) for certain signals whose Fourier transforms have full support, in stark contrast to the traditional \(\sigma^2\) that arise in regular models. Often prohibitively large in practice, these results have prompted the investigation of variations around the MRA model where better sample complexity may be achieved. In this paper, we show that sparse signals exhibit an intermediate \(\sigma^4\) sample complexity even in the classical MRA model. Our results explore and exploit connections of the MRA estimation problem with two classical topics in applied mathematics: the beltway problem from combinatorial optimization, and uniform uncertainty principles from harmonic analysis.
We establish the first tight lower bound of \(\Omega(\log\log\kappa)\) on the query complexity of sampling from the class of strongly log-concave and log-smooth distributions with condition number κ in one dimension. Whereas existing guarantees for MCMC-based algorithms scale polynomially in \(\kappa\), we introduce a novel algorithm based on rejection sampling that closes this doubly exponential gap.
We consider the task of generating exact samples from a target distribution, known up to normalization, over a finite alphabet. The classical algorithm for this task is rejection sampling, and although it has been used in practice for decades, there is surprisingly little study of its fundamental limitations. In this work, we study the query complexity of rejection sampling in a minimax framework for various classes of discrete distributions. Our results provide new algorithms for sampling whose complexity scales sublinearly with the alphabet size. When applied to adversarial bandits, we show that a slight modification of the Exp3 algorithm reduces the per-iteration complexity from \(O(K)\) to \(O(\log^2 K)\), where \(K\) is the number of arms.
This work establishes fast rates of convergence for empirical barycenters over a large class of geodesic spaces with curvature bounds in the sense of Alexandrov. More specifically, we show that parametric rates of convergence are achievable under natural conditions that characterize the bi-extendibility of geodesics emanating from a barycenter. These results largely advance the state-of-the-art on the subject both in terms of rates of convergence and the variety of spaces covered. In particular, our results apply to infinite-dimensional spaces such as the 2-Wasserstein space, where bi-extendibility of geodesics translates into regularity of Kantorovich potentials.
Brenier's theorem is a cornerstone of optimal transport that guarantees the existence of an optimal transport map \(T\) between two probability distributions \(P\) and \(Q\) over \(\mathbb{R}^d\) under certain regularity conditions. The main goal of this work is to establish the minimax rates estimation rates for such a transport map from data sampled from \(P\) and \(Q\) under additional smoothness assumptions on \(T\). To achieve this goal, we develop an estimator based on the minimization of an empirical version of the semi-dual optimal transport problem, restricted to truncated wavelet expansions. This estimator is shown to achieve near minimax optimality using new stability arguments for the semi-dual and a complementary minimax lower bound. These are the first minimax estimation rates for transport maps in general dimension.
Conventional wisdom in the sampling literature, backed by a popular diffusion scaling limit, suggests that the mixing time of the Metropolis-Adjusted Langevin Algorithm (MALA) scales as \(O(d^{1/3})\), where \(d\) is the dimension. However, the diffusion scaling limit requires stringent assumptions on the target distribution and is asymptotic in nature. In contrast, the best known non-asymptotic mixing time bound for MALA on the class of log-smooth and strongly log-concave distributions is \(O(d)\). In this work, we establish that the mixing time of MALA on this class of target distributions is \(\widetilde\Theta(d^{1/2})\) under a warm start. Our upper bound proof introduces a new technique based on a projection characterization of the Metropolis adjustment which reduces the study of MALA to the well-studied discretization analysis of the Langevin SDE and bypasses direct computation of the acceptance probability.
Coresets have emerged as a powerful tool to summarize data by selecting a small subset of the original observations while retaining most of its information. This approach has led to significant computational speedups but the performance of statistical procedures run on coresets is largely unexplored. In this work, we develop a statistical framework to study coresets and focus on the canonical task of nonparameteric density estimation. Our contributions are twofold. First, we establish the minimax rate of estimation achievable by coreset-based estimators. Second, we show that the practical coreset kernel density estimators are near-minimax optimal over a large class of Hölder-smooth densities.
We study the problem of space and time efficient evaluation of a nonparametric estimator that approximates an unknown density. In the regime where consistent estimation is possible, we use a piecewise multivariate polynomial interpolation scheme to give a computationally efficient construction that converts the original estimator to a new estimator that can be queried efficiently and has low space requirements, all without adversely deteriorating the original approximation quality. Our result gives a new statistical perspective on the problem of fast evaluation of kernel density estimators in the presence of underlying smoothness. As a corollary, we give a succinct derivation of a classical result of Kolmogorov---Tikhomirov on the metric entropy of H\"{o}lder classes of smooth functions.
We propose a new method for smoothly interpolating probability measures using the geometry of optimal transport. To that end, we reduce this problem to the classical Euclidean setting, allowing us to directly leverage the extensive toolbox of spline interpolation. Unlike previous approaches to measure-valued splines, our interpolated curves (i) have a clear interpretation as governing particle flows, which is natural for applications, and (ii) come with the first approximation guarantees on Wasserstein space. Finally, we demonstrate the broad applicability of our interpolation methodology by fitting surfaces of measures using thin-plate splines.
Stein Variational Gradient Descent (SVGD), a popular sampling algorithm, is often described as the kernelized gradient flow for the Kullback-Leibler divergence in the geometry of optimal transport. We introduce a new perspective on SVGD that instead views SVGD as the (kernelized) gradient flow of the chi-squared divergence which, we show, exhibits a strong form of uniform exponential ergodicity under conditions as weak as a Poincaré inequality. This perspective leads us to propose an alternative to SVGD, called Laplacian Adjusted Wasserstein Gradient Descent (LAWGD), that can be implemented from the spectral decomposition of the Laplacian operator associated with the target density. We show that LAWGD exhibits strong convergence guarantees and good practical performance.
2D mixture of two Gaussians. Left: LAWGD converges. Right: SVGD (particules still move according to a divergence-free velocity field)
In the context of regression, we consider the fundamental question of making an estimator fair while preserving its prediction accuracy as much as possible. To that end, we define its projection to fairness as its closest fair estimator in a sense that reflects prediction accuracy. Our methodology leverages tools from optimal transport to construct efficiently the projection to fairness of any given estimator as a simple post-processing step. Moreover, our approach precisely quantifies the cost of fairness, measured in terms of prediction accuracy.
Motivated by the problem of sampling from ill-conditioned log-concave distributions, we give a clean non-asymptotic convergence analysis of mirror-Langevin diffusions as introduced in Zhang et al. (2020). As a special case of this framework, we propose a class of diffusions called Newton-Langevin diffusions and prove that they converge to stationarity exponentially fast with a rate which not only is dimension-free, but also has no dependence on the target distribution. We give an application of this result to the problem of sampling from the uniform distribution on a convex body using a strategy inspired by interior-point methods. Our general approach follows the recent trend of linking sampling and optimization, and in particular, it yields new results on the convergence of the vanilla Langevin diffusion in Wasserstein distance.
We study minimax estimation of two-dimensional totally positive distributions. Such distributions pertain to pairs of strongly positively dependent random variables and appear frequently in statistics and probability. In particular, for distributions with \(\beta\)-Hölder smooth densities where \(\beta \in (0,2)\), we observe polynomially faster minimax rates of estimation when, additionally, the total positivity condition is imposed. Moreover, we demonstrate fast algorithms to compute the proposed estimators and corroborate the theoretical rates of estimation by simulation studies.
The increasingly complex nature of data has led statisticians to rethinking even the most basic of modeling assumptions. In this context, a determinantal point process (DPP) modeling paradigm promotes diversity in the sample at hand. In this work, we introduce a simple and flexible Gaussian DPP model to capture directionality in the data. Using the Gaussian DPP as an ansatz, we obtain an approach for dimensionality reduction that produces a better and more readable representation of the original data than standard principal component analysis (PCA). These findings are supported by a finite sample analysis of the performance of our estimator, in particular in a spiked model similar to the one employed to analyze PCA.
We study first order methods to compute the barycenter of a probability distribution over the Bures-Wasserstein manifold. We derive global rates of convergence for both gradient descent and stochastic gradient descent despite the fact that the barycenter functional is not geodesically convex. Our analysis overcomes this technical hurdle by developing a Polyak-Lojasiewicz (PL) inequality, which is built using tools from optimal transport and metric geometry.
Motivated by problems in controlled experiments, we study the discrepancy of random matrices with continuous entries where the number of columns \(n\) is much larger than the number of rows \(m\). Our first result shows that if \(\omega(1) \leq m \leq o(n)\), a matrix with i.i.d. standard Gaussian entries has discrepancy \(\Theta(\sqrt{n} \, 2^{-n/m})\) with high probability. This provides sharp guarantees for Gaussian discrepancy in a regime that had not been considered before in the existing literature. Our results also apply to a more general family of random matrices with continuous i.i.d. entries, assuming that \(m \leq O(n/\log{n})\). The proof is non-constructive and is an application of the second moment method. Our second result is algorithmic and applies to random matrices whose entries are i.i.d. and have a Lipschitz density. We present a randomized polynomial-time algorithm that achieves discrepancy \(e^{-\Omega(\log^2(n)/m)}\) with high probability, provided that \(m \leq O(\sqrt{\log{n}})\). In the one-dimensional case, this matches the best known algorithmic guarantees due to Karmarkar--Karp. For higher dimensions \(2 \leq m \leq O(\sqrt{\log{n}})\), this establishes the first efficient algorithm achieving discrepancy smaller than \(O( \sqrt{m} )\).
Causal models are important tools to understand complex phenomena and predict the outcome of controlled experiments, also known as interventions. In this work, we present statistical rates of estimation for linear cyclic causal models under the assumption of homoscedastic Gaussian noise by analyzing both the LLC estimator introduced by Hyttinen, Eberhardt and Hoyer and a novel two-step penalized maximum likelihood estimator. We establish asymptotic near minimax optimality for the maximum likelihood estimator over a class of sparse causal graphs in the case of near-optimally chosen interventions. Moreover, we find evidence for practical advantages of this estimator compared to LLC in synthetic numerical experiments.
The knockoff filter introduced by Barber and Candes 2016 is an elegant framework for controlling the false discovery rate in variable selection. While empirical results indicate that this methodology is not too conservative, there is no conclusive theoretical result on its power. When the predictors are i.i.d. Gaussian, it is known that as the signal to noise ratio tend to infinity, the knockoff filter is consistent in the sense that one can make FDR go to 0 and power go to 1 simultaneously. In this work we study the case where the predictors have a general covariance matrix \(\Sigma\). We introduce a simple functional called effective signal deficiency (ESD) of the covariance matrix \(\Sigma\) that predicts consistency of various variable selection methods. In particular, ESD reveals that the structure of the precision matrix \(\Sigma^{-1}\) plays a central role in consistency and therefore, so does the conditional independence structure of the predictors. To leverage this connection, we introduce Conditional Independence knockoff, a simple procedure that is able to compete with the more sophisticated knockoff filters and that is defined when the predictors obey a Gaussian tree graphical models (or when the graph is sufficiently sparse). Our theoretical results are supported by numerical evidence on synthetic data.
We propose a new statistical model, the spiked transport model, which formalizes the assumption that two probability distributions differ only on a low-dimensional subspace. We study the minimax rate of estimation for the Wasserstein distance under this model and show that this low-dimensional structure can be exploited to avoid the curse of dimensionality. As a byproduct of our minimax analysis, we establish a lower bound showing that, in the absence of such structure, the plug-in estimator is nearly rate-optimal for estimating the Wasserstein distance in high dimension. We also give evidence for a statistical-computational gap and conjecture that any computationally efficient estimator is bound to suffer from the curse of dimensionality.
The growing role of data-driven approaches to scientific discovery has unveiled a large class of models that involve latent transformations with a rigid algebraic constraint. Among them, multi-reference alignment (MRA) is a simple model that captures fundamental aspects of the statistical and algorithmic challenges arising from this new paradigm. In this model, an unknown signal is subject to two types of corruption: a latent cyclic shift and the more traditional additive white noise. The goal is to recover the signal at a certain precision from independent samples. While at high signal-to-noise ratio (SNR), the number of observations needed to recover a generic signal is proportional to 1/SNR, we show that it rises to 1/SNR^3 in the more realistic low SNR regime. We propose an algorithm that achieves this optimal dependence on the SNR. Furthermore, we extend our results to cover a heterogeneous MRA model where the samples come from a mixture of signals, as is often the case in applications such as Cryo-Electron Microscopy, where molecules may have different conformations. We provide the first known procedure that provably achieves signal recovery in the low SNR regime for heterogeneous MRA.
Monge matrices and their permuted versions known as pre-Monge matrices naturally appear in many domains across science and engineering. While the rich structural properties of such matrices have long been leveraged for algorithmic purposes, little is known about their impact on statistical estimation. In this work, we propose to view this structure as a shape constraint and study the problem of estimating a Monge matrix subject to additive random noise. More specifically, we establish the minimax rates of estimation of Monge and pre-Monge matrices. In the case of pre-Monge matrices, the minimax-optimal least-squares estimator is not efficiently computable, and we propose two efficient estimators and establish their rates of convergence. Our theoretical findings are supported by numerical experiments.
Richard Mansfield Dudley (Dick Dudley) was born in 1938. He received the A.B. from Harvard in 1952 and the Ph.D. from Princeton in 1962 (under the supervision of Gilbert Hunt and Edward Nelson). Following an appointment at UC Berkeley as an assistant professor, he joined the Department of Mathematics at MIT in 1967. Dick Dudley has made fundamental contributions to the theory of Gaussian processes and Probability in Banach Spaces. Among his major achievements is the development of a general framework for empirical processes theory, in particular, for uniform central limit theorems. These results have had and continue having tremendous impact in contemporary statistics and in mathematical foundations of machine learning. A more extensive biographical sketch is contained in the preface to the Selected works of R. M. Dudley (editors: E. Giné, V. Koltchinskii and R. Norvaisa) published in 2010. This conversation took place (mostly, via email) in the fall of 2017.
Isotonic regression is a standard problem in shape-constrained estimation where the goal is to estimate an unknown nondecreasing regression function \(f\) from independent pairs \((x_i, y_i)\) where \(\mathbb{E}[y_i]=f(x_i), i=1, \ldots n\). While this problem is well understood both statistically and computationally, much less is known about its uncoupled counterpart where one is given only the unordered sets \(\{x_1, \ldots, x_n\}\) and \(\{y_1, \ldots, y_n\}\). In this work, we leverage tools from optimal transport theory to derive minimax rates under weak moments conditions on \(y_i\) and to give an efficient algorithm achieving optimal rates. Both upper and lower bounds employ moment-matching arguments that are also pertinent to learning mixtures of distributions and deconvolution.
Understanding the molecular programs that guide cellular differentiation during development is a major goal of modern biology. Here, we introduce an approach, WADDINGTON-OT, based on the mathematics of optimal transport, for inferring developmental landscapes, probabilistic cellular fates and dynamic trajectories from large-scale single-cell RNA-seq (scRNA-seq) data collected along a time course. We demonstrate the power of WADDINGTON-OT by applying the approach to study 65,781 scRNA-seq profiles collected at 10 time points over 16 days during reprogramming of fibroblasts to iPSCs. We construct a high-resolution map of reprogramming that rediscovers known features; uncovers new alternative cell fates including neural- and placental-like cells; predicts the origin and fate of any cell class; highlights senescent-like cells that may support reprogramming through paracrine signaling; and implicates regulatory models in particular trajectories. Of these findings, we highlight Obox6, which we experimentally show enhances reprogramming efficiency. Our approach provides a general framework for investigating cellular differentiation.
We propose a new method to estimate Wasserstein distances and optimal transport plans between two probability distributions from samples in high dimension. Unlike plug-in rules that simply replace the true distributions by their empirical counterparts, our method promotes couplings with low transport rank, a new structural assumption that is similar to the nonnegative rank of a matrix. Regularizing based on this assumption leads to drastic improvements on high-dimensional data for various tasks, including domain adaptation in single-cell RNA sequencing data. These findings are supported by a theoretical analysis that indicates that the transport rank is key in overcoming the curse of dimensionality inherent to data-driven optimal transport.
We give a statistical interpretation of entropic optimal transport by showing that performing maximum-likelihood estimation for Gaussian deconvolution corresponds to calculating a projection with respect to the entropic optimal transport distance. This structural result gives theoretical support for the wide adoption of these tools in the machine learning community.
We call a learner super-teachable if a teacher can trim down an iid training set while making the learner learn even better. We provide sharp super-teaching guarantees on two learners: the maximum likelihood estimator for the mean of a Gaussian, and the large margin classifier in 1D. For general learners, we provide a mixed-integer nonlinear programming-based algorithm to find a super teaching set. Empirical experiments show that our algorithm is able to find good super-teaching sets for both regression and classification problems.
There has been a recent surge of interest in studying permutation-based models for ranking from pairwise comparison data. Despite being structurally richer and more robust than parametric ranking models, permutation-based models are less well understood statistically and generally lack efficient learning algorithms. In this work, we study a prototype of permutation-based ranking models, namely, the noisy sorting model. We establish the optimal rates of learning the model under two sampling procedures. Furthermore, we provide a fast algorithm to achieve near-optimal rates if the observations are sampled independently. Along the way, we discover properties of the symmetric group which are of theoretical interest.
Independent component analysis (ICA) is a cornerstone of modern data analysis. Its goal is to recover a latent random vector S with independent components from samples of X=AS where A is an unknown mixing matrix. Critically, all existing methods for ICA rely on and exploit strongly the assumption that S is not Gaussian as otherwise A becomes unidentifiable. In this paper, we show that in fact one can handle the case of Gaussian components by imposing structure on the matrix A. Specifically, we assume that A is sparse and generic in the sense that it is generated from a sparse Bernoulli-Gaussian ensemble. Under this condition, we give an efficient algorithm to recover the columns of A given only the covariance matrix of X as input even when S has several Gaussian components.
This paper describes optimal rates of adaptive estimation of a vector in the multi-reference alignment model, a problem with important applications in fields such as signal processing, image processing, and computer vision, among others. We describe how this model can be viewed as a multivariate Gaussian mixture model under the constraint that the centers belong to the orbit of a group. This enables us to derive matching upper and lower bounds that feature an interesting dependence on the signal-to-noise ratio of the model. Both upper and lower bounds are articulated around a tight local control of Kullback-Leibler divergences that showcases the central role of moment tensors in this problem.
These lecture notes were written for the course 18.657, High Dimensional Statistics at MIT. They build on a set of notes that was prepared at Princeton University in 2013-14 that was modified (and hopefully improved) over the years. Over the past decade, statistics have undergone drastic changes with the development of high-dimensional statistical inference. Indeed, on each individual, more and more features are measured to a point that their number usually far exceeds the number of observations. This is the case in biology and specifically genetics where millions of (combinations of) genes are measured for a single individual. High resolution imaging, finance, online advertising, climate studies ...the list of intensive data producing fields is too long to be established exhaustively. Clearly not all measured features are relevant for a given task and most of them are simply noise. But which ones? What can be done with so little data and so much noise? Surprisingly, the situation is not that bad and on some simple models we can assess to which extent meaningful statistical methods can be applied. Regression is one such simple model. Regression analysis can be traced back to 1632 when Galileo Galilei used a procedure to infer a linear relationship from noisy data. It was not until the early 19th century that Gauss and Legendre developed a systematic procedure: the least-squares method. Since then, regression has been studied in so many forms that much insight has been gained and recent advances on high-dimensional statistics would not have been possible without standing on the shoulders of giants. In these notes, we will explore one, obviously subjective giant on whose shoulders high-dimensional statistics stand: nonparametric statistics. The works of Ibragimov and Has’minskii in the seventies followed by many researchers from the Russian school have contributed to developing a large toolkit to understand regression with an infinite number of parameters. Much insight from this work can be gained to understand high-dimensional or sparse regression and it comes as no surprise that Donoho and Johnstone have made the first contributions on this topic in the early nineties.
Computing optimal transport distances such as the earth mover's distance is a fundamental problem in machine learning, statistics, and computer vision. Despite the recent introduction of several algorithms with good empirical performance, it is unknown whether general optimal transport distances can be approximated in near-linear time. This paper demonstrates that this ambitious goal is in fact achieved by Cuturi's Sinkhorn Distances, and provides guidance towards parameter tuning for this algorithm. This result relies on a new analysis of Sinkhorn iterations that also directly suggests a new algorithm Greenkhorn with the same theoretical guarantees. Numerical simulations illustrate that Greenkhorn significantly outperforms the classical Sinkhorn algorithm in practice.
Determinantal Point Processes (DPPs) are a family of probabilistic models that have a repulsive behavior, and lend themselves naturally to many tasks in machine learning where returning a diverse set of objects is important. While there are fast algorithms for sampling, marginalization and conditioning, much less is known about learning the parameters of a DPP. Our contribution is twofold: (i) we establish the optimal sample complexity achievable in this problem and show that it is governed by a natural parameter, which we call the \emph{cycle sparsity}; (ii) we propose a provably fast combinatorial algorithm that implements the method of moments efficiently and achieves optimal sample complexity. Finally, we give experimental results that confirm our theoretical findings.
eterminantal point processes (DPPs) have wide-ranging applications in machine learning, where they are used to enforce the notion of diversity in subset selection problems. Many estimators have been proposed, but surprisingly the basic properties of the maximum likelihood estimator (MLE) have received little attention. The difficulty is that it is a non-concave maximization problem, and such functions are notoriously difficult to understand in high dimensions, despite their importance in modern machine learning. Here we study both the local and global geometry of the expected log-likelihood function. We prove several rates of convergence for the MLE and give a complete characterization of the case where these are parametric. We also exhibit a potential curse of dimensionality where the asymptotic variance of the MLE scales exponentially with the dimension of the problem. Moreover, we exhibit an exponential number of saddle points, and give evidence that these may be the only critical points.
We prove that Kendall's Rank correlation matrix converges to the Marcenko-Pastur law, under the assumption that the observations are i.i.d random vectors X_1, …, X_n with components that are independent and absolutely continuous with respect to the Lebesgue measure. This is the first result on the empirical spectral distribution of a multivariate U-statistic.
We consider the problem associated to recovering the block structure of an Ising model given independent observations on the binary hypercube. This new model, called the Ising blockmodel, is a perturbation of the mean field approximation of the Ising model known as the Curie-Weiss model: the sites are partitioned into two blocks of equal size and the interaction between those of the same block is stronger than across blocks, to account for more order within each block. We study probabilistic, statistical and computational aspects of this model in the high-dimensional case when the number of sites may be much larger than the sample size.
Given a matrix the seriation problem consists in permuting its rows in such way that all its columns have the same shape, for example, they are monotone increasing. We propose a statistical approach to this problem where the matrix of interest is observed with noise and study the corresponding minimax rate of estimation of the matrices. Specifically, when the columns are either unimodal or monotone, we show that the least squares estimator is optimal up to logarithmic factors and adapts to matrices with a certain natural structure. Finally, we propose a computationally efficient estimator in the monotonic case and study its performance both theoretically and experimentally. Our work is at the intersection of shape constrained estimation and recent work that involves permutation learning, such as graph denoising and ranking.
Motivated by online advertising auctions, we consider repeated Vickrey auctions where goods of unknown value are sold sequentially and bidders only learn (potentially noisy) information about a good's value once it is purchased. We adopt an online learning approach with bandit feedback to model this problem and derive bidding strategies for two models: stochastic and adversarial. In the stochastic model, the observed values of the goods are random variables centered around the true value of the good. In this case, logarithmic regret is achievable when competing against well behaved adversaries. In the adversarial model, the goods need not be identical and we simply compare our performance against that of the best fixed bid in hindsight. We show that sublinear regret is also achievable in this case and prove matching minimax lower bounds. To our knowledge, this is the first complete set of strategies for bidders participating in auctions of this type.
Motivated by its practical success, we show that the two-dimensional total variation denoiser satisfies a sharp oracle inequality that leads to near optimal rates of estimation for a large class of image models such as bi-isotonic, H\"older smooth and cartoons. Our analysis hinges on properties of the unnormalized Laplacian of the two-dimensional grid such as eigenvector delocalization and spectral decay. We also present extensions to more than two dimensions as well as several other graphs.
Motivated by practical applications, chiefly clinical trials, we study the regret achievable for stochastic bandits under the constraint that the employed policy must split trials into a small number of batches. We propose a simple policy, and show that a very small number of batches gives close to minimax optimal regret bounds. As a byproduct, we derive optimal policies with low switching cost for stochastic bandits.
Invited book review for the Journal of the American Statistical Association.
Invited comment on the discussion paper "Hypothesis testing by convex optimization" by Alexander Goldenshluger, Anatoli Juditsky and Arkadi Nemirovski.
High-dimensional statistical tests often ignore correlations to gain simplicity and stability leading to null distributions that depend on functionals of correlation matrices such as their Frobenius norm and other \(\ell_r\) norms. Motivated by the computation of critical values of such tests, we investigate the difficulty of estimation the functionals of sparse correlation matrices. Specifically, we show that simple plug-in procedures based on thresholded estimators of correlation matrices are sparsity-adaptive and minimax optimal over a large class of correlation matrices. Akin to previous results on functional estimation, the minimax rates exhibit an elbow phenomenon. Our results are further illustrated in simulated data as well as an empirical study of data arising in financial econometrics.
Motivated by practical applications, chiefly clinical trials, we study the regret achievable for stochastic multi-armed bandits under the constraint that the employed policy must split trials into a small number of batches. Our results show that a very small number of batches gives already close to minimax optimal regret bounds and we also evaluate the number of trials in each batch. As a byproduct, we derive optimal policies with low switching cost for stochastic bandits.
We consider the problem of aggregating a general collection of affine estimators for fixed design regression. Relevant examples include some commonly used statistical estimators such as least squares, ridge and robust least squares estimators. Dalalyan and Salmon (2012) have established that, for this problem, exponentially weighted (EW) model selection aggregation leads to sharp oracle inequalities in expectation, but similar bounds in deviation were not previously known. While results (Dai, Rigollet, Zhang, 2012) indicate that the same aggregation scheme may not satisfy sharp oracle inequalities with high probability, we prove that a weaker notion of oracle inequality for EW that holds with high probability. Moreover, using a generalization of the newly introduced \(Q\)-aggregation scheme we also prove sharp oracle inequalities that hold with high probability. Finally, we apply our results to universal aggregation and show that our proposed estimator leads simultaneously to all the best known bounds for aggregation, including \(\ell_q\)-aggregation, \(q \in (0,1)\), with high probability.
We consider a general supervised learning problem with strongly convex and Lipschitz loss and study the problem of model selection aggregation. In particular, given a finite dictionary functions (learners) together with the prior, we generalize the results obtained by Dai, Rigollet and Zhang (2012) for Gaussian regression with squared loss and fixed design to this learning setup. Specifically, we prove that the Q-aggregation procedure outputs an estimator that satisfies optimal oracle inequalities both in expectation and with high probability. Our proof techniques somewhat depart from traditional proofs by making most of the standard arguments on the Laplace transform of the empirical process to be controlled.
In the context of sparse principal component detection, we bring evidence towards the existence of a statistical price to pay for computational efficiency. We measure the performance of a test by the smallest signal strength that it can detect and we propose a computationally efficient method based on semidefinite programming. We also prove that the statistical performance of this test cannot be strictly improved by any computationally efficient method. Our results can be viewed as complexity theoretic lower bounds conditionally on the assumptions that some instances of the planted clique problem cannot be solved in randomized polynomial time.
We study the stochastic multi-armed bandit problem when one knows the value \(\mu^{(\star)}\) of an optimal arm, as a well as a positive lower bound on the smallest positive gap \(\Delta\). We propose a new randomized policy that attains a regret uniformly bounded over time in this setting. We also prove several lower bounds, which show in particular that bounded regret is not possible if one only knows \(\Delta\), and bounded regret of order \(1/\Delta\) is not possible if one only knows \(\mu^{(\star)}\).
We perform a finite sample analysis of the detection levels for sparse principal components of a high-dimensional covariance matrix. Our minimax optimal test is based on a sparse eigenvalue statistic. Alas, computing this test is known to be NP-complete in general, and we describe a computationally efficient alternative test using convex relaxations. Our relaxation is also proved to detect sparse principal components at near optimal detection levels, and it performs well on simulated datasets. Moreover, using polynomial time reductions from theoretical computer science, we bring significant evidence that our results cannot be improved, thus revealing an inherent trade off between statistical and computational performance.
We consider a multi-armed bandit problem in a setting where each arm produces a noisy reward realization which depends on an observable random covariate. As opposed to the traditional static multi-armed bandit problem, this setting allows for dynamically changing rewards that better describe applications where side information is available. We adopt a nonparametric model where the expected rewards are smooth functions of the covariate and where the hardness of the problem is captured by a margin parameter. To maximize the expected cumulative reward, we introduce a policy called Adaptively Binned Successive Elimination (ABSE) that adaptively decomposes the global problem into suitably localized static bandit problems. This policy constructs an adaptive partition using a variant of the Successive Elimination (SE) policy. Our results include sharper regret bounds for the SE policy in a static bandit problem and minimax optimal regret bounds for the ABSE policy in the dynamic problem.
Consider a regression model with fixed design and Gaussian noise where the regression function can potentially be well approximated by a function that admits a sparse representation in a given dictionary. This paper resorts to exponential weights to exploit this underlying sparsity by implementing the principle of sparsity pattern aggregation. This model selection take on sparse estimation allows us to derive sparsity oracle inequalities in several popular frameworks including ordinary sparsity, fused sparsity and group sparsity. One striking aspect of these theoretical results is that they hold under no condition on the dictionary. Moreover, we describe an efficient implementation of the sparsity pattern aggregation principle that compares favorably to state-of-the-art procedures on some basic numerical examples.
Discussion of ``Minimax Estimation of Large Covariance Matrices under L1-Norm'' by Tony Cai and Harrison Zhou.
Given a finite family of functions, the goal of model selection is to construct a procedure that mimics the function from this family that is the closest to an unknown regression function. More precisely, we consider a general regression model with fixed design and measure the distance between functions by the mean squared error at the design points. While procedures based on exponential weights are known to solve the problem of model selection in expectation, they are, surprisingly, sub-optimal in deviation. We propose a new formulation called Q-aggregation that addresses this limitation; namely, its solution leads to sharp oracle inequalities that are optimal in a minimax sense. Moreover, based on the new formulation, we design greedy Q-aggregation procedures that produce sparse aggregation models achieving the optimal rate. The convergence and performance of these greedy procedures are illustrated and compared with other standard methods on simulated examples.
In a regression setup with deterministic design, we study the pure aggregation problem and introduce a natural extension from the Gaussian distribution to distributions in the exponential family. While this extension bears strong connections with generalized linear models, it does not require identifiability of the parameter or even that the model on the systematic component is true. It is shown that this problem can be solved by constrained and/or penalized likelihood maximization and we derive sharp oracle inequalities that hold both in expectation and with high probability. Finally all the bounds are proved to be optimal in a minimax sense.
Motivated by problems of anomaly detection, this paper implements the Neyman-Pearson paradigm to deal with asymmetric errors in binary classification with a convex loss \(\varphi\). Given a finite collection of classifiers, we combine them and obtain a new classifier that satisfies simultaneously the two following properties with high probability: (i) its \(\varphi\)-type I error is below a pre-specified level and (ii), it has \(\varphi\)-type II error close to the minimum possible. The proposed classifier is obtained by minimizing an empirical convex objective with an empirical convex constraint. The novelty of the method is that the classifier output by this computationally feasible program is shown to satisfy the original constraint on type I error. New techniques to handle such problems are developed and they have consequences on chance constrained programming. We also evaluate the price to pay in terms of type II error for being conservative on type I error.
In high-dimensional linear regression, the goal pursued here is to estimate an unknown regression function using linear combinations of a suitable set of covariates. One of the key assumptions for the success of any statistical procedure in this setup is to assume that the linear combination is sparse in some sense, for example, that it involves only few covariates. We consider a general, non necessarily linear, regression with Gaussian noise and study a related question that is to find a linear combination of approximating functions, which is at the same time sparse and has small mean squared error (MSE). We introduce a new estimation procedure, called \emph{Exponential Screening} that shows remarkable adaptation properties. It adapts to the linear combination that optimally balances MSE and sparsity, whether the latter is measured in terms of the number of non-zero entries in the combination (\(\ell_0\) norm) or in terms of the global weight of the combination (\(\ell_1\) norm). The power of this adaptation result is illustrated by showing that Exponential Screening solves optimally and simultaneously all the problems of aggregation in Gaussian regression that have been discussed in the literature. Moreover, we show that the performance of the Exponential Screening estimator cannot be improved in a minimax sense, even if the optimal sparsity is known in advance. The theoretical and numerical superiority of Exponential Screening compared to state-of-the-art sparse procedures is also discussed.
Motivated by problems of anomaly detection, this paper implements the Neyman-Pearson paradigm to deal with asymmetric errors in binary classification with a convex loss. Given a finite collection of classifiers, we combine them and obtain a new classifier that satisfies simultaneously the two following properties with high probability: (i), its probability of type~I error is below a pre-specified level and (ii), it has probability of type ~II error close to the minimum possible. The proposed classifier is obtained by minimizing an empirical objective subject to an empirical constraint. The novelty of the method is that the classifier output by this problem is shown to satisfy the original constraint on type~I error. This strict enforcement of the constraint has interesting consequences on the control of the type~II error and we develop new techniques to handle this situation. Finally, connections with chance constrained optimization are evident and are investigated.
Me consider a bandit problem which involves sequential sampling from two populations (arms). Each arm produces a noisy reward realization which depends on an observable random covariate. The goal is to maximize cumulative expected reward. We derive general lower bounds on the performance of any admissible policy, and develop an algorithm whose performance achieves the order of said lower bound up to logarithmic terms. This is done by decomposing the global problem into suitably ``localized'' bandit problems. Proofs blend ideas from nonparametric statistics and traditional methods used in the bandit literature.
In the context of density level set estimation, we study the convergence of general plug-in methods under two main assumptions on the density for a given level \(\lambda\). More precisely, it is assumed that the density (i) is smooth in a neighborhood of \(\lambda\) and (ii) has \(\gamma\)-exponent at level \(\lambda\). Condition (i) ensures that the density can be estimated at a standard nonparametric rate and condition (ii) is similar to Tsybakov's margin assumption which is stated for the classification framework. Under these assumptions, we derive optimal rates of convergence for plug-in estimators. Explicit convergence rates are given for plug-in estimators based on kernel density estimators when the underlying measure is the Lebesgue measure. Lower bounds proving optimality of the rates in a minimax sense when the density is Holder smooth are also provided.
Given a finite collection of estimators or classifiers, we study the problem of model selection type aggregation, that is, we construct a new estimator or classifier, called aggregate, which is nearly as good as the best among them with respect to a given risk criterion. We define our aggregate by a simple recursive procedure which solves an auxiliary stochastic linear programming problem related to the original nonlinear one and constitutes a special case of the mirror averaging algorithm. We show that the aggregate satisfies sharp oracle inequalities under some general assumptions. The results are applied to several problems including regression, classification and density estimation.
We consider semi-supervised classification when part of the available data is unlabeled. These unlabeled data can be useful for the classification problem when we make an assumption relating the behavior of the regression function to that of the marginal distribution. Seeger (2000) proposed the well-known cluster assumption as a reasonable one. We propose a mathematical formulation of this assumption and a method based on density level sets estimation that takes advantage of it to achieve fast rates of convergence both in the number of unlabeled examples and the number of labeled examples.
We study the problem of finding the best linear and convex combination of M estimators of a density with respect to the mean squared risk. We suggest aggregation procedures and we prove sharp oracle inequalities for their risks, i.e., oracle inequalities with leading constant 1. We also obtain lower bounds showing that these procedures attain optimal rates of aggregation. As an example, we consider aggregation of multivariate kernel density estimators with different bandwidths. We show that linear and convex aggregates mimic the kernel oracles in asymptotically exact sense. We prove that, for Pinsker's kernel, the proposed aggregates are sharp asymptotically minimax simultaneously over a large scale of Sobolev classes of densities. Finally, we provide simulations demonstrating performance of the convex aggregation procedure.
We study the problem of nonparametric estimation of a probability density of unknown smoothness in \(L_2(\mathbb{R})\). Expressing mean integrated squared error (MISE) in the Fourier domain, we show that it is close to mean squared error in the Gaussian sequence model. Then applying a modified version of Stein's blockwise method, we obtain a linear monotone oracle inequality. Two consequences of this oracle inequality are that the proposed estimator is sharp minimax adaptive over a scale of Sobolev classes of densities, and that its MISE is asymptotically smaller than or equal to that of kernel density estimators with any bandwidth provided that the kernel belongs to a large class of functions including many standard kernels.
Short note on aggregation published in Oberwolfach Reports following the Meeting on Statistical and Probabilistic Methods of Model Selection, October 2005.
We study the problem of the nonparametric estimation of a probability density in \(L_2(\mathbb{R})\). Expressing the mean integrated squared error in the Fourier domain, we show that it is close to the mean squared error in the Gaussian sequence model. Then, applying a modified version of Stein's blockwise method, we obtain a linear monotone oracle inequality and a kernel oracle inequality. As a consequence, the proposed estimator is sharp minimax adaptive (i.e. up to a constant) on a scale of Sobolev classes of densities.
I am/have been giving time to my community by serving in the following qualities.
The best way to reach me is by email.
2-279 in the Department of Mathematics
E17-467 in the Institute for Data Systems and Society