• Present 2016

#### Associate Professor

MIT, Mathematics

• 2016 2015

#### Assistant Professor

MIT, Mathematics

• 2014 2008

#### Assistant Professor

Princeton University, ORFE

• 2008 2007

#### Visiting Assistant Professor

Georgia Tech, Mathematics

### Education & Training

• Ph.D. 2006

Ph.D. in Mathematics

University of Paris VI

• B. Sc.2002

B. Sc. in Applied Mathematics

University of Paris VI

• B. Sc..2001

B. Sc. in Statistics

Institute for Statistics (ISUP)

### Honors, Awards and Grants

• 2017
NSF grant DMS-1712596
Statistical Estimation with Algebraic Structure. With Afonso Bandeira (NYU).
• 2017
Office of Naval Research
Statistical Learning Theory of Complex Causal Models (PI: C. Uhler).
• 2016
DARPA -- Fundamentals of Learning
Foundations of Scalable Statistical Learning (PI: A. Jadbabaie).
• 2016
NEC Corporation Fund for Research in Computers and Communications
Awarded by the MIT School of Science to support research on "Average-Case Complexity of Statistical Problems".
• 2015
Invited lecturer. Machine Learning Summer School.
The machine learning summer school (MLSS) series promulgates modern methods of statistical machine learning and inference. Machine learning summer schools present topics which are at the core of modern Machine Learning, from fundamentals to state-of-the-art practice. The speakers are leading experts in their field who talk with enthusiasm about their subjects. More information about MLSS here.
• 2013
Best Paper Award. Conference on Learning Theory (COLT)
Berthet and Rigollet received "best paper award" at the Conference On Learning Theory (COLT) for their paper "Complexity Theoretic Lower Bounds for Sparse Principal Component Detection". In this work, they show that employing computationally efficient statistical methods in the context of sparse principal component analysis leads to an ineluctable deterioration of statistical performance, regardless of the efficient method employed. In particular, their paper draw new connections between statistical learning and computational complexity. An extended version of the paper is available here.
• 2013
Howard B. Wentz Jr. Junior Faculty Award
The Howard B. Wentz, Jr. Junior Faculty Award from the School of Engineering and Applied Science (SEAS) at Princeton University “intended to recognize and assist promising junior faculty members”.
• 2013
NSF grant DMS-1317308
Statistical and Computational Tradeoffs in High Dimensional Learning. With Sanjeev Arora (Princeton).
• 2011
NSF CAREER Award DMS-1053987
Large Scale Stochastic Optimization and Statistics.
• 2010
250th Anniversary Fund for Innovation in Undergraduate Education
Competitive award given from the office of the President at Princeton University. The criteria for the award are: innovation and excellence, the positive impact of the course on the undergraduate curriculum, and its effectiveness in demonstrating the faculty's continuing commitment to undergraduate teaching.
• 2009
NSF grant DMS-0906424
Optimal sequential allocation in dynamic environments.
• 2009
Dean's commendation list for outstanding teaching.
Awarded on the basis of outstanding teaching evaluations for the course "Statistical Learning Theory and Nonparametric Estimation".

### Looking for students!

If you are a Ph.D. student at MIT interested in Mathematical Statistics, Machine Learning, Algorithms, Convex Optimization or Computational Complexity, please send me an email.

### Alumni

• Ph.D.2015

Lucy Xia (co-advised by J. Fan)

First position: Stein Fellow, Department of Statistics, Stanford

• Ph.D.2014

Quentin Berthet

First position: Postdoc, CMS, Caltech

• Ph.D. 2012

Xin Tong (co-advised by J. Fan)

First position: Instructor, Department of Mathematics, MIT

• Post-Doc 2016

Afonso Bandeira

First position: Assistant Professor, Courant Institute, NYU

• Post-Doc2016

Irene Waldspurger

First position: CNRS Researcher, Universite Paris Dauphine

• ### Statistical and Computational Tradeoffs in High-dimensional Statistics

Computational limitations of statistical problems have largely been ignored or simply overcome by ad hoc relaxations techniques. If optimal methods cannot be computed in reasonable time, what is the best possible statistical performance of a computationally efficient procedure?

In our papers on sparse PCA, we have redefined the classical notion of optimality to incorporate the computational aspects of statistical methods and showed the existence of a gap between the old and the new notion. This is the first result of its kind and it opens the door for a much better understanding of many more statistical problems, especially in high dimensions. In particular, it allows a better discrimination between statistical procedures that are currently sorted out using ad-hoc and often unrealistic assumptions.

Practically, our approach relies on the use of polynomial time reductions. While such ideas have been developed by theoretical computer scientists for decades, virtually all the current literature on reductions is about worst case complexity, while statistical problems fall naturally in the framework of average case complexity. Unfortunately, the latter is still poorly understood, as current reduction techniques tend to break the fragile distributional assumptions attached to a computational problem. Building on insight from statistics, we have introduced a notion of robustness that allows us to bypass this limitation and, for the first time, to present an average case reduction between two problems that are not artificial.

Looking forward, we aim to (i) push this analysis to other high-dimensional statstical problems and (ii) develop a toolbox to derive computational lower bounds in a more systematic fashion.

• ### Multiphase inference

Traditional sequential allocation problems such as bandit problems allow the user to collect information after each action. In practice, for example in clinical trial, such a feedaback cannot exist and one proceed in phases or stages. The question is then to find the size of each pahse and analyze the benefit of adding extra phases to the problem.

Consider the following basic question: given two populations (arms), at each time t=1,...,T, sample from only one of the populations and receive a random reward dictated by the properties of the sampled population. The objective is to devise a policy that maximizes the expected cumulative reward over the T rounds. While the terminology arm" comes from a well known analogy with slot machines the original motivation for this questions comes from clinical trials where each arm corresponds to a treatment and rewards measure the efficacy of this treatment on a patient. When devising a strategy to solve this problem, one faces the so-called exploration versus exploitation dilemma. Indeed, there is a clear tradeoff between discovering which treatment is the most effective (exploration) and administering the best treatment to as many patients as possible (exploitation).

In practice, it may not be possible to process past rewards after each patient to decide which treatment to administer next. Rather, clinical trials are performed in a small number of phases, typically two: in the first phase treatments are allocated randomly during a pilot study and in the second phase, treatments are re-allocated according to the results of the pilot study. In this two-phase approach, the pilot study focuses on exploration while he second one focuses on exploitation. This contrasts with the basic problem described above that consists in T phases, one per patient. This begs the following questions: how does one reunite the two frameworks? Do extra phases significantly improve upon the basic two-phase approach?

These questiosn extend beyond the framework of bandit problems to the case of inference, in particular for multiple hypothesis testing where the goal is to identify significant features.

• ### Algebraic regularization

A key aspect of high-dimensional statistics, specifically when the number of paramters to estimate is larger than the sample size, consists in assuming a lower dimensinal structure. For example, the sparsity assumption often employes assumes a lower dimensional linear structure. In several examples, the data follows a simple algebraic structure that can be exploited to bypass the curse of dimensionality.

A canonical example fo this research project is the multi-reference alignment problem where one observes noisy shifted versions of an unknown signal. The goal is to estimate the signal up to a gloabl shift. The algebraic stucture here is rather simple: it is the group of periodic shifts of vectors of a given length.

This problem has receive attention from the computational point of view where connections to the Unique Games Conjecture were established and exploited. However, the the statistical understanding of this problem is rather limited and we study optimal rates of estimations in this context.

### Filter by type:

Sort by year:

#### The sample complexity of multi-reference alignment

Amelia Perry, Jonathan Weed, Afonso Bandeira, Philippe Rigollet and Amit Singer (2017)
Preprint arXiv:1707.00943

#### Abstract

The growing role of data-driven approaches to scientific discovery has unveiled a large class of models that involve latent transformations with a rigid algebraic constraint. Among them, multi-reference alignment (MRA) is a simple model that captures fundamental aspects of the statistical and algorithmic challenges arising from this new paradigm. In this model, an unknown signal is subject to two types of corruption: a latent cyclic shift and the more traditional additive white noise. The goal is to recover the signal at a certain precision from independent samples. While at high signal-to-noise ratio (SNR), the number of observations needed to recover a generic signal is proportional to 1/SNR, we show that it rises to 1/SNR^3 in the more realistic low SNR regime. We propose an algorithm that achieves this optimal dependence on the SNR. Furthermore, we extend our results to cover a heterogeneous MRA model where the samples come from a mixture of signals, as is often the case in applications such as Cryo-Electron Microscopy, where molecules may have different conformations. We provide the first known procedure that provably achieves signal recovery in the low SNR regime for heterogeneous MRA.

#### High Dimensional Statistics

Philippe Rigollet and Jan-Christian Hütter (2017)
Lecture Notes

#### Abstract

This course is mainly about learning a regression function from a collection of observations. In this chapter, after defining this task formally, we give an overview of the course and the questions around regression. We adopt the statistical learning point of view where the task of prediction prevails. Last updated in June 2017.

#### Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration

Jason Altschuler, Jonathan Weed and Philippe Rigollet (2017)
Conference Proceedings of NIPS 2017 (to appear) Spotlight

#### Abstract

Computing optimal transport distances such as the earth mover's distance is a fundamental problem in machine learning, statistics, and computer vision. Despite the recent introduction of several algorithms with good empirical performance, it is unknown whether general optimal transport distances can be approximated in near-linear time. This paper demonstrates that this ambitious goal is in fact achieved by Cuturi's Sinkhorn Distances, and provides guidance towards parameter tuning for this algorithm. This result relies on a new analysis of Sinkhorn iterations that also directly suggests a new algorithm Greenkhorn with the same theoretical guarantees. Numerical simulations illustrate that Greenkhorn significantly outperforms the classical Sinkhorn algorithm in practice.

#### Learning determinantal point processes with moments and cycles

John Urschel, Victor-Emmanuel Brunel, Ankur Moitra and Philippe Rigollet (2017)
Conference Proceedings of ICML 2017, PMLR 70:3511-3520.

#### Abstract

Determinantal Point Processes (DPPs) are a family of probabilistic models that have a repulsive behavior, and lend themselves naturally to many tasks in machine learning where returning a diverse set of objects is important. While there are fast algorithms for sampling, marginalization and conditioning, much less is known about learning the parameters of a DPP. Our contribution is twofold: (i) we establish the optimal sample complexity achievable in this problem and show that it is governed by a natural parameter, which we call the \emph{cycle sparsity}; (ii) we propose a provably fast combinatorial algorithm that implements the method of moments efficiently and achieves optimal sample complexity. Finally, we give experimental results that confirm our theoretical findings.

#### Optimal rates of estimation for multi-reference alignment

Afonso Bandeira, Philippe Rigollet and Jonathan Weed (2017)
Preprint arXiv:1702.08546

#### Abstract

This paper describes optimal rates of adaptive estimation of a vector in the multi-reference alignment model, a problem with important applications in fields such as signal processing, image processing, and computer vision, among others. We describe how this model can be viewed as a multivariate Gaussian mixture model under the constraint that the centers belong to the orbit of a group. This enables us to derive matching upper and lower bounds that feature an interesting dependence on the signal-to-noise ratio of the model. Both upper and lower bounds are articulated around a tight local control of Kullback-Leibler divergences that showcases the central role of moment tensors in this problem.

#### Maximum likelihood estimation of determinantal point processes

Victor-Emmanuel Brunel, Ankur Moitra, Philippe Rigollet and John Urschel (2017)
Conference Proceedings of COLT 2017, PMLR 65:343-345.

#### Abstract

eterminantal point processes (DPPs) have wide-ranging applications in machine learning, where they are used to enforce the notion of diversity in subset selection problems. Many estimators have been proposed, but surprisingly the basic properties of the maximum likelihood estimator (MLE) have received little attention. The difficulty is that it is a non-concave maximization problem, and such functions are notoriously difficult to understand in high dimensions, despite their importance in modern machine learning. Here we study both the local and global geometry of the expected log-likelihood function. We prove several rates of convergence for the MLE and give a complete characterization of the case where these are parametric. We also exhibit a potential curse of dimensionality where the asymptotic variance of the MLE scales exponentially with the dimension of the problem. Moreover, we exhibit an exponential number of saddle points, and give evidence that these may be the only critical points.

#### Marcenko-Pastur Law for Kendall's Tau

Afonso Bandeira, Asad Lohdia and Philippe Rigollet (2017)
Journal Electron. Commun. Probab., 22(32), 1--7.

#### Abstract

We prove that Kendall's Rank correlation matrix converges to the Marcenko-Pastur law, under the assumption that the observations are i.i.d random vectors X_1, …, X_n with components that are independent and absolutely continuous with respect to the Lebesgue measure. This is the first result on the empirical spectral distribution of a multivariate U-statistic.

#### Exact recovery in the Ising blockmodel

Quentin Berthet, Philippe Rigollet and Piyush Srivastava (2016)
Journal Ann. Statist. (To appear)

#### Abstract

We consider the problem associated to recovering the block structure of an Ising model given independent observations on the binary hypercube. This new model, called the Ising blockmodel, is a perturbation of the mean field approximation of the Ising model known as the Curie-Weiss model: the sites are partitioned into two blocks of equal size and the interaction between those of the same block is stronger than across blocks, to account for more order within each block. We study probabilistic, statistical and computational aspects of this model in the high-dimensional case when the number of sites may be much larger than the sample size.

#### Optimal rates of Statistical Seriation

Nicolas Flammarion, Cheng Mao and Philippe Rigollet (2016)
Preprint arXiv:1607.02435

#### Abstract

Given a matrix the seriation problem consists in permuting its rows in such way that all its columns have the same shape, for example, they are monotone increasing. We propose a statistical approach to this problem where the matrix of interest is observed with noise and study the corresponding minimax rate of estimation of the matrices. Specifically, when the columns are either unimodal or monotone, we show that the least squares estimator is optimal up to logarithmic factors and adapts to matrices with a certain natural structure. Finally, we propose a computationally efficient estimator in the monotonic case and study its performance both theoretically and experimentally. Our work is at the intersection of shape constrained estimation and recent work that involves permutation learning, such as graph denoising and ranking.

#### Online learning in repeated auctions

Jonathan Weed, Vianney Perchet and Philippe Rigollet (2016)
Conference Proceedings of COLT 2016, PMLR 49:1562-1583.

#### Abstract

Motivated by online advertising auctions, we consider repeated Vickrey auctions where goods of unknown value are sold sequentially and bidders only learn (potentially noisy) information about a good's value once it is purchased. We adopt an online learning approach with bandit feedback to model this problem and derive bidding strategies for two models: stochastic and adversarial. In the stochastic model, the observed values of the goods are random variables centered around the true value of the good. In this case, logarithmic regret is achievable when competing against well behaved adversaries. In the adversarial model, the goods need not be identical and we simply compare our performance against that of the best fixed bid in hindsight. We show that sublinear regret is also achievable in this case and prove matching minimax lower bounds. To our knowledge, this is the first complete set of strategies for bidders participating in auctions of this type.

#### Optimal rates for total variation denoising

Jan-Christian Hütter and Philippe Rigollet (2016)
Conference Proceedings of COLT 2016, PMLR 49:1115-1146.

#### Abstract

Motivated by its practical success, we show that the two-dimensional total variation denoiser satisfies a sharp oracle inequality that leads to near optimal rates of estimation for a large class of image models such as bi-isotonic, H\"older smooth and cartoons. Our analysis hinges on properties of the unnormalized Laplacian of the two-dimensional grid such as eigenvector delocalization and spectral decay. We also present extensions to more than two dimensions as well as several other graphs.

#### Batched Bandit Problems

Vianney Perchet, Philippe Rigollet, Sylvain Chassang, and Erik Snowberg (2016)
Journal Ann. Statist., 44(2), 660-681.

#### Abstract

Motivated by practical applications, chiefly clinical trials, we study the regret achievable for stochastic bandits under the constraint that the employed policy must split trials into a small number of batches. We propose a simple policy, and show that a very small number of batches gives close to minimax optimal regret bounds. As a byproduct, we derive optimal policies with low switching cost for stochastic bandits.

#### Review of the book "Introduction to High-Dimensional Statistics" by C. Giraud.

Philippe Rigollet (2015)
Other JASA Book Reviews, 110(512), 1821

#### Abstract

Invited book review for the Journal of the American Statistical Association.

#### Comment on "Hypothesis testing by convex optimization"

Philippe Rigollet (2015)
Other Electron. J. Statist., 9(2), 1723-1726

#### Abstract

Invited comment on the discussion paper "Hypothesis testing by convex optimization" by Alexander Goldenshluger, Anatoli Juditsky and Arkadi Nemirovski.

#### Estimation of Functionals of Sparse Covariance Matrices

Jianqing Fan, Philippe Rigollet, and Weichen Wang (2015)
Journal Ann. Statist., 43(6), 2706-2737

#### Abstract

High-dimensional statistical tests often ignore correlations to gain simplicity and stability leading to null distributions that depend on functionals of correlation matrices such as their Frobenius norm and other $\ell_r$ norms. Motivated by the computation of critical values of such tests, we investigate the difficulty of estimation the functionals of sparse correlation matrices. Specifically, we show that simple plug-in procedures based on thresholded estimators of correlation matrices are sparsity-adaptive and minimax optimal over a large class of correlation matrices. Akin to previous results on functional estimation, the minimax rates exhibit an elbow phenomenon. Our results are further illustrated in simulated data as well as an empirical study of data arising in financial econometrics.

#### Batched Bandit Problems

Vianney Perchet, Philippe Rigollet , Sylvain Chassang and Erik Snowberg (2015)
Conference Proceedings of COLT 2015, PMLR 40:1456-1456.

#### Abstract

Motivated by practical applications, chiefly clinical trials, we study the regret achievable for stochastic multi-armed bandits under the constraint that the employed policy must split trials into a small number of batches. Our results show that a very small number of batches gives already close to minimax optimal regret bounds and we also evaluate the number of trials in each batch. As a byproduct, we derive optimal policies with low switching cost for stochastic bandits.

#### Aggregation of Affine Estimators

Dong Dai, Philippe Rigollet, Lucy Xia, and Tong Zhang (2014)
Journal Electon. J. Stat., 8, 302-327.

#### Abstract

We consider the problem of aggregating a general collection of affine estimators for fixed design regression. Relevant examples include some commonly used statistical estimators such as least squares, ridge and robust least squares estimators. Dalalyan and Salmon (2012) have established that, for this problem, exponentially weighted (EW) model selection aggregation leads to sharp oracle inequalities in expectation, but similar bounds in deviation were not previously known. While results (Dai, Rigollet, Zhang, 2012) indicate that the same aggregation scheme may not satisfy sharp oracle inequalities with high probability, we prove that a weaker notion of oracle inequality for EW that holds with high probability. Moreover, using a generalization of the newly introduced $Q$-aggregation scheme we also prove sharp oracle inequalities that hold with high probability. Finally, we apply our results to universal aggregation and show that our proposed estimator leads simultaneously to all the best known bounds for aggregation, including $\ell_q$-aggregation, $q \in (0,1)$, with high probability.

#### Optimal learning with Q-aggregation

Guillaume Lecue and Philippe Rigollet (2014)
Journal Ann. Statist., 42(1), 211-224.

#### Abstract

We consider a general supervised learning problem with strongly convex and Lipschitz loss and study the problem of model selection aggregation. In particular, given a finite dictionary functions (learners) together with the prior, we generalize the results obtained by Dai, Rigollet and Zhang (2012) for Gaussian regression with squared loss and fixed design to this learning setup. Specifically, we prove that the Q-aggregation procedure outputs an estimator that satisfies optimal oracle inequalities both in expectation and with high probability. Our proof techniques somewhat depart from traditional proofs by making most of the standard arguments on the Laplace transform of the empirical process to be controlled.

#### Complexity Theoretic Lower Bounds for Sparse Principal Component Detection

Quentin Berthet and Philippe Rigollet (2013)
ConferenceProceedings of COLT 2013, PMLR 30:1046-1066. Best Paper Award

#### Abstract

In the context of sparse principal component detection, we bring evidence towards the existence of a statistical price to pay for computational efficiency. We measure the performance of a test by the smallest signal strength that it can detect and we propose a computationally efficient method based on semidefinite programming. We also prove that the statistical performance of this test cannot be strictly improved by any computationally efficient method. Our results can be viewed as complexity theoretic lower bounds conditionally on the assumptions that some instances of the planted clique problem cannot be solved in randomized polynomial time.

#### Bounded regret in stochastic multi-armed bandits

Sebastien Bubeck, Vianney Perchet and Philippe Rigollet (2013)
ConferenceProceedings of COLT 2013, PMLR 30:122-134.

#### Abstract

We study the stochastic multi-armed bandit problem when one knows the value $\mu^{(\star)}$ of an optimal arm, as a well as a positive lower bound on the smallest positive gap $\Delta$. We propose a new randomized policy that attains a regret {\em uniformly bounded over time} in this setting. We also prove several lower bounds, which show in particular that bounded regret is not possible if one only knows $\Delta$, and bounded regret of order $1/\Delta$ is not possible if one only knows $\mu^{(\star)}$.

#### Optimal detection of sparse principal components in high dimension

Quentin Berthet and Philippe Rigollet (2013)
JournalAnn. Statist., 41(1), 1780-1815.

#### Abstract

We perform a finite sample analysis of the detection levels for sparse principal components of a high-dimensional covariance matrix. Our minimax optimal test is based on a sparse eigenvalue statistic. Alas, computing this test is known to be NP-complete in general, and we describe a computationally efficient alternative test using convex relaxations. Our relaxation is also proved to detect sparse principal components at near optimal detection levels, and it performs well on simulated datasets. Moreover, using polynomial time reductions from theoretical computer science, we bring significant evidence that our results cannot be improved, thus revealing an inherent trade off between statistical and computational performance.

#### The multi-armed bandit problem with covariates

Vianney Perchet and Philippe Rigollet (2013)
JournalAnn. Statist., 41(2), 693-721.

#### Abstract

We consider a multi-armed bandit problem in a setting where each arm produces a noisy reward realization which depends on an observable random covariate. As opposed to the traditional static multi-armed bandit problem, this setting allows for dynamically changing rewards that better describe applications where side information is available. We adopt a nonparametric model where the expected rewards are smooth functions of the covariate and where the hardness of the problem is captured by a margin parameter. To maximize the expected cumulative reward, we introduce a policy called Adaptively Binned Successive Elimination (ABSE) that adaptively decomposes the global problem into suitably localized static bandit problems. This policy constructs an adaptive partition using a variant of the Successive Elimination (SE) policy. Our results include sharper regret bounds for the SE policy in a static bandit problem and minimax optimal regret bounds for the ABSE policy in the dynamic problem.

#### Sparse estimation by exponential weighting

Philippe Rigollet and Alexandre Tsybakov (2012)
JournalStatist. Sci., 27(4), 558-575

#### Abstract

Consider a regression model with fixed design and Gaussian noise where the regression function can potentially be well approximated by a function that admits a sparse representation in a given dictionary. This paper resorts to exponential weights to exploit this underlying sparsity by implementing the principle of sparsity pattern aggregation. This model selection take on sparse estimation allows us to derive sparsity oracle inequalities in several popular frameworks including ordinary sparsity, fused sparsity and group sparsity. One striking aspect of these theoretical results is that they hold under no condition on the dictionary. Moreover, we describe an efficient implementation of the sparsity pattern aggregation principle that compares favorably to state-of-the-art procedures on some basic numerical examples.

#### Estimation of Covariance Matrices under Sparsity Constraints

Philippe Rigollet and Alexandre Tsybakov (2012)
JournalStatist. Sinica, 22(4), 1319-1378.

#### Abstract

Discussion of Minimax Estimation of Large Covariance Matrices under L1-Norm" by Tony Cai and Harrison Zhou.

#### Deviation Optimal Learning using Greedy Q-aggregation

Dong Dai, Philippe Rigollet and Tong Zhang (2012)
JournalAnn. Statist., 40(3), 1878-1905.

#### Abstract

Given a finite family of functions, the goal of model selection is to construct a procedure that mimics the function from this family that is the closest to an unknown regression function. More precisely, we consider a general regression model with fixed design and measure the distance between functions by the mean squared error at the design points. While procedures based on exponential weights are known to solve the problem of model selection in expectation, they are, surprisingly, sub-optimal in deviation. We propose a new formulation called Q-aggregation that addresses this limitation; namely, its solution leads to sharp oracle inequalities that are optimal in a minimax sense. Moreover, based on the new formulation, we design greedy Q-aggregation procedures that produce sparse aggregation models achieving the optimal rate. The convergence and performance of these greedy procedures are illustrated and compared with other standard methods on simulated examples.

#### Kullback-Leibler aggregation and misspecified generalized linear models

Philippe Rigollet (2012)
JournalAnn. Statist., 40(2), 639-665.

#### Abstract

In a regression setup with deterministic design, we study the pure aggregation problem and introduce a natural extension from the Gaussian distribution to distributions in the exponential family. While this extension bears strong connections with generalized linear models, it does not require identifiability of the parameter or even that the model on the systematic component is true. It is shown that this problem can be solved by constrained and/or penalized likelihood maximization and we derive sharp oracle inequalities that hold both in expectation and with high probability. Finally all the bounds are proved to be optimal in a minimax sense.

#### Neyman-Pearson classification, convexity and stochastic constraints

Philippe Rigollet and Xin Tong (2011)
JournalJ. Mach. Learn. Res., 12(Oct):2831-2855.

#### Abstract

Motivated by problems of anomaly detection, this paper implements the Neyman-Pearson paradigm to deal with asymmetric errors in binary classification with a convex loss $\varphi$. Given a finite collection of classifiers, we combine them and obtain a new classifier that satisfies simultaneously the two following properties with high probability: (i) its $\varphi$-type I error is below a pre-specified level and (ii), it has $\varphi$-type II error close to the minimum possible. The proposed classifier is obtained by minimizing an empirical convex objective with an empirical convex constraint. The novelty of the method is that the classifier output by this computationally feasible program is shown to satisfy the original constraint on type I error. New techniques to handle such problems are developed and they have consequences on chance constrained programming. We also evaluate the price to pay in terms of type II error for being conservative on type I error.

#### Exponential Screening and optimal rates of sparse estimation

Philippe Rigollet and Alexandre Tsybakov (2011)
JournalAnn. Statist., 39(2), 731-771.

#### Abstract

In high-dimensional linear regression, the goal pursued here is to estimate an unknown regression function using linear combinations of a suitable set of covariates. One of the key assumptions for the success of any statistical procedure in this setup is to assume that the linear combination is sparse in some sense, for example, that it involves only few covariates. We consider a general, non necessarily linear, regression with Gaussian noise and study a related question that is to find a linear combination of approximating functions, which is at the same time sparse and has small mean squared error (MSE). We introduce a new estimation procedure, called \emph{Exponential Screening} that shows remarkable adaptation properties. It adapts to the linear combination that optimally balances MSE and sparsity, whether the latter is measured in terms of the number of non-zero entries in the combination ($\ell_0$ norm) or in terms of the global weight of the combination ($\ell_1$ norm). The power of this adaptation result is illustrated by showing that Exponential Screening solves optimally and simultaneously all the problems of aggregation in Gaussian regression that have been discussed in the literature. Moreover, we show that the performance of the Exponential Screening estimator cannot be improved in a minimax sense, even if the optimal sparsity is known in advance. The theoretical and numerical superiority of Exponential Screening compared to state-of-the-art sparse procedures is also discussed.

#### Neyman-Pearson classification under a strict constraint

Philippe Rigollet and Xin Tong (2011)
ConferenceProceedings of COLT 2011, PMLR 19:595-614.

#### Abstract

Motivated by problems of anomaly detection, this paper implements the Neyman-Pearson paradigm to deal with asymmetric errors in binary classification with a convex loss. Given a finite collection of classifiers, we combine them and obtain a new classifier that satisfies simultaneously the two following properties with high probability: (i), its probability of type~I error is below a pre-specified level and (ii), it has probability of type ~II error close to the minimum possible. The proposed classifier is obtained by minimizing an empirical objective subject to an empirical constraint. The novelty of the method is that the classifier output by this problem is shown to satisfy the original constraint on type~I error. This strict enforcement of the constraint has interesting consequences on the control of the type~II error and we develop new techniques to handle this situation. Finally, connections with chance constrained optimization are evident and are investigated.

#### Optimal rates of sparse estimation and universal aggregation

Philippe Rigollet and Alexandre Tsybakov (2010)
OtherOberwolfach reports, 7(1), 924-927 In: Modern Nonparametric Statistics: Going Beyond Asymptotic Minimax, Mar.-Apr. 2010.

#### Nonparametric Bandits with Covariates

Philippe Rigollet and Assaf Zeevi (2010)
ConferenceProceedings of COLT 2010. Omnipress, 54-66.

#### Abstract

Me consider a bandit problem which involves sequential sampling from two populations (arms). Each arm produces a noisy reward realization which depends on an observable random covariate. The goal is to maximize cumulative expected reward. We derive general lower bounds on the performance of any admissible policy, and develop an algorithm whose performance achieves the order of said lower bound up to logarithmic terms. This is done by decomposing the global problem into suitably localized'' bandit problems. Proofs blend ideas from nonparametric statistics and traditional methods used in the bandit literature.

#### Optimal rates for plug-in estimators of density level sets

Philippe Rigollet and Regis Vert (2009)
JournalBernoulli, 15(4), 1154-1178.

#### Abstract

In the context of density level set estimation, we study the convergence of general plug-in methods under two main assumptions on the density for a given level $\lambda$. More precisely, it is assumed that the density (i) is smooth in a neighborhood of $\lambda$ and (ii) has $\gamma$-exponent at level $\lambda$. Condition (i) ensures that the density can be estimated at a standard nonparametric rate and condition (ii) is similar to Tsybakov's margin assumption which is stated for the classification framework. Under these assumptions, we derive optimal rates of convergence for plug-in estimators. Explicit convergence rates are given for plug-in estimators based on kernel density estimators when the underlying measure is the Lebesgue measure. Lower bounds proving optimality of the rates in a minimax sense when the density is Holder smooth are also provided.

#### Learning by mirror averaging

Anatoli Juditsky, Philippe Rigollet and Alexandre Tsybakov (2009)
JournalAnn. Statist., 36(5), 2183-2206.

#### Abstract

Given a finite collection of estimators or classifiers, we study the problem of model selection type aggregation, that is, we construct a new estimator or classifier, called aggregate, which is nearly as good as the best among them with respect to a given risk criterion. We define our aggregate by a simple recursive procedure which solves an auxiliary stochastic linear programming problem related to the original nonlinear one and constitutes a special case of the mirror averaging algorithm. We show that the aggregate satisfies sharp oracle inequalities under some general assumptions. The results are applied to several problems including regression, classification and density estimation.

#### Generalization error bounds in semi-supervised classification under the cluster assumption

Philippe Rigollet (2009)
JournalJ. Mach. Learn. Res., 8(Jul), 1369-1392.

#### Abstract

We consider semi-supervised classification when part of the available data is unlabeled. These unlabeled data can be useful for the classification problem when we make an assumption relating the behavior of the regression function to that of the marginal distribution. Seeger (2000) proposed the well-known cluster assumption as a reasonable one. We propose a mathematical formulation of this assumption and a method based on density level sets estimation that takes advantage of it to achieve fast rates of convergence both in the number of unlabeled examples and the number of labeled examples.

#### Linear and convex aggregation of density estimators

Philippe Rigollet and Alexandre Tsybakov (2007)
Journal Math. Methods of Statist., 15(3), 260-280

#### Abstract

We study the problem of finding the best linear and convex combination of M estimators of a density with respect to the mean squared risk. We suggest aggregation procedures and we prove sharp oracle inequalities for their risks, i.e., oracle inequalities with leading constant 1. We also obtain lower bounds showing that these procedures attain optimal rates of aggregation. As an example, we consider aggregation of multivariate kernel density estimators with different bandwidths. We show that linear and convex aggregates mimic the kernel oracles in asymptotically exact sense. We prove that, for Pinsker's kernel, the proposed aggregates are sharp asymptotically minimax simultaneously over a large scale of Sobolev classes of densities. Finally, we provide simulations demonstrating performance of the convex aggregation procedure.

#### Adaptive density estimation using the blockwise Stein method

Philippe Rigollet (2006)
Journal Bernoulli, 12(2), 351-370

#### Abstract

We study the problem of nonparametric estimation of a probability density of unknown smoothness in $L_2(\mathbb{R})$. Expressing mean integrated squared error (MISE) in the Fourier domain, we show that it is close to mean squared error in the Gaussian sequence model. Then applying a modified version of Stein's blockwise method, we obtain a linear monotone oracle inequality. Two consequences of this oracle inequality are that the proposed estimator is sharp minimax adaptive over a scale of Sobolev classes of densities, and that its MISE is asymptotically smaller than or equal to that of kernel density estimators with any bandwidth provided that the kernel belongs to a large class of functions including many standard kernels.

#### Mirror averaging, aggregation and model selection

Anatoli Juditsky, Philippe Rigollet and Alexandre Tsybakov (2005)
Other Oberwolfach reports, 2(4), 2688-2691. In: Meeting on Statistical and Probabilistic Methods of Model Selection, October 2005.

#### Abstract

Short note on aggregation published in Oberwolfach Reports following the Meeting on Statistical and Probabilistic Methods of Model Selection, October 2005.

#### Inegalites d'oracle pour l'estimation d'une densite de probabilite

Philippe Rigollet (2005)
Journal C. R. Math. Acad. Sci. Paris, 340(1), 59-62

#### Abstract

We study the problem of the nonparametric estimation of a probability density in $L_2(R)$. Expressing the mean integrated squared error in the Fourier domain, we show that it is close to the mean squared error in the Gaussian sequence model. Then, applying a modified version of Stein's blockwise method, we obtain a linear monotone oracle inequality and a kernel oracle inequality. As a consequence, the proposed estimator is sharp minimax adaptive (i.e. up to a constant) on a scale of Sobolev classes of densities.

### Currrent Teaching

• 2017 Spring

#### High-dimensional Statistics (18.657)

This course offers an introduction to the finite sample analysis of high-dimensional statistical methods. The goal is to present various proof techniques for state-of-the-art methods in regression, matrix estimation and principal component analysis (PCA) as well as optimality guarantees. The course ends with research questions that are currently open.

Topics covered include: 1. Sub-Gaussian concentration, 2. High-dimensional linear regression, 3. Misspecified linear models, 4. Matrix estimation, 5. Minimax lower bounds
This course assumes introductory knowledge of statistics and probability.

### Teaching History (Selected Courses)

• 2016 Fall

#### Statistics for Applications (18.650)

This class offers an in-depth introduction to the theoretical foundations of statistical methods that are useful in many applications. The goal is to understand the role of mathematics in the research and development of efficient statistical methods.

• 2015 Fall

#### Mathematics of Machine Learning

Broadly speaking, Machine Learning refers to the automated identification of patterns in data. As such it has been a fertile ground for new statistical and algorithmic developments. The purpose of this course is to provide a mathematically rigorous introduction to these developments with emphasis on methods and their analysis. The class will be split in three main parts: (1) The statistical theory of machine learning, (2) Algorithms and convexity and (3) Online learning.

The course website is accesible via this link.

• 2015 Spring

#### High-dimensional statistics (MIT)

This course offers an introduction to the finite sample analysis of high-dimensional statistical methods. The goal is to present various proof techniques for state-of-the-art methods in regression, matrix estimation and principal component analysis (PCA) as well as optimality guarantees. The course ends with research questions that are currently open.

Topics covered include: 1. Sub-Gaussian concentration, 2. High-dimensional regression, 3. Matrix estimation, 4. Sparse PCA, 5. Minimax lower bounds, 6. Computational limitations.
This course assumes introductory knowledge of statistics, probability and computational complexity.

• 2013 2008

#### Fundamentals of Statistics (Princeton University)

A first introduction to probability and statistics. This course provides the background to understand and produce rigorous statistical analysis including estimation, confidence intervals, hypothesis testing and regression. Applicability and limitations of these methods are illustrated in the light of modern data sets and manipulation of the statistical software R. Precepts are based on real data analysis.

• 2014 2009

#### Mathematical Statistics (Princeton University)

A graduate-level introduction to statistical theory and methods and some of the most important and commonly-used principles of statistical inference. Covers the statistical theory and methods for point estimation, confidence intervals, and hypothesis testing.

• 2014 2009

#### Statistical Learning and Nonparametric Estimation (Princeton University)

A theoretical introduction to statistical learning problems related to regression and classification. Topics covered include: Principle Component Analysis, nonparametric estimation, sparse regression, and Classification and Statistical learning,

## Associate Editor

• Electronic Journal of Statistics (2016-present)
• Bernoulli (2013-present)
• Statistical Inference for Stochastic Processes (2015-16)
• Journal of Statistical Planning and Inference (2012-15)

## Conferences Program Committee

• Conference on Learning Theory, COLT (2012-present)

## At My Office

2-279 in the Department of Mathematics

## Online

The best way to reach me is to send me an email at rigo...@math.mit.edu

photo courtesy of Peter Vanderwarker