Sylvain Arlot: Precise analysis of some Purely Random Forests

Random forests (Breiman, 2001) are a very effective and commonly used statistical method, but their full theoretical analysis is still an open problem. As a first step, simplified models such as purely random forests have been introduced, in order to shed light on the good performance of Breiman’s random forests. In the regression framework, the quadratic risk of a purely random forest can be written (approximately) as the sum of two terms, which can be understood as an approximation error and an estimation error. In this talk, we study how each of these terms depends on the size of each tree and on the number of trees in the forest. Precise theoretical results are obtained for a toy model. On the one hand, if the regression function is smooth enough, the approximation error of an infinite forest decreases at a faster rate (with respect to the size of each tree) than the one of a single tree. On the other hand, the estimation error is of the same order of magnitude for a single tree and for an infinite forest. As a consequence, infinite forests attain a strictly better risk rate (with respect to the sample size) than single trees, because of their better approximation properties. These results on the approximation error and on the risk can be generalized to other purely random forests. Numerical experiments on some purely random forests close to Breiman’s algorithm (hold-out random forests) suggest that they behave similarly, which sheds light on how Breiman’s random forests work. This talk is based on joint works with Robin Genuer.

Pierre Belec: Uncertainty quantification for penalized M-estimators in high-dimensions

The talk will focus on convex penalized M-estimators in high-dimensions in linear models when the ratio of the ambient dimension and the sample size are of the same order. A typical example covered by the theory is the combination of the Huber loss to fit the data and the Elastic-Net penalty to promote sparsity. I will present new means of uncertainty quantification for such penalized M-estimators for anisotropic Gaussian design and possibly heavy-tailed noise with two applications: (1) asymptotic normality and confidence intervals for single components of the true regression vector, and (2) estimation of the out-of-sample error of the penalized M-estimator. I will present applications to parameter tuning, where the practitioner can choose to minimize the length of the resulting confidence interval or the out-of-sample error.

Sébastien Bubeck: A universal law of robustness via isoperimetry

Classically, data interpolation with a parametrized model class is possible as long as the number of parameters is larger than the number of equations to be satisfied. A puzzling phenomenon in deep learning is that models are trained with many more parameters than what this classical theory would suggest. We propose a theoretical explanation for this phenomenon. We prove that for a broad class of data distributions and model classes, overparametrization is necessary​ if one wants to interpolate the data smoothly​. Namely we show that smooth​ interpolation requires d times more parameters than mere interpolation, where d is the ambient data dimension. We prove this universal law of robustness for any smoothly parametrized function class with polynomial size weights, and any covariate distribution verifying isoperimetry. In the case of two-layer neural networks and Gaussian covariates, this law was conjectured in prior work by Bubeck, Li and Nagaraj. Joint work with Mark Sellke.

Emmanuel Candès

Ronald deVore: Thoughts on Deep Learning

This talk will take a mathematical view of the problem of learning a function from data and then examine how this view fits into our current knowledge of deep learning.

Zijian Guo: Inference for High-dimensional Maximin Effects in Heterogeneous Regression Models Using a Sampling Approach

Heterogeneity is an important feature of modern data sets and a central task is to extract information from large-scale and heterogeneous data. In this paper, we consider multiple high-dimensional linear models and adopt the definition of maximin effect (Meinshausen, Buhlmann, AoS, 43(4), 1801–1830) to summarize the information contained in this heterogeneous model. We define the maximin effect for a targeted population whose covariate distribution is possibly different from that of the observed data. We further introduce a ridge-type maximin effect to simultaneously account for reward optimality and statistical stability. To identify the high-dimensional maximin effect, we estimate the regression covariance matrix by a debiased estimator and use it to construct the aggregation weights for the maximin effect. A main challenge for statistical inference is that the estimated weights might have a mixture distribution and the resulted maximin effect estimator is not necessarily asymptotic normal. To address this, we devise a novel sampling approach to construct the confidence interval for any linear contrast of high-dimensional maximin effects. The coverage and precision properties of the proposed confidence interval are studied. The proposed method is demonstrated over simulations and a genetic data set on yeast colony growth under different environments. The paper is available at https://arxiv.org/abs/2011.07568

Vladimir Koltchinskii: Estimation of smooth functionals of high-dimensional parameters: bias reduction and efficiency

Consider the problem of estimation of a smooth functional $f(\theta)$ of unknown high-dimensional parameter $\theta$ of statistical model based on i.i.d. observations $X_1,\dots, X_n.$ The goal is to develop an estimator of $f(\theta)$ with optimal error rates depending on the degree of smoothness of the functional. In the case of regular models of a fixed small dimension, a naive plug-in estimator $f(\hat \theta),$ with $\hat \theta$ being a maximum likelihood estimator of parameter $\theta,$ is known to be asymptotically efficient as $n\to\infty$ for any continuously differentiable $f.$ However, when the dimension $d$ of unknown parameter is large, this estimator usually fails due to its large bias and bias reduction methods (often, rather sophisticated) have to be developed to overcome this difficulty. We will discuss an approach to this problem based on a general method of construction of a sufficiently smooth functional $g(\theta)$ (depending on the degree of smoothness of the functional $f$) such that the estimator $g(\hat \theta)$ has a negligible bias. Such estimators also have minimax optimal error rates in classes of functionals of H\”older smoothness $s$ for Gaussian shift models as well as normal models with unknown mean and covariance. Moreover, if $d\leq n^{\alpha}$ for $\alpha\in (0,1)$ and $s>\frac{1}{1-\alpha},$ then $g(\hat \theta)$ is asymptotically efficient with $\sqrt{n}$-rate whereas, for $s<\frac{1}{1-\alpha},$ the optimal estimation  rate is slower than $n^{-1/2}.$  More recently, it was proved that a similar phase transition between parametric and slow rates holds for some more general models,in particular, log-concave. In addition, similar phenomena in functional estimation were shown for arbitrary models with a high-dimensional parameter provided that there exists an estimator $\hat \theta$ of $\theta$ such that the normal approximation of $\sqrt{n}(\hat \theta-\theta)$ holds when $d=o(n).$

Po-Ling Loh: A modern take on Huber regression

In the first part of the talk, we discuss the use of a penalized Huber M-estimator for high-dimensional linear regression. We explain how a fairly straightforward analysis yields high-probability error bounds that hold even when the additive errors are heavy-tailed. However, the parameter governing the shape of the Huber loss must be chosen in relation to the scale of the error distribution. We discuss how to use an adaptive technique, based on Lepski’s method, to overcome the difficulties traditionally faced by applying Huber M-estimation in a context where both location and scale are unknown. In the second part of the talk, we turn to a more complicated setting where both the covariates and responses may be heavy-tailed and/or adversarially contaminated. We show how to modify the Huber regression estimator by first applying an appropriate “filtering” procedure to the data based on the covariates. We prove that in low-dimensional settings, this filtered Huber regression estimator achieves near-optimal error rates. We further show that the commonly used least trimmed squares and least absolute deviation estimators may similarly be made robust to contaminated covariates via the same covariate filtering step. This is based on joint work with Ankit Pensia and Varun Jog.

Stanislav Minsker: Asymptotic properties of robust mean estimators

There are several known examples of the estimators of the mean that admit sub-Gaussian deviation guarantees under minimal assumptions on the underlying distribution. In this talk, we will focus on the asymptotic properties of such estimators. First, we will present an asymptotically efficient robust estimator of the univariate mean whose construction is inspired by robustness properties of self-normalized sums. Then we will discuss a more general class of algorithms that can be viewed as robust analogues of the classical empirical risk minimization.

Alex Munk: Nanostatistics

Conventional light microscopes have been used for centuries for the study of small length scales down to about 250 nanometers. At such a resolution level images are blurred and noisy and their recovery has been the focus of a multitude of statistical deconvolution techniques during the past. However, such conventional light microscopes have an intrinsic physical limit of resolution which was formulated by E. Abbe more than a century ago and can be roughly stated as: “Two objects cannot be optically discerned if they are spatially closer than half of the wavelength of the incoming light”. This resolution barrier was recently broken with the advent of modern super-resolution fluorescence microscopy techniques (nanoscopy), acknowledged with the Nobel prize in Chemistry 2014.  Nowadays, nanoscopy is an indispensable tool in medical and biological research for studying structure, communication, and dynamics of living cells. Current experimental advances go to the physical limits of nanoscopic imaging where the quantum nature of photons becomes predominant. Consequently, nanoscopy recordings are inherently random. This challenges established ways to formulate resolution as well as methods of data analysis, such as deconvolution for conventional light microscopy. In this talk we present a theory of resolution in terms of minimax hypothesis testing and provide multiscale testing based imaging techniques for spatially sparse signals. Finally, we address the task how to count molecules at a certain spot. We argue that it becomes necessary to model the temporal transitions of quantum states of fluorescent molecules in order to achieve the highest resolution and statistical precision possible. This leads to some novel issues for the analysis of hidden Markov models.

Richard Nickl: On Information theoretic limitations for inference in some statistical inverse problems arising with PDEs’

Inverse problems arising with partial differential equations are ubiquitous in science, physics and engineering, and other branches of data science. The Bayesian approach has been particularly popular since seminal work by Andrew Stuart (2010). Statistical guarantees for such methods are often obtained from Bernstein von Mises theorems, whose fundamental assumption is the existence of the inverse Fisher information (operator). We review recent progress that establishes such theorems for a class of PDEs arising with Schroedinger type equations. We then show that for an important class of a PDEs where the goal is to infer the conductivity/permeability in a diffusion type equation, the Fisher information operator is in fact not invertible, and that root-N consistent inference on regular smooth linear functional is impossible. This is proved using classical results by van der Vaart (1991) and a fine analysis of the underlying properties of the PDE. A picture emerges that allows to understand which PDEs permit Bernstein von Mises approximation and which ones may not.

Sofia Ohlede: Novel approaches for network data modelling and analysis

Aaditya Ramdas: Estimating means of bounded random variables by betting

We derive confidence intervals (CI) and time-uniform confidence sequences (CS) for the classical problem of estimating an unknown mean from bounded observations. We present a general approach for deriving concentration bounds, that can be seen as a generalization (and improvement) of the celebrated Chernoff method. At its heart, it is based on deriving a new class of composite nonnegative martingales, with strong connections to Kelly betting and Robbins’ method of mixtures. We show how to extend these ideas to sampling without replacement, another heavily studied problem. In all cases, our bounds are adaptive to the unknown variance, and empirically vastly outperform competing approaches based on Hoeffding or empirical Bernstein inequalities and their recent supermartingale generalizations. In short, we establish a new state-of-the-art for four fundamental problems: CSs and CIs for bounded means, with and without replacement. This work is joint with my student, Ian Waudby-Smith, a preprint is available here: https://arxiv.org/abs/2010.09686

Johannes Schmidt-Hieber: Statistical analysis of machine learning methods

Recently a lot of progress has been made regarding the theoretical understanding of machine learning methods. One of the very promising directions is the statistical approach, which interprets machine learning as a collection of statistical methods and builds on existing techniques in mathematical statistics to derive theoretical error bounds and to understand phenomena such as overparametrization. The talk surveys this field and describes future challenges.

Qi-Man Shao: Randomized Concentration Inequalities with Applications

Randomized concentration inequalities play an important role in dealing with distribution approximation for non-linear statistics. In this talk, we will review some recent developments on randomized concentration inequalities. Especially, a randomized concentration inequality with application to non-linear statistics in Rd will be discussed.

Nicolas Verzelen: Optimal change-point detection and localization

Given a times series, we consider the twin problems of detecting and localizing change-points in the distribution. This encompasses various models including mean univariate, sparse multivariate, non-parametric, or kernel change-points… Starting with the simple Gaussian univariate model, we derive optimal detection and localization rates and uncover phase transitions from regional testing problems to local estimation problems. Notably, our multi-scale procedures accommodate with the presence of possibly many low-energy and therefore undetectable change-points.  In a second part, we explain how the methodology extends to more generic change-point problems by drawing some connection with minimax theory in multiple testing. This is based on joint work with A. Carpentier, M. Fromont, M. Lerasle, E. Pilliat, and P. Reynaud-Bouret.

Rebecca Willett: Machine Learning and Inverse Problems: Convergence and Adaptation

Many challenging image processing tasks can be described by an ill-posed linear inverse problem: deblurring, deconvolution, inpainting, compressed sensing, and superresolution all lie in this framework. Recent advances in machine learning and image processing have illustrated that it is often possible to learn a regularizer from training data that can outperform more traditional approaches by large margins. In this talk, I will describe the central prevailing themes and tradeoffs of this emerging area.  The core ideas underlying these methods draw upon ideas from statistics, optimization, signal processing, and machine learning. We will explore a new class of approaches based on implicit “infinite-depth” networks that yield reconstruction frameworks with provable convergence guarantees and significant empirical performance gains. Finally, I will describe mechanisms for “model adaptation” — that is, given a model trained to solve one inverse problem with a known forward model, we propose novel procedures that adapt the network to a perturbed forward model. This is joint work with Davis Gilton and Greg Ongie.

Nikita Zhivotovskiy: Distribution-Free Robust Linear Regression 

We study random design linear regression with no assumptions on the distribution of the covariates and with a heavy-tailed response variable. When learning without assumptions on the covariates, we establish boundedness of the conditional second moment of the response variable as a necessary and sufficient condition for achieving deviation-optimal excess risk bounds. First, we prove an optimal version of the classical in-expectation bound for the truncated least squares estimator due to Györfi, Kohler, Krzyzak, and Walk. However, in spite of its optimal in-expectation performance, we show that this procedure fails with constant probability for some distributions. Combining the ideas of truncated least squares, median-of-means procedures, and aggregation theory, we construct a non-linear estimator achieving excess risk of order O(d/n) with the optimal sub-exponential tail. Joint work with Jaouad Mourtada (CREST, ENSAE) and Tomas Vaškevičius (University of Oxford).

Shuheng Zhou: Tensor models for high dimensional data with complex dependencies

Building models and methods for complex data is an important task for many scientific and application areas. I will discuss several interrelated yet distinct models and methods on graph and mean recovery problems with applications in spatio-temporal modeling and genomics. For genomics studies, many datasets exhibit dependencies among observations as well as among variables. This gives rise to the challenging problem of analyzing high-dimensional data with unknown mean and dependence structures. I will present a practical method utilizing generalized least squares and penalized inverse covariance estimation to address this challenge. I will then introduce a multiway tensor generalization of the bi-graphical lasso (Kalaitzis et. al. 2013), an estimator for precision matrices of matrix-normals based on the Cartesian product of graphs. Statistical consistency and rates of convergence are established for both Bi-graphical and TeraLasso estimators for precision estimation. Data examples from a meteorological study will be presented to show that we can extract meaningful conditional dependency structures from large and complex datasets. This talk is based on the joint work with my former Ph.D. students Kristjan Greenewald, Michael Hornstein, Roger Fan and collaborators Kerby Shedden and Alfred Hero.