We will hold a 3-day workshop "Mathematical Foundations of Learning Theory", June 17-19.

Click here to register.

**Preliminary Schedule**

June 17 | June 18 | June 19 | |

9:30-10:15 | Sanjeev Arora | Sergio Verdú | Jake Abernethy |

10:15-11:00 | Elad Hazan | Ruediger Urbanke | Ohad Shamir |

11:30-12:15 | Shie Mannor | Ery Arias-Castro | Lorenzo Rosasco |

12:15-13:00 | Ankur Moitra | Ulrike von Luxburg | Matthias Hein |

lunch break | |||

15:00-15:45 | Dimitris Achlioptas | Volodya Vovk | Rui Castro |

15:45-16:30 | Ofer Dekel | Ben Recht | Aryeh Kontorovich |

17:00-17:45 | Emmanuel Abbe | Peter Grünwald | |

17:45-18:30 | Vitaly Feldman | Daniel Hsu | |

*ABSTRACTS:*

Emmanuel Abbe

I will start with general motivations connecting coding to clustering problems, and spend some time on a recent result establishing the recovery threshold for the stochastic block model with two communities: If the edge probability within clusters is a*log(n)/n and across clusters is b*log(n)/n, recovery is possible if and only if (a+b)/2 - \sqrt{ab} >1.

Dimitris Achlioptas

Realistic models of random graphs are important both for design (predicting the performance of different protocols/algorithms) as well as for inference (extracting latent group membership, clustering, etc.). There are by now thousands of papers defining random graph models. I will present a principled framework of deriving random graph models by generalizing the approach of Erdos-Renyi in defining their classic G(n,m) model. The central principle is to study the uniform measure over symmetric sets of graphs, i.e., sets that are invariant under a group of transformations. The main contribution is to derive natural sufficient conditions under which the uniform measure over a symmetric set of graphs collapses asymptotically to a distribution where edges appear independently (with different probabilities).

Based on joint work with Paris Siminelakis.

Ery Arias-Castro

Maximum Variance Unfolding (MVU), also known as Semidefinite Embedding (SDE), is one of the main methods for nonlinear dimensionality reduction. We study its large-sample limit under standard assumptions. We find that it is consistent when the underlying submanifold is isometric to a convex subset. In other cases, we provide some simple examples where it fails to be consistent.

(Joint work with Bruno Pelletier, Université Rennes II, France)

Rui Castro

In many practical settings one can sequentially and adaptively guide the collection of future data based on information extracted from data collected previously. These sequential data collection procedures are known as sequential experimental design, active learning, or adaptive sensing/sampling. The intricate relation between data analysis and acquisition in adaptive sensing paradigms is extremely powerful, and allows for reliable estimation in situations where non-adaptive sensing would fail dramatically. In this talk the problem of estimating the support of a structured sparse signal from coordinate-wise observations under the adaptive sensing paradigm is considered. A general procedure for support set estimation is given and illustrates that through the use of adaptive sensing one can: (i) significantly mitigate the effect of observation noise when compared to non-adaptive sensing and, (ii) capitalize on structural information to a much larger extent than possible with non-adaptive sensing. In addition, performance lower bounds are given for both adaptive and non-adaptive sensing paradigms, showing the general procedure is optimal is a variety of scenarios.

(Joint work the Ervin Tánczos)

Ofer Dekel

Online learning is often modelled as a repeated game between a player and an adversary. The dynamics of this game are well-understood when we make the simplifying assumption that the adversary is non-adaptive (namely, that it does not modify its strategy in reaction to the player’s actions). Without this assumption, our theoretical understanding of the online learning game is much less developed. We begin this talk by mentioning two different formulations of this problem and highlighting the differences between them. Then, we narrow our discussion to the concrete setting where the player’s action set is finite. We examine how the difficulty of the problem changes as we vary the strength of the adversary (from non-adaptive to fully adaptive) and as we vary the amount of feedback given to the player. We present a rich and complete theoretical characterization of this setting and show that the different variants of this problem can be arranged in three tiers of difficulty: easy problems (those with sqrt(T) minimax regret), hard problems (with T^(2/3) minimax regret), and unlearnable problems (with linear minimax regret). We also show that the tier of hard problems, which was not previously identified in this setting, actually includes a variety of natural online learning problems. As a corollary, we also give the first formal confirmation that learning with bandit feedback can be asymptotically harder than learning with full information feedback.

Joint work with Raman Arora, Nicolo Cesa-Bianchi, Jian Ding, Elad Hazan, Tomer Koren, Yuval Peres, Ohad Shamir, and Ambuj Tewari.

Vitaly Feldman

Statistical inference procedures and machine learning algorithms need

to estimate the error of hypotheses on the population from which the

observed samples were drawn. Accuracy of the estimates is crucial for

avoiding overfitting and false discovery. We formalize estimation of

hypothesis errors and other population parameters using statistical

queries of Kearns (1993). A single statistical query provides an

estimate of the true expectation of any function of a single sample.

It is easy to obtain an estimate of the expectation of a function

using the empirical mean of the function values on random samples.

Further, using n IID samples, an exponential in n number of

statistical queries can be so answered if the queries are chosen

independently of the samples. Yet this naive approach might fail

dramatically if samples are reused to answer queries that depend on

previous answers, in other words, when queries are adaptive. Special

cases of this problem have been shown to cause overfitting in many

practical works. We show that remarkably, there exists a different way

to evaluate statistical queries that allows answering an exponential

number of even adaptively chosen queries. This technique,

counter-intuitively, involves perturbing the empirical estimates and

coordinating the answers using techniques developed for data privacy

preservation. Our results also provide practical guidance on how to

prevent overfitting and false discovery in adaptive settings.

Joint work with Cynthia Dwork, Moritz Hardt, Omer Reingold and Aaron Roth

Peter Grünwald

'g-stochastic mixability' characterizes the easiness of a learning problem. Stochastic concepts of 'easy distributions' such as Tsybakov margin conditions and individual sequence concepts of 'easy loss functions' such as Vovk's mixable (and exp-concave) losses are both special cases. Under g-stochastic mixability, the excess risk of PAC-Bayes and other generalized Bayesian methods converges faster than O(1/sqrt n), if the learning rate LR is adapted to g. Here I present the Safe Bayesian algorithm, which *learns* the optimal LR from data without knowing g in batch and online stochastic settings, highlighting two applications:

(1) Safe Bayes repairs standard Bayes when the model is wrong, adaptively decreasing the learning rate when the data tell us to. We give a simple regression example where this dramatically boosts Bayesian performance.

(2) McAllester-style PAC-Bayesian confidence bounds: using Safe Bayes one gets qualitatively different bounds which do not require randomized classifiers and which can be substantially better. In some cases we get bounds of form O(KLdivergence/n) + O(1/sqrt n) whereas previous bounds have O( sqrt(KLdivergence/n) ).

The g-stochastic mixability concept extends joint work with B. Williamson, M. Reid and T. van Erven (NIPS 2012).

Elad Hazan

Numerous machine learning problems require an exploration basis - a mechanism to explore the action space. We define a novel convex-geometric notion of exploration basis with low variance called volumetric spanners, and give efficient algorithms to construct such bases.

We'll describe applications to optimal experiment design, for which we give the first near-optimal and efficient algorithm for the hard-margin criteria of active linear regression, and to bandit linear optimization - for which we give an efficient and near-optimal regret algorithm over general convex sets.

based on joint work with Zohar Karnin and Raghu Meka.

Matthias Hein

Motivated by an application in computational biology, we consider low-rank matrix

factorization with {0, 1}-constraints on one of the factors and optionally convex

constraints on the second one. In the line of recent work on non-negative matrix factorization by Arora et

al. (2012), we provide an algorithm that provably recovers the underlying factorization in the

exact case. Moreover, we provide conditions under which this factorization is unique. For a random binary factor

matrix the uniqueness holds if the span of the columns does not contain another binary vector. We extend

a classical result of Odlyzko on this question using a variation of the Littlewood-Offord theorem.

Daniel Hsu

I'll describe a new algorithm for the contextual bandit learning problem, where the learner repeatedly takes an action in response to the observed context, observing the reward only for that action. The method assumes access to an oracle for solving cost-sensitive classification problems and achieves the statistically optimal regret guarantee with only $\tilde{O}(\sqrt{T})$ oracle calls across all $T$ rounds. Joint work with Alekh Agarwal, Satyen Kale, John Langford, Lihong Li, and Rob Schapire.

Aryeh Kontorovich

Although well-known by practitioners to be an effective classification

tool, nearest-neighbor methods have been somewhat neglected by

learning theory of late. The goal of this talk is to revive interest

in this time-tested technique by recasting it in a modern perspective.

We will present a paradigm of margin-regularized 1-nearest neighbor

classification which: (i) is Bayes-consistent (ii) yields simple,

usable finite-sample error bounds (iii) provides for very efficient

algorithms with a principled speed-accuracy tradeoff (iv) allows for

near-optimal sample compression. Further extensions include

multiclass, regression, and metric dimensionality reduction. I will

argue that the regularized 1-nearest neighbor is superior to k-nearest

neighbors in several crucial statistical and computational aspects.

Based on a series of works with: Lee-Ad Gottlieb, Robert Krauthgamer, Roi Weiss

Shie Mannor

We consider machine learning problems in high dimensions with arbitrary (possibly, severe or coordinated) errors in the data. We are interested in understanding how many corruptions we can tolerate, while getting a reasonable solution.

Many standard algorithms such as PCA, SVM, logistic regression, and others are extremely brittle and can collapse completely even in the presence of a few adversarially placed outliers.

We dive in details into the problem of sparse regression where the objective is to identify the correct support. We show that any natural convex relaxation is bound to fail and that the brute force algorithm that enumerate over all supports is also not effective. We explore the power of a simple idea: replace the essential linear algebraic calculation (the inner product) with a robust counterpart that cannot be greatly affected by a controlled number of arbitrarily corrupted points: the trimmed inner product.

We provide the robust counterparts of three popular algorithms in the uncorrupted setting: Thresholding Regression, Lasso, and the Dantzig selector, and show that the counterparts obtained using the trimmed inner product are provably robust.

We then discuss the issue of learning on large data sets and propose a framework for distributed robust learning.

Ben Recht

I will present a new method to analyze iterative optimization algorithms built on the framework of Integral Quadratic Constraints (IQC) from robust control theory. IQCs provide sufficient conditions for the stability of complicated interconnected systems, and these conditions can be checked by semidefinite programming. I will discuss how to adapt IQC theory to study optimization algorithms, proving new inequalities about convex functions. Using these inequalities, I will derive upper bounds on convergence rates for the gradient method, the heavy-ball method, Nesterov's accelerated method, and related variants by solving small, simple semidefinite programming problems. I will close with a discussion of how to extend these techniques to study more complicated iterative optimization schemes.

Joint work with Laurent Lessard and Andrew Packard

Lorenzo Rosasco

Machine learning algorithms have to be predictive but at the same

time efficient and scalable. Investigating the interplay between

statistics and computations is emerging as a new exciting venue for

both theoretical and practical studies. A key challenge in this

context is the problem of performing provably efficient and adaptive

model selection.

Early stopping is one of the most appealing heuristics for this

problem, since the computational resources required for learning are

directly linked to the desired prediction properties. Regularization

is achieved by iteratively passing over the data multiple times and

typically choosing the number of passes (epochs) providing the best

prediction performance on a validation set. Despite being widely used

in practice, the theoretical properties of early stopping have only

recently being investigated.

In this talk we will discuss the connection between learning,

stability and ill-posedeness to illustrate recent results that provide

a theoretical foundation of early stopping for several iterative

learning schemes, including batch and online (incremental) approaches.

Ohad Shamir

Many machine learning approaches are characterized by information constraints on how they interact with the training data. These include memory and sequential access constraints (e.g. fast first-order methods to solve stochastic optimization problems); communication constraints (e.g. distributed learning); partial access to the underlying data (e.g. missing features and multi-armed bandits) and more. However, currently we have little understanding how such information constraints fundamentally affect our performance, independent of the learning problem semantics. For example, are there learning problems where *any* algorithm which has small memory footprint (or can use any bounded number of bits from each example, or has certain communication constraints) will perform worse than what is possible without such constraints? In this talk, I'll describe how a single set of results implies positive answers to the above, for a few different settings.

Sergio Verdú

In recent years, several interesting relationships have been discovered between the minimum mean-square error of estimation in additive Gaussian noise and information measures

such as entropy, mutual information and relative entropy. Thanks to these intersections, estimation theory enables some of the simplest known proofs of a number of classical Shannon-theoretic results. Conversely, surprising new results in nonlinear filtering are obtained using information theory.

Ulrike von Luxburg

Consider an unweighted k-nearest neighbor graph that has been built on

a random sample from some unknown density on R^d. Assume we are given

the unweighted (!) adjacency matrix of the graph, but we do not know

the point locations or any distance or similarity scores between the

points. Is it then possible to estimate the underlying density p, just

from the adjacency matrix of the unweighted graph? As I will show in

the talk, the answer is yes. I present a proof for this result, and

also discuss relations to the problem of ordinal embedding.

Volodya Vovk

Venn predictors produce probability-type predictions for the labels of test objects which are guaranteed to be well calibrated under the standard assumption that the observations are generated independently from the same distribution. In this talk I will define Venn predictors, give a simple formalization of their property of being well calibrated, and introduce the method of Venn-Abers prediction allowing one to build Venn predictors on top of a wide range of machine-learning algorithms. In conclusion I will state some open problems in this area.