# Workshop: Foundations of Learning Theory

We will hold a 3-day workshop "Mathematical Foundations of Learning Theory", June 17-19

Preliminary Schedule

 June 17 June 18 June 19 9:30-10:15 Sanjeev Arora Sergio Verdú Jake Abernethy 10:15-11:00 Elad Hazan Ruediger Urbanke Ohad Shamir 11:30-12:15 Shie Mannor Ery Arias-Castro Lorenzo Rosasco 12:15-13:00 Ankur Moitra Ulrike von Luxburg Matthias Hein lunch break 15:00-15:45 Dimitris Achlioptas Volodya Vovk Rui Castro 15:45-16:30 Ofer Dekel Ben Recht Aryeh Kontorovich 17:00-17:45 Emmanuel Abbe Peter Grünwald 17:45-18:30 Vitaly Feldman Daniel Hsu

ABSTRACTS:

Recovery threshold in the stochastic block model

Emmanuel Abbe

I will start with general motivations connecting coding to clustering problems, and spend some time on a recent result establishing the recovery threshold for the stochastic block model with two communities: If the edge probability within clusters is a*log(n)/n and across clusters is b*log(n)/n, recovery is possible if and only if (a+b)/2 - \sqrt{ab} >1.

Convex Random Graph Models

Dimitris Achlioptas

Realistic models of random graphs are important both for design (predicting the performance of different protocols/algorithms) as well as for inference (extracting latent group membership, clustering, etc.). There are by now thousands of papers defining random graph models. I will present a principled framework of deriving random graph models by generalizing the approach of Erdos-Renyi in defining their classic G(n,m) model. The central principle is to study the uniform measure over  symmetric sets of graphs, i.e., sets that are invariant under a group of transformations. The main contribution is to derive natural sufficient conditions under which the uniform measure over a symmetric set of graphs collapses asymptotically to a distribution where edges appear independently (with different probabilities).

Based on joint work with Paris Siminelakis.

On the convergence of maximum variance unfolding
Ery Arias-Castro

Maximum Variance Unfolding (MVU), also known as Semidefinite Embedding (SDE), is one of the main methods for nonlinear dimensionality reduction.  We study its large-sample limit under standard assumptions.  We find that it is consistent when the underlying submanifold is isometric to a convex subset.  In other cases, we provide some simple examples where it fails to be consistent.

(Joint work with Bruno Pelletier, Université Rennes II, France)

Adaptive Sensing for Estimation of Structured Sparse Signals

Rui Castro

(Joint work the Ervin Tánczos)

Ofer Dekel

Online learning is often modelled as a repeated game between a player and an adversary. The dynamics of this game are well-understood when we make the simplifying assumption that the adversary is non-adaptive (namely, that it does not modify its strategy in reaction to the player’s actions). Without this assumption, our theoretical understanding of the online learning game is much less developed. We begin this talk by mentioning two different formulations of this problem and highlighting the differences between them. Then, we narrow our discussion to the concrete setting where the player’s action set is finite. We examine how the difficulty of the problem changes as we vary the strength of the adversary (from non-adaptive to fully adaptive) and as we vary the amount of feedback given to the player. We present a rich and complete theoretical characterization of this setting and show that the different variants of this problem can be arranged in three tiers of difficulty: easy problems (those with sqrt(T) minimax regret), hard problems (with T^(2/3) minimax regret), and unlearnable problems (with linear minimax regret). We also show that the tier of hard problems, which was not previously identified in this setting, actually includes a variety of natural online learning problems. As a corollary, we also give the first formal confirmation that learning with bandit feedback can be asymptotically harder than learning with full information feedback.

Joint work with Raman Arora, Nicolo Cesa-Bianchi, Jian Ding, Elad Hazan, Tomer Koren, Yuval Peres, Ohad Shamir, and Ambuj Tewari.

Using Data Privacy for Better Adaptive Predictions

Vitaly Feldman

Statistical inference procedures and machine learning algorithms need
to estimate the error of hypotheses on the population from which the
observed samples were drawn. Accuracy of the estimates is crucial for
avoiding overfitting and false discovery. We formalize estimation of
hypothesis errors and other population parameters using statistical
queries of Kearns (1993). A single statistical query provides an
estimate of the true expectation of any function of a single sample.

It is easy to obtain an estimate of the expectation of a function
using the empirical mean of the function values on random samples.
Further, using n IID samples, an exponential in n number of
statistical queries can be so answered if the queries are chosen
independently of the samples. Yet this naive approach might fail
dramatically if samples are reused to answer queries that depend on
cases of this problem have been shown to cause overfitting in many
practical works. We show that remarkably, there exists a different way
to evaluate statistical queries that allows answering an exponential
number of even adaptively chosen queries. This technique,
counter-intuitively, involves perturbing the empirical estimates and
coordinating the answers using techniques developed for data privacy
preservation. Our results also provide practical guidance on how to
prevent overfitting and false discovery in adaptive settings.

Joint work with Cynthia Dwork, Moritz Hardt, Omer Reingold and Aaron Roth

Learning the Learning Rate

Peter Grünwald

'g-stochastic mixability' characterizes the easiness of a learning problem. Stochastic concepts of 'easy distributions' such as Tsybakov margin conditions and individual sequence concepts of 'easy loss functions' such as Vovk's mixable (and exp-concave) losses are both special cases. Under g-stochastic mixability, the excess risk of PAC-Bayes and other generalized Bayesian methods converges faster than O(1/sqrt n), if the learning rate LR is adapted to g. Here I present the Safe Bayesian algorithm,  which *learns* the optimal LR from data without knowing g in batch and online stochastic settings, highlighting two applications:

(1) Safe Bayes repairs standard Bayes when the model is wrong, adaptively decreasing the learning rate when the data tell us to. We give a simple regression example where this dramatically boosts Bayesian performance.
(2) McAllester-style PAC-Bayesian confidence bounds:  using Safe Bayes one gets qualitatively different bounds which do not require randomized classifiers and which can be substantially better. In some cases we get bounds of form O(KLdivergence/n) + O(1/sqrt n) whereas previous bounds have O( sqrt(KLdivergence/n) ).

The g-stochastic mixability concept extends joint work with B. Williamson, M. Reid and T. van Erven (NIPS 2012).

Volumetric Spanners:  an Efficient Exploration Basis for Learning

Numerous machine learning problems require an exploration basis - a mechanism to explore the action space. We define a novel convex-geometric notion of exploration basis with low variance called volumetric spanners, and give efficient algorithms to construct such bases.
We'll describe applications to optimal experiment design, for which we give the first near-optimal and efficient algorithm for the hard-margin criteria of active linear regression, and to bandit linear optimization - for which we give an efficient and near-optimal regret algorithm over general convex sets.
based on joint work with Zohar Karnin and Raghu Meka.

Matrix Factorization with Binary Components

Matthias Hein

Motivated by an application in computational biology, we consider low-rank matrix
factorization with {0, 1}-constraints on one of the factors and optionally convex
constraints on the second one. In the line of recent work on non-negative matrix factorization by Arora et
al. (2012), we provide an algorithm that provably recovers the underlying factorization in the
exact case. Moreover, we provide conditions under which this factorization is unique. For a random binary factor
matrix the uniqueness holds if the span of the columns does not contain another binary vector. We extend
a classical result of Odlyzko on this question using a variation of the Littlewood-Offord theorem.

Reducing contextual bandits to supervised learning

Daniel Hsu

I'll describe a new algorithm for the contextual bandit learning problem, where the learner repeatedly takes an action in response to the observed context, observing the reward only for that action. The method assumes access to an oracle for solving cost-sensitive classification problems and achieves the statistically optimal regret guarantee with only $\tilde{O}(\sqrt{T})$ oracle calls across all $T$ rounds. Joint work with Alekh Agarwal, Satyen Kale, John Langford, Lihong Li, and Rob Schapire.

Good Margins Make Good Neighbors

Aryeh Kontorovich

Although well-known by practitioners to be an effective classification
tool, nearest-neighbor methods have been somewhat neglected by
learning theory of late. The goal of this talk is to revive interest
in this time-tested technique by recasting it in a modern perspective.
We will present a paradigm of margin-regularized 1-nearest neighbor
classification which: (i) is Bayes-consistent (ii) yields simple,
usable finite-sample error bounds (iii) provides for very efficient
algorithms with a principled speed-accuracy tradeoff (iv) allows for
near-optimal sample compression. Further extensions include
multiclass, regression, and metric dimensionality reduction. I will
argue that the regularized 1-nearest neighbor is superior to k-nearest
neighbors in several crucial statistical and computational aspects.

Based on a series of works with: Lee-Ad Gottlieb, Robert Krauthgamer, Roi Weiss

Machine learning in high dimensions with adversarial corruption

Shie Mannor

We consider machine learning problems in high dimensions with arbitrary (possibly, severe or coordinated) errors in the data. We are interested in understanding how many corruptions we can tolerate, while getting a reasonable solution.
Many standard algorithms such as PCA, SVM, logistic regression, and others are extremely brittle and can collapse completely even in the presence of a few adversarially placed outliers.
We dive in details into the problem of sparse regression where the objective is to identify the correct support. We show that any natural convex relaxation is bound to fail and that the brute force algorithm that enumerate over all supports is also not effective. We explore the power of a simple idea: replace the essential linear algebraic calculation (the inner product) with a robust counterpart that cannot be greatly affected by a controlled number of arbitrarily corrupted points: the trimmed inner product.
We provide the robust counterparts of three popular algorithms in the uncorrupted setting: Thresholding Regression, Lasso, and the Dantzig selector, and show that the counterparts obtained using the trimmed inner product are provably robust.
We then discuss the issue of learning on large data sets and propose a framework for distributed robust learning.

Analyzing Optimization Algorithms with Integral Quadratic Constraints

Ben Recht

I will present a new method to analyze iterative optimization algorithms built on the framework of Integral Quadratic Constraints (IQC) from robust control theory.  IQCs provide sufficient conditions for the stability of complicated interconnected systems, and these conditions can be checked by semidefinite programming.  I will discuss how to adapt IQC theory to study optimization algorithms, proving new inequalities about convex functions.  Using these inequalities, I will derive upper bounds on convergence rates for the gradient method, the heavy-ball method, Nesterov's accelerated method, and related variants by solving small, simple semidefinite programming problems.  I will close with a discussion of how to extend these techniques to study more complicated iterative optimization schemes.

Joint work with Laurent Lessard and Andrew Packard

Early Stopping Regularization

Lorenzo Rosasco

Machine learning algorithms have  to be predictive  but at the same
time efficient and scalable. Investigating the  interplay  between
statistics and computations is emerging as  a new exciting venue for
both theoretical and practical studies. A key challenge in this
context is the problem of performing  provably efficient and  adaptive
model selection.

Early stopping is one of the most appealing heuristics for this
problem, since the  computational resources required for learning  are
directly linked to the desired prediction properties.  Regularization
is achieved  by iteratively passing  over the data multiple times and
typically choosing the number of passes (epochs) providing the best
prediction performance on a validation set.  Despite being widely used
in practice, the theoretical properties of early stopping have only
recently being investigated.

In this talk we will discuss the connection between learning,
stability and ill-posedeness to illustrate recent results that provide
a theoretical foundation of early stopping for several iterative
learning schemes, including batch and online (incremental) approaches.

Many machine learning approaches are characterized by information constraints on how they interact with the training data. These include memory and sequential access constraints (e.g. fast first-order methods to solve stochastic optimization problems); communication constraints (e.g. distributed learning); partial access to the underlying data (e.g. missing features and multi-armed bandits) and more. However, currently we have little understanding how such information constraints fundamentally affect our performance, independent of the learning problem semantics. For example, are there learning problems where *any* algorithm which has small memory footprint (or can use any bounded number of bits from each example, or has certain communication constraints) will perform worse than what is possible without such constraints? In this talk, I'll describe how a single set of results implies positive answers to the above, for a few different settings.

Estimation in Information Theory

Sergio Verdú

In recent years, several interesting relationships have been discovered between the minimum mean-square error of estimation in additive Gaussian noise and information measures
such as entropy, mutual information and relative entropy. Thanks to these intersections, estimation theory enables some of the simplest known proofs of a number of classical Shannon-theoretic results. Conversely,  surprising new results in nonlinear filtering are obtained using information theory.

Density estimation from unweighted kNN graphs

Ulrike von Luxburg

Consider an unweighted k-nearest neighbor graph that has been built on
a random sample from some unknown density on R^d. Assume we are given
the unweighted (!) adjacency matrix of the graph, but we do not know
the point locations or any distance or similarity scores between the
points. Is it then possible to estimate the underlying density p, just
from the adjacency matrix of the unweighted graph? As I will show in
the talk, the answer is yes. I present a proof for this result, and
also discuss relations to the problem of ordinal embedding.

Venn prediction

Volodya Vovk

Venn predictors produce probability-type predictions for the labels of test objects which are guaranteed to be well calibrated under the standard assumption that the observations are generated independently from the same distribution. In this talk I will define Venn predictors, give a simple formalization of their property of being well calibrated, and introduce the method of Venn-Abers prediction allowing one to build Venn predictors on top of a wide range of machine-learning algorithms. In conclusion I will state some open problems in this area.