# Research Program: "Mathematics of Machine Learning"

Learning theory is a field at the intersection of statistics, probability, computer science, and optimization. This mathematical theory is concerned with theoretical guarantees for machine learning algorithms. The dialogue between computation and statistics has been key to the enormous advances and growth in the field. This dialogue is becoming more and more important as we enter the age of large-scale data. The goal of the Program is to bring together experts from fields spanning the broad range of modern learning theory, from Statistics to Optimization.

The research program will take place in Barcelona, Spain, from April 7th, 2014 to July 14th, 2014. Over the course of these fourteen weeks, the Centre de Recerca Matemàtica in Barcelona will host a number of visitors, and we expect a lively research atmosphere with frequent talks and short seminars. In addition, there will be several co-located conferences and a large 3-day workshop on learning theory.

The program is cross-organized with

General Information:
• The weekly seminars at CRM are open to all interested researchers. Please send an email to the organizers if you are planning to attend.
• The workshops and conferences will have a separate registration process. Some travel funds will be available for these as well.
• We have limited travel support for a few junior faculty, postdocs, and doctoral students. Please send us an email, or follow directions here.

More information about the location can be found on the website of the Centre de Recerca Matemàtica.

## Visitors

Visitors will be hosted by the Centre de Recerca Matemàtica from April 7th, 2014 to July 14th, 2014

See the main page for weekly seminars.

The list of conferences co-organized with the research program:

See this updated list with dates.

Long- and short-term visitors

 Name Affiliation Tentative Dates Pierre Alquier University College Dublin Sanjeev Arora Princeton University Peter Auer University of Leoben Francis Bach Ecole Normale Superieure Nina Balcan Georgia Tech Sebastien Bubeck Princeton University Nicolò Cesa-Bianchi Università degli Studi di Milano Ofer Dekel Microsoft Pedro Delicado Universitat Politècnica de Catalunya Luc Devroye McGill University Gersende Fort CNRS Aurélien Garivier Université Paul Sabatier Ricard Gavaldà Universitat Politècnica de Catalunya László Györfi Budapest University of Technology and Economics Elad Hazan Technion Tamás Linder Queens University Gábor Lugosi Universitat Pompeu Fabra Stephane Mallat Ecole Polytechnique Yishay Mansour Tel Aviv University Catherine Matias CNRS Ankur Moitra MIT Omiros Papaspiliopoulos Universitat Pompeu Fabra Alexander Rakhlin University of Pennsylvania Ben Recht UC Berkeley Philippe Rigollet Princeton Ohad Shamir Weizmann Institute Shai-Shalev Shwartz Hebrew University Karthik Sridharan University of Pennsylvania Csaba Szepesvari University of Alberta Martin Wainright UC Berkeley

# Weekly Seminars

For the up-to-date seminar information, visit this link.

UPCOMING SEMINARS
 April 23 Francis Bach Beyond stochastic gradient descent for large-scale machine learning April 30 Balázs Kégl Learning to discover: signal/background separation and the Higgs boson challenge May 8, 10:45-12 Sebastien Bubeck On the influence of the seed graph in the preferential attachment model May 22, 10:45-12 Yishay Mansour Robust Bayesian Inference May 27, 10:45-12 Ohad Shamir Information Trade-offs in Machine Learning June 5, 10:30-11:30 Sasha Rakhlin A Tale of Three Regression Problems June 6, 10:30-11:30 Amit Daniely From average case complexity to improper learning complexity June 11, 15:30-18:00andJune 12, 14:00-16:30 Stéphane Mallat Minicourse: Analyzing Deep Neural Networks for Learning
Place: Centre de Recerca Matemàtica, Facultat de Ciències. UAB, Bellaterra (Barcelona)
Auditori Conference Room
Time: 10:30 am after the Research Program's weekly coffee time. No registration is needed.

# Workshop, June 17-19

Preliminary Schedule

 June 17 June 18 June 19 9:30-10:15 Sanjeev Arora Sergio Verdú Jake Abernethy 10:15-11:00 Elad Hazan Ruediger Urbanke Ohad Shamir 11:30-12:15 Shie Mannor Ery Arias-Castro Lorenzo Rosasco 12:15-13:00 Ankur Moitra Ulrike von Luxburg Matthias Hein lunch break 15:00-15:45 Dimitris Achlioptas Volodya Vovk Rui Castro 15:45-16:30 Ofer Dekel Ben Recht Aryeh Kontorovich 17:00-17:45 Emmanuel Abbe Peter Grünwald 17:45-18:30 Vitaly Feldman Daniel Hsu

ABSTRACTS:

Recovery threshold in the stochastic block model

Emmanuel Abbe

I will start with general motivations connecting coding to clustering problems, and spend some time on a recent result establishing the recovery threshold for the stochastic block model with two communities: If the edge probability within clusters is a*log(n)/n and across clusters is b*log(n)/n, recovery is possible if and only if (a+b)/2 - \sqrt{ab} >1.

Convex Random Graph Models

Dimitris Achlioptas

Realistic models of random graphs are important both for design (predicting the performance of different protocols/algorithms) as well as for inference (extracting latent group membership, clustering, etc.). There are by now thousands of papers defining random graph models. I will present a principled framework of deriving random graph models by generalizing the approach of Erdos-Renyi in defining their classic G(n,m) model. The central principle is to study the uniform measure over  symmetric sets of graphs, i.e., sets that are invariant under a group of transformations. The main contribution is to derive natural sufficient conditions under which the uniform measure over a symmetric set of graphs collapses asymptotically to a distribution where edges appear independently (with different probabilities).

Based on joint work with Paris Siminelakis.

On the convergence of maximum variance unfolding
Ery Arias-Castro

Maximum Variance Unfolding (MVU), also known as Semidefinite Embedding (SDE), is one of the main methods for nonlinear dimensionality reduction.  We study its large-sample limit under standard assumptions.  We find that it is consistent when the underlying submanifold is isometric to a convex subset.  In other cases, we provide some simple examples where it fails to be consistent.

(Joint work with Bruno Pelletier, Université Rennes II, France)

Adaptive Sensing for Estimation of Structured Sparse Signals

Rui Castro

(Joint work the Ervin Tánczos)

Ofer Dekel

Online learning is often modelled as a repeated game between a player and an adversary. The dynamics of this game are well-understood when we make the simplifying assumption that the adversary is non-adaptive (namely, that it does not modify its strategy in reaction to the player’s actions). Without this assumption, our theoretical understanding of the online learning game is much less developed. We begin this talk by mentioning two different formulations of this problem and highlighting the differences between them. Then, we narrow our discussion to the concrete setting where the player’s action set is finite. We examine how the difficulty of the problem changes as we vary the strength of the adversary (from non-adaptive to fully adaptive) and as we vary the amount of feedback given to the player. We present a rich and complete theoretical characterization of this setting and show that the different variants of this problem can be arranged in three tiers of difficulty: easy problems (those with sqrt(T) minimax regret), hard problems (with T^(2/3) minimax regret), and unlearnable problems (with linear minimax regret). We also show that the tier of hard problems, which was not previously identified in this setting, actually includes a variety of natural online learning problems. As a corollary, we also give the first formal confirmation that learning with bandit feedback can be asymptotically harder than learning with full information feedback.

Joint work with Raman Arora, Nicolo Cesa-Bianchi, Jian Ding, Elad Hazan, Tomer Koren, Yuval Peres, Ohad Shamir, and Ambuj Tewari.

Using Data Privacy for Better Adaptive Predictions

Vitaly Feldman

Statistical inference procedures and machine learning algorithms need
to estimate the error of hypotheses on the population from which the
observed samples were drawn. Accuracy of the estimates is crucial for
avoiding overfitting and false discovery. We formalize estimation of
hypothesis errors and other population parameters using statistical
queries of Kearns (1993). A single statistical query provides an
estimate of the true expectation of any function of a single sample.

It is easy to obtain an estimate of the expectation of a function
using the empirical mean of the function values on random samples.
Further, using n IID samples, an exponential in n number of
statistical queries can be so answered if the queries are chosen
independently of the samples. Yet this naive approach might fail
dramatically if samples are reused to answer queries that depend on
cases of this problem have been shown to cause overfitting in many
practical works. We show that remarkably, there exists a different way
to evaluate statistical queries that allows answering an exponential
number of even adaptively chosen queries. This technique,
counter-intuitively, involves perturbing the empirical estimates and
coordinating the answers using techniques developed for data privacy
preservation. Our results also provide practical guidance on how to
prevent overfitting and false discovery in adaptive settings.

Joint work with Cynthia Dwork, Moritz Hardt, Omer Reingold and Aaron Roth

Learning the Learning Rate

Peter Grünwald

'g-stochastic mixability' characterizes the easiness of a learning problem. Stochastic concepts of 'easy distributions' such as Tsybakov margin conditions and individual sequence concepts of 'easy loss functions' such as Vovk's mixable (and exp-concave) losses are both special cases. Under g-stochastic mixability, the excess risk of PAC-Bayes and other generalized Bayesian methods converges faster than O(1/sqrt n), if the learning rate LR is adapted to g. Here I present the Safe Bayesian algorithm,  which *learns* the optimal LR from data without knowing g in batch and online stochastic settings, highlighting two applications:

(1) Safe Bayes repairs standard Bayes when the model is wrong, adaptively decreasing the learning rate when the data tell us to. We give a simple regression example where this dramatically boosts Bayesian performance.
(2) McAllester-style PAC-Bayesian confidence bounds:  using Safe Bayes one gets qualitatively different bounds which do not require randomized classifiers and which can be substantially better. In some cases we get bounds of form O(KLdivergence/n) + O(1/sqrt n) whereas previous bounds have O( sqrt(KLdivergence/n) ).

The g-stochastic mixability concept extends joint work with B. Williamson, M. Reid and T. van Erven (NIPS 2012).

Volumetric Spanners:  an Efficient Exploration Basis for Learning

Numerous machine learning problems require an exploration basis - a mechanism to explore the action space. We define a novel convex-geometric notion of exploration basis with low variance called volumetric spanners, and give efficient algorithms to construct such bases.
We'll describe applications to optimal experiment design, for which we give the first near-optimal and efficient algorithm for the hard-margin criteria of active linear regression, and to bandit linear optimization - for which we give an efficient and near-optimal regret algorithm over general convex sets.
based on joint work with Zohar Karnin and Raghu Meka.

Matrix Factorization with Binary Components

Matthias Hein

Motivated by an application in computational biology, we consider low-rank matrix
factorization with {0, 1}-constraints on one of the factors and optionally convex
constraints on the second one. In the line of recent work on non-negative matrix factorization by Arora et
al. (2012), we provide an algorithm that provably recovers the underlying factorization in the
exact case. Moreover, we provide conditions under which this factorization is unique. For a random binary factor
matrix the uniqueness holds if the span of the columns does not contain another binary vector. We extend
a classical result of Odlyzko on this question using a variation of the Littlewood-Offord theorem.

Reducing contextual bandits to supervised learning

Daniel Hsu

I'll describe a new algorithm for the contextual bandit learning problem, where the learner repeatedly takes an action in response to the observed context, observing the reward only for that action. The method assumes access to an oracle for solving cost-sensitive classification problems and achieves the statistically optimal regret guarantee with only $\tilde{O}(\sqrt{T})$ oracle calls across all $T$ rounds. Joint work with Alekh Agarwal, Satyen Kale, John Langford, Lihong Li, and Rob Schapire.

Good Margins Make Good Neighbors

Aryeh Kontorovich

Although well-known by practitioners to be an effective classification
tool, nearest-neighbor methods have been somewhat neglected by
learning theory of late. The goal of this talk is to revive interest
in this time-tested technique by recasting it in a modern perspective.
We will present a paradigm of margin-regularized 1-nearest neighbor
classification which: (i) is Bayes-consistent (ii) yields simple,
usable finite-sample error bounds (iii) provides for very efficient
algorithms with a principled speed-accuracy tradeoff (iv) allows for
near-optimal sample compression. Further extensions include
multiclass, regression, and metric dimensionality reduction. I will
argue that the regularized 1-nearest neighbor is superior to k-nearest
neighbors in several crucial statistical and computational aspects.

Based on a series of works with: Lee-Ad Gottlieb, Robert Krauthgamer, Roi Weiss

Machine learning in high dimensions with adversarial corruption

Shie Mannor

We consider machine learning problems in high dimensions with arbitrary (possibly, severe or coordinated) errors in the data. We are interested in understanding how many corruptions we can tolerate, while getting a reasonable solution.
Many standard algorithms such as PCA, SVM, logistic regression, and others are extremely brittle and can collapse completely even in the presence of a few adversarially placed outliers.
We dive in details into the problem of sparse regression where the objective is to identify the correct support. We show that any natural convex relaxation is bound to fail and that the brute force algorithm that enumerate over all supports is also not effective. We explore the power of a simple idea: replace the essential linear algebraic calculation (the inner product) with a robust counterpart that cannot be greatly affected by a controlled number of arbitrarily corrupted points: the trimmed inner product.
We provide the robust counterparts of three popular algorithms in the uncorrupted setting: Thresholding Regression, Lasso, and the Dantzig selector, and show that the counterparts obtained using the trimmed inner product are provably robust.
We then discuss the issue of learning on large data sets and propose a framework for distributed robust learning.

Analyzing Optimization Algorithms with Integral Quadratic Constraints

Ben Recht

I will present a new method to analyze iterative optimization algorithms built on the framework of Integral Quadratic Constraints (IQC) from robust control theory.  IQCs provide sufficient conditions for the stability of complicated interconnected systems, and these conditions can be checked by semidefinite programming.  I will discuss how to adapt IQC theory to study optimization algorithms, proving new inequalities about convex functions.  Using these inequalities, I will derive upper bounds on convergence rates for the gradient method, the heavy-ball method, Nesterov's accelerated method, and related variants by solving small, simple semidefinite programming problems.  I will close with a discussion of how to extend these techniques to study more complicated iterative optimization schemes.

Joint work with Laurent Lessard and Andrew Packard

Early Stopping Regularization

Lorenzo Rosasco

Machine learning algorithms have  to be predictive  but at the same
time efficient and scalable. Investigating the  interplay  between
statistics and computations is emerging as  a new exciting venue for
both theoretical and practical studies. A key challenge in this
context is the problem of performing  provably efficient and  adaptive
model selection.

Early stopping is one of the most appealing heuristics for this
problem, since the  computational resources required for learning  are
directly linked to the desired prediction properties.  Regularization
is achieved  by iteratively passing  over the data multiple times and
typically choosing the number of passes (epochs) providing the best
prediction performance on a validation set.  Despite being widely used
in practice, the theoretical properties of early stopping have only
recently being investigated.

In this talk we will discuss the connection between learning,
stability and ill-posedeness to illustrate recent results that provide
a theoretical foundation of early stopping for several iterative
learning schemes, including batch and online (incremental) approaches.

Many machine learning approaches are characterized by information constraints on how they interact with the training data. These include memory and sequential access constraints (e.g. fast first-order methods to solve stochastic optimization problems); communication constraints (e.g. distributed learning); partial access to the underlying data (e.g. missing features and multi-armed bandits) and more. However, currently we have little understanding how such information constraints fundamentally affect our performance, independent of the learning problem semantics. For example, are there learning problems where *any* algorithm which has small memory footprint (or can use any bounded number of bits from each example, or has certain communication constraints) will perform worse than what is possible without such constraints? In this talk, I'll describe how a single set of results implies positive answers to the above, for a few different settings.

Estimation in Information Theory

Sergio Verdú

In recent years, several interesting relationships have been discovered between the minimum mean-square error of estimation in additive Gaussian noise and information measures
such as entropy, mutual information and relative entropy. Thanks to these intersections, estimation theory enables some of the simplest known proofs of a number of classical Shannon-theoretic results. Conversely,  surprising new results in nonlinear filtering are obtained using information theory.

Density estimation from unweighted kNN graphs

Ulrike von Luxburg

Consider an unweighted k-nearest neighbor graph that has been built on
a random sample from some unknown density on R^d. Assume we are given
the unweighted (!) adjacency matrix of the graph, but we do not know
the point locations or any distance or similarity scores between the
points. Is it then possible to estimate the underlying density p, just
from the adjacency matrix of the unweighted graph? As I will show in
the talk, the answer is yes. I present a proof for this result, and
also discuss relations to the problem of ordinal embedding.

Venn prediction

Volodya Vovk

Venn predictors produce probability-type predictions for the labels of test objects which are guaranteed to be well calibrated under the standard assumption that the observations are generated independently from the same distribution. In this talk I will define Venn predictors, give a simple formalization of their property of being well calibrated, and introduce the method of Venn-Abers prediction allowing one to build Venn predictors on top of a wide range of machine-learning algorithms. In conclusion I will state some open problems in this area.

Organizers

 Sebastien Bubeck Princeton University Nicolo Cesa-Bianchi Università degli Studi di Milano Gabor Lugosi Pompeu Fabra University Sasha Rakhlin University of Pennsylvania

The Program is supported in part by the National Science Foundation.