Publications




Information Complexity of Black-Box Convex Optimization: A New Look via Feedback Information Theory (with M. Raginsky). Allerton Conference on Communication, Control, and Computing, 2009.
Abstract:



Quantitative Analysis of Embedded Systems Using Game-Theoretic Learning (with S. Seshia). Submitted.
Abstract:



A Stochastic View of Optimal Regret through Minimax Duality (with J. Abernethy, A. Agarwal, and P. Bartlett). COLT 2009.
Abstract: We study the regret of optimal strategies for online convex optimization games. Using von Neumann's minimax theorem, we show that the optimal regret in this adversarial setting is closely related to the behavior of the empirical minimization algorithm in a stochastic process setting: it is equal to the maximum, over joint distributions of the adversary's action sequence, of the difference between a sum of minimal expected losses and the minimal empirical loss. We show that the optimal regret has a natural geometric interpretation, since it can be viewed as the gap in Jensen's inequality for a concave functional--the minimizer over the player's actions of expected loss--defined on a set of probability distributions. We use this expression to obtain upper and lower bounds on the regret of an optimal strategy for a variety of online learning problems. Our method provides upper bounds without the need to construct a learning algorithm; the lower bounds provide explicit optimal strategies for the adversary.



Beating the Adaptive Bandit with High Probability (with J. Abernethy). Information Theory and Applications Workshop, 2009; COLT 2009.
The tech report version with all the proofs is here.
Abstract: We provide a principled way of proving O(\sqrt{T}) high-probability guarantees for partial-information (bandit) problems over arbitrary convex decision sets. First, we prove a regret guarantee for the full-information problem in terms of "local" norms, both for entropy and self-concordant barrier regularization, unifying these methods. Given one of such algorithms as a black-box, we can convert a bandit problem into a full-information problem using a sampling scheme. The main result states that a high-probability O(\sqrt{T}) bound holds whenever the black-box, the sampling scheme, and the estimates of missing information satisfy a number of conditions, which are relatively easy to check. At the heart of the method is a construction of linear upper bounds on confidence intervals. As applications of the main result, we provide the first known efficient algorithm for the sphere with an O(\sqrt{T}) high-probability bound. We also derive the result for the n-simplex, improving the O(\sqrt{nT\log(nT)}) bound of Auer et al by replacing the logT term with loglogT and closing the gap to the lower bound of \Omega(\sqrt{nT}). While O(\sqrt{T}) high-probability bounds should hold for general decision sets through our main result, construction of linear upper bounds depends on the particular geometry of the set; we believe that the sphere example already exhibits the necessary ingredients. The guarantees we obtain hold for adaptive adversaries (unlike the in-expectation results of Abernethy et al) and the algorithms are efficient, given that the linear upper bounds on confidence can be computed.



Matrix Regularization Techniques for Online Multitask Learning (with A. Agarwal and P. Bartlett). Technical Report, 2008.
Abstract: In this paper we examine the problem of prediction with expert advice in a setup where the learner is presented with a sequence of examples coming from different tasks. In order for the learner to be able to benefit from performing multiple tasks simultaneously, we make assumptions of task relatedness by constraining the comparator to use a lesser number of best experts than the number of tasks. We show how this corresponds naturally to learning under spectral or structural matrix constraints, and propose regularization techniques to enforce the constraints. The regularization techniques proposed here are interesting in their own right and multitask learning is just one application for the ideas. A theoretical analysis of one such regularizer is performed, and a regret bound that shows benefits of this setup is reported.



Game-Theoretic Timing Analysis (with S. Seshia). IEEE/ACM Conference on Computer-Aided Design (ICCAD), 2008.
Abstract: Estimating the worst-case execution time (WCET) of tasks is a key step in the design of reliable real-time software and systems. In this paper, we present a new, game-theoretic approach to estimating WCET based on performing directed measurements on the target platform. We model the estimation problem as a game between our algorithm (player) and the environment of the program (adversary), where the player seeks to find the longest path through the program while the adversary sets environment parameters to thwart the player. We present both theoretical and experimental results demonstrating the utility of our approach. On the theoretical side, we prove that our algorithm can converge to find the longest path with high probability. Experimental results indicate that our approach is competitive with an existing technique based on static analysis and integer programming. Moreover, the approach can be easily applied to even complex hardware/software platforms.



Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization (with J. Abernethy and E. Hazan). COLT 2008.
Abstract: We introduce an efficient algorithm for the problem of online linear optimization in the bandit setting which achieves the optimal $O^*(\sqrt{T})$ regret. The setting is a natural generalization of the non-stochastic multi-armed bandit problem, and the existence of an efficient optimal algorithm has been posed as an open problem in a number of recent papers. We show how the difficulties encountered by previous approaches are overcome by the use of a self-concordant potential function. Our approach presents a novel connection between online learning and interior point methods.



High probability regret bounds for online optimization (with P. Bartlett, V. Dani, T. Hayes, S. Kakade, and A. Tewari). COLT 2008.
Abstract: We present a modification of the algorithm of Dani et al. for the online linear optimization problem in the bandit setting, which with high probability has regret at most $O^*(\sqrt{T})$ against an adaptive adversary. This improves on the previous algorithm of Dani et al whose regret is bounded \emph{in expectation } against an \emph{oblivious} adversary. We obtain the same dependence on the dimension ($n^{3/2}$) as that exhibited by Dani et al. The results of this paper rest firmly on those of Dani et al and the remarkable technique of Auer et al. for obtaining high-probability bounds via optimistic estimates. This paper answers an open question: it eliminates the gap between the high-probability bounds obtained in the full-information vs bandit settings.



Optimal Strategies and Minimax Lower Bounds for Online Convex Games (with Jacob Abernethy, Peter Bartlett, and Ambuj Tewari). COLT 2008.
Abstract: A number of learning problems can be cast as an Online Convex Game: on each round, a learner makes a prediction $x$ from a convex set, the environment plays a loss function $f$, and the learner's long-term goal is to minimize regret. Algorithms have been proposed by Zinkevich, when $f$ is assumed to be convex, and Hazan et al., when $f$ is assumed to be strongly convex, that have provably low regret. We consider these two settings and analyze such games from a minimax perspective, proving minimax strategies and lower bounds in each case. These results prove that the existing algorithms are essentially optimal.



Closing the Gap between Bandit and Full-Information Online Optimization: High-Probability Regret Bound (with Peter Bartlett and Ambuj Tewari), 2007
Abstract: We demonstrate a modification of the algorithm of Dani et al for the online linear optimization problem in the bandit setting, which allows us to achieve an $O(\sqrt{T\ln T})$ regret bound {\it in high probability}, as opposed to the {\it in expectation } result of Dani et al. Moreover, we obtain the same dependence on the dimension ($n^{3/2}$) as that exhibited by Dani et al. The results of this paper rest firmly on those of Dani et al and the remarkable technique of Auer et al for obtaining high-probability bounds via optimistic estimates. This paper answers an open question: it eliminates the gap between the high-probability bounds obtained in the full-information vs bandit settings.
Note: A similar analysis has been carried out independently by Dani, Hayes & Kakade. A merged version of our results is being submitted.



Adaptive Online Gradient Descent (with Peter Bartlett and Elad Hazan), NIPS 2007. Technical report version available here.
Abstract: We study the rates of growth of the regret in online convex optimization. First, we show that a simple extension of the algorithm of Hazan et al eliminates the need for a priori knowledge of the lower bound on the second derivatives of the observed functions. We then provide an algorithm, Adaptive Online Gradient Descent, which interpolates between the results of Zinkevich for linear functions and of Hazan et al for strongly convex functions, achieving intermediate rates between $\sqrt{T}$ and $\log T$. Furthermore, we show strong optimality of the algorithm. Finally, we provide an extension of our results to general norms.



Online Discovery of Similarity Mappings (with Jacob Abernethy and Peter Bartlett), ICML 2007.
Abstract: We consider the problem of choosing, sequentially, a map which assigns elements of a set A to a few elements of a set B. On each round, the algorithm suffers some cost associated with the chosen assignment, and the goal is to minimize the cumulative loss of these choices relative to the best map on the entire sequence. Even though the offline problem of finding the best map is provably hard, we show that there is an equivalent online approximation algorithm, Randomized Map Prediction (RMP), that is efficient and performs nearly as well. While drawing upon results from the ``Online Prediction with Expert Advice'' setting, we show how RMP can be utilized as an online approach to several standard batch problems. We apply RMP to online clustering as well as online feature selection and, surprisingly, RMP often outperforms the standard batch algorithms on these problems.



Multitask Learning with Expert Advice (with Jacob Abernethy and Peter Bartlett), COLT 2007. Technical report version available here.
Abstract: We consider the problem of prediction with expert advice in the setting where a forecaster is presented with several online prediction tasks. Instead of competing against the best expert separately on each task, we assume the tasks are related, and thus we expect that a few experts will perform well on the entire set of tasks. That is, our forecaster would like, on each task, to compete against the best expert chosen from a small set of experts. While we describe the "ideal" algorithm and its performance bound, we show that the computation required for this algorithm is as hard as computation of a matrix permanent. We present an efficient algorithm based on mixing priors, and prove a bound that is nearly as good for the sequential task presentation case. We also consider a harder case where the task may change arbitrarily from round to round, and we develop an efficient approximate randomized algorithm based on Markov chain Monte Carlo techniques.



Stability of K-means Clustering (with Andrea Caponnetto). NIPS, 2006.
Abstract: We phrase K-means clustering as an empirical risk minimization procedure over a class $H_K$ and explicitly calculate the covering number for this class. Next, we show that stability of K-means clustering is characterized by the geometry of $H_K$ with respect to the underlying distribution. We prove that in the case of a unique global minimizer, the clustering solution is stable with respect to complete changes of the data, while for the case of multiple minimizers, the change of $\Omega(n^{1/2})$ samples defines the transition between stability and instability. While for a finite number of minimizers this result follows from multinomial distribution estimates, the case of infinite minimizers requires more refined tools. We conclude by proving that stability of the functions in $H_K$ implies stability of the actual centers of the clusters. Since stability is often used for selecting the number of clusters in practice, we hope that our analysis serves as a starting point for finding theoretically grounded recipes for the choice of K.



Stability Properties of Empirical Risk Minimization over Donsker Classes (with Andrea Caponnetto). Journal of Machine Learning Research. Vol. 7 (Dec), 2565--2583, 2006.
(Older version as a technical report: AI Memo 2005-018. May 2005)
Abstract: We study some stability properties of algorithms which minimize (or almost-minimize) empirical error over Donsker classes of functions. We show that, as the number $n$ of samples grows, the $L_2$-diameter of the set of almost-minimizers of empirical error with tolerance $\xi(n)=o(n^{-\frac{1}{2}})$ converges to zero in probability. Hence, even in the case of multiple minimizers of expected error, as $n$ increases it becomes less and less likely that adding a sample (or a number of samples) to the training set will result in a large jump to a new hypothesis. Moreover, under some assumptions on the entropy of the class, along with an assumption of Komlos-Major-Tusnady type, we derive a power rate of decay for the diameter of almost-minimizers. This rate, through an application of a uniform ratio limit inequality, is shown to govern the closeness of the expected errors of the almost-minimizers. In fact, under the above assumptions, the expected errors of almost-minimizers become closer with a rate strictly faster than $n^{-1/2}$.



Risk Bounds for Mixture Density Estimation (with Dmitry Panchenko and Sayan Mukherjee). ESAIM Probability and Statistics. Vol. 9, 220-229, June 2005.
(Older version as a technical report: AI Memo 2004-001. Jan 2004)
Abstract: In this paper we focus on the problem of estimating a bounded density using a finite combination of densities from a given class. We consider the Maximum Likelihood Estimator (MLE) and the greedy procedure described by Li and Barron \cite{LiBarron99,Li99} under the additional assumption of boundedness of densities. We prove an $O(\frac{1}{\sqrt{n}})$ bound on the estimation error which does not depend on the number of densities in the estimated combination. Under the boundedness assumption, this improves the bound of Li and Barron by removing the $\log n$ factor and also generalizes it to the base classes with converging Dudley integral.



Stability Results in Learning Theory (with Sayan Mukherjee and Tomaso Poggio). Analysis and Applications, Special Issue on Learning Theory. Vol. 3, No. 4, 397-419. October 2005.
Abstract: The problem of proving generalization bounds for the performance of learning algorithms can be formulated as a problem of bounding the bias and variance of estimators of the expected error. We show how various {\it stability assumptions} can be employed for this purpose. We provide a necessary and sufficient stability condition for bounding the bias and variance for the Empirical Risk Minimization algorithm, and various sufficient conditions for bounding bias and variance of estimators for general algorithms. We discuss settings in which it is possible to obtain exponential bounds, and we prove an extension of the bounded-difference inequality for "almost always" stable algorithms.



On Stability and Concentration of Measure (with Sayan Mukherjee and Tomaso Poggio). CBCL Paper, Massachusetts Institute of Technology, Cambridge, MA, June 2004.



B (with Poggio, T., S. Mukherjee, R. Rifkin and A. Verri). In: Uncertainty in Geometric Computations, J. Winkler and M. Niranjan (eds.), Kluwer Academic Publishers, 131-141, 2002.
Abstract: In this chapter we discuss the role of $b$, which is the constant in the standard form of the solutions provided by the Support Vector Machine technique $f(\bx x)=\Sum_i^l $\alpha_i$ K({\bf x}, {\bf x}_i) + b$, which is a special case of Regularization Machines. In the process, we describe properties of Reproducing Kernel Hilbert Spaces induced by different classes of kernels.



Bagging Regularizes (with T. Poggio, R. Rifkin, S. Mukherjee). AI Memo 2002-003, Massachusetts Institute of Technology, Cambridge, MA, February 2002.
Abstract: Intuitively, we expect that averaging -- or bagging -- different regressors with low correlation should smooth their behavior and be somewhat similar to regularization. In this note we make this intuition precise. Using an almost classical definition of stability, we prove that a certain form of averaging provides generalization bounds with a rate of convergence of the same order as Tikhonov regularization -- similar to fashionable RKHS-based learning algorithms.



"Extra-label Information: Experiments with View-based Classification." (with G. Yeo and T. Poggio). Proceedings of the Sixth International Conference on Knowledge-Based Intelligent Information & Engineering Systems (KES'2002), Podere d'Ombriano, Crema, Italy, September 16-18, 2002.
Abstract: Extra information is often readily available but not utilized in a classification paradigm. Here we explore using extra labels (profile faces and rotated faces) to aid in distinguishing faces versus non-faces. We propose a way to combine simple discriminant classifiers to build a more complex ones and justify the combination in a probabilistic setting.