Tony Cai's Papers

Nonparametric Bandits with Single-Index Rewards: Optimality and Adaptivity

Wanteng Ma and Tony Cai

Abstract: Contextual bandits are a central framework for sequential decision-making, , with applications ranging from recommendation systems to clinical trials. While nonparametric methods can flexibly model complex reward structures, they suffer acutely from the curse of dimensionality. We address this challenge using a single-index model, which projects high-dimensional covariates onto a one-dimensional subspace while preserving nonparametric flexibility.
We first develop a nonasymptotic theory for offline single-index regression for each arm, combining maximum rank correlation for index estimation with local polynomial regression. Building on this foundation, we propose a single-index bandit algorithm and establish its convergence rate. We further derive a matching lower bound, showing that the algorithm achieves minimax-optimal regret independent of the ambient dimension $d$, thereby overcoming the curse of dimensionality.
We also establish an impossibility result for adaptation: without additional assumptions, no policy can adapt to unknown smoothness levels. Under a standard self-similarity condition, however, we construct a policy that remains minimax-optimal while automatically adapting to the unknown smoothness. Finally, as the dimension $d$ increases, our algorithm continues to achieve minimax-optimal regret, revealing a phase transition that characterizes the fundamental limits of single-index bandit learning.
Paper: pdf file.