Thompson Sampling for Contextual Bandits with Linear Payoffs

Navin Goyal; Shipra Agrawal

arxiv: 1209.3352 · v4 · pith:W5ZCGH6Pnew · submitted 2012-09-15 · 💻 cs.LG · cs.DS· stat.ML

Thompson Sampling for Contextual Bandits with Linear Payoffs

Shipra Agrawal , Navin Goyal This is my paper

classification 💻 cs.LG cs.DSstat.ML

keywords contextualproblemsamplingsqrtthompsonalgorithmboundbandit

0 comments

read the original abstract

Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated it to have better empirical performance compared to the state-of-the-art methods. However, many questions regarding its theoretical performance remained open. In this paper, we design and analyze a generalization of Thompson Sampling algorithm for the stochastic contextual multi-armed bandit problem with linear payoff functions, when the contexts are provided by an adaptive adversary. This is among the most important and widely studied versions of the contextual bandits problem. We provide the first theoretical guarantees for the contextual version of Thompson Sampling. We prove a high probability regret bound of $\tilde{O}(d^{3/2}\sqrt{T})$ (or $\tilde{O}(d\sqrt{T \log(N)})$), which is the best regret bound achieved by any computationally efficient algorithm available for this problem in the current literature, and is within a factor of $\sqrt{d}$ (or $\sqrt{\log(N)}$) of the information-theoretic lower bound for this problem.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Spectral Thompson sampling
cs.LG 2026-04 unverdicted novelty 6.0

SpectralTS achieves regret scaling as d sqrt(T ln N) for graph-smooth bandit problems, matching known bounds while scaling better for large numbers of choices.
Exploitation Over Exploration: Unmasking the Bias in Linear Bandit Recommender Offline Evaluation
cs.LG 2025-07 unverdicted novelty 5.0

Greedy linear models without exploration consistently achieve top-tier performance in over 90% of offline dataset evaluations for linear bandit recommenders, with hyperparameter tuning favoring minimal exploration and...