Spectral Thompson sampling

Michal Valko; Remi Munos; Shipra Agrawal; Tomas Kocak

arxiv: 2604.13739 · v1 · submitted 2026-04-15 · 💻 cs.LG · stat.ML

Spectral Thompson sampling

Tomas Kocak , Michal Valko , Remi Munos , Shipra Agrawal This is my paper

Pith reviewed 2026-05-10 12:55 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords Thompson samplingbandit algorithmsgraph banditsregret boundsspectral methodsrecommender systemscomputational advertising

0 comments

The pith

Spectral Thompson sampling achieves a regret bound of order d sqrt(T log N) for graph-structured bandits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SpectralTS, a Thompson sampling variant for bandits in which each choice is a node on a graph and neighboring nodes have similar expected payoffs. It proves that the algorithm's regret scales as d sqrt(T ln N) with high probability, where d is the graph's effective dimension that stays small on real-world instances and T, N are the horizon and number of choices. This bound is comparable to existing guarantees yet avoids the computational cost that grows directly with N, which matters for advertising and recommender systems that face thousands of options. Experiments on synthetic and real data confirm that the method remains competitive while running efficiently.

Core claim

In the graph bandit setting where the expected payoff of each arm is smooth across neighboring nodes, the SpectralTS algorithm attains a high-probability regret bound of d sqrt(T ln N). The quantity d is the effective dimension of the graph, T is the time horizon, and N is the total number of arms. This scaling matches the best known results for the problem while supplying a computationally lighter alternative to prior algorithms.

What carries the argument

Spectral Thompson Sampling (SpectralTS), which uses the graph's spectral structure to enforce payoff smoothness across neighboring nodes and to reduce the problem to an effective dimension d.

If this is right

Regret stays sublinear in T and grows only with the effective dimension d rather than the full number of arms N.
The algorithm applies to large-scale recommender systems and advertising where choices can be arranged on a graph.
Computational cost remains practical as N increases because the method avoids explicit dependence on the full arm count.
Performance on synthetic and real data matches or exceeds that of standard Thompson sampling and other baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same spectral reduction could be applied to other bandit algorithms such as UCB when rewards are graph-smooth.
If the underlying graph must be learned from data, the regret bound would need an extra term that accounts for graph estimation error.
Graphs with very low effective dimension d would yield the largest practical savings compared with unstructured bandit methods.

Load-bearing premise

Neighboring nodes on the graph have similar expected payoffs, so that a small effective dimension d captures the structure.

What would settle it

A sequence of experiments on a fixed graph with verified payoff smoothness in which SpectralTS regret grows faster than order d sqrt(T ln N) for large T would contradict the bound.

Figures

Figures reproduced from arXiv: 2604.13739 by Michal Valko, Remi Munos, Shipra Agrawal, Tomas Kocak.

**Figure 1.** Figure 1: Barabasi-Albert random graph results ´ Furthermore, we performed the comparison of the algorithms on the MovieLens dataset (Lam and Herlocker, 2012) of the movie ratings. The graph in this dataset is the graph of 2019 movies with edges corresponding to the movie similarities. For each user we have a graph function, unknown to the algorithms, that assigns to each node, the rating of the particular user. A… view at source ↗

**Figure 2.** Figure 2: MovieLens data results 6 Conclusion We considered the spectral bandit setting with a reward function assumed to be smooth on a given graph and proposed Spectral Thompson Sampling (TS) for it. Our main contribution is stated in Theorem 1 where we prove that the regret bound scales with effective dimension d, typically much smaller than the ambient dimension D, which is the case of linear bandit algorithm… view at source ↗

read the original abstract

Thompson Sampling (TS) has attracted a lot of interest due to its good empirical performance, in particular in the computational advertising. Though successful, the tools for its performance analysis appeared only recently. In this paper, we describe and analyze SpectralTS algorithm for a bandit problem, where the payoffs of the choices are smooth given an underlying graph. In this setting, each choice is a node of a graph and the expected payoffs of the neighboring nodes are assumed to be similar. Although the setting has application both in recommender systems and advertising, the traditional algorithms would scale poorly with the number of choices. For that purpose we consider an effective dimension d, which is small in real-world graphs. We deliver the analysis showing that the regret of SpectralTS scales as d*sqrt(T ln N) with high probability, where T is the time horizon and N is the number of choices. Since a d*sqrt(T ln N) regret is comparable to the known results, SpectralTS offers a computationally more efficient alternative. We also show that our algorithm is competitive on both synthetic and real-world data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SpectralTS embeds graph bandits into a low-d spectral space and runs TS there, delivering a d sqrt(T ln N) regret bound that matches known rates while cutting computation for large N.

read the letter

The main thing here is a new algorithm that combines the spectral embedding of a graph Laplacian with Thompson sampling. It targets the setting where arms are nodes and expected rewards are smooth across edges, which shows up in recommenders and ads. The paper gives a high-probability regret bound of order d sqrt(T ln N), where d is the effective dimension from the spectrum, and claims this is comparable to prior work but scales better when N is huge and d stays small. That efficiency angle is the practical hook. The analysis adapts standard TS posterior concentration to the finite-dimensional spectral features without obvious extra looseness, and the stress-test note confirms no hidden circularity or bad assumptions in the scaling. The assumption that real graphs have small d is standard and reasonable, though the paper would benefit from more discussion on how d is picked or estimated from data in practice. The synthetic and real-world experiments are mentioned but not detailed in the abstract, so the empirical side is hard to judge from what's visible. This is solid work for the structured-bandit crowd. The math is grounded in existing TS tools plus spectral graph theory, the citation pattern looks clean, and there is no load-bearing fitting or internal contradiction. A serious editor should send it to referees; the core contribution is a clean algorithmic idea with an analysis that holds up on the terms given.

Referee Report

0 major / 3 minor

Summary. The paper introduces SpectralTS, a Thompson sampling algorithm for stochastic bandits where arms correspond to nodes of a graph and expected rewards are assumed smooth with respect to the graph structure. It defines an effective dimension d from the spectrum of the graph Laplacian (or normalized adjacency matrix) and provides an analysis establishing a high-probability regret bound of order d sqrt(T ln N). The algorithm is positioned as computationally more efficient than standard TS for large N while achieving regret comparable to known bounds; empirical results on synthetic and real-world data are also reported.

Significance. If the stated regret bound holds, the work supplies a theoretically supported, scalable alternative for graph-structured bandits that exploits low effective dimension to avoid poor scaling with N. The adaptation of posterior sampling and concentration arguments to a finite-dimensional spectral embedding is a clear strength, as is the explicit high-probability guarantee under the smoothness assumption. This is particularly relevant for recommender and advertising applications where graphs are large but spectrally compressible.

minor comments (3)

[§2] §2 (algorithm description): the precise form of the posterior update and the truncation to the d-dimensional spectral embedding should be stated explicitly, including how the covariance is maintained in the embedded space.
[§4] The proof sketch in §4 adapts standard TS tail bounds, but the union bound over the N arms and the dependence on the effective dimension d should be written out with the precise constants and failure probability to make the high-probability claim fully transparent.
[Experiments] Figure 1 and the real-world experiments: label the axes consistently with the regret definition used in the analysis (cumulative vs. per-step) and report the number of independent runs together with error bars.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of our work on SpectralTS and for recommending minor revision. The provided summary correctly captures the algorithm's regret scaling and computational advantages for graph-structured bandits.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper derives a high-probability regret bound of order d sqrt(T ln N) for SpectralTS by adapting standard Thompson Sampling concentration inequalities to a finite-dimensional spectral embedding of the graph Laplacian. The effective dimension d is defined directly from the eigenvalues of the graph operator, independently of the regret quantity. No step in the claimed derivation reduces by construction to a fitted parameter, a self-referential definition, or a load-bearing self-citation; the analysis relies on external sub-Gaussian tail bounds and union bounds that are not internal to the result. The derivation is therefore self-contained against standard probabilistic tools.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption of payoff smoothness over graph neighbors and the existence of a small effective dimension d derived from the graph.

axioms (1)

domain assumption Expected payoffs of neighboring nodes in the graph are similar
Explicitly stated as the problem setting in the abstract.

pith-pipeline@v0.9.0 · 5483 in / 1049 out tokens · 41395 ms · 2026-05-10T12:55:33.208094+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION fin.entry add.period write newline FUNCTION new.block output.state before.all = 'skip after.block 'output.state := if FUNCTION new.sentence output.state after.block = 'skip output.state before.all = 'skip after.sentence 'output.state := if if FUNCTION not #0 #1 if FUNCTION and 'skip pop #0 if FUNCTIO...

work page
[2]

Abbasi-Yadkori, Y.; P\' a l, D.; and Szepesv\' a ri, C. 2011. Improved Algorithms for Linear Stochastic Bandits . In Neural Information Processing Systems (NeurIPS)

work page 2011
[3]

Agrawal, S., and Goyal, N. 2012a. Analysis of Thompson Sampling for the multi-armed bandit problem . Conference on Learning Theory (COLT)

work page
[4]

Agrawal, S., and Goyal, N. 2012b. Thompson Sampling for Contextual Bandits with Linear Payoffs . CoRR, abs/1209.3352, http://arxiv.org/abs/1209.3352

work page Pith review arXiv
[5]

Agrawal, S., and Goyal, N. 2013a. Further Optimal Regret Bounds for Thompson Sampling . In International Conference on Artificial Intelligence and Statistics (AISTATS) , volume 31, 99--107

work page
[6]

Agrawal, S., and Goyal, N. 2013b. Thompson Sampling for Contextual Bandits with Linear Payoffs . In International Conference on Machine Learning (ICML)

work page
[7]

Auer, P.; Cesa-Bianchi, N.; and Fischer, P. 2002. Finite-time Analysis of the Multiarmed Bandit Problem . Machine Learning 47(2-3):235--256

work page 2002
[8]

Auer, P. 2002. Using confidence bounds for exploitation-exploration trade-offs . Journal of Machine Learning Research 3:397--422

work page 2002
[9]

Barab\' a si, A.-L., and Albert, R. 1999. Emergence of scaling in random networks . Science 286:11

work page 1999
[10]

Belkin, M.; Niyogi, P.; and Sindhwani, V. 2006. Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples . Journal of Machine Learning Research 7:2399--2434

work page 2006
[11]

Chapelle, O., and Li, L. 2011. An Empirical Evaluation of Thompson Sampling . In Neural Information Processing Systems (NeurIPS)

work page 2011
[12]

P.; and Kakade, S

Dani, V.; Hayes, T. P.; and Kakade, S. M. 2008. The Price of Bandit Information for Online Optimization . In Neural Information Processing Systems (NeurIPS) . MIT Press

work page 2008
[13]

Kaufmann, E.; Korda, N.; and Munos, R. 2012. Thompson Sampling: An Asymptotically Optimal Finite Time Analysis . Algorithmic Learning Theory (ALT)

work page 2012
[14]

Korda, N.; Kaufmann, E.; and Munos, R. 2013. Thompson Sampling for 1-Dimensional Exponential Family Bandits . In Neural Information Processing Systems (NeurIPS)

work page 2013
[15]

L.; and Peng, R

Koutis, I.; Miller, G. L.; and Peng, R. 2010. Approaching Optimality for Solving SDD Linear Systems . In IEEE Symposium on Foundations of Computer Science (FOCS) , 235--244. IEEE

work page 2010
[16]

Lam, S., and Herlocker, J. 2012. MovieLens 1M Dataset . http://www.grouplens.org/node/12

work page 2012
[17]

Li, L.; Chu, W.; Langford, J.; and Schapire, R. E. 2010. A Contextual-Bandit Approach to Personalized News Article Recommendation . International World Wide Web Conference (WWW)

work page 2010
[18]

C.; Korda, N.; Lee, A.; and Leslie, D

May, B. C.; Korda, N.; Lee, A.; and Leslie, D. S. 2012. Optimistic Bayesian sampling in contextual-bandit problems . Journal of Machine Learning Research 13(1):2069--2106

work page 2012
[19]

J., and Billsus, D

Pazzani, M. J., and Billsus, D. 2007. Content-Based Recommendation Systems . The adaptive web

work page 2007
[20]

Russo, D., and Van Roy , B. 2013. Eluder Dimension and the Sample Complexity of Optimistic Exploration . In Neural Information Processing Systems (NeurIPS)

work page 2013
[21]

Russo, D., and Van Roy , B. 2014. Learning to Optimize Via Posterior Sampling . Mathematics of Operations Research

work page 2014
[22]

Srinivas, N.; Krause, A.; Kakade, S.; and Seeger, M. 2010. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design . International Conference on Machine Learning (ICML) 1015--1022

work page 2010
[23]

Thompson, W. R. 1933. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples . Biometrika 25:285--294

work page 1933
[24]

Valko, M.; Munos, R.; Kveton, B.; and Koc\' a k, T. 2014. Spectral Bandits for Smooth Graph Functions . In International Conference on Machine Learning (ICML)

work page 2014
[25]

A.; and Guestrin, C

Yue, Y.; Hong, S. A.; and Guestrin, C. 2012. Hierarchical Exploration for Accelerating Contextual Bandits . In International Conference on Machine Learning (ICML) , 1895--1902. New York, NY, USA: ACM

work page 2012

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION fin.entry add.period write newline FUNCTION new.block output.state before.all = 'skip after.block 'output.state := if FUNCTION new.sentence output.state after.block = 'skip output.state before.all = 'skip after.sentence 'output.state := if if FUNCTION not #0 #1 if FUNCTION and 'skip pop #0 if FUNCTIO...

work page

[2] [2]

Abbasi-Yadkori, Y.; P\' a l, D.; and Szepesv\' a ri, C. 2011. Improved Algorithms for Linear Stochastic Bandits . In Neural Information Processing Systems (NeurIPS)

work page 2011

[3] [3]

Agrawal, S., and Goyal, N. 2012a. Analysis of Thompson Sampling for the multi-armed bandit problem . Conference on Learning Theory (COLT)

work page

[4] [4]

Agrawal, S., and Goyal, N. 2012b. Thompson Sampling for Contextual Bandits with Linear Payoffs . CoRR, abs/1209.3352, http://arxiv.org/abs/1209.3352

work page Pith review arXiv

[5] [5]

Agrawal, S., and Goyal, N. 2013a. Further Optimal Regret Bounds for Thompson Sampling . In International Conference on Artificial Intelligence and Statistics (AISTATS) , volume 31, 99--107

work page

[6] [6]

Agrawal, S., and Goyal, N. 2013b. Thompson Sampling for Contextual Bandits with Linear Payoffs . In International Conference on Machine Learning (ICML)

work page

[7] [7]

Auer, P.; Cesa-Bianchi, N.; and Fischer, P. 2002. Finite-time Analysis of the Multiarmed Bandit Problem . Machine Learning 47(2-3):235--256

work page 2002

[8] [8]

Auer, P. 2002. Using confidence bounds for exploitation-exploration trade-offs . Journal of Machine Learning Research 3:397--422

work page 2002

[9] [9]

Barab\' a si, A.-L., and Albert, R. 1999. Emergence of scaling in random networks . Science 286:11

work page 1999

[10] [10]

Belkin, M.; Niyogi, P.; and Sindhwani, V. 2006. Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples . Journal of Machine Learning Research 7:2399--2434

work page 2006

[11] [11]

Chapelle, O., and Li, L. 2011. An Empirical Evaluation of Thompson Sampling . In Neural Information Processing Systems (NeurIPS)

work page 2011

[12] [12]

P.; and Kakade, S

Dani, V.; Hayes, T. P.; and Kakade, S. M. 2008. The Price of Bandit Information for Online Optimization . In Neural Information Processing Systems (NeurIPS) . MIT Press

work page 2008

[13] [13]

Kaufmann, E.; Korda, N.; and Munos, R. 2012. Thompson Sampling: An Asymptotically Optimal Finite Time Analysis . Algorithmic Learning Theory (ALT)

work page 2012

[14] [14]

Korda, N.; Kaufmann, E.; and Munos, R. 2013. Thompson Sampling for 1-Dimensional Exponential Family Bandits . In Neural Information Processing Systems (NeurIPS)

work page 2013

[15] [15]

L.; and Peng, R

Koutis, I.; Miller, G. L.; and Peng, R. 2010. Approaching Optimality for Solving SDD Linear Systems . In IEEE Symposium on Foundations of Computer Science (FOCS) , 235--244. IEEE

work page 2010

[16] [16]

Lam, S., and Herlocker, J. 2012. MovieLens 1M Dataset . http://www.grouplens.org/node/12

work page 2012

[17] [17]

Li, L.; Chu, W.; Langford, J.; and Schapire, R. E. 2010. A Contextual-Bandit Approach to Personalized News Article Recommendation . International World Wide Web Conference (WWW)

work page 2010

[18] [18]

C.; Korda, N.; Lee, A.; and Leslie, D

May, B. C.; Korda, N.; Lee, A.; and Leslie, D. S. 2012. Optimistic Bayesian sampling in contextual-bandit problems . Journal of Machine Learning Research 13(1):2069--2106

work page 2012

[19] [19]

J., and Billsus, D

Pazzani, M. J., and Billsus, D. 2007. Content-Based Recommendation Systems . The adaptive web

work page 2007

[20] [20]

Russo, D., and Van Roy , B. 2013. Eluder Dimension and the Sample Complexity of Optimistic Exploration . In Neural Information Processing Systems (NeurIPS)

work page 2013

[21] [21]

Russo, D., and Van Roy , B. 2014. Learning to Optimize Via Posterior Sampling . Mathematics of Operations Research

work page 2014

[22] [22]

Srinivas, N.; Krause, A.; Kakade, S.; and Seeger, M. 2010. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design . International Conference on Machine Learning (ICML) 1015--1022

work page 2010

[23] [23]

Thompson, W. R. 1933. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples . Biometrika 25:285--294

work page 1933

[24] [24]

Valko, M.; Munos, R.; Kveton, B.; and Koc\' a k, T. 2014. Spectral Bandits for Smooth Graph Functions . In International Conference on Machine Learning (ICML)

work page 2014

[25] [25]

A.; and Guestrin, C

Yue, Y.; Hong, S. A.; and Guestrin, C. 2012. Hierarchical Exploration for Accelerating Contextual Bandits . In International Conference on Machine Learning (ICML) , 1895--1902. New York, NY, USA: ACM

work page 2012