pith. sign in

arxiv: 2604.13739 · v1 · submitted 2026-04-15 · 💻 cs.LG · stat.ML

Spectral Thompson sampling

Pith reviewed 2026-05-10 12:55 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords Thompson samplingbandit algorithmsgraph banditsregret boundsspectral methodsrecommender systemscomputational advertising
0
0 comments X

The pith

Spectral Thompson sampling achieves a regret bound of order d sqrt(T log N) for graph-structured bandits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SpectralTS, a Thompson sampling variant for bandits in which each choice is a node on a graph and neighboring nodes have similar expected payoffs. It proves that the algorithm's regret scales as d sqrt(T ln N) with high probability, where d is the graph's effective dimension that stays small on real-world instances and T, N are the horizon and number of choices. This bound is comparable to existing guarantees yet avoids the computational cost that grows directly with N, which matters for advertising and recommender systems that face thousands of options. Experiments on synthetic and real data confirm that the method remains competitive while running efficiently.

Core claim

In the graph bandit setting where the expected payoff of each arm is smooth across neighboring nodes, the SpectralTS algorithm attains a high-probability regret bound of d sqrt(T ln N). The quantity d is the effective dimension of the graph, T is the time horizon, and N is the total number of arms. This scaling matches the best known results for the problem while supplying a computationally lighter alternative to prior algorithms.

What carries the argument

Spectral Thompson Sampling (SpectralTS), which uses the graph's spectral structure to enforce payoff smoothness across neighboring nodes and to reduce the problem to an effective dimension d.

If this is right

  • Regret stays sublinear in T and grows only with the effective dimension d rather than the full number of arms N.
  • The algorithm applies to large-scale recommender systems and advertising where choices can be arranged on a graph.
  • Computational cost remains practical as N increases because the method avoids explicit dependence on the full arm count.
  • Performance on synthetic and real data matches or exceeds that of standard Thompson sampling and other baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same spectral reduction could be applied to other bandit algorithms such as UCB when rewards are graph-smooth.
  • If the underlying graph must be learned from data, the regret bound would need an extra term that accounts for graph estimation error.
  • Graphs with very low effective dimension d would yield the largest practical savings compared with unstructured bandit methods.

Load-bearing premise

Neighboring nodes on the graph have similar expected payoffs, so that a small effective dimension d captures the structure.

What would settle it

A sequence of experiments on a fixed graph with verified payoff smoothness in which SpectralTS regret grows faster than order d sqrt(T ln N) for large T would contradict the bound.

Figures

Figures reproduced from arXiv: 2604.13739 by Michal Valko, Remi Munos, Shipra Agrawal, Tomas Kocak.

Figure 1
Figure 1. Figure 1: Barabasi-Albert random graph results ´ Furthermore, we performed the comparison of the algo￾rithms on the MovieLens dataset (Lam and Herlocker, 2012) of the movie ratings. The graph in this dataset is the graph of 2019 movies with edges corresponding to the movie sim￾ilarities. For each user we have a graph function, unknown to the algorithms, that assigns to each node, the rating of the particular user. A… view at source ↗
Figure 2
Figure 2. Figure 2: MovieLens data results 6 Conclusion We considered the spectral bandit setting with a reward func￾tion assumed to be smooth on a given graph and proposed Spectral Thompson Sampling (TS) for it. Our main contri￾bution is stated in Theorem 1 where we prove that the re￾gret bound scales with effective dimension d, typically much smaller than the ambient dimension D, which is the case of linear bandit algorithm… view at source ↗
read the original abstract

Thompson Sampling (TS) has attracted a lot of interest due to its good empirical performance, in particular in the computational advertising. Though successful, the tools for its performance analysis appeared only recently. In this paper, we describe and analyze SpectralTS algorithm for a bandit problem, where the payoffs of the choices are smooth given an underlying graph. In this setting, each choice is a node of a graph and the expected payoffs of the neighboring nodes are assumed to be similar. Although the setting has application both in recommender systems and advertising, the traditional algorithms would scale poorly with the number of choices. For that purpose we consider an effective dimension d, which is small in real-world graphs. We deliver the analysis showing that the regret of SpectralTS scales as d*sqrt(T ln N) with high probability, where T is the time horizon and N is the number of choices. Since a d*sqrt(T ln N) regret is comparable to the known results, SpectralTS offers a computationally more efficient alternative. We also show that our algorithm is competitive on both synthetic and real-world data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper introduces SpectralTS, a Thompson sampling algorithm for stochastic bandits where arms correspond to nodes of a graph and expected rewards are assumed smooth with respect to the graph structure. It defines an effective dimension d from the spectrum of the graph Laplacian (or normalized adjacency matrix) and provides an analysis establishing a high-probability regret bound of order d sqrt(T ln N). The algorithm is positioned as computationally more efficient than standard TS for large N while achieving regret comparable to known bounds; empirical results on synthetic and real-world data are also reported.

Significance. If the stated regret bound holds, the work supplies a theoretically supported, scalable alternative for graph-structured bandits that exploits low effective dimension to avoid poor scaling with N. The adaptation of posterior sampling and concentration arguments to a finite-dimensional spectral embedding is a clear strength, as is the explicit high-probability guarantee under the smoothness assumption. This is particularly relevant for recommender and advertising applications where graphs are large but spectrally compressible.

minor comments (3)
  1. [§2] §2 (algorithm description): the precise form of the posterior update and the truncation to the d-dimensional spectral embedding should be stated explicitly, including how the covariance is maintained in the embedded space.
  2. [§4] The proof sketch in §4 adapts standard TS tail bounds, but the union bound over the N arms and the dependence on the effective dimension d should be written out with the precise constants and failure probability to make the high-probability claim fully transparent.
  3. [Experiments] Figure 1 and the real-world experiments: label the axes consistently with the regret definition used in the analysis (cumulative vs. per-step) and report the number of independent runs together with error bars.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of our work on SpectralTS and for recommending minor revision. The provided summary correctly captures the algorithm's regret scaling and computational advantages for graph-structured bandits.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper derives a high-probability regret bound of order d sqrt(T ln N) for SpectralTS by adapting standard Thompson Sampling concentration inequalities to a finite-dimensional spectral embedding of the graph Laplacian. The effective dimension d is defined directly from the eigenvalues of the graph operator, independently of the regret quantity. No step in the claimed derivation reduces by construction to a fitted parameter, a self-referential definition, or a load-bearing self-citation; the analysis relies on external sub-Gaussian tail bounds and union bounds that are not internal to the result. The derivation is therefore self-contained against standard probabilistic tools.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption of payoff smoothness over graph neighbors and the existence of a small effective dimension d derived from the graph.

axioms (1)
  • domain assumption Expected payoffs of neighboring nodes in the graph are similar
    Explicitly stated as the problem setting in the abstract.

pith-pipeline@v0.9.0 · 5483 in / 1049 out tokens · 41395 ms · 2026-05-10T12:55:33.208094+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION fin.entry add.period write newline FUNCTION new.block output.state before.all = 'skip after.block 'output.state := if FUNCTION new.sentence output.state after.block = 'skip output.state before.all = 'skip after.sentence 'output.state := if if FUNCTION not #0 #1 if FUNCTION and 'skip pop #0 if FUNCTIO...

  2. [2]

    Abbasi-Yadkori, Y.; P\' a l, D.; and Szepesv\' a ri, C. 2011. Improved Algorithms for Linear Stochastic Bandits . In Neural Information Processing Systems (NeurIPS)

  3. [3]

    Agrawal, S., and Goyal, N. 2012a. Analysis of Thompson Sampling for the multi-armed bandit problem . Conference on Learning Theory (COLT)

  4. [4]

    Agrawal, S., and Goyal, N. 2012b. Thompson Sampling for Contextual Bandits with Linear Payoffs . CoRR, abs/1209.3352, http://arxiv.org/abs/1209.3352

  5. [5]

    Agrawal, S., and Goyal, N. 2013a. Further Optimal Regret Bounds for Thompson Sampling . In International Conference on Artificial Intelligence and Statistics (AISTATS) , volume 31, 99--107

  6. [6]

    Agrawal, S., and Goyal, N. 2013b. Thompson Sampling for Contextual Bandits with Linear Payoffs . In International Conference on Machine Learning (ICML)

  7. [7]

    Auer, P.; Cesa-Bianchi, N.; and Fischer, P. 2002. Finite-time Analysis of the Multiarmed Bandit Problem . Machine Learning 47(2-3):235--256

  8. [8]

    Auer, P. 2002. Using confidence bounds for exploitation-exploration trade-offs . Journal of Machine Learning Research 3:397--422

  9. [9]

    Barab\' a si, A.-L., and Albert, R. 1999. Emergence of scaling in random networks . Science 286:11

  10. [10]

    Belkin, M.; Niyogi, P.; and Sindhwani, V. 2006. Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples . Journal of Machine Learning Research 7:2399--2434

  11. [11]

    Chapelle, O., and Li, L. 2011. An Empirical Evaluation of Thompson Sampling . In Neural Information Processing Systems (NeurIPS)

  12. [12]

    P.; and Kakade, S

    Dani, V.; Hayes, T. P.; and Kakade, S. M. 2008. The Price of Bandit Information for Online Optimization . In Neural Information Processing Systems (NeurIPS) . MIT Press

  13. [13]

    Kaufmann, E.; Korda, N.; and Munos, R. 2012. Thompson Sampling: An Asymptotically Optimal Finite Time Analysis . Algorithmic Learning Theory (ALT)

  14. [14]

    Korda, N.; Kaufmann, E.; and Munos, R. 2013. Thompson Sampling for 1-Dimensional Exponential Family Bandits . In Neural Information Processing Systems (NeurIPS)

  15. [15]

    L.; and Peng, R

    Koutis, I.; Miller, G. L.; and Peng, R. 2010. Approaching Optimality for Solving SDD Linear Systems . In IEEE Symposium on Foundations of Computer Science (FOCS) , 235--244. IEEE

  16. [16]

    Lam, S., and Herlocker, J. 2012. MovieLens 1M Dataset . http://www.grouplens.org/node/12

  17. [17]

    Li, L.; Chu, W.; Langford, J.; and Schapire, R. E. 2010. A Contextual-Bandit Approach to Personalized News Article Recommendation . International World Wide Web Conference (WWW)

  18. [18]

    C.; Korda, N.; Lee, A.; and Leslie, D

    May, B. C.; Korda, N.; Lee, A.; and Leslie, D. S. 2012. Optimistic Bayesian sampling in contextual-bandit problems . Journal of Machine Learning Research 13(1):2069--2106

  19. [19]

    J., and Billsus, D

    Pazzani, M. J., and Billsus, D. 2007. Content-Based Recommendation Systems . The adaptive web

  20. [20]

    Russo, D., and Van Roy , B. 2013. Eluder Dimension and the Sample Complexity of Optimistic Exploration . In Neural Information Processing Systems (NeurIPS)

  21. [21]

    Russo, D., and Van Roy , B. 2014. Learning to Optimize Via Posterior Sampling . Mathematics of Operations Research

  22. [22]

    Srinivas, N.; Krause, A.; Kakade, S.; and Seeger, M. 2010. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design . International Conference on Machine Learning (ICML) 1015--1022

  23. [23]

    Thompson, W. R. 1933. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples . Biometrika 25:285--294

  24. [24]

    Valko, M.; Munos, R.; Kveton, B.; and Koc\' a k, T. 2014. Spectral Bandits for Smooth Graph Functions . In International Conference on Machine Learning (ICML)

  25. [25]

    A.; and Guestrin, C

    Yue, Y.; Hong, S. A.; and Guestrin, C. 2012. Hierarchical Exploration for Accelerating Contextual Bandits . In International Conference on Machine Learning (ICML) , 1895--1902. New York, NY, USA: ACM