Graph Dimensionality Reduction for Contextual Bandits: Structure-Specific Regret Bounds under Approximate Smoothness and Noisy Eigenspaces

Anuj Sharma; Ibne Farabi Shihab; Joyanta Jyoti Mondal

arxiv: 2606.27917 · v1 · pith:LDYBOMGOnew · submitted 2026-06-26 · 💻 cs.LG

Graph Dimensionality Reduction for Contextual Bandits: Structure-Specific Regret Bounds under Approximate Smoothness and Noisy Eigenspaces

Joyanta Jyoti Mondal , Ibne Farabi Shihab , Anuj Sharma This is my paper

Pith reviewed 2026-06-29 04:21 UTC · model grok-4.3

classification 💻 cs.LG

keywords contextual banditsgraph structurespectral dimensionality reductionregret boundslinear UCBapproximate smoothnessnoisy graphs

0 comments

The pith

Projecting arm features onto a graph's low-frequency spectral subspace produces the first Õ(k√T) regret bound for contextual bandits, replacing dependence on ambient dimension d with dependence on subspace dimension k.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that when arms lie on a graph and rewards are approximately smooth across connected arms, one can project the feature vectors into the k-dimensional low-frequency part of the graph spectrum and then run linear UCB only in that reduced space. The resulting algorithm avoids paying the full price of exploring in the original d dimensions. A central technical step is showing that any high-frequency reward component does not automatically produce linear-in-T regret; its cost is instead governed by how much it actually affects the specific sequence of arms that get played. The same analysis supplies an explicit additive penalty when the graph itself must be estimated from data or when smoothness holds only approximately. This matters for recommendation and retrieval tasks whose arms naturally form graphs, because the usual dimensionality-reduction methods discard that structure and therefore explore more than necessary.

Core claim

GraphDR-LinUCB projects each arm's feature vector onto the k-dimensional low-frequency spectral subspace of the graph Laplacian and then executes linear UCB inside the projected space; under the assumption that the reward function is approximately smooth on the graph, the regret is bounded by Õ(k√T) rather than the usual Õ(d√T), and a perturbation argument extends the bound to noisy or estimated graphs while charging an explicit term for any mismatch between the true reward smoothness and the chosen subspace plus any error in the estimated eigenvectors.

What carries the argument

GraphDR-LinUCB, the algorithm that projects arm features onto the graph's low-frequency spectral subspace before running linear UCB, together with the perturbation argument that controls the effect of noisy eigenspaces.

If this is right

Regret scales with the chosen subspace dimension k instead of the original feature dimension d.
High-frequency reward components incur a path-dependent cost rather than an automatic linear penalty.
An explicit additive term appears for any mismatch between reward smoothness and the chosen subspace or for error in the estimated graph eigenvectors.
A simple spectral comparison operator predicts which of two candidate subspaces will produce lower regret on a given dataset.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same path-dependent costing argument could be applied to other structured bandit problems whose actions lie on a manifold or graph.
The predictive spectral comparison might allow automatic selection of the best reducer on new data without additional validation runs.
If the graph is only approximately known, the explicit penalty terms give a concrete budget for how much graph estimation error can be tolerated before the reduction loses its advantage.

Load-bearing premise

The reward function is approximately smooth on the graph, so the realized cost of any high-frequency component is set by its effect along the actual sequence of arms played rather than by a worst-case linear-in-T penalty that ignores the path.

What would settle it

An experiment in which the high-frequency reward component is shown to produce a linear-in-T penalty no matter which sequence of arms is played would falsify the claimed regret bound.

Figures

Figures reproduced from arXiv: 2606.27917 by Anuj Sharma, Ibne Farabi Shihab, Joyanta Jyoti Mondal.

**Figure 1.** Figure 1: GraphDR-LinUCB. The graph determines a bottom-k low-frequency Laplacian eigenbasis Ebk; arm features are projected to R k and LinUCB runs there. Only the high-frequency residual r⊥,k left outside the subspace can hurt the learner; its effect is governed by the realized graph quantities SRLT and CRWT . The shuffle control permutes node labels before computing Ebk, preserving the projected dimension but dest… view at source ↗

**Figure 2.** Figure 2: Main synthetic comparison with the graph [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Dimension scaling at fixed k ⋆ = 5. GraphDRLinUCB remains nearly flat as d = n grows; full LinUCB grows with ambient dimension. Random GraphDR-shuffled (control) JL+LinUCB PCA+LinUCB LinUCB-full GraphDR+LinUCB (ours) Method 10 2 10 3 Cumulativ e re gret R(T) 2613 2546 2364 2297 418 18 [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Second graph family: random geometric graph. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

Contextual bandits with graph-structured arms arise in recommendation, citation retrieval, and social advertising, where arms connected on a graph tend to share reward signal. Standard dimensionality reduction ignores this structure, inflating exploration cost by a factor of $d/k$. We propose GraphDR-LinUCB, which projects arm features onto the graph's low-frequency spectral subspace and runs linear UCB in the resulting $k$-dimensional space. We prove the first $\wtO(k\sqrt{T})$ regret bound for spectral-projection-based contextual bandits, reducing dimension dependence from $d$ to $k$; a perturbation argument extends this to noisy graphs, with an explicit penalty for reward-smoothness mismatch and graph-estimation error. Our central theoretical finding is that the high-frequency reward component need not incur a worst-case linear-in-$T$ penalty: its actual cost depends on its realized impact along the played path, not on its total energy. A simple spectral comparison between subspaces ($\Gamma_k$) predicts which reducer wins on a given dataset, correctly calling five of six real-dataset outcomes without any fitted threshold. Across a synthetic benchmark and six real datasets (MovieLens, Amazon, LastFM, ogbn-arxiv, MIND), GraphDR-LinUCB reduces cumulative regret by $15\times$ over full-dimensional LinUCB and outperforms competing graph-aware methods on five of six; the single failure is precisely where the graph's spectral subspace is misaligned with the reward.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers a Õ(k√T) regret bound for spectral projection in graph contextual bandits by treating high-frequency reward costs as path-dependent rather than worst-case, plus a simple Γ_k predictor that matches most empirical outcomes.

read the letter

The central new piece is the claim that projecting onto the graph's k low-frequency eigenvectors lets you replace the usual d factor in contextual bandit regret with k, and that this still holds under approximate smoothness and noisy graphs via a perturbation argument. The twist is that the high-frequency component only incurs cost based on its actual effect along the sequence of played arms, not a blanket linear-in-T penalty. They also introduce an unfitted spectral comparator Γ_k that correctly predicts which reducer wins on five of six real datasets.

The empirical results are straightforward and useful: GraphDR-LinUCB cuts regret by roughly 15× versus full-dimensional LinUCB on MovieLens, Amazon, LastFM, ogbn-arxiv, and MIND, and beats the competing graph-aware baselines on five of the six. The single loss case lines up exactly with the subspace misalignment that Γ_k flags. That kind of out-of-sample predictor is rare and practical.

The soft spot is the path-dependent bounding step itself. The abstract states that the realized impact along the played path controls the high-frequency cost, but the precise definition of approximate smoothness, the exact statement of the main theorem, and the derivation that keeps perturbation terms sub-dominant are not visible here. If those steps introduce hidden d or T-linear factors under realistic noise, the claimed rate would not hold. The experiments do not include ablations on the smoothness parameter, so it is hard to judge robustness.

This is for researchers working on bandits with graph side information in recommendation or citation settings. A reader who needs dimension reduction that respects known structure will find the bound and the predictor worth checking. It deserves peer review because the regret claim is new on its face and the experiments are on real data, even if the proof needs careful verification.

Referee Report

1 major / 0 minor

Summary. The paper introduces GraphDR-LinUCB for contextual bandits whose arms lie on a graph. It projects arm features onto the graph's k-dimensional low-frequency eigenspace and runs linear UCB in the reduced space. The central theoretical claim is an Õ(k√T) regret bound (first such for spectral-projection methods), obtained by showing that the high-frequency reward component incurs only path-dependent cost under approximate smoothness rather than a worst-case linear-in-T penalty; a perturbation argument extends the bound to noisy graphs with explicit penalties for smoothness mismatch and estimation error. A spectral-comparison statistic Γ_k is shown to predict which reducer wins on a given instance (correct on 5/6 real datasets without fitted thresholds). Experiments on a synthetic benchmark and six real datasets (MovieLens, Amazon, LastFM, ogbn-arxiv, MIND) report 15× regret reduction versus full-dimensional LinUCB and outperformance versus competing graph-aware baselines on five of six instances.

Significance. If the path-dependent high-frequency analysis is valid, the result would be significant: it supplies the first explicit dimension reduction from d to k in the regret of spectral-projection contextual bandits and supplies a practical, threshold-free predictor Γ_k together with broad empirical validation. The explicit perturbation terms for graph noise and smoothness mismatch are also a strength, as is the fact that the single empirical failure is explained by misalignment of the spectral subspace with the reward.

major comments (1)

[Abstract] Abstract (central claim): the Õ(k√T) regret bound rests on converting the path-dependent cost of the high-frequency component (under approximate smoothness) into a sub-dominant term whose perturbation penalties for graph-estimation error and smoothness mismatch remain o(√T). The abstract states this conversion explicitly but supplies neither the definition of approximate smoothness nor the derivation steps that bound the realized path cost without re-introducing a d- or T-linear factor; this is load-bearing for the dimension-reduction claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and constructive feedback on the central claim. The comment correctly notes that the abstract is high-level; we address it point-by-point below and will revise the abstract accordingly while preserving the manuscript's technical content.

read point-by-point responses

Referee: [Abstract] Abstract (central claim): the Õ(k√T) regret bound rests on converting the path-dependent cost of the high-frequency component (under approximate smoothness) into a sub-dominant term whose perturbation penalties for graph-estimation error and smoothness mismatch remain o(√T). The abstract states this conversion explicitly but supplies neither the definition of approximate smoothness nor the derivation steps that bound the realized path cost without re-introducing a d- or T-linear factor; this is load-bearing for the dimension-reduction claim.

Authors: We agree the abstract omits the explicit definition and derivation steps, as is conventional for abstracts. Approximate smoothness is defined in Definition 3.1: for reward function f, the high-frequency projection satisfies ||P_{>k} f||_G <= epsilon, where the graph Laplacian controls the decay. The path-dependent cost is formalized in Lemma 4.2: the high-frequency regret term equals sum_{t=1}^T <P_{>k} f, x_t>^2 / ||x_t||^2, which is bounded by the realized path energy rather than worst-case ||P_{>k} f|| * T. Under the smoothness assumption this sum is o(T) and, after the perturbation analysis in Theorem 5.1 (accounting for graph noise delta and mismatch eta), contributes only o(sqrt(T)) to the total regret, yielding the claimed Õ(k sqrt(T)) bound. The full derivation appears in Section 4. To make the abstract self-contained we will add a parenthetical definition of approximate smoothness and a citation to Theorem 4.3. revision: yes

Circularity Check

0 steps flagged

Derivation chain is self-contained; no reduction to inputs by construction

full rationale

The abstract and description present a theoretical proof establishing an Õ(k√T) regret bound via spectral projection onto the graph's low-frequency subspace, followed by a perturbation argument for noisy graphs and smoothness mismatch. The path-dependent cost under approximate smoothness is stated as the key theoretical finding (high-frequency component cost depends on realized impact along the played path), but this is framed as a derived property of the assumption rather than a definitional tautology or fitted input renamed as prediction. Γ_k is a simple spectral comparison without fitted thresholds that matches dataset outcomes, supplying independent grounding. No self-citation load-bearing steps, uniqueness theorems imported from authors, or ansatz smuggling via citation are present in the provided material. The central claim retains independent mathematical content from the stated assumptions and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption of approximate reward smoothness with respect to the graph and on the technical assumption that the low-frequency subspace can be estimated with controlled error; no free parameters or new invented entities are introduced in the abstract.

axioms (2)

domain assumption Approximate smoothness of the reward function on the graph
Invoked to ensure the high-frequency component incurs only path-dependent rather than worst-case linear regret.
domain assumption Controlled perturbation of the graph eigenspaces
Required for the noisy-graph extension and explicit penalty term.

pith-pipeline@v0.9.1-grok · 5810 in / 1492 out tokens · 53055 ms · 2026-06-29T04:21:22.468461+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 20 canonical work pages · 7 internal anchors

[1]

Abbasi-Yadkori, Y.; Pál, D.; and Szepesvári, C. 2011. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems

2011
[2]

Agarwal, A.; Luo, H.; Neyshabur, B.; and Schapire, R. E. 2017. Corralling a band of bandit algorithms. In Conference on Learning Theory

2017
[3]

Auer, P. 2002. Using confidence bounds for exploitation-explorationtrade-offs.JournalofMachine Learning Research, 3: 397–422

2002
[4]

LaplacianEigenmaps forDimensionalityReductionandDataRepresentation

Belkin,M.;andNiyogi,P.2003. LaplacianEigenmaps forDimensionalityReductionandDataRepresentation. Neural Computation, 15(6): 1373–1396

2003
[5]

Cesa-Bianchi, N.; Gentile, C.; and Zappella, G. 2013. A gang of bandits. InAdvances in Neural Information Processing Systems

2013
[6]

Chapelle, O.; and Li, L. 2011. An empirical evalua- tion of Thompson sampling. InAdvances in Neural Information Processing Systems

2011
[7]

Chu, W.; Li, L.; Reyzin, L.; and Schapire, R. E. 2011. Contextual Bandits with Linear Payoff Functions. In International Conference on Artificial Intelligence and Statistics

2011
[8]

Dai, Y.; Golrezaei, N.; and Jaillet, P. 2026. Policy Re- gretforEmbeddingModelRouting:ContextualBandits with Low-Rank Experts. arXiv:2606.14929

work page arXiv 2026
[9]

In Conference on Learning Theory

Dani,V.;Hayes,T.P.;andKakade,S.M.2008.Stochas- tic Linear Optimization under Bandit Feedback. In Conference on Learning Theory

2008
[10]

Davis, C.; and Kahan, W. M. 1970. The rotation of eigenvectors by a perturbation. III.SIAM Journal on Numerical Analysis, 7(1): 1–46

1970
[11]

Dudík, M.; Langford, J.; and Li, L. 2011. Doubly ro- bust policy evaluation and learning. InInternational Conference on Machine Learning

2011
[12]

J.; Gentile, C.; Mohri, M.; and Zimmert, J

Foster, D. J.; Gentile, C.; Mohri, M.; and Zimmert, J. 2020. Adapting to misspecification in contextual bandits. InAdvancesinNeuralInformationProcessing Systems

2020
[13]

Gentile, C.; Li, S.; and Zappella, G. 2014. Online clustering of bandits. InInternational Conference on Machine Learning

2014
[14]

Ghojogh, B.; Ghodsi, A.; Karray, F.; and Crowley, M
[15]

Laplacian-Based Dimensionality Reduction In- cluding Spectral Clustering, Laplacian Eigenmap, Lo- cality Preserving Projection, Graph Embedding, and DiffusionMap:TutorialandSurvey.arXiv:2106.02154

work page arXiv
[16]

Inductive Representation Learning on Large Graphs

Hamilton, W. L.; Ying, R.; and Leskovec, J. 2017. Inductive Representation Learning on Large Graphs. arXiv:1706.02216

work page internal anchor Pith review Pith/arXiv arXiv 2017
[17]

Hu, W.; Fey, M.; Zitnik, M.; Dong, Y.; Ren, H.; Liu, B.; Catasta, M.; and Leskovec, J. 2020. Open Graph Benchmark:DatasetsforMachineLearningonGraphs. arXiv:2005.00687

work page arXiv 2020
[18]

Huang, R.; and Huang, Z. 2025. Nearly Tight Bounds for Cross-Learning Contextual Bandits with Graphical Feedback.arXiv preprint arXiv:2502.04678

work page arXiv 2025
[19]

Jamshidi, F.; Shahverdikondori, M.; and Kiyavash, N
[20]

arXiv:2503.07555

Graph-Dependent Regret Bounds in Multi- Armed Bandits with Interference. arXiv:2503.07555

work page arXiv
[21]

Jedra,Y.;Réveillard,W.;Stojanovic,S.;andProutiere, A. 2024. Low-Rank Bandits via Tight Two-to-Infinity Singular Subspace Recovery. arXiv:2402.15739

work page arXiv 2024
[22]

Jun, K.-S.; Willett, R.; Wright, S.; and Nowak, R
[23]

Bilinear Bandits with Low-rank Structure

Bilinear Bandits with Low-rank Structure. arXiv:1901.02470

work page internal anchor Pith review Pith/arXiv arXiv 1901
[24]

Khosravi, H.; and Huo, X. 2026. Catching a Mov- ing Subspace: Low-Rank Bandits Beyond Stationarity. arXiv:2605.20269

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

Semi-Supervised Classification with Graph Convolutional Networks

Kipf, T. N.; and Welling, M. 2017. Semi-Supervised Classification with Graph Convolutional Networks. arXiv:1609.02907

work page internal anchor Pith review Pith/arXiv arXiv 2017
[26]

2020.Bandit Algo- rithms

Lattimore, T.; and Szepesvári, C. 2020.Bandit Algo- rithms. Cambridge University Press

2020
[27]

Lattimore, T.; Szepesvári, C.; and Weisz, G. 2019. Learning with good feature representations in bandits and in RL with a generative model. InInternational Conference on Machine Learning

2019
[28]

A contextual-bandit approach to personalized news ar- ticle recommendation

Li,L.;Chu,W.;Langford,J.;andSchapire,R.E.2010. A contextual-bandit approach to personalized news ar- ticle recommendation. InInternational Conference on World Wide Web

2010
[29]

Mannor, S.; and Shamir, O. 2011. From Ban- dits to Experts: On the Value of Side-Observations. arXiv:1106.2436

work page internal anchor Pith review Pith/arXiv arXiv 2011
[30]

Y.; Jordan, M

Ng, A. Y.; Jordan, M. I.; and Weiss, Y. 2001. On Spectral Clustering: Analysis and an Algorithm. In Advances in Neural Information Processing Systems

2001
[31]

Pacchiano, A.; Phan, M.; Abbasi-Yadkori, Y.; Rao, A.; Zimmert, J.; Lattimore, T.; and Szepesvari, C. 2020. ModelSelectioninContextualStochasticBanditProb- lems. arXiv:2003.01704

work page arXiv 2020
[32]

InAmericanControlConference(ACC)

Paschalidis,P.;Zhang,R.;andLi,N.2024.Cooperative Multi-Agent Graph Bandits: UCB Algorithm and Re- gretAnalysis. InAmericanControlConference(ACC)

2024
[33]

Qi, H.; Guo, F.-Y.; Zhu, L.; Zhang, Q.; and Li, X
[34]

Graph Feedback Bandits on Similar Arms: With and Without Graph Structures.arXiv preprint arXiv:2501.14314

work page arXiv
[35]

Qiu, H.; Zhang, M.; and Cesa-Bianchi, N. 2026. Near-Optimal Regret for Distributed Adversarial Bandits: A Black-Box Approach.arXiv preprint arXiv:2602.06404

work page arXiv 2026
[36]

Russo,D.;VanRoy,B.;Kazerouni,A.;Osband,I.;and Wen, Z. 2017. A Tutorial on Thompson Sampling. arXiv:1707.02038

work page arXiv 2017
[37]

I.; Narang, S

Shuman, D. I.; Narang, S. K.; Frossard, P.; Ortega, A.; and Vandergheynst, P. 2013. The emerging field of signal processing on graphs.IEEE Signal Processing Magazine, 30(3): 83–98

2013
[38]

Spectral bandits for smooth graph functions

Valko,M.;Munos,R.;Kveton,B.;andKocák,T.2014. Spectral bandits for smooth graph functions. InInter- national Conference on Machine Learning

2014
[39]

von Luxburg, U. 2007. A Tutorial on Spectral Cluster- ing. arXiv:0711.0189

work page internal anchor Pith review Pith/arXiv arXiv 2007
[40]

Wang, Y.; Li, J.; Kang, Y.; Gao, S.; and Xiao, Z. 2025. GeneralizedLow-RankMatrixContextualBanditswith Graph Information. arXiv:2507.17528

work page arXiv 2025
[41]

Wen, D.; Yin, H.; Zhang, X.; Zhao, P.; Zhang, L.; and Wei, Z. 2024. Revisiting Matrix Sketching in Linear Bandits:AchievingSublinearRegretviaDyadicBlock Sketching.arXiv preprint arXiv:2410.10258

work page arXiv 2024
[42]

Yang, K.; Toni, L.; and Dong, X. 2020. Laplacian- regularized graph bandits: Algorithms and theoretical analysis. InInternational Conference on Artificial In- telligence and Statistics

2020
[43]

Contextual Bandits with Random Projection

Yu,X.2019. ContextualBanditswithRandomProjec- tion. arXiv:1903.08600

work page internal anchor Pith review Pith/arXiv arXiv 2019
[44]

Zhao, H.; Ye, C.; Gu, Q.; and Zhang, T. 2024. Sharp Analysis for KL-Regularized Contextual Bandits and RLHF. arXiv:2411.04625. Proofs This appendix contains the proofs of all results stated in the Theory section, in the order they appear. Each proof begins withabriefsketchofthekeysteptoorientbeforetheformal argument. Proof of Theorem 1. Sketch.The project...

work page arXiv 2024
[45]

BuildbLfrom the training graph and computebEk
[46]

Fitapilotrewardmodel bθfromadisjointexplorationpre- fix, logged data, or cross-fitting (a node reward estimate in the one-hot case; a ridge or doubly robust model with general features)
[47]

Computebr⊥,k = (I− bΠk)bθ
[48]

Estimate [CRWk on held-out candidate sets by averaging the range ofx⊤ t,abr⊥,k over candidates
[49]

EstimatedSRLk on a short simulated or logged prefix
[50]

Se- lect GraphDR whenΓk >0; otherwise prefer PCA, full LinUCB, or a Laplacian-regularized baseline

Compute the subspace-capture marginΓk = U ⊤ k bθ 2 2 − P ⊤bθ 2 2 against the competing reducer’s basisP. Se- lect GraphDR whenΓk >0; otherwise prefer PCA, full LinUCB, or a Laplacian-regularized baseline. Full Diagnostic-Validation Protocol The protocol behind the predictive diagnostic evaluation is specified here in full so that the negative primary resu...

[1] [1]

Abbasi-Yadkori, Y.; Pál, D.; and Szepesvári, C. 2011. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems

2011

[2] [2]

Agarwal, A.; Luo, H.; Neyshabur, B.; and Schapire, R. E. 2017. Corralling a band of bandit algorithms. In Conference on Learning Theory

2017

[3] [3]

Auer, P. 2002. Using confidence bounds for exploitation-explorationtrade-offs.JournalofMachine Learning Research, 3: 397–422

2002

[4] [4]

LaplacianEigenmaps forDimensionalityReductionandDataRepresentation

Belkin,M.;andNiyogi,P.2003. LaplacianEigenmaps forDimensionalityReductionandDataRepresentation. Neural Computation, 15(6): 1373–1396

2003

[5] [5]

Cesa-Bianchi, N.; Gentile, C.; and Zappella, G. 2013. A gang of bandits. InAdvances in Neural Information Processing Systems

2013

[6] [6]

Chapelle, O.; and Li, L. 2011. An empirical evalua- tion of Thompson sampling. InAdvances in Neural Information Processing Systems

2011

[7] [7]

Chu, W.; Li, L.; Reyzin, L.; and Schapire, R. E. 2011. Contextual Bandits with Linear Payoff Functions. In International Conference on Artificial Intelligence and Statistics

2011

[8] [8]

Dai, Y.; Golrezaei, N.; and Jaillet, P. 2026. Policy Re- gretforEmbeddingModelRouting:ContextualBandits with Low-Rank Experts. arXiv:2606.14929

work page arXiv 2026

[9] [9]

In Conference on Learning Theory

Dani,V.;Hayes,T.P.;andKakade,S.M.2008.Stochas- tic Linear Optimization under Bandit Feedback. In Conference on Learning Theory

2008

[10] [10]

Davis, C.; and Kahan, W. M. 1970. The rotation of eigenvectors by a perturbation. III.SIAM Journal on Numerical Analysis, 7(1): 1–46

1970

[11] [11]

Dudík, M.; Langford, J.; and Li, L. 2011. Doubly ro- bust policy evaluation and learning. InInternational Conference on Machine Learning

2011

[12] [12]

J.; Gentile, C.; Mohri, M.; and Zimmert, J

Foster, D. J.; Gentile, C.; Mohri, M.; and Zimmert, J. 2020. Adapting to misspecification in contextual bandits. InAdvancesinNeuralInformationProcessing Systems

2020

[13] [13]

Gentile, C.; Li, S.; and Zappella, G. 2014. Online clustering of bandits. InInternational Conference on Machine Learning

2014

[14] [14]

Ghojogh, B.; Ghodsi, A.; Karray, F.; and Crowley, M

[15] [15]

Laplacian-Based Dimensionality Reduction In- cluding Spectral Clustering, Laplacian Eigenmap, Lo- cality Preserving Projection, Graph Embedding, and DiffusionMap:TutorialandSurvey.arXiv:2106.02154

work page arXiv

[16] [16]

Inductive Representation Learning on Large Graphs

Hamilton, W. L.; Ying, R.; and Leskovec, J. 2017. Inductive Representation Learning on Large Graphs. arXiv:1706.02216

work page internal anchor Pith review Pith/arXiv arXiv 2017

[17] [17]

Hu, W.; Fey, M.; Zitnik, M.; Dong, Y.; Ren, H.; Liu, B.; Catasta, M.; and Leskovec, J. 2020. Open Graph Benchmark:DatasetsforMachineLearningonGraphs. arXiv:2005.00687

work page arXiv 2020

[18] [18]

Huang, R.; and Huang, Z. 2025. Nearly Tight Bounds for Cross-Learning Contextual Bandits with Graphical Feedback.arXiv preprint arXiv:2502.04678

work page arXiv 2025

[19] [19]

Jamshidi, F.; Shahverdikondori, M.; and Kiyavash, N

[20] [20]

arXiv:2503.07555

Graph-Dependent Regret Bounds in Multi- Armed Bandits with Interference. arXiv:2503.07555

work page arXiv

[21] [21]

Jedra,Y.;Réveillard,W.;Stojanovic,S.;andProutiere, A. 2024. Low-Rank Bandits via Tight Two-to-Infinity Singular Subspace Recovery. arXiv:2402.15739

work page arXiv 2024

[22] [22]

Jun, K.-S.; Willett, R.; Wright, S.; and Nowak, R

[23] [23]

Bilinear Bandits with Low-rank Structure

Bilinear Bandits with Low-rank Structure. arXiv:1901.02470

work page internal anchor Pith review Pith/arXiv arXiv 1901

[24] [24]

Khosravi, H.; and Huo, X. 2026. Catching a Mov- ing Subspace: Low-Rank Bandits Beyond Stationarity. arXiv:2605.20269

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

Semi-Supervised Classification with Graph Convolutional Networks

Kipf, T. N.; and Welling, M. 2017. Semi-Supervised Classification with Graph Convolutional Networks. arXiv:1609.02907

work page internal anchor Pith review Pith/arXiv arXiv 2017

[26] [26]

2020.Bandit Algo- rithms

Lattimore, T.; and Szepesvári, C. 2020.Bandit Algo- rithms. Cambridge University Press

2020

[27] [27]

Lattimore, T.; Szepesvári, C.; and Weisz, G. 2019. Learning with good feature representations in bandits and in RL with a generative model. InInternational Conference on Machine Learning

2019

[28] [28]

A contextual-bandit approach to personalized news ar- ticle recommendation

Li,L.;Chu,W.;Langford,J.;andSchapire,R.E.2010. A contextual-bandit approach to personalized news ar- ticle recommendation. InInternational Conference on World Wide Web

2010

[29] [29]

Mannor, S.; and Shamir, O. 2011. From Ban- dits to Experts: On the Value of Side-Observations. arXiv:1106.2436

work page internal anchor Pith review Pith/arXiv arXiv 2011

[30] [30]

Y.; Jordan, M

Ng, A. Y.; Jordan, M. I.; and Weiss, Y. 2001. On Spectral Clustering: Analysis and an Algorithm. In Advances in Neural Information Processing Systems

2001

[31] [31]

Pacchiano, A.; Phan, M.; Abbasi-Yadkori, Y.; Rao, A.; Zimmert, J.; Lattimore, T.; and Szepesvari, C. 2020. ModelSelectioninContextualStochasticBanditProb- lems. arXiv:2003.01704

work page arXiv 2020

[32] [32]

InAmericanControlConference(ACC)

Paschalidis,P.;Zhang,R.;andLi,N.2024.Cooperative Multi-Agent Graph Bandits: UCB Algorithm and Re- gretAnalysis. InAmericanControlConference(ACC)

2024

[33] [33]

Qi, H.; Guo, F.-Y.; Zhu, L.; Zhang, Q.; and Li, X

[34] [34]

Graph Feedback Bandits on Similar Arms: With and Without Graph Structures.arXiv preprint arXiv:2501.14314

work page arXiv

[35] [35]

Qiu, H.; Zhang, M.; and Cesa-Bianchi, N. 2026. Near-Optimal Regret for Distributed Adversarial Bandits: A Black-Box Approach.arXiv preprint arXiv:2602.06404

work page arXiv 2026

[36] [36]

Russo,D.;VanRoy,B.;Kazerouni,A.;Osband,I.;and Wen, Z. 2017. A Tutorial on Thompson Sampling. arXiv:1707.02038

work page arXiv 2017

[37] [37]

I.; Narang, S

Shuman, D. I.; Narang, S. K.; Frossard, P.; Ortega, A.; and Vandergheynst, P. 2013. The emerging field of signal processing on graphs.IEEE Signal Processing Magazine, 30(3): 83–98

2013

[38] [38]

Spectral bandits for smooth graph functions

Valko,M.;Munos,R.;Kveton,B.;andKocák,T.2014. Spectral bandits for smooth graph functions. InInter- national Conference on Machine Learning

2014

[39] [39]

von Luxburg, U. 2007. A Tutorial on Spectral Cluster- ing. arXiv:0711.0189

work page internal anchor Pith review Pith/arXiv arXiv 2007

[40] [40]

Wang, Y.; Li, J.; Kang, Y.; Gao, S.; and Xiao, Z. 2025. GeneralizedLow-RankMatrixContextualBanditswith Graph Information. arXiv:2507.17528

work page arXiv 2025

[41] [41]

Wen, D.; Yin, H.; Zhang, X.; Zhao, P.; Zhang, L.; and Wei, Z. 2024. Revisiting Matrix Sketching in Linear Bandits:AchievingSublinearRegretviaDyadicBlock Sketching.arXiv preprint arXiv:2410.10258

work page arXiv 2024

[42] [42]

Yang, K.; Toni, L.; and Dong, X. 2020. Laplacian- regularized graph bandits: Algorithms and theoretical analysis. InInternational Conference on Artificial In- telligence and Statistics

2020

[43] [43]

Contextual Bandits with Random Projection

Yu,X.2019. ContextualBanditswithRandomProjec- tion. arXiv:1903.08600

work page internal anchor Pith review Pith/arXiv arXiv 2019

[44] [44]

Zhao, H.; Ye, C.; Gu, Q.; and Zhang, T. 2024. Sharp Analysis for KL-Regularized Contextual Bandits and RLHF. arXiv:2411.04625. Proofs This appendix contains the proofs of all results stated in the Theory section, in the order they appear. Each proof begins withabriefsketchofthekeysteptoorientbeforetheformal argument. Proof of Theorem 1. Sketch.The project...

work page arXiv 2024

[45] [45]

BuildbLfrom the training graph and computebEk

[46] [46]

Fitapilotrewardmodel bθfromadisjointexplorationpre- fix, logged data, or cross-fitting (a node reward estimate in the one-hot case; a ridge or doubly robust model with general features)

[47] [47]

Computebr⊥,k = (I− bΠk)bθ

[48] [48]

Estimate [CRWk on held-out candidate sets by averaging the range ofx⊤ t,abr⊥,k over candidates

[49] [49]

EstimatedSRLk on a short simulated or logged prefix

[50] [50]

Se- lect GraphDR whenΓk >0; otherwise prefer PCA, full LinUCB, or a Laplacian-regularized baseline

Compute the subspace-capture marginΓk = U ⊤ k bθ 2 2 − P ⊤bθ 2 2 against the competing reducer’s basisP. Se- lect GraphDR whenΓk >0; otherwise prefer PCA, full LinUCB, or a Laplacian-regularized baseline. Full Diagnostic-Validation Protocol The protocol behind the predictive diagnostic evaluation is specified here in full so that the negative primary resu...