arxiv: 2605.09277 · v2 · submitted 2026-05-10 · 💻 cs.LG

Worst-Case Regret Bounds for Combinatorial Thompson Sampling in Sleeping Semi-Bandits

Zhiming Huang , Bingshan Hu , Jianping Pan This is my paper

Pith reviewed 2026-05-13 07:35 UTC · model grok-4.3

classification 💻 cs.LG

keywords combinatorial thompson samplingsleeping semi-banditsregret boundsworst-case analysisgaussian priorscoordinated explorationadversarial availability

0 comments

The pith

Combinatorial Thompson sampling with Gaussian priors achieves first worst-case regret bound of Õ(m√NT) for sleeping semi-bandits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper delivers the first worst-case regret analysis for combinatorial Thompson sampling using Gaussian priors in semi-bandits where arms can sleep or become unavailable over time. It proves that the standard CTS-G method has regret bounded above by Õ(m√NT) and establishes a matching lower bound of Ω̃(m√NT). The authors introduce CL-SG, a variant that samples one shared Gaussian seed per round to coordinate exploration, which improves the bound to Õ(√(mNT)) with a matching lower bound. These results address gaps in theory for settings like wireless mesh routing with fluctuating links and show better empirical performance than prior CTS variants.

Core claim

CTS-G attains a worst-case regret upper bound of Õ(m√NT) with a matching lower bound Ω̃(m√NT) in sleeping semi-bandits under adversarial availability. CL-SG, by sampling a single shared Gaussian seed each round, attains the improved bound Õ(√(mNT)) with matching lower bound Ω(√(mNT)).

What carries the argument

CL-SG's shared Gaussian seed sampling, which coordinates exploration across arms by using one random seed per round instead of independent samples.

If this is right

The regret for CL-SG scales as the square root of the product of the number of arms m and time horizon T, enabling better scaling in large systems.
Matching lower bounds confirm that the upper bounds cannot be improved asymptotically.
CL-SG outperforms CTS-G and CTS-B on real-world datasets.
The analysis applies to adversarial changes in arm availability while keeping combinatorial constraints fixed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

CL-SG's coordination trick could be applied to other Thompson sampling algorithms to reduce regret in non-combinatorial settings.
Wireless network designers can now select algorithms with proven regret guarantees for fluctuating link availability.
Future work might test these bounds in online settings with learned combinatorial constraints rather than fixed ones.

Load-bearing premise

The feasible actions must satisfy fixed combinatorial constraints that are known in advance and do not change over time.

What would settle it

Observe the cumulative regret of CL-SG in a synthetic sleeping semi-bandit instance with m arms and T rounds under adversarial availability; if it grows faster than order sqrt(mNT), the bound is falsified.

Figures

Figures reproduced from arXiv: 2605.09277 by Bingshan Hu, Jianping Pan, Zhiming Huang.

**Figure 3.** Figure 3: Comparison of the regret for both settings with [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 1.** Figure 1: Setting 1: Wireless Mesh Network. s 1 2 3 4 5 6 7 8 9 t 4 1 2 2 4 3 1 3 1 3 3 1 3 2 [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗

**Figure 3.** Figure 3: Effect of different exploration rates γ = 0.01, 0.1, 0.5, and 1 for CTS-G and CL-SG. sample per round. CL-SG achieves near-optimal performance, with a worst-case upper bound of O˜( √ mNT) and a matching lower bound of Ω( ˜ √ mNT). Empirically, CL-SG significantly outperforms existing benchmarks such as CTS-B and CombUCB, demonstrating both improved accuracy and efficiency across diverse environments. Look… view at source ↗

**Figure 4.** Figure 4: Effect of different exploration rates γ = 0.01, 0.1, 0.5, and 1 for CTS-G and CL-SG. CL-SG and CTS-G achieve the lowest regret when γ = 0.01. If we continue to increase γ, both algorithms suffer a larger regret. This suggests that a certain lower level of randomness (exploration) is more effective in practice. VI. CONCLUSION This paper addresses a long-standing open problem by establishing the first worst-… view at source ↗

read the original abstract

We revisit combinatorial Thompson sampling (CTS) for semi-bandits with sleeping arms, where arm availability varies over time and actions must satisfy combinatorial constraints, as in wireless mesh routing with fluctuating link availability. Despite its practical relevance, CTS has been hindered by several long-standing problems: (i) the absence of worst-case regret guarantees in the semi-bandit setting even without sleeping arms, (ii) the lack of theory under adversarially varying availability, and (iii) the consistently weak empirical performance of CTS with Gaussian priors (CTS-G). This paper resolves these long-standing issues by providing the first worst-case regret analysis of CTS-G, proving an upper bound of $\tilde{O}(m\sqrt{NT})$ and a matching lower bound of $\tilde{\Omega}(m\sqrt{NT})$. To bridge the gap between theory and practice, we further propose CL-SG, a simple CTS-G variant that samples a single shared Gaussian seed each round to coordinate exploration across arms. We show that CL-SG achieves an improved regret bound of $\tilde{O}(\sqrt{mNT})$, together with a matching lower bound $\Omega(\sqrt{mNT})$. Experiments on real-world datasets demonstrate that CL-SG consistently outperforms strong baselines including CTS-G and CTS-B, and we open-source our implementation for reproducibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives the first worst-case regret bounds for combinatorial Thompson sampling with Gaussian priors under sleeping arms, plus a coordinated variant that improves the scaling from m sqrt(NT) to sqrt(m NT).

read the letter

The main thing to know is that this work supplies the first worst-case regret analysis for CTS-G in the semi-bandit setting with adversarial sleeping arms, including matching upper and lower bounds of order m sqrt(NT), and then introduces CL-SG, which uses one shared Gaussian seed per round to coordinate and reaches the tighter sqrt(m NT) scaling with its own matching lower bound. They also run experiments on real datasets where CL-SG beats the baselines and release the code. That combination of new bounds and practical checks is the concrete advance. The analysis builds on standard bandit techniques without obvious circularity, and the setting matches real uses like wireless routing with fluctuating links. The soft spot is the CL-SG coordination step. Sharing a single seed is meant to preserve correlation across arms, but adversarial availability can change which arms are present each round and potentially weaken or destroy that correlation. The abstract does not detail how the proof keeps the anti-concentration or variance properties intact under arbitrary restrictions, so that interaction between fixed combinatorial constraints and time-varying support is the part that needs the closest look. If the argument holds there, the improvement is real; if not, the bound may fall back to the weaker m-factor version. This is worth a serious referee because the claims are specific, the gap was genuine, and the applications are clear. I would bring it to a reading group to walk through the CL-SG proof and check whether the coordination survives the adversarial availability.

Referee Report

1 major / 1 minor

Summary. The paper claims to deliver the first worst-case regret analysis of combinatorial Thompson sampling with Gaussian priors (CTS-G) for sleeping semi-bandits, establishing a matching upper bound of Õ(m√(NT)) and lower bound of Ω̃(m√(NT)). It further introduces CL-SG, which draws one shared Gaussian seed per round to coordinate exploration, and proves that this variant attains the improved regret Õ(√(mNT)) with a matching lower bound Ω(√(mNT)). Empirical results on real-world datasets are reported to show CL-SG outperforming CTS-G and CTS-B baselines.

Significance. If the stated bounds are correct, the work closes two long-standing gaps: the absence of worst-case guarantees for CTS in semi-bandits (even without sleeping) and the lack of theory under adversarial availability. The matching upper/lower bounds for both algorithms, together with the open-sourced implementation, would constitute a substantive contribution to the combinatorial bandit literature.

major comments (1)

[Abstract / CL-SG analysis] Abstract / CL-SG analysis: the improved Õ(√(mNT)) regret for CL-SG is asserted to follow from sampling a single shared Gaussian seed per round. Under adversarial sleeping, the set of available arms (and therefore the effective support of the seed) changes arbitrarily each round. The manuscript must supply the explicit argument showing that the anti-concentration or variance properties of the shared seed survive this random restriction and do not collapse to the weaker m√(NT) scaling of plain CTS-G; without that step the central improvement claim is not yet secured.

minor comments (1)

[Abstract] The abstract uses m, N, T without immediate definition; an early sentence or footnote clarifying that m is the number of base arms, N the maximum action size, and T the horizon would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the CL-SG analysis requires an explicit argument showing that the shared seed's anti-concentration properties are preserved under adversarial sleeping, and we will add this in the revision.

read point-by-point responses

Referee: [Abstract / CL-SG analysis] Abstract / CL-SG analysis: the improved Õ(√(mNT)) regret for CL-SG is asserted to follow from sampling a single shared Gaussian seed per round. Under adversarial sleeping, the set of available arms (and therefore the effective support of the seed) changes arbitrarily each round. The manuscript must supply the explicit argument showing that the anti-concentration or variance properties of the shared seed survive this random restriction and do not collapse to the weaker m√(NT) scaling of plain CTS-G; without that step the central improvement claim is not yet secured.

Authors: We agree the current draft leaves this step implicit. In the revised manuscript we will add a dedicated lemma (new Lemma 4.3) proving that the shared standard-Gaussian seed g_t ~ N(0,1), when restricted to the adversarially chosen available set S_t, still satisfies the required anti-concentration: for any fixed direction v supported on S_t, P(|<g_t, v>| < ε) ≤ O(ε) uniformly in S_t. The proof uses the rotational invariance of the Gaussian together with the fact that the restriction operator is a coordinate projection whose operator norm is at most 1; consequently the marginal variance along any unit vector in the span of S_t remains exactly 1, preventing collapse to the per-arm independent scaling of CTS-G. This lemma is then plugged directly into the existing regret decomposition to recover the Õ(√(mNT)) bound. We will also include a short remark explaining why the same argument fails for independent per-arm seeds. revision: yes

Circularity Check

0 steps flagged

No significant circularity; bounds derived via standard analysis techniques

full rationale

The paper presents the first worst-case regret analysis for CTS-G under sleeping semi-bandits, claiming an upper bound of Õ(m√NT) with matching lower bound, and an improved Õ(√(mNT)) bound for the CL-SG variant using a shared Gaussian seed. These derivations rely on standard semi-bandit regret decomposition, concentration inequalities, and Thompson sampling posterior analysis rather than any self-definitional reduction, fitted parameter renamed as prediction, or load-bearing self-citation. No equations or steps in the provided claims reduce the target bounds to quantities defined by the paper's own inputs. The central claims remain independent of any circular construction, consistent with the reader's assessment of standard techniques.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The regret analysis relies on standard concentration inequalities for Gaussian variables and combinatorial structure assumptions common to the bandit literature; no free parameters or new entities are introduced in the abstract.

axioms (1)

standard math Standard sub-Gaussian concentration bounds for Thompson sampling analysis
Invoked implicitly for deriving the regret upper bounds in the semi-bandit setting.

pith-pipeline@v0.9.0 · 5533 in / 1205 out tokens · 51214 ms · 2026-05-13T07:35:39.241557+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages

[1]

Bridging the Regret Gap in Combinatorial Thompson Sampling: Worst-Case Guarantees and Algorithmic Refine- ment,

Z. Huang, B. Hu, and J. Pan, “Bridging the Regret Gap in Combinatorial Thompson Sampling: Worst-Case Guarantees and Algorithmic Refine- ment,” inProc. IEEE Conference on Computer Communications (IN- FOCOM). IEEE, 2026

work page 2026
[2]

Regret Bounds for Sleeping Experts and Bandits,

R. Kleinberg, A. Niculescu-Mizil, and Y . Sharma, “Regret Bounds for Sleeping Experts and Bandits,”Machine Learning, vol. 80, no. 2-3, pp. 245–272, 2010

work page 2010
[3]

Analysis of Thompson Sampling for Stochastic Sleeping Bandits,

A. Chatterjee, G. Ghalme, S. Jain, R. Vaish, and Y . Narahari, “Analysis of Thompson Sampling for Stochastic Sleeping Bandits,” inProc. Conference on Uncertainty in Artificial Intelligence (UAI), 2017

work page 2017
[4]

Intelligent Caching Algorithms in Heterogeneous Wireless Networks with Uncertainty,

B. Hu, Y . Chen, Z. Huang, N. A. Mehta, and J. Pan, “Intelligent Caching Algorithms in Heterogeneous Wireless Networks with Uncertainty,” in Proc. IEEE Conference on Distributed Computing Systems (ICDCS). IEEE, 2019

work page 2019
[5]

Hardness of Online Sleeping Combinatorial Optimization Problems,

S. Kale, C. Lee, and D. Pal, “Hardness of Online Sleeping Combinatorial Optimization Problems,” inProc. Advances in Neural Information Processing Systems (NeurIPS), 2016, pp. 2181–2189

work page 2016
[6]

Online Combinatorial Optimization with Stochastic Decision Sets and Adversarial Losses,

G. Neu and M. Valko, “Online Combinatorial Optimization with Stochastic Decision Sets and Adversarial Losses,” inProc. Advances in Neural Information Processing Systems (NeurIPS), 2014, pp. 2780– 2788

work page 2014
[7]

Combinatorial Sleeping Bandits with Fairness Constraints,

F. Li, J. Liu, and B. Ji, “Combinatorial Sleeping Bandits with Fairness Constraints,” inProc. IEEE Conference on Computer Communica- tions (INFOCOM). IEEE, May 2019, pp. 1702–1710

work page 2019
[8]

Combinatorial Sleeping Bandits with Fairness Constraints,

——, “Combinatorial Sleeping Bandits with Fairness Constraints,”IEEE Transactions on Network Science and Engineering (TNSE), vol. 7, no. 3, pp. 1799–1813, 2019

work page 2019
[9]

Achieving Regular and Fair Learning in Combi- natorial Multi-Armed Bandit,

X. Wu and B. Li, “Achieving Regular and Fair Learning in Combi- natorial Multi-Armed Bandit,” inProc. IEEE Conference on Computer Communications (INFOCOM). IEEE, 2024, pp. 361–370

work page 2024
[10]

On the Low-Complexity of Fair Learning for Combinatorial Multi-Armed Bandit,

X. Wu, B. Ji, and B. Li, “On the Low-Complexity of Fair Learning for Combinatorial Multi-Armed Bandit,” inProc. IEEE Conference on Computer Communications (INFOCOM), 2025

work page 2025
[11]

An Empirical Evaluation of Thompson Sampling,

O. Chapelle and L. Li, “An Empirical Evaluation of Thompson Sampling,” inProc. Advances in Neural Information Processing Sys- tems (NeurIPS), 2011, pp. 2249–2257

work page 2011
[12]

Thompson Sampling for Combinatorial Semi-Bandits,

S. Wang and W. Chen, “Thompson Sampling for Combinatorial Semi-Bandits,” inProc. International Conference on Machine Learn- ing (ICML), 2018, pp. 5101–5109

work page 2018
[13]

Statistical Efficiency of Thompson Sampling for Combinatorial Semi-Bandits,

P. Perrault, E. Boursier, M. Valko, and V . Perchet, “Statistical Efficiency of Thompson Sampling for Combinatorial Semi-Bandits,” inProc. Ad- vances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 5429–5440

work page 2020
[14]

On the Suboptimality of Thompson Sam- pling in High Dimensions,

R. Zhang and R. Combes, “On the Suboptimality of Thompson Sam- pling in High Dimensions,” inProc. Advances in Neural Information Processing Systems (NeurIPS), vol. 34, 2021, pp. 8345–8354

work page 2021
[15]

Thompson Sampling for Combinatorial Bandits: Polynomial Regret and Mismatched Sampling Paradox,

——, “Thompson Sampling for Combinatorial Bandits: Polynomial Regret and Mismatched Sampling Paradox,” inProc. Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[16]

Tight Regret Bounds for Stochastic Combinatorial Semi-Bandits,

B. Kveton, Z. Wen, A. Ashkan, and C. Szepesvari, “Tight Regret Bounds for Stochastic Combinatorial Semi-Bandits,” inProc. Artificial Intelligence and Statistics (AISTATS), 2015, pp. 535–543

work page 2015
[17]

Combinatorial Network Opti- mization with Unknown Variables: Multi-Armed Bandits with Linear Rewards and Individual Observations,

Y . Gai, B. Krishnamachari, and R. Jain, “Combinatorial Network Opti- mization with Unknown Variables: Multi-Armed Bandits with Linear Rewards and Individual Observations,”IEEE/ACM Transactions on Networking (TON), vol. 20, no. 5, pp. 1466–1478, 2012

work page 2012
[18]

Combinatorial Bandits Revisited,

R. Combes, M. S. T. M. Shahi, A. Proutiereet al., “Combinatorial Bandits Revisited,” inProc. Advances in Neural Information Processing Systems (NeurIPS), 2015, pp. 2116–2124

work page 2015
[19]

Combinatorial Multi-Armed Bandit: General Framework and Applications,

W. Chen, Y . Wang, and Y . Yuan, “Combinatorial Multi-Armed Bandit: General Framework and Applications,” inProc. International Confer- ence on Machine Learning (ICML), 2013, pp. 151–159

work page 2013
[20]

Tight Lower Bounds for Combinato- rial Multi-Armed Bandits,

N. Merlis and S. Mannor, “Tight Lower Bounds for Combinato- rial Multi-Armed Bandits,” inProc. Conference on Learning The- ory (COLT). PMLR, 2020, pp. 2830–2857

work page 2020
[21]

Combinatorial Semi-Bandit with Known Covariance,

R. Degenne and V . Perchet, “Combinatorial Semi-Bandit with Known Covariance,” inProc. Advances in Neural Information Processing Sys- tems (NeurIPS), vol. 29. Curran Associates, Inc., 2016

work page 2016
[22]

Lattimore and C

T. Lattimore and C. Szepesv ´ari,Bandit Algorithms. Cambridge University Press, 2020

work page 2020
[23]

Thompson Sampling for Combinatorial Net- work Optimization in Unknown Environments,

A. H ¨uy¨uk and C. Tekin, “Thompson Sampling for Combinatorial Net- work Optimization in Unknown Environments,”IEEE/ACM Transac- tions on Networking (TON), vol. 28, no. 6, pp. 2836–2849, 2020

work page 2020
[24]

Sleeping Combi- natorial Bandits,

K. Abhishek, G. Ghalme, S. Gujar, and Y . Narahari, “Sleeping Combi- natorial Bandits,”arXiv preprint arXiv:2106.01624, 2021

work page arXiv 2021
[25]

Near-Optimal Regret Bounds for Thomp- son Sampling (Corrected),

S. Agrawal and N. Goyal, “Near-Optimal Regret Bounds for Thomp- son Sampling (Corrected),” http://www.columbia.edu/ ∼sa3305/papers/ j3-corrected.pdf, 2017

work page 2017
[26]

Worst-Case Regret Bounds for Exploration via Randomized Value Functions,

D. Russo, “Worst-Case Regret Bounds for Exploration via Randomized Value Functions,” inProc. Advances in Neural Information Processing Systems, vol. 32, 2019

work page 2019
[27]

Near-Optimal Randomized Exploration for Tabular Markov Decision Processes,

Z. Xiong, R. Shen, Q. Cui, M. Fazel, and S. S. Du, “Near-Optimal Randomized Exploration for Tabular Markov Decision Processes,” in Proc. Advances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[28]

Experiences from the Design, Deployment, and Usage of the UCSB MeshNet Testbed,

H. Lundgren, K. Ramachandran, E. Belding-Royer, K. Almeroth, M. Benny, A. Hewatt, A. Touma, and A. Jardosh, “Experiences from the Design, Deployment, and Usage of the UCSB MeshNet Testbed,” IEEE Wireless Communications, vol. 13, no. 2, pp. 18–29, 2006

work page 2006
[29]

UCSB/MeshNet,

I. Sheriff, E. Belding, K. C. Almeroth, and K. N. Ramachandran, “UCSB/MeshNet,” February 2007. [Online]. Available: https://doi.org/ 10.15783/C7WS3S APPENDIX Notations:LetF t−1 denote by the history of past ac- tions and rewards until the end of roundt−1. Recall that JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 8 EΘt[·] :=E[· |Θ t]andPr Θt(·)...

work page doi:10.15783/c7ws3s 2007