arxiv: 2604.13738 · v1 · submitted 2026-04-15 · 📊 stat.ML · cs.LG

Recognition: unknown

Covariance-adapting algorithm for semi-bandits with application to sparse rewards

Pierre Perrault , Vianney Perchet , Michal Valko

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:16 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords combinatorial semi-banditscovariance estimationregret boundssub-exponential distributionssparse rewardsstochastic banditsasymptotic analysisrecommender systems

0 comments

The pith

A covariance-adapting algorithm matches the tightest lower bound for semi-bandit regret using estimates of outcome covariance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that stochastic combinatorial semi-bandits can be solved with regret that depends on the unknown covariance matrix of outcomes rather than looser sub-Gaussian parameters. It does so by introducing a broad family of sub-exponential distributions that includes common cases like bounded and Gaussian rewards. An algorithm is built that estimates this covariance from data and achieves asymptotically matching regret bounds. This matters for applications like recommender systems where rewards are sparse and prior distribution knowledge is unavailable. A sympathetic reader would care because it offers a more practical and precise way to bound and minimize regret in combinatorial settings.

Core claim

For stochastic combinatorial semi-bandits, the expected regret admits a lower bound parameterized by the covariance matrix of the joint outcome distribution under a general sub-exponential family. An algorithm is constructed that uses online covariance estimates to attain an asymptotically tight upper bound on regret. The results are extended to the sparse outcomes family, which is relevant for many recommender systems.

What carries the argument

The covariance matrix of outcomes, which serves as the parameter for both the new lower bound on regret and the adaptive algorithm that estimates it on the fly.

Load-bearing premise

The joint distribution of outcomes belongs to the proposed general family of sub-exponential distributions, and reliable covariance estimates can be formed from the observed data without introducing significant additional regret.

What would settle it

If the expected regret of the covariance-adapting algorithm does not approach the covariance-parameterized lower bound as the time horizon tends to infinity in simulations with known sub-exponential distributions, the asymptotic tightness claim would be falsified.

Figures

Figures reproduced from arXiv: 2604.13738 by Michal Valko, Pierre Perrault, Vianney Perchet.

**Figure 1.** Figure 1: Confidence regions build by ESCB-C (the pseudo-ellipse), and CUCB-KL (the rectangle), for ∥·∥1 constrained outcomes. Notice that CUCB-KL has slightly better confidence intervals along the axis, but that ESCB-C is better in the direction e{1,2} . Prior work on stochastic semi-bandits We review algorithms for stochastic semi-bandits, coming with the analysis that depends on the family of probability distri… view at source ↗

**Figure 2.** Figure 2: Left: Correlation matrix of the dataset, right: Cumulative regret, averaged over 36 independent simulations We consider the following dynamic assortment problem. An agent has n products to sale, with fixed known prices. At each round, a customer arrives, with some unknown random valuation vector over products. Then, the agent offers any subset of products, by paying a fixed known cost for each offered pro… view at source ↗

**Figure 3.** Figure 3: Confidence regions build by ESCB-C (the pseudo-ellipse), and CUCB-KL (the rectangle), for ∥·∥1 constrained outcomes. Notice that CUCB-KL has slightly better confidence intervals along the axis, but that ESCB-C is better in the direction e{1,2} . Appendix J. Implementation using the Lovasz extension ´ From the step 1 of the proof of Theorem 2, we have that e T Atµt ≤ e T Atµt−1 + 2 vuutδ(t) X i∈At P j∈At 0… view at source ↗

read the original abstract

We investigate stochastic combinatorial semi-bandits, where the entire joint distribution of outcomes impacts the complexity of the problem instance (unlike in the standard bandits). Typical distributions considered depend on specific parameter values, whose prior knowledge is required in theory but quite difficult to estimate in practice; an example is the commonly assumed sub-Gaussian family. We alleviate this issue by instead considering a new general family of sub-exponential distributions, which contains bounded and Gaussian ones. We prove a new lower bound on the expected regret on this family, that is parameterized by the unknown covariance matrix of outcomes, a tighter quantity than the sub-Gaussian matrix. We then construct an algorithm that uses covariance estimates, and provide a tight asymptotic analysis of the regret. Finally, we apply and extend our results to the family of sparse outcomes, which has applications in many recommender systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a covariance-parameterized lower bound and adapting algorithm for sub-exponential semi-bandits, but the finite-sample estimation cost needs explicit control to keep the asymptotic tightness.

read the letter

The main contribution is a new general sub-exponential family that includes bounded and Gaussian cases, together with a regret lower bound that uses the actual covariance matrix of the outcomes instead of a looser sub-Gaussian matrix. They build an algorithm that plugs in covariance estimates and show the regret matches the lower bound asymptotically, then specialize the results to sparse outcomes for recommender-style applications. This is a practical step because most prior work assumes known parameters that are difficult to set or estimate in advance. The adaptation idea is clean and directly addresses that gap. The analysis appears to treat the estimation phase carefully enough that it does not spoil the leading term, at least on paper. The soft spot is still the one flagged in the stress test: with only sub-exponential tails and combinatorial partial feedback, the number of samples needed for reliable covariance estimates could be non-negligible on some instances, potentially adding a term that the asymptotic claim does not fully absorb. If the proofs show the extra regret is o(1) uniformly, the tightness holds; otherwise the bound is only approximately tight. The citation pattern looks standard and the math is formally grounded. This is for people working on structured bandits with unknown noise parameters. It is coherent and worth a serious referee even if the estimation details require some revision.

Referee Report

2 major / 2 minor

Summary. The paper studies stochastic combinatorial semi-bandits under a new general family of sub-exponential distributions (containing bounded and Gaussian cases). It proves a lower bound on expected regret parameterized by the unknown covariance matrix of the outcomes (claimed tighter than any sub-Gaussian proxy), constructs a covariance-adapting algorithm, and gives a matching asymptotic upper bound on regret. The results are extended to sparse outcome distributions with recommender-system applications.

Significance. If the asymptotic tightness holds, the work is significant because it replaces crude sub-Gaussian assumptions with a covariance-parameterized bound that can be substantially tighter on sparse or correlated instances. The matching lower/upper bounds and the explicit handling of covariance estimation constitute a clear technical advance over prior semi-bandit analyses that require known distribution parameters.

major comments (2)

[§4 and §5] §4 (Algorithm construction) and §5 (Regret analysis): the claim that covariance estimation contributes only o(T) or lower-order terms to the regret is load-bearing for the tightness result, yet the manuscript does not exhibit an explicit bound on the length of the initial estimation phase or on the additional regret incurred when the sub-exponential tails are only marginally better than exponential; a concrete calculation showing that this phase is o(T) uniformly over the family is required.
[Theorem 1] Theorem 1 (lower bound): the lower bound is stated to be parameterized by the covariance matrix Σ, but the proof sketch does not clarify whether the change-of-measure argument remains valid when the algorithm’s covariance estimator is formed from the same data stream that is used to define the regret; an explicit independence argument or a two-phase construction would remove the potential circularity.

minor comments (2)

Notation: the symbol for the covariance estimator is introduced without a clear distinction from the true Σ in several displayed equations; consistent use of a hat or subscript would improve readability.
Figure 1 (regret plots): the x-axis scaling and the number of independent runs are not stated in the caption; this makes it difficult to assess whether the observed slopes match the claimed asymptotic rates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and valuable suggestions. We address each major comment below and plan to incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§4 and §5] §4 (Algorithm construction) and §5 (Regret analysis): the claim that covariance estimation contributes only o(T) or lower-order terms to the regret is load-bearing for the tightness result, yet the manuscript does not exhibit an explicit bound on the length of the initial estimation phase or on the additional regret incurred when the sub-exponential tails are only marginally better than exponential; a concrete calculation showing that this phase is o(T) uniformly over the family is required.

Authors: We agree that an explicit bound on the estimation phase would clarify the asymptotic tightness. In the revised manuscript, we will provide a concrete calculation of the initial phase length (e.g., using concentration inequalities for sub-exponential variables) and show that the additional regret is o(T) uniformly over the family, including cases where the sub-exponential parameter is close to the boundary. This will be added to §5. revision: yes
Referee: [Theorem 1] Theorem 1 (lower bound): the lower bound is stated to be parameterized by the covariance matrix Σ, but the proof sketch does not clarify whether the change-of-measure argument remains valid when the algorithm’s covariance estimator is formed from the same data stream that is used to define the regret; an explicit independence argument or a two-phase construction would remove the potential circularity.

Authors: The lower bound applies to any algorithm, and the change-of-measure is between two instances with the same covariance but different means, so the estimator's distribution is identical under both measures, preserving the argument. Nevertheless, to address the concern directly, we will revise the proof in the appendix to include an explicit two-phase construction: a short initial phase for covariance estimation (with negligible regret) followed by the main phase where the algorithm uses the fixed estimate, ensuring independence. This maintains the asymptotic lower bound. revision: yes

Circularity Check

0 steps flagged

No circularity: lower bound uses true covariance; estimation handled separately in asymptotics

full rationale

The paper derives a lower bound parameterized by the true (unknown) covariance matrix of the sub-exponential family, which is an external property of the instance distribution and not defined in terms of the algorithm or its estimates. The subsequent algorithm plugs in covariance estimates formed from observed data, and the asymptotic upper-bound analysis treats estimation error as a lower-order term. No equation or step equates the final regret expression to a fitted parameter or self-citation by construction; the covariance remains an independent input quantity. This satisfies the default expectation of a non-circular derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that outcomes follow the new sub-exponential family and that covariance can be estimated from samples; no free parameters are explicitly fitted in the abstract, and no new entities are postulated.

axioms (1)

domain assumption The joint distribution of outcomes belongs to a general family of sub-exponential distributions that contains bounded and Gaussian cases.
This replaces the need for prior knowledge of specific parameters such as sub-Gaussian norms.

pith-pipeline@v0.9.0 · 5446 in / 1290 out tokens · 36078 ms · 2026-05-10T12:16:07.816296+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 15 canonical work pages

[1]

Exploration-exploitation trade-off using variance estimates in multi-armed bandits.Theoretical Computer Science, 410:1876–1902,

Jean-Yves Audibert, R´emi Munos, and Csaba Szepesv´ari. Exploration-exploitation trade-off using variance estimates in multi-armed bandits.Theoretical Computer Science, 410:1876–1902,

1902
[3]

URLhttp://arxiv.org/abs/1711.01037. V . V . Buldygin and K. K. Moskvichova. The sub-Gaussian norm of a binary random variable. Theory of Probability and Mathematical Statistics, 86:33–49,

work page arXiv
[4]

Apostolos N

1090/s0094-9000-2013-00887-4. Apostolos N. Burnetas and Micha ¨el N. Katehakis. Optimal adaptive policies for sequential alloca- tion problems.Advances in Applied Mathematics, 17(2):122–142,

2013
[5]

Aur´elien Garivier and Olivier Capp ´e

URLhttps://arxiv.org/abs/1612.01859. Aur´elien Garivier and Olivier Capp ´e. The KL-UCB algorithm for bounded stochastic bandits and beyond. InConference on Learning Theory (COLT),

work page arXiv
[6]

arXiv , arxivId =:1102.2490 , month =

URLhttps://arxiv.org/pdf/ 1102.2490.pdf. Claudio Gentile, Shuai Li, and Giovanni Zappella. Online clustering of bandits. InInternational Conference on Machine Learning (ICML),

work page arXiv
[7]

Jean Honorio and Tommi Jaakkola

doi: 10.3233/ KES-2006-10202. Jean Honorio and Tommi Jaakkola. Tight Bounds for the Expected Risk of Linear Classifiers and PAC-Bayes Finite-Sample Guarantees. In Samuel Kaski and Jukka Corander, editors,Interna- tional Conference on Artificial Intelligence and Statistics (AISTATS), volume 33 ofProceedings of Machine Learning Research, pages 384–392, Reyk...

2006
[8]

https://arxiv.org/pdf/1205.4217.pdf Thompson sampling: An asymptotically optimal finite-time analysis

URLhttps://arxiv. org/pdf/1205.4217.pdf. Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. Tight regret bounds for stochastic combinatorial semi-bandits. InInternational Conference on Artificial Intelligence and Statistics,

work page arXiv
[9]

Joon Kwon and Vianney Perchet

URLhttp://proceedings.mlr.press/v38/kveton15.pdf. Joon Kwon and Vianney Perchet. Gains and losses are fundamentally different in regret minimiza- tion: The sparse case.CoRR, abs/1511.08405,

work page arXiv
[11]

15 COVARIANCE SEMI-BANDITS Tze L

URLhttp://arxiv.org/abs/1706.01383. 15 COVARIANCE SEMI-BANDITS Tze L. Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules.Advances in Applied Mathematics, 6(1):4–22,

work page arXiv
[12]

pdf?{_}tid=6ded14a5-1fe6-4c09-a1e3-9738a40b46d4{&}acdnat= 1539373065{_}3220aa4053ab6e1f5db385fd4ef37e61

URLhttps: //ac.els-cdn.com/0196885885900028/1-s2.0-0196885885900028-main. pdf?{_}tid=6ded14a5-1fe6-4c09-a1e3-9738a40b46d4{&}acdnat= 1539373065{_}3220aa4053ab6e1f5db385fd4ef37e61. L´aszl´o Lov´asz. Submodular functions and convexity.Mathematical programming the state of the art, pages 235–257,

work page arXiv
[13]

Nadav Merlis and Shie Mannor

URLhttps://arxiv.org/abs/1405.4758. Nadav Merlis and Shie Mannor. Batch-Size Independent Regret Bounds for the Combinatorial Multi-Armed Bandit Problem. may

work page arXiv
[14]

Subhojyoti Mukherjee, K

URLhttps://arxiv.org/abs/1905.03125. Subhojyoti Mukherjee, K. P. Naveen, Nandan Sudarsanam, and Balaraman Ravindran. Efficient- UCBV: An Almost Optimal Algorithm using Variance Estimates. nov

work page arXiv 1905
[15]

URLhttps:// arxiv.org/abs/1711.03591. G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing submodular set functions-I.Mathematical Programming, 14(1):265–294,

work page arXiv
[16]

Mathematical Programming14(1), 265– 294 (1978).https://doi.org/10.1007/BF01588971,https://doi.org/10.1007/ BF01588971

ISSN 00255610. doi: 10.1007/BF01588971. Pierre Perrault, Vianney Perchet, and Michal Valko. Exploiting structure of uncertainty for effi- cient matroid semi-bandits. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Interna- tional Conference on Machine Learning (ICML), volume 97 ofProceedings of Machine Learn- ing Research, pages 5123–5132, Long Be...

work page doi:10.1007/bf01588971 2006
[17]

Michal Valko.Bandits on graphs and structures

URLhttps: //arxiv.org/abs/1309.7367. Michal Valko.Bandits on graphs and structures. habilitation, ´Ecole normale sup´erieure de Cachan,

work page arXiv
[18]

arXiv , arxivId =:1703.01610 , file =

16 COVARIANCE SEMI-BANDITS Qinshi Wang and Wei Chen. Improving regret bounds for combinatorial semi-bandits with prob- abilistically triggered arms and its applications. InNeural Information Processing Systems (NeurIPS), mar 2017a. URLhttp://arxiv.org/abs/1703.01610. Qinshi Wang and Wei Chen. Tighter Regret Bounds for Influence Maximization and Other Comb...

work page arXiv
[19]

Yingfei Wang, Hua Ouyang, Chu Wang, Jianhui Chen, Tsvetan Asamov, and Yi Chang

URLhttp://arxiv.org/abs/1803.04623. Yingfei Wang, Hua Ouyang, Chu Wang, Jianhui Chen, Tsvetan Asamov, and Yi Chang. Thomp- son Sampling for Contextual Combinatorial Bandits. InInternational Conference on Machine Learning (ICML),

work page arXiv
[20]

(λT(X−µ ∗))k k! #  ≤ X k≥2 E

ACM. ISBN 1-59593-433-2. doi: 10.1145/1183614.1183640. URLhttp://doi.acm.org/10.1145/1183614.1183640. 17 COVARIANCE SEMI-BANDITS Appendix A. The sub-Gaussian matrix is an upper bound on the covariance matrix The fact thatΣ ∗ ⪯Cis well known and can be proved as follows: Fixx∈R n. For anyλ∈R, E h eλxT(X−µ∗) i ≤e λ2xTCx/2. The second order Taylor expansion ...

work page doi:10.1145/1183614.1183640 1996
[21]

Bd,t ∩ \ i∈A∗ ( Ni,t−1 µ∗ i − µi,t−1 +2 4 σ2 i +|A ∗| µ∗ i − µi,t−1 > ζ i )# ≤e − P i∈A∗ ζie−1 , i.e. P

Step 2: IfS t,¬Ct holdLetσ 2 i ≜P j∈A∗ 0∨Σ ∗ ij for all armsi∈[n].We fix someδ≥e·m, and define the following events: At ≜ (X i∈A∗ I µ∗ i ≥ µi,t−1 Ni,t−1 µ∗ i − µi,t−1 2 4 σ2 i +|A ∗| µ∗ i − µi,t−1 ≥δ ) ∀d∈(N ∗)A∗ ,B d,t ≜ \ i∈A∗ n edi−1 ≤N i,t−1 < e di o . Notice thatS t,¬C t impliesA t forδ=δ(t). Since each number of pullsN i,t−1 fori∈A ∗ is bounded byt,...

2014
[22]

Proof of Theorem 3 ProofConsiderAcontainingn/mdisjoint actionsA 1,

Appendix G. Proof of Theorem 3 ProofConsiderAcontainingn/mdisjoint actionsA 1, . . . , An/m composed ofmarms.Xis constructed as follows:(1∨s/m)different actions are randomly chosen amongA, with equal probability, except the one for actionA 1, that have an offset ofδ. From (1∨s/m) =E "X A∈A I{Ais chosen} # = (n/m−1)(P[A 1 is chosen]−δ) +P[A 1 is chosen], w...

1985
[23]

28 COVARIANCE SEMI-BANDITS e1 e2 e{1,2} µt−1 • • • Figure 3:Confidence regions build byESCB-C(the pseudo-ellipse), andCUCB-KL(the rectangle), for∥·∥ 1 constrained outcomes

Otherwise an approximation regret would be a more appropriate performance measure to consider. 28 COVARIANCE SEMI-BANDITS e1 e2 e{1,2} µt−1 • • • Figure 3:Confidence regions build byESCB-C(the pseudo-ellipse), andCUCB-KL(the rectangle), for∥·∥ 1 constrained outcomes. Notice thatCUCB-KLhas slightly better confidence inter- vals along the axis, but thatESCB...

1983