The Sample Complexity of Multiclass and Sparse Contextual Bandits

Alexander Rakhlin; Alon Cohen; Fan Chen; Liad Erez; Shay Moran; Tomer Koren; Yishay Mansour

arxiv: 2605.29645 · v1 · pith:KQQ6JBP3new · submitted 2026-05-28 · 💻 cs.LG · cs.AI· stat.ML

The Sample Complexity of Multiclass and Sparse Contextual Bandits

Liad Erez , Fan Chen , Alon Cohen , Tomer Koren , Yishay Mansour , Shay Moran , Alexander Rakhlin This is my paper

Pith reviewed 2026-06-29 09:15 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords contextual banditssample complexitysparse rewardsmulticlass classificationdecision estimation coefficientexploration by optimizationcombinatorial semi-bandits

0 comments

The pith

Algorithms output an epsilon-optimal policy for s-sparse contextual bandits using O((s/eps squared plus |A|/eps) log |Pi|) samples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies contextual bandits where each context yields an s-sparse reward vector whose L1 norm is at most s, much smaller than the action set size. It constructs algorithms that, with high probability, return a policy within epsilon of the best in a given class Pi after roughly (s over epsilon squared plus |A| over epsilon) times log of |Pi| over delta samples. The bound removes an extra polynomial factor in |A| that appeared in earlier analyses and matches a lower bound up to logarithmic terms. The same rate extends to Natarajan dimension classes and to contextual combinatorial semi-bandits.

Core claim

For the class of s-sparse reward models the induced decision-estimation coefficient is bounded by a quantity linear in s; an exploration-by-optimization procedure whose sample complexity is controlled by this coefficient therefore returns an epsilon-optimal policy from Pi after the stated number of samples. A separate low-variance exploration method yields concrete, efficiently implementable algorithms that achieve the same bound and extend directly to list classification.

What carries the argument

The decision-estimation coefficient of the model class induced by s-sparse rewards, which scales linearly with s and directly governs the number of samples needed for exploration-by-optimization.

If this is right

Bandit multiclass classification with zero-one rewards admits the improved sample complexity.
The same rate holds when the policy class has finite Natarajan dimension.
Contextual combinatorial semi-bandits obtain improved guarantees via the low-variance method.
Prior algorithms that incurred an extra Theta(|A| to the ninth) factor are no longer necessary.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Sparsity may render very large discrete action spaces practical for contextual decision tasks when rewards concentrate on few actions per context.
The low-variance exploration technique could be adapted to other structured observation models that admit low-variance unbiased estimators.
Empirical tests on multiclass datasets with known sparsity patterns would reveal whether the theoretical improvement translates to faster convergence in practice.

Load-bearing premise

The induced model class for s-sparse rewards admits a sharp decision-estimation coefficient bound that scales with s.

What would settle it

A concrete instance of an s-sparse contextual bandit problem together with a policy class Pi for which every algorithm requires asymptotically more than Omega((s/epsilon squared) log |Pi|) samples to output an epsilon-optimal policy would falsify the upper bound.

read the original abstract

We study contextual bandits in the stochastic i.i.d.\ setting, where a learner observes contexts drawn from an unknown distribution, selects actions from a finite set $A$, and aims to identify an approximately optimal policy from a given class based on bandit feedback. Motivated by bandit multiclass classification with zero-one rewards, we focus on the \emph{$s$-sparse} setting in which, for every context, the reward vector has $L_1$-norm at most $s \ll |A|$. Our main result is the design of algorithms that, with high probability, output an $\epsilon$-optimal policy compared to policy class $\Pi$ using $\tilde{O} ((s/\epsilon^2 + |A|/\epsilon)\log |\Pi|/\delta)$ samples. We extend this bound to general Natarajan classes and complement it with a matching lower bound (up to logarithmic factors), thereby closing a substantial gap left by prior work (Erez et al., 2024, 2025), which incurred an additional $\Theta(|A|^9)$ dependence. We obtain these results via two complementary approaches. First, we analyze contextual bandits through the lens of contextual decision making with structured observations, designing an exploration-by-optimization algorithm whose sample complexity is governed by the \emph{decision-estimation coefficient} (DEC; Foster et al., 2021, 2022). We show that, with $s$-sparse rewards, the induced model class admits a sharp DEC bound that scales with $s$ and directly yields the optimal rate. Since this approach is largely information-theoretic and involves solving complex min-max optimization problems, we also develop a second, more specialized algorithmic method based on a low-variance exploration technique. This approach leads to concrete, tractable algorithms and naturally extends to contextual combinatorial semi-bandits, leading to improved sample complexity guarantees for bandit multiclass list classification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers a clean linear-in-|A| and s sample complexity for sparse contextual bandits with a matching lower bound, closing the |A|^9 gap from prior work.

read the letter

The main advance here is replacing the |A|^9 blowup in earlier sparse contextual bandit bounds with Õ(s/ε² + |A|/ε) times log factors. They get this via a DEC argument tailored to the s-sparse reward model class plus a separate low-variance exploration algorithm that avoids the min-max. Both routes give the same rate, and they add a matching lower bound that rules out better |A| dependence. The extension to Natarajan classes and to combinatorial semi-bandits is a nice bonus and shows the technique travels.

The two complementary proofs are the real strength; one is information-theoretic and the other produces concrete algorithms. The citation to Foster et al. on DEC is used cleanly without circularity, and the sparse structure is handled directly rather than through a generic reduction.

The only soft spot visible from the abstract and stress-test is that the exact DEC calculation for the sparse class is not reproduced here, so the precise constants and log factors would need checking in the full text. No hidden |A| terms or realizability issues are flagged, and the lower bound appears to land where it should.

This is for people working on structured bandits or multiclass classification with large action sets. It is worth a serious referee because the rate improvement is sharp and the two proof paths make the result more robust.

Referee Report

0 major / 3 minor

Summary. The paper studies sample complexity for stochastic contextual bandits with s-sparse rewards (L1-norm at most s per context). It claims algorithms that output an ε-optimal policy from class Π w.h.p. using Õ((s/ε² + |A|/ε) log |Π|/δ) samples. Results are derived via two complementary routes—an information-theoretic analysis showing the s-sparse model class induces a DEC scaling linearly in s (via the Foster et al. framework) plus a concrete low-variance exploration algorithm—and are extended to Natarajan dimension classes and combinatorial semi-bandits, with a matching lower bound (up to logs) that removes the |A|^9 factor from prior work.

Significance. If the stated rates hold, the work closes a substantial gap in the literature on structured contextual bandits and bandit multiclass classification. Credit is due for supplying both an information-theoretic DEC derivation for the sparse case and a matching algorithmic construction, together with an explicit lower bound ruling out better |A| dependence; these elements together make the headline rate credible and falsifiable.

minor comments (3)

[Abstract] The sample-complexity expression writes log |Π|/δ; confirm whether this is log(|Π|/δ) or log |Π| / δ and state the precise high-probability failure probability.
[Abstract] The extension paragraph states the bound carries over to general Natarajan classes; add a sentence or corollary making the dependence on the Natarajan dimension explicit.
Notation for the policy class Π and the induced model class should be introduced once in the main body with a clear mapping from policies to reward distributions.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive and accurate summary of the paper, as well as for the recommendation of minor revision. The report correctly captures the main results on the sample complexity of s-sparse contextual bandits, the use of the DEC framework, the low-variance exploration algorithm, the extensions, and the matching lower bound that improves upon prior |A|^9 dependence. Since the referee report lists no specific major comments, we have no point-by-point responses to provide at this time.

Circularity Check

0 steps flagged

Minor self-citation to prior work; central DEC bound newly derived

full rationale

The paper cites Erez et al. (2024, 2025) solely to identify the |A|^9 gap being closed. The core contribution is an independent derivation of a DEC bound scaling with s for the s-sparse reward model class, using the external Foster et al. (2021, 2022) framework, plus a matching lower bound and a separate low-variance algorithm. No step reduces the claimed sample complexity to a fitted quantity, self-definition, or load-bearing self-citation chain. The derivation remains self-contained against the external DEC benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The result rests on the standard stochastic i.i.d. contextual bandit model plus the s-sparse reward assumption; no free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption Contexts are drawn i.i.d. from an unknown distribution and rewards are stochastic given context and action.
Stated in the first sentence of the abstract as the setting under study.
domain assumption The reward vector for every context has L1 norm at most s.
Central modeling choice that enables the improved rate; appears in the second sentence.

pith-pipeline@v0.9.1-grok · 5904 in / 1355 out tokens · 22686 ms · 2026-06-29T09:15:52.415427+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 15 canonical work pages · 2 internal anchors

[1]

Decision making in changing environments: Robustness, query-based learning, and differential privacy,

F. Chen and A. Rakhlin. Decision making in changing environments: Robustness, query-based learning, and differential privacy.arXiv preprint arXiv:2501.14928,

work page arXiv
[2]

F. Chen, S. Mei, and Y. Bai. Unified algorithms for RL with decision-estimation coefficients: PAC, reward- free, preference-based learning, and beyond.arXiv preprint arXiv:2209.11745,

work page arXiv
[3]

F. Chen, D. J. Foster, Y. Han, J. Qian, A. Rakhlin, and Y. Xu. Assouad, Fano, and Le Cam with inter- action: A unifying lower bound framework and characterization for bandit learnability.arXiv preprint arXiv:2410.05117,

work page arXiv
[4]

Cohen, L

A. Cohen, L. Erez, S. Hanneke, T. Koren, Y. Mansour, S. Moran, and Q. Zhang. Sample complexity of agnostic multiclass classification: Natarajan dimension strikes back.arXiv preprint arXiv:2511.12659,

work page arXiv
[5]

Efficient Optimal Learning for Contextual Bandits

M. Dudik, D. Hsu, S. Kale, N. Karampatziakis, J. Langford, L. Reyzin, and T. Zhang. Efficient optimal learning for contextual bandits.arXiv preprint arXiv:1106.2369,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Erez and T

L. Erez and T. Koren. Bandit multiclass list classification.arXiv preprint arXiv:2502.09257,

work page arXiv
[7]

Bandit-feedbackonlinemulticlassclassification: Variants and tradeoffs.arXiv preprint arXiv:2402.07453,

Y.Filmus,S.Hanneke,I.Mehalel,andS.Moran. Bandit-feedbackonlinemulticlassclassification: Variants and tradeoffs.arXiv preprint arXiv:2402.07453,

work page arXiv
[8]

Foster, D

D. Foster, D. J. Foster, N. Golowich, and A. Rakhlin. On the complexity of multi-agent decision making: Fromlearningingamestopartialmonitoring. InTheThirtySixthAnnualConferenceonLearningTheory, pages 2678–2792. PMLR, 2023a. D. J. Foster and A. Rakhlin. Beyond UCB: Optimal and efficient contextual bandits with regression oracles. arXiv preprint arXiv:2002....

work page arXiv 2002
[9]

Practical Contextual Bandits with Regression Oracles

D.J.Foster,A.Agarwal,M.Dudík,H.Luo,andR.E.Schapire. Practicalcontextualbanditswithregression oracles.arXiv preprint arXiv:1803.01088,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

D. J. Foster, A. Rakhlin, D. Simchi-Levi, and Y. Xu. Instance-dependent complexity of contextual bandits and reinforcement learning: A disagreement-based perspective.arXiv preprint arXiv:2010.03104,

work page arXiv 2010
[11]

The statistical complexity of interactive decision making,

D.J.Foster,S.M.Kakade,J.Qian,andA.Rakhlin. Thestatisticalcomplexityofinteractivedecisionmaking. arXiv preprint arXiv:2112.13487,

work page arXiv
[12]

D. J. Foster, N. Golowich, and Y. Han. Tight guarantees for interactive decision making with the decision- estimation coefficient. InThe Thirty Sixth Annual Conference on Learning Theory, pages 3969–4043. PMLR, 2023b. D. J. Foster, N. Golowich, J. Qian, A. Rakhlin, and A. Sekhari. Model-free reinforcement learning with the decision-estimation coefficient....

work page arXiv
[13]

L. Li, Y. Lu, and D. Zhou. Provably optimal algorithms for generalized linear contextual bandits. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2071–2080. JMLR. org,

2071
[14]

Decisionmakinginhybridenvironments: Amodelaggregationapproach

H.Liu,C.-Y.Wei,andJ.Zimmert. Decisionmakinginhybridenvironments: Amodelaggregationapproach. arXiv preprint arXiv:2502.05974, 2025a. H.Liu,C.-Y.Wei,andJ.Zimmert. Animprovedmodel-freedecision-estimationcoefficientwithapplications in adversarial mdps.arXiv preprint arXiv:2510.08882, 2025b. P. M. Long. New bounds on the price of bandit feedback for mistake-bo...

work page arXiv
[15]

doi: 10.1016/J.TCS.2019.11.017. H. B. McMahan and M. J. Streeter. Tighter bounds for multi-armed bandits with expert advice. InCOLT,

work page doi:10.1016/j.tcs.2019.11.017 2019
[16]

Moran, O

S. Moran, O. Sharon, I. Tsubari, and S. Yosebashvili. List online classification. InThe Thirty Sixth Annual Conference on Learning Theory, pages 1885–1913. PMLR,

1913
[17]

L. Qin, S. Chen, and X. Zhu. Contextual combinatorial bandit and its application on diversified online recommendation. InProceedings of the 2014 SIAM International Conference on Data Mining, pages 461–469. SIAM,

2014
[18]

Multiclassonlinelearnabilityunderbanditfeedback.arXiv preprint arXiv:2308.04620,

A.Raman,V.Raman,U.Subedi,andA.Tewari. Multiclassonlinelearnabilityunderbanditfeedback.arXiv preprint arXiv:2308.04620,

work page arXiv
[19]

Problem setup.Let={𝑎∈ {0,1} 𝐾 | ∥𝑎∥1 =𝑚}be the prediction space where1≤𝑚≤𝐾is an integer

A Extension to combinatorial semi-bandits Our approach outlined in Section 4 has the additional benefit of extending to the more general setting of contextualcombinatorialsemi-bandits(CCSB,ErezandKoren(2025)),inwhichthepredictionsaresubsets (orlists)ofactionsofafixedsize,andtherewardofagivensubsetisthesumofrewardsofindividualactions in this subset. Proble...

2025
[20]

Theorem 5.LetΠ:→be a finite policy class, and assume the rewards𝑟∈ [0,1] 𝐾 satisfy ∥𝑟 ∥1 ≤𝑠

The sample complexity guarantee for Algorithm 3 is given in the following theorem. Theorem 5.LetΠ:→be a finite policy class, and assume the rewards𝑟∈ [0,1] 𝐾 satisfy ∥𝑟 ∥1 ≤𝑠. If we set 𝑇= eΘ 𝐾min(𝑠, 𝑚) 𝑚𝜀 log |Π| 𝛿 , 𝑛= eΘ 𝐾min(𝑠, 𝑚) 𝑚𝜀 + 𝑠min(𝑠, 𝑚) 𝜀2 log |Π| 𝛿 , 8Note that contextual bandits is the special case where𝑚=1. 16 𝜂=𝛾𝑚/(𝐾min(𝑠, 𝑚))and𝛾=1/2,...

2025
[21]

Combining this with the above implies that with probability1−2𝛿, 𝑇∑︁ 𝑡=1 𝑢⊤ 𝑡 𝑝𝑡 ≤ 32𝑠𝑚𝑇 𝐾 log𝑇+6𝐵log 1 𝛿 Dividing through by𝑚𝑇/𝐾and using Lemma 6 we obtain with probability1−3𝛿, 𝔼(𝑥,𝑟)∼  𝐾∑︁ 𝑗=1 𝑟(𝑗) 1{𝑗∈𝜋(𝑥)} 𝛾𝑚/𝐾+ (1−𝛾)𝑄 𝑥, 𝑗 (ˆ𝑝)  ≤ 2𝐾 𝑚𝑇 +2𝜂𝐵 𝑇∑︁ 𝑡=1 𝑢⊤ 𝑡 𝑝𝑡 + 2𝐾 𝜂𝑚𝑇 log|Π| +4𝐵log 1 𝛿 ≤3 32𝑠log𝑇+ 6𝐵𝐾 𝑚𝑇 log 1 𝛿 + 4𝐵𝐾 𝑚𝑇 log |Π| 𝛿 ≤96𝑠log...

2022
[22]

∑︁ 𝑎∈ Pr[𝑎 𝑖 =𝑎] 𝑟(𝑎)1{𝜋(𝑥)=𝑎} 𝛾/|| + (1−𝛾)𝑄 𝑥,𝑎 (ˆ𝑝) # =𝔼 (𝑥,𝑟)∼

Combining this with the above implies that with probability1−2𝛿, 𝑇∑︁ 𝑡=1 𝑢⊤ 𝑡 𝑝𝑡 ≤ 32𝑠𝑇 || log𝑇+ 6|| 𝛾 log 1 𝛿 Dividing through by𝑇/||and using Lemma 6 we obtain with probability1−3𝛿, 𝔼(𝑥,𝑟)∼ "∑︁ 𝑎∈ 𝑟(𝑎) 2 1{𝜋(𝑥)=𝑎} 𝛾/|| + (1−𝛾)𝑄 𝑥,𝑎 (ˆ𝑝) # ≤ 2|| 𝑇 + 2𝜂|| 2 𝛾𝑇 𝑇∑︁ 𝑡=1 𝑢⊤ 𝑡 𝑝𝑡 + 2|| 𝜂𝑇 log|Π| + 4|| 2 𝛾𝑇 log 1 𝛿 ≤3 32𝑠log𝑇+ 6|| 2 𝛾𝑇 log 1 𝛿 + 4|...

2025
[23]

𝑇∑︁ 𝑡=1 𝑋𝑡 ≤ 1+ 1 𝑎 𝑇∑︁ 𝑡=1 𝔼[𝑋𝑡 | 𝑡−1 ] +𝑎𝐵log 1 𝛿 # ≥1−𝛿, and Pr

The reward functiongiventhecontext𝑥 2 isidenticallyzero,while𝑟 𝑥1 ∈ {0,1}  assignsarewardof1toauniqueaction 𝑎★ ∈and0for𝑎≠𝑎 ★, andissampleduniformlyatrandompriortotheinteraction. Notethatinorderfor Algto output an𝜀-optimal policy with probability at least3/4, it must produceˆ𝜋∈Πfor whichˆ𝜋(𝑥1)=𝑎 ★ with probability at least3/4. Now, denote by𝑎1, 𝑎2, . . ....

2011
[24]

□ Lemma 5.Let𝑎 1,

The second inequality follows similarly by considering 𝑍𝑡 =−𝑌 𝑡. □ Lemma 5.Let𝑎 1, . . . , 𝑎𝑛 ∈ [0,1]. Then 𝑛∑︁ 𝑖=1 𝑎𝑖 1+ Í𝑖−1 𝑗=1 𝑎 𝑗 ≤2 log 1+ 𝑛∑︁ 𝑖=1 𝑎𝑖 ! . Proof.Denote𝑠 𝑖 = Í𝑖 𝑗=1 𝑎 𝑗 for𝑖∈ [𝑛]and𝑠 0 =0. Using the fact that𝑧≤2 log(1+𝑧)for all𝑧∈ [0,1], we have 𝑎𝑖 1+𝑠 𝑖−1 ≤2 log 1+ 𝑎𝑖 1+𝑠 𝑖−1 =2 [log(1+𝑠 𝑖) −log (1+𝑠 𝑖−1 )] . Summing over𝑖∈ [𝑛], we not...

2015

[1] [1]

Decision making in changing environments: Robustness, query-based learning, and differential privacy,

F. Chen and A. Rakhlin. Decision making in changing environments: Robustness, query-based learning, and differential privacy.arXiv preprint arXiv:2501.14928,

work page arXiv

[2] [2]

F. Chen, S. Mei, and Y. Bai. Unified algorithms for RL with decision-estimation coefficients: PAC, reward- free, preference-based learning, and beyond.arXiv preprint arXiv:2209.11745,

work page arXiv

[3] [3]

F. Chen, D. J. Foster, Y. Han, J. Qian, A. Rakhlin, and Y. Xu. Assouad, Fano, and Le Cam with inter- action: A unifying lower bound framework and characterization for bandit learnability.arXiv preprint arXiv:2410.05117,

work page arXiv

[4] [4]

Cohen, L

A. Cohen, L. Erez, S. Hanneke, T. Koren, Y. Mansour, S. Moran, and Q. Zhang. Sample complexity of agnostic multiclass classification: Natarajan dimension strikes back.arXiv preprint arXiv:2511.12659,

work page arXiv

[5] [5]

Efficient Optimal Learning for Contextual Bandits

M. Dudik, D. Hsu, S. Kale, N. Karampatziakis, J. Langford, L. Reyzin, and T. Zhang. Efficient optimal learning for contextual bandits.arXiv preprint arXiv:1106.2369,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Erez and T

L. Erez and T. Koren. Bandit multiclass list classification.arXiv preprint arXiv:2502.09257,

work page arXiv

[7] [7]

Bandit-feedbackonlinemulticlassclassification: Variants and tradeoffs.arXiv preprint arXiv:2402.07453,

Y.Filmus,S.Hanneke,I.Mehalel,andS.Moran. Bandit-feedbackonlinemulticlassclassification: Variants and tradeoffs.arXiv preprint arXiv:2402.07453,

work page arXiv

[8] [8]

Foster, D

D. Foster, D. J. Foster, N. Golowich, and A. Rakhlin. On the complexity of multi-agent decision making: Fromlearningingamestopartialmonitoring. InTheThirtySixthAnnualConferenceonLearningTheory, pages 2678–2792. PMLR, 2023a. D. J. Foster and A. Rakhlin. Beyond UCB: Optimal and efficient contextual bandits with regression oracles. arXiv preprint arXiv:2002....

work page arXiv 2002

[9] [9]

Practical Contextual Bandits with Regression Oracles

D.J.Foster,A.Agarwal,M.Dudík,H.Luo,andR.E.Schapire. Practicalcontextualbanditswithregression oracles.arXiv preprint arXiv:1803.01088,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

D. J. Foster, A. Rakhlin, D. Simchi-Levi, and Y. Xu. Instance-dependent complexity of contextual bandits and reinforcement learning: A disagreement-based perspective.arXiv preprint arXiv:2010.03104,

work page arXiv 2010

[11] [11]

The statistical complexity of interactive decision making,

D.J.Foster,S.M.Kakade,J.Qian,andA.Rakhlin. Thestatisticalcomplexityofinteractivedecisionmaking. arXiv preprint arXiv:2112.13487,

work page arXiv

[12] [12]

D. J. Foster, N. Golowich, and Y. Han. Tight guarantees for interactive decision making with the decision- estimation coefficient. InThe Thirty Sixth Annual Conference on Learning Theory, pages 3969–4043. PMLR, 2023b. D. J. Foster, N. Golowich, J. Qian, A. Rakhlin, and A. Sekhari. Model-free reinforcement learning with the decision-estimation coefficient....

work page arXiv

[13] [13]

L. Li, Y. Lu, and D. Zhou. Provably optimal algorithms for generalized linear contextual bandits. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2071–2080. JMLR. org,

2071

[14] [14]

Decisionmakinginhybridenvironments: Amodelaggregationapproach

H.Liu,C.-Y.Wei,andJ.Zimmert. Decisionmakinginhybridenvironments: Amodelaggregationapproach. arXiv preprint arXiv:2502.05974, 2025a. H.Liu,C.-Y.Wei,andJ.Zimmert. Animprovedmodel-freedecision-estimationcoefficientwithapplications in adversarial mdps.arXiv preprint arXiv:2510.08882, 2025b. P. M. Long. New bounds on the price of bandit feedback for mistake-bo...

work page arXiv

[15] [15]

doi: 10.1016/J.TCS.2019.11.017. H. B. McMahan and M. J. Streeter. Tighter bounds for multi-armed bandits with expert advice. InCOLT,

work page doi:10.1016/j.tcs.2019.11.017 2019

[16] [16]

Moran, O

S. Moran, O. Sharon, I. Tsubari, and S. Yosebashvili. List online classification. InThe Thirty Sixth Annual Conference on Learning Theory, pages 1885–1913. PMLR,

1913

[17] [17]

L. Qin, S. Chen, and X. Zhu. Contextual combinatorial bandit and its application on diversified online recommendation. InProceedings of the 2014 SIAM International Conference on Data Mining, pages 461–469. SIAM,

2014

[18] [18]

Multiclassonlinelearnabilityunderbanditfeedback.arXiv preprint arXiv:2308.04620,

A.Raman,V.Raman,U.Subedi,andA.Tewari. Multiclassonlinelearnabilityunderbanditfeedback.arXiv preprint arXiv:2308.04620,

work page arXiv

[19] [19]

Problem setup.Let={𝑎∈ {0,1} 𝐾 | ∥𝑎∥1 =𝑚}be the prediction space where1≤𝑚≤𝐾is an integer

A Extension to combinatorial semi-bandits Our approach outlined in Section 4 has the additional benefit of extending to the more general setting of contextualcombinatorialsemi-bandits(CCSB,ErezandKoren(2025)),inwhichthepredictionsaresubsets (orlists)ofactionsofafixedsize,andtherewardofagivensubsetisthesumofrewardsofindividualactions in this subset. Proble...

2025

[20] [20]

Theorem 5.LetΠ:→be a finite policy class, and assume the rewards𝑟∈ [0,1] 𝐾 satisfy ∥𝑟 ∥1 ≤𝑠

The sample complexity guarantee for Algorithm 3 is given in the following theorem. Theorem 5.LetΠ:→be a finite policy class, and assume the rewards𝑟∈ [0,1] 𝐾 satisfy ∥𝑟 ∥1 ≤𝑠. If we set 𝑇= eΘ 𝐾min(𝑠, 𝑚) 𝑚𝜀 log |Π| 𝛿 , 𝑛= eΘ 𝐾min(𝑠, 𝑚) 𝑚𝜀 + 𝑠min(𝑠, 𝑚) 𝜀2 log |Π| 𝛿 , 8Note that contextual bandits is the special case where𝑚=1. 16 𝜂=𝛾𝑚/(𝐾min(𝑠, 𝑚))and𝛾=1/2,...

2025

[21] [21]

Combining this with the above implies that with probability1−2𝛿, 𝑇∑︁ 𝑡=1 𝑢⊤ 𝑡 𝑝𝑡 ≤ 32𝑠𝑚𝑇 𝐾 log𝑇+6𝐵log 1 𝛿 Dividing through by𝑚𝑇/𝐾and using Lemma 6 we obtain with probability1−3𝛿, 𝔼(𝑥,𝑟)∼  𝐾∑︁ 𝑗=1 𝑟(𝑗) 1{𝑗∈𝜋(𝑥)} 𝛾𝑚/𝐾+ (1−𝛾)𝑄 𝑥, 𝑗 (ˆ𝑝)  ≤ 2𝐾 𝑚𝑇 +2𝜂𝐵 𝑇∑︁ 𝑡=1 𝑢⊤ 𝑡 𝑝𝑡 + 2𝐾 𝜂𝑚𝑇 log|Π| +4𝐵log 1 𝛿 ≤3 32𝑠log𝑇+ 6𝐵𝐾 𝑚𝑇 log 1 𝛿 + 4𝐵𝐾 𝑚𝑇 log |Π| 𝛿 ≤96𝑠log...

2022

[22] [22]

∑︁ 𝑎∈ Pr[𝑎 𝑖 =𝑎] 𝑟(𝑎)1{𝜋(𝑥)=𝑎} 𝛾/|| + (1−𝛾)𝑄 𝑥,𝑎 (ˆ𝑝) # =𝔼 (𝑥,𝑟)∼

Combining this with the above implies that with probability1−2𝛿, 𝑇∑︁ 𝑡=1 𝑢⊤ 𝑡 𝑝𝑡 ≤ 32𝑠𝑇 || log𝑇+ 6|| 𝛾 log 1 𝛿 Dividing through by𝑇/||and using Lemma 6 we obtain with probability1−3𝛿, 𝔼(𝑥,𝑟)∼ "∑︁ 𝑎∈ 𝑟(𝑎) 2 1{𝜋(𝑥)=𝑎} 𝛾/|| + (1−𝛾)𝑄 𝑥,𝑎 (ˆ𝑝) # ≤ 2|| 𝑇 + 2𝜂|| 2 𝛾𝑇 𝑇∑︁ 𝑡=1 𝑢⊤ 𝑡 𝑝𝑡 + 2|| 𝜂𝑇 log|Π| + 4|| 2 𝛾𝑇 log 1 𝛿 ≤3 32𝑠log𝑇+ 6|| 2 𝛾𝑇 log 1 𝛿 + 4|...

2025

[23] [23]

𝑇∑︁ 𝑡=1 𝑋𝑡 ≤ 1+ 1 𝑎 𝑇∑︁ 𝑡=1 𝔼[𝑋𝑡 | 𝑡−1 ] +𝑎𝐵log 1 𝛿 # ≥1−𝛿, and Pr

The reward functiongiventhecontext𝑥 2 isidenticallyzero,while𝑟 𝑥1 ∈ {0,1}  assignsarewardof1toauniqueaction 𝑎★ ∈and0for𝑎≠𝑎 ★, andissampleduniformlyatrandompriortotheinteraction. Notethatinorderfor Algto output an𝜀-optimal policy with probability at least3/4, it must produceˆ𝜋∈Πfor whichˆ𝜋(𝑥1)=𝑎 ★ with probability at least3/4. Now, denote by𝑎1, 𝑎2, . . ....

2011

[24] [24]

□ Lemma 5.Let𝑎 1,

The second inequality follows similarly by considering 𝑍𝑡 =−𝑌 𝑡. □ Lemma 5.Let𝑎 1, . . . , 𝑎𝑛 ∈ [0,1]. Then 𝑛∑︁ 𝑖=1 𝑎𝑖 1+ Í𝑖−1 𝑗=1 𝑎 𝑗 ≤2 log 1+ 𝑛∑︁ 𝑖=1 𝑎𝑖 ! . Proof.Denote𝑠 𝑖 = Í𝑖 𝑗=1 𝑎 𝑗 for𝑖∈ [𝑛]and𝑠 0 =0. Using the fact that𝑧≤2 log(1+𝑧)for all𝑧∈ [0,1], we have 𝑎𝑖 1+𝑠 𝑖−1 ≤2 log 1+ 𝑎𝑖 1+𝑠 𝑖−1 =2 [log(1+𝑠 𝑖) −log (1+𝑠 𝑖−1 )] . Summing over𝑖∈ [𝑛], we not...

2015