pith. sign in

arxiv: 2605.29645 · v1 · pith:KQQ6JBP3new · submitted 2026-05-28 · 💻 cs.LG · cs.AI· stat.ML

The Sample Complexity of Multiclass and Sparse Contextual Bandits

Pith reviewed 2026-06-29 09:15 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords contextual banditssample complexitysparse rewardsmulticlass classificationdecision estimation coefficientexploration by optimizationcombinatorial semi-bandits
0
0 comments X

The pith

Algorithms output an epsilon-optimal policy for s-sparse contextual bandits using O((s/eps squared plus |A|/eps) log |Pi|) samples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies contextual bandits where each context yields an s-sparse reward vector whose L1 norm is at most s, much smaller than the action set size. It constructs algorithms that, with high probability, return a policy within epsilon of the best in a given class Pi after roughly (s over epsilon squared plus |A| over epsilon) times log of |Pi| over delta samples. The bound removes an extra polynomial factor in |A| that appeared in earlier analyses and matches a lower bound up to logarithmic terms. The same rate extends to Natarajan dimension classes and to contextual combinatorial semi-bandits.

Core claim

For the class of s-sparse reward models the induced decision-estimation coefficient is bounded by a quantity linear in s; an exploration-by-optimization procedure whose sample complexity is controlled by this coefficient therefore returns an epsilon-optimal policy from Pi after the stated number of samples. A separate low-variance exploration method yields concrete, efficiently implementable algorithms that achieve the same bound and extend directly to list classification.

What carries the argument

The decision-estimation coefficient of the model class induced by s-sparse rewards, which scales linearly with s and directly governs the number of samples needed for exploration-by-optimization.

If this is right

  • Bandit multiclass classification with zero-one rewards admits the improved sample complexity.
  • The same rate holds when the policy class has finite Natarajan dimension.
  • Contextual combinatorial semi-bandits obtain improved guarantees via the low-variance method.
  • Prior algorithms that incurred an extra Theta(|A| to the ninth) factor are no longer necessary.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Sparsity may render very large discrete action spaces practical for contextual decision tasks when rewards concentrate on few actions per context.
  • The low-variance exploration technique could be adapted to other structured observation models that admit low-variance unbiased estimators.
  • Empirical tests on multiclass datasets with known sparsity patterns would reveal whether the theoretical improvement translates to faster convergence in practice.

Load-bearing premise

The induced model class for s-sparse rewards admits a sharp decision-estimation coefficient bound that scales with s.

What would settle it

A concrete instance of an s-sparse contextual bandit problem together with a policy class Pi for which every algorithm requires asymptotically more than Omega((s/epsilon squared) log |Pi|) samples to output an epsilon-optimal policy would falsify the upper bound.

read the original abstract

We study contextual bandits in the stochastic i.i.d.\ setting, where a learner observes contexts drawn from an unknown distribution, selects actions from a finite set $A$, and aims to identify an approximately optimal policy from a given class based on bandit feedback. Motivated by bandit multiclass classification with zero-one rewards, we focus on the \emph{$s$-sparse} setting in which, for every context, the reward vector has $L_1$-norm at most $s \ll |A|$. Our main result is the design of algorithms that, with high probability, output an $\epsilon$-optimal policy compared to policy class $\Pi$ using $\tilde{O} ((s/\epsilon^2 + |A|/\epsilon)\log |\Pi|/\delta)$ samples. We extend this bound to general Natarajan classes and complement it with a matching lower bound (up to logarithmic factors), thereby closing a substantial gap left by prior work (Erez et al., 2024, 2025), which incurred an additional $\Theta(|A|^9)$ dependence. We obtain these results via two complementary approaches. First, we analyze contextual bandits through the lens of contextual decision making with structured observations, designing an exploration-by-optimization algorithm whose sample complexity is governed by the \emph{decision-estimation coefficient} (DEC; Foster et al., 2021, 2022). We show that, with $s$-sparse rewards, the induced model class admits a sharp DEC bound that scales with $s$ and directly yields the optimal rate. Since this approach is largely information-theoretic and involves solving complex min-max optimization problems, we also develop a second, more specialized algorithmic method based on a low-variance exploration technique. This approach leads to concrete, tractable algorithms and naturally extends to contextual combinatorial semi-bandits, leading to improved sample complexity guarantees for bandit multiclass list classification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper studies sample complexity for stochastic contextual bandits with s-sparse rewards (L1-norm at most s per context). It claims algorithms that output an ε-optimal policy from class Π w.h.p. using Õ((s/ε² + |A|/ε) log |Π|/δ) samples. Results are derived via two complementary routes—an information-theoretic analysis showing the s-sparse model class induces a DEC scaling linearly in s (via the Foster et al. framework) plus a concrete low-variance exploration algorithm—and are extended to Natarajan dimension classes and combinatorial semi-bandits, with a matching lower bound (up to logs) that removes the |A|^9 factor from prior work.

Significance. If the stated rates hold, the work closes a substantial gap in the literature on structured contextual bandits and bandit multiclass classification. Credit is due for supplying both an information-theoretic DEC derivation for the sparse case and a matching algorithmic construction, together with an explicit lower bound ruling out better |A| dependence; these elements together make the headline rate credible and falsifiable.

minor comments (3)
  1. [Abstract] The sample-complexity expression writes log |Π|/δ; confirm whether this is log(|Π|/δ) or log |Π| / δ and state the precise high-probability failure probability.
  2. [Abstract] The extension paragraph states the bound carries over to general Natarajan classes; add a sentence or corollary making the dependence on the Natarajan dimension explicit.
  3. Notation for the policy class Π and the induced model class should be introduced once in the main body with a clear mapping from policies to reward distributions.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive and accurate summary of the paper, as well as for the recommendation of minor revision. The report correctly captures the main results on the sample complexity of s-sparse contextual bandits, the use of the DEC framework, the low-variance exploration algorithm, the extensions, and the matching lower bound that improves upon prior |A|^9 dependence. Since the referee report lists no specific major comments, we have no point-by-point responses to provide at this time.

Circularity Check

0 steps flagged

Minor self-citation to prior work; central DEC bound newly derived

full rationale

The paper cites Erez et al. (2024, 2025) solely to identify the |A|^9 gap being closed. The core contribution is an independent derivation of a DEC bound scaling with s for the s-sparse reward model class, using the external Foster et al. (2021, 2022) framework, plus a matching lower bound and a separate low-variance algorithm. No step reduces the claimed sample complexity to a fitted quantity, self-definition, or load-bearing self-citation chain. The derivation remains self-contained against the external DEC benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The result rests on the standard stochastic i.i.d. contextual bandit model plus the s-sparse reward assumption; no free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption Contexts are drawn i.i.d. from an unknown distribution and rewards are stochastic given context and action.
    Stated in the first sentence of the abstract as the setting under study.
  • domain assumption The reward vector for every context has L1 norm at most s.
    Central modeling choice that enables the improved rate; appears in the second sentence.

pith-pipeline@v0.9.1-grok · 5904 in / 1355 out tokens · 22686 ms · 2026-06-29T09:15:52.415427+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 15 canonical work pages · 2 internal anchors

  1. [1]

    Decision making in changing environments: Robustness, query-based learning, and differential privacy,

    F. Chen and A. Rakhlin. Decision making in changing environments: Robustness, query-based learning, and differential privacy.arXiv preprint arXiv:2501.14928,

  2. [2]

    F. Chen, S. Mei, and Y. Bai. Unified algorithms for RL with decision-estimation coefficients: PAC, reward- free, preference-based learning, and beyond.arXiv preprint arXiv:2209.11745,

  3. [3]

    F. Chen, D. J. Foster, Y. Han, J. Qian, A. Rakhlin, and Y. Xu. Assouad, Fano, and Le Cam with inter- action: A unifying lower bound framework and characterization for bandit learnability.arXiv preprint arXiv:2410.05117,

  4. [4]

    Cohen, L

    A. Cohen, L. Erez, S. Hanneke, T. Koren, Y. Mansour, S. Moran, and Q. Zhang. Sample complexity of agnostic multiclass classification: Natarajan dimension strikes back.arXiv preprint arXiv:2511.12659,

  5. [5]

    Efficient Optimal Learning for Contextual Bandits

    M. Dudik, D. Hsu, S. Kale, N. Karampatziakis, J. Langford, L. Reyzin, and T. Zhang. Efficient optimal learning for contextual bandits.arXiv preprint arXiv:1106.2369,

  6. [6]

    Erez and T

    L. Erez and T. Koren. Bandit multiclass list classification.arXiv preprint arXiv:2502.09257,

  7. [7]

    Bandit-feedbackonlinemulticlassclassification: Variants and tradeoffs.arXiv preprint arXiv:2402.07453,

    Y.Filmus,S.Hanneke,I.Mehalel,andS.Moran. Bandit-feedbackonlinemulticlassclassification: Variants and tradeoffs.arXiv preprint arXiv:2402.07453,

  8. [8]

    Foster, D

    D. Foster, D. J. Foster, N. Golowich, and A. Rakhlin. On the complexity of multi-agent decision making: Fromlearningingamestopartialmonitoring. InTheThirtySixthAnnualConferenceonLearningTheory, pages 2678–2792. PMLR, 2023a. D. J. Foster and A. Rakhlin. Beyond UCB: Optimal and efficient contextual bandits with regression oracles. arXiv preprint arXiv:2002....

  9. [9]

    Practical Contextual Bandits with Regression Oracles

    D.J.Foster,A.Agarwal,M.Dudík,H.Luo,andR.E.Schapire. Practicalcontextualbanditswithregression oracles.arXiv preprint arXiv:1803.01088,

  10. [10]

    D. J. Foster, A. Rakhlin, D. Simchi-Levi, and Y. Xu. Instance-dependent complexity of contextual bandits and reinforcement learning: A disagreement-based perspective.arXiv preprint arXiv:2010.03104,

  11. [11]

    The statistical complexity of interactive decision making,

    D.J.Foster,S.M.Kakade,J.Qian,andA.Rakhlin. Thestatisticalcomplexityofinteractivedecisionmaking. arXiv preprint arXiv:2112.13487,

  12. [12]

    D. J. Foster, N. Golowich, and Y. Han. Tight guarantees for interactive decision making with the decision- estimation coefficient. InThe Thirty Sixth Annual Conference on Learning Theory, pages 3969–4043. PMLR, 2023b. D. J. Foster, N. Golowich, J. Qian, A. Rakhlin, and A. Sekhari. Model-free reinforcement learning with the decision-estimation coefficient....

  13. [13]

    L. Li, Y. Lu, and D. Zhou. Provably optimal algorithms for generalized linear contextual bandits. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2071–2080. JMLR. org,

  14. [14]

    Decisionmakinginhybridenvironments: Amodelaggregationapproach

    H.Liu,C.-Y.Wei,andJ.Zimmert. Decisionmakinginhybridenvironments: Amodelaggregationapproach. arXiv preprint arXiv:2502.05974, 2025a. H.Liu,C.-Y.Wei,andJ.Zimmert. Animprovedmodel-freedecision-estimationcoefficientwithapplications in adversarial mdps.arXiv preprint arXiv:2510.08882, 2025b. P. M. Long. New bounds on the price of bandit feedback for mistake-bo...

  15. [15]

    doi: 10.1016/J.TCS.2019.11.017. H. B. McMahan and M. J. Streeter. Tighter bounds for multi-armed bandits with expert advice. InCOLT,

  16. [16]

    Moran, O

    S. Moran, O. Sharon, I. Tsubari, and S. Yosebashvili. List online classification. InThe Thirty Sixth Annual Conference on Learning Theory, pages 1885–1913. PMLR,

  17. [17]

    L. Qin, S. Chen, and X. Zhu. Contextual combinatorial bandit and its application on diversified online recommendation. InProceedings of the 2014 SIAM International Conference on Data Mining, pages 461–469. SIAM,

  18. [18]

    Multiclassonlinelearnabilityunderbanditfeedback.arXiv preprint arXiv:2308.04620,

    A.Raman,V.Raman,U.Subedi,andA.Tewari. Multiclassonlinelearnabilityunderbanditfeedback.arXiv preprint arXiv:2308.04620,

  19. [19]

    Problem setup.Let={𝑎∈ {0,1} 𝐾 | ∥𝑎∥1 =𝑚}be the prediction space where1≤𝑚≤𝐾is an integer

    A Extension to combinatorial semi-bandits Our approach outlined in Section 4 has the additional benefit of extending to the more general setting of contextualcombinatorialsemi-bandits(CCSB,ErezandKoren(2025)),inwhichthepredictionsaresubsets (orlists)ofactionsofafixedsize,andtherewardofagivensubsetisthesumofrewardsofindividualactions in this subset. Proble...

  20. [20]

    Theorem 5.LetΠ:→be a finite policy class, and assume the rewards𝑟∈ [0,1] 𝐾 satisfy ∥𝑟 ∥1 ≤𝑠

    The sample complexity guarantee for Algorithm 3 is given in the following theorem. Theorem 5.LetΠ:→be a finite policy class, and assume the rewards𝑟∈ [0,1] 𝐾 satisfy ∥𝑟 ∥1 ≤𝑠. If we set 𝑇= eΘ 𝐾min(𝑠, 𝑚) 𝑚𝜀 log |Π| 𝛿 , 𝑛= eΘ 𝐾min(𝑠, 𝑚) 𝑚𝜀 + 𝑠min(𝑠, 𝑚) 𝜀2 log |Π| 𝛿 , 8Note that contextual bandits is the special case where𝑚=1. 16 𝜂=𝛾𝑚/(𝐾min(𝑠, 𝑚))and𝛾=1/2,...

  21. [21]

    Combining this with the above implies that with probability1−2𝛿, 𝑇∑︁ 𝑡=1 𝑢⊤ 𝑡 𝑝𝑡 ≤ 32𝑠𝑚𝑇 𝐾 log𝑇+6𝐵log 1 𝛿 Dividing through by𝑚𝑇/𝐾and using Lemma 6 we obtain with probability1−3𝛿, 𝔼(𝑥,𝑟)∼  𝐾∑︁ 𝑗=1 𝑟(𝑗) 1{𝑗∈𝜋(𝑥)} 𝛾𝑚/𝐾+ (1−𝛾)𝑄 𝑥, 𝑗 (ˆ𝑝)  ≤ 2𝐾 𝑚𝑇 +2𝜂𝐵 𝑇∑︁ 𝑡=1 𝑢⊤ 𝑡 𝑝𝑡 + 2𝐾 𝜂𝑚𝑇 log|Π| +4𝐵log 1 𝛿 ≤3 32𝑠log𝑇+ 6𝐵𝐾 𝑚𝑇 log 1 𝛿 + 4𝐵𝐾 𝑚𝑇 log |Π| 𝛿 ≤96𝑠log...

  22. [22]

    ∑︁ 𝑎∈ Pr[𝑎 𝑖 =𝑎] 𝑟(𝑎)1{𝜋(𝑥)=𝑎} 𝛾/|| + (1−𝛾)𝑄 𝑥,𝑎 (ˆ𝑝) # =𝔼 (𝑥,𝑟)∼

    Combining this with the above implies that with probability1−2𝛿, 𝑇∑︁ 𝑡=1 𝑢⊤ 𝑡 𝑝𝑡 ≤ 32𝑠𝑇 || log𝑇+ 6|| 𝛾 log 1 𝛿 Dividing through by𝑇/||and using Lemma 6 we obtain with probability1−3𝛿, 𝔼(𝑥,𝑟)∼ "∑︁ 𝑎∈ 𝑟(𝑎) 2 1{𝜋(𝑥)=𝑎} 𝛾/|| + (1−𝛾)𝑄 𝑥,𝑎 (ˆ𝑝) # ≤ 2|| 𝑇 + 2𝜂|| 2 𝛾𝑇 𝑇∑︁ 𝑡=1 𝑢⊤ 𝑡 𝑝𝑡 + 2|| 𝜂𝑇 log|Π| + 4|| 2 𝛾𝑇 log 1 𝛿 ≤3 32𝑠log𝑇+ 6|| 2 𝛾𝑇 log 1 𝛿 + 4|...

  23. [23]

    𝑇∑︁ 𝑡=1 𝑋𝑡 ≤ 1+ 1 𝑎 𝑇∑︁ 𝑡=1 𝔼[𝑋𝑡 | 𝑡−1 ] +𝑎𝐵log 1 𝛿 # ≥1−𝛿, and Pr

    The reward functiongiventhecontext𝑥 2 isidenticallyzero,while𝑟 𝑥1 ∈ {0,1}  assignsarewardof1toauniqueaction 𝑎★ ∈and0for𝑎≠𝑎 ★, andissampleduniformlyatrandompriortotheinteraction. Notethatinorderfor Algto output an𝜀-optimal policy with probability at least3/4, it must produceˆ𝜋∈Πfor whichˆ𝜋(𝑥1)=𝑎 ★ with probability at least3/4. Now, denote by𝑎1, 𝑎2, . . ....

  24. [24]

    □ Lemma 5.Let𝑎 1,

    The second inequality follows similarly by considering 𝑍𝑡 =−𝑌 𝑡. □ Lemma 5.Let𝑎 1, . . . , 𝑎𝑛 ∈ [0,1]. Then 𝑛∑︁ 𝑖=1 𝑎𝑖 1+ Í𝑖−1 𝑗=1 𝑎 𝑗 ≤2 log 1+ 𝑛∑︁ 𝑖=1 𝑎𝑖 ! . Proof.Denote𝑠 𝑖 = Í𝑖 𝑗=1 𝑎 𝑗 for𝑖∈ [𝑛]and𝑠 0 =0. Using the fact that𝑧≤2 log(1+𝑧)for all𝑧∈ [0,1], we have 𝑎𝑖 1+𝑠 𝑖−1 ≤2 log 1+ 𝑎𝑖 1+𝑠 𝑖−1 =2 [log(1+𝑠 𝑖) −log (1+𝑠 𝑖−1 )] . Summing over𝑖∈ [𝑛], we not...