pith. machine review for the scientific record. sign in

arxiv: 2605.02141 · v1 · submitted 2026-05-04 · 💻 cs.LG · cs.AI· math.ST· stat.ML· stat.TH

Recognition: 2 theorem links

· Lean Theorem

On the Optimal Sample Complexity of Offline Multi-Armed Bandits with KL Regularization

Authors on Pith no claims yet

Pith reviewed 2026-05-08 19:33 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.STstat.MLstat.TH
keywords regularizationcomplexitysampletildeepsilonofflinebanditsmulti-armed
0
0 comments X

The pith

Offline KL-regularized MABs require sample complexity scaling as O(η S A C^π*/ε) for large regularization and Ω(S A C^π*/ε²) for small regularization, with matching lower bounds across the full range.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In offline multi-armed bandits, an algorithm must pick actions using only a fixed dataset collected by some unknown behavior policy. Adding KL regularization penalizes deviation from a reference policy and is common in practice for stability. The authors analyze an existing algorithm called KL-PCB and derive how many samples are needed to reach a target performance gap ε. When the regularization strength η is large (roughly 1/ε), the sample need grows linearly with η. When η is small (roughly 1/ε or weaker), the need grows like 1/ε squared. They also prove matching lower bounds, showing these rates cannot be improved in the worst case. The coverage coefficient C^π* measures how well the offline data covers the optimal policy.

Core claim

We provide a sharp analysis of KL-PCB (Zhao et al., 2026), showing that it achieves a sample complexity of Õ(η S A C^π*/ε) under large regularization η = Õ(ε^{-1}), and a sample complexity of Ω̃(S A C^π*/ε²) under small regularization η = Ω̃(ε^{-1}). We further provide a pair of sharper sample complexity lower bounds, which matches the upper bounds over the entire range of regularization strengths.

Load-bearing premise

The analysis relies on the offline dataset satisfying a policy coverage coefficient C^π* at the optimal policy and on the regularization parameter η being either much larger or much smaller than 1/ε; the precise dependence of the bounds on these quantities is not derived for intermediate η values.

Figures

Figures reproduced from arXiv: 2605.02141 by Heyang Zhao, Kaixuan Ji, Qingyue Zhao, Qiwei Di, Quanquan Gu.

Figure 1
Figure 1. Figure 1: An illustration of the counting argument for the case view at source ↗
read the original abstract

Kullback-Leibler (KL) regularization is widely used in offline decision-making and offers several benefits, motivating recent work on the sample complexity of offline learning with respect to KL-regularized performance metrics. Nevertheless, the exact sample complexity of KL-regularized offline learning remains largely from fully characterized. In this paper, we study this question in the setting of multi-armed bandits (MABs). We provide a sharp analysis of KL-PCB (Zhao et al., 2026), showing that it achieves a sample complexity of $\tilde{O}(\eta SAC^{\pi^*}/\epsilon)$ under large regularization $\eta = \tilde{O}(\epsilon^{-1})$, and a sample complexity of $\tilde{\Omega}(SAC^{\pi^*}/\epsilon^2)$ under small regularization $\eta = \tilde{\Omega}(\epsilon^{-1})$, where $\eta$ is the regularization parameter, $S$ is the number of contexts, $A$ is the number of arms, $C^{\pi^*}$ policy coverage coefficient at the optimal policy $\pi^*$, $\epsilon$ is the desired sub-optimality, and $\tilde{O}$ and $\tilde{\Omega}$ hide all poly-logarithmic factors. We further provide a pair of sharper sample complexity lower bounds, which matches the upper bounds over the entire range of regularization strengths. Overall, our results provide a nearly complete characterization of offline multi-armed bandits with KL regularization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper analyzes the sample complexity of offline multi-armed bandits under KL regularization. It provides a sharp analysis of the KL-PCB algorithm, deriving an upper bound of Õ(η S A C^{π*}/ε) when the regularization parameter η is large (η = Õ(ε^{-1})) and a rate of Ω̃(S A C^{π*}/ε²) when η is small (η = Ω̃(ε^{-1})), along with a pair of lower bounds that are claimed to match these upper bounds over the entire range of η values, yielding a nearly complete characterization.

Significance. If the claimed matching bounds hold without hidden restrictions, the work would deliver a precise, regime-dependent understanding of how KL regularization strength governs sample complexity in offline MABs. This extends prior analyses of KL-PCB and supplies falsifiable lower bounds, which is a positive contribution to the theory of offline decision-making. However, the absence of an explicit upper bound or interpolation for the transitional regime η ≈ Θ(ε^{-1}) limits the completeness of the characterization.

major comments (1)
  1. [Abstract] Abstract: The upper bound for KL-PCB is stated only for the two extreme regimes (η = Õ(ε^{-1}) yielding Õ(η S A C^{π*}/ε) and η = Ω̃(ε^{-1}) yielding Ω̃(S A C^{π*}/ε²)), yet the central claim is a 'nearly complete characterization' via matching bounds 'over the entire range of regularization strengths.' No general upper bound or continuous interpolation is supplied for the transitional regime η = Θ(ε^{-1}), so the lower bounds cannot close the characterization for all η as asserted.
minor comments (2)
  1. [Abstract] Abstract: The sentence 'achieves a sample complexity of Ω̃(S A C^{π*}/ε²) under small regularization' is imprecise because Ω̃ conventionally denotes lower bounds; clarify whether this rate is achieved by the algorithm or is instead the matching lower bound.
  2. [Abstract] Abstract: The phrase 'remains largely from fully characterized' appears to be a typographical error and should be corrected to 'remains largely not fully characterized.'

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thorough review and constructive comments. We address the major comment regarding the abstract and the completeness of the characterization below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The upper bound for KL-PCB is stated only for the two extreme regimes (η = Õ(ε^{-1}) yielding Õ(η S A C^{π*}/ε) and η = Ω̃(ε^{-1}) yielding Ω̃(S A C^{π*}/ε²)), yet the central claim is a 'nearly complete characterization' via matching bounds 'over the entire range of regularization strengths.' No general upper bound or continuous interpolation is supplied for the transitional regime η = Θ(ε^{-1}), so the lower bounds cannot close the characterization for all η as asserted.

    Authors: We thank the referee for this observation. Our manuscript provides upper bounds on the sample complexity of the KL-PCB algorithm in two specific regimes of the regularization parameter η: Õ(η S A C^{π*}/ε) for η = Õ(ε^{-1}) and Õ(S A C^{π*}/ε²) for η = Ω̃(ε^{-1}). We note that the abstract contains a typographical error in the second regime, where Õ should be used instead of Ω̃ to indicate the achieved upper bound. The pair of lower bounds we present match these upper bounds within their respective regimes and are valid across the full spectrum of η. Nevertheless, we do not supply an upper bound or an interpolating expression for the transitional regime η = Θ(ε^{-1}). We therefore agree that this prevents a fully closed characterization for all values of η. In the revised version, we will amend the abstract to accurately reflect the regimes where matching occurs and include a brief discussion of the transitional case as an open question. This revision will ensure our claims are precisely aligned with the theorems and proofs in the paper. revision: partial

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete. No explicit free parameters beyond the given η, ε, S, A, C^π* are mentioned. Standard concentration inequalities are implicitly used but not listed.

pith-pipeline@v0.9.0 · 5579 in / 1212 out tokens · 31457 ms · 2026-05-08T19:33:32.789185+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

67 extracted references · 15 canonical work pages · 3 internal anchors

  1. [1]

    F.(2020)

    Agarwal, A.,Kakade, S.andYang, L. F.(2020). Model-based reinforcement learning with a generative model is minimax optimal. InConference on Learning Theory. PMLR

  2. [2]

    Deux remarques sur l’estimation.Comptes rendus des s´ eances de l’Acad´ emie des sciences

    Assouad, P.(1983). Deux remarques sur l’estimation.Comptes rendus des s´ eances de l’Acad´ emie des sciences. S´ erie 1, Math´ ematique2961021–1024

  3. [3]

    Best arm identification in multi-armed bandits

    Audibert, J.-Y.andBubeck, S.(2010). Best arm identification in multi-armed bandits. In COLT-23th Conference on learning theory-2010. 35

  4. [4]

    Pure exploration in infinitely- armed bandit models with fixed-confidence

    Aziz, M.,Anderton, J.,Kaufmann, E.andAslam, J.(2018). Pure exploration in infinitely- armed bandit models with fixed-confidence. InAlgorithmic Learning Theory. PMLR

  5. [5]

    Weighted sums of certain dependent random variables.Tohoku Mathematical

    Azuma, K.(1967). Weighted sums of certain dependent random variables.Tohoku Mathematical

  6. [6]

    A.andMurty, U

    Bondy, J. A.andMurty, U. S. R.(1979).Graph theory with applications. north-Holland

  7. [7]

    N.andKatehakis, M

    Burnetas, A. N.andKatehakis, M. N.(1996). Optimal adaptive policies for sequential allo- cation problems.Advances in Applied Mathematics17122–142

  8. [8]

    R.andKalyanakrishnan, S.(2017)

    Chaudhuri, A. R.andKalyanakrishnan, S.(2017). Pac identification of a bandit arm rel- ative to a reward quantile. InProceedings of the Thirty-First AAAI Conference on Artificial Intelligence

  9. [9]

    R.andKalyanakrishnan, S.(2018)

    Chaudhuri, A. R.andKalyanakrishnan, S.(2018). Quantile-regret minimisation in infinitely many-armed bandits. InUAI

  10. [10]

    J.,Han, Y.,Qian, J.,Rakhlin, A.andXu, Y.(2024)

    Chen, F.,Foster, D. J.,Han, Y.,Qian, J.,Rakhlin, A.andXu, Y.(2024). Assouad, fano, and le cam with interaction: A unifying lower bound framework and characterization for bandit learnability.Advances in Neural Information Processing Systems3775585–75641

  11. [11]

    Adversarially trained actor critic for offline reinforcement learning

    Cheng, C.-A.,Xie, T.,Jiang, N.andAgarwal, A.(2022). Adversarially trained actor critic for offline reinforcement learning. InInternational Conference on Machine Learning. PMLR

  12. [12]

    M.(1999).Elements of information theory

    Cover, T. M.(1999).Elements of information theory. John Wiley & Sons. De Heide, R.,Cheshire, J.,M ´enard, P.andCarpentier, A.(2021). Bandits with many optimal arms.Advances in Neural Information Processing Systems3422457–22469

  13. [13]

    M.(2019)

    Degenne, R.andKoolen, W. M.(2019). Pure exploration with multiple correct answers. Advances in Neural Information Processing Systems32

  14. [14]

    Pessimistic nonlinear least-squares value iter- ation for offline reinforcement learning

    Di, Q.,Zhao, H.,He, J.andGu, Q.(2024). Pessimistic nonlinear least-squares value iter- ation for offline reinforcement learning. InThe Twelfth International Conference on Learning Representations

  15. [15]

    Model alignment as prospect theoretic optimization

    Ethayarajh, K.,Xu, W.,Muennighoff, N.,Jurafsky, D.andKiela, D.(2024). Model alignment as prospect theoretic optimization. InInternational Conference on Machine Learning. PMLR

  16. [16]

    arXiv preprint arXiv:2112.13487 , year=

    Foster, D. J.,Kakade, S. M.,Qian, J.andRakhlin, A.(2021). The statistical complexity of interactive decision making.arXiv preprint arXiv:2112.13487

  17. [17]

    J.,Mhammedi, Z.andRohatgi, D.(2025)

    Foster, D. J.,Mhammedi, Z.andRohatgi, D.(2025). Is a good foundation necessary for efficient reinforcement learning? the computational role of the base model in exploration. InThe Thirty Eighth Annual Conference on Learning Theory. PMLR

  18. [18]

    A.(1975)

    Freedman, D. A.(1975). On tail probabilities for martingales.the Annals of Probability100–118

  19. [19]

    Optimal best arm identification with fixed confidence

    Garivier, A.andKaufmann, E.(2016). Optimal best arm identification with fixed confidence. InConference on Learning Theory. PMLR. 36

  20. [20]

    Nonasymptotic sequential tests for overlapping hy- potheses applied to near-optimal arm identification in bandit models.Sequential Analysis40 61–96

    Garivier, A.andKaufmann, E.(2021). Nonasymptotic sequential tests for overlapping hy- potheses applied to near-optimal arm identification in bandit models.Sequential Analysis40 61–96

  21. [21]

    Explore first, exploit next: The true shape of regret in bandit problems.Mathematics of Operations Research44377–399

    Garivier, A.,M ´enard, P.andStoltz, G.(2019). Explore first, exploit next: The true shape of regret in bandit problems.Mathematics of Operations Research44377–399

  22. [22]

    N.(1952)

    Gilbert, E. N.(1952). A comparison of signalling alphabets.The Bell system technical journal 31504–522

  23. [23]

    Evolution of information in interactive decision making: A case study for multi-armed bandits.arXiv preprint arXiv:2503.00273

    Gu, Y.,Han, Y.andQian, J.(2025). Evolution of information in interactive decision making: A case study for multi-armed bandits.arXiv preprint arXiv:2503.00273

  24. [24]

    Essential coding theory.Draft available at http://cse

    Guruswami, V.,Rudra, A.andSudan, M.(2019). Essential coding theory.Draft available at http://cse. buffalo. edu/faculty/atri/courses/coding-theory/book11

  25. [25]

    arXiv preprint arXiv:2603.02155 , year=

    Ji, K.,Zhao, Q.,Zhao, H.,Di, Q.andGu, Q.(2026). Near-optimal regret for kl-regularized multi-armed bandits.arXiv preprint arXiv:2603.02155

  26. [26]

    Is pessimism provably efficient for offline rl? In International Conference on Machine Learning

    Jin, Y.,Yang, Z.andWang, Z.(2021). Is pessimism provably efficient for offline rl? In International Conference on Machine Learning. PMLR

  27. [27]

    Conservative q-learning for offline reinforcement learning.Advances in neural information processing systems331179–1191

    Kumar, A.,Zhou, A.,Tucker, G.andLevine, S.(2020). Conservative q-learning for offline reinforcement learning.Advances in neural information processing systems331179–1191

  28. [28]

    L.andRobbins, H.(1985)

    Lai, T. L.andRobbins, H.(1985). Asymptotically efficient adaptive allocation rules.Advances in applied mathematics64–22

  29. [29]

    Step-dpo: Step-wise preference optimization for long-chain reasoning of llms, 2024

    Lai, X.,Tian, Z.,Chen, Y.,Yang, S.,Peng, X.andJia, J.(2024). Step-dpo: Step-wise preference optimization for long-chain reasoning of llms.arXiv preprint arXiv:2406.18629

  30. [30]

    Cambridge University Press

    Lattimore, T.andSzepesv ´ari, C.(2020).Bandit algorithms. Cambridge University Press. Le Cam, L.(1973). Convergence of estimates under dimensionality restrictions.The Annals of Statistics38–53

  31. [31]

    Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

    Levine, S.,Kumar, A.,Tucker, G.andFu, J.(2020). Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643

  32. [32]

    Learning hand- eye coordination for robotic grasping with deep learning and large-scale data collection.The International journal of robotics research37421–436

    Levine, S.,Pastor, P.,Krizhevsky, A.,Ibarz, J.andQuillen, D.(2018). Learning hand- eye coordination for robotic grasping with deep learning and large-scale data collection.The International journal of robotics research37421–436

  33. [33]

    Settling the sample complexity of model-based offline reinforcement learning.The Annals of Statistics52233–260

    Li, G.,Shi, L.,Chen, Y.,Chi, Y.andWei, Y.(2024). Settling the sample complexity of model-based offline reinforcement learning.The Annals of Statistics52233–260

  34. [34]

    N.(2004)

    Mannor, S.andTsitsiklis, J. N.(2004). The sample complexity of exploration in the multi- armed bandit problem.Journal of Machine Learning Research5623–648

  35. [35]

    Simpo: Simple preference optimization with a reference- free reward.Advances in Neural Information Processing Systems37124198–124235

    Meng, Y.,Xia, M.andChen, D.(2024). Simpo: Simple preference optimization with a reference- free reward.Advances in Neural Information Processing Systems37124198–124235. 37

  36. [36]

    arXiv preprint arXiv:2510.13060 , year=

    Nayak, A.,Yang, T.,Yagan, O.,Joshi, G.andChi, Y.(2025). Achieving logarithmic regret in kl-regularized zero-sum markov games.arXiv preprint arXiv:2510.13060

  37. [37]

    E.,Pattathil, S.,Zhang, J.andZhang, K.(2023)

    Ozdaglar, A. E.,Pattathil, S.,Zhang, J.andZhang, K.(2023). Revisiting the linear- programming framework for offline rl with general function approximation. InInternational Conference on Machine Learning. PMLR

  38. [38]

    Cambridge university press

    Polyanskiy, Y.andWu, Y.(2025).Information theory: From coding to learning. Cambridge university press

  39. [39]

    Fromr to Q∗: Your language model is secretly a Q-function,

    Rafailov, R.,Hejna, J.,Park, R.andFinn, C.(2024). Fromrtoq ∗: Your language model is secretly a q-function.arXiv preprint arXiv:2404.12358

  40. [40]

    D.,Ermon, S.andFinn, C.(2023)

    Rafailov, R.,Sharma, A.,Mitchell, E.,Manning, C. D.,Ermon, S.andFinn, C.(2023). Direct preference optimization: Your language model is secretly a reward model.Advances in Neural Information Processing Systems36

  41. [41]

    Bridging offline rein- forcement learning and imitation learning: A tale of pessimism.Advances in Neural Information Processing Systems3411702–11716

    Rashidinejad, P.,Zhu, B.,Ma, C.,Jiao, J.andRussell, S.(2021). Bridging offline rein- forcement learning and imitation learning: A tale of pessimism.Advances in Neural Information Processing Systems3411702–11716

  42. [42]

    Optimal conserva- tive offline rl with general function approximation via augmented lagrangian.arXiv preprint arXiv:2211.00716

    Rashidinejad, P.,Zhu, H.,Yang, K.,Russell, S.andJiao, J.(2022). Optimal conserva- tive offline rl with general function approximation via augmented lagrangian.arXiv preprint arXiv:2211.00716

  43. [43]

    S.andSanghavi, S.(2021)

    Ren, T.,Li, J.,Dai, B.,Du, S. S.andSanghavi, S.(2021). Nearly horizon-free offline rein- forcement learning.Advances in neural information processing systems3415621–15634

  44. [44]

    A variant of azuma’s inequality for martingales with subgaussian tails.arXiv preprint arXiv:1110.2392

    Shamir, O.(2011). A variant of azuma’s inequality for martingales with subgaussian tails.arXiv preprint arXiv:1110.2392

  45. [45]

    On the complexity of bandit linear optimization

    Shamir, O.(2015). On the complexity of bandit linear optimization. InConference on Learning Theory. PMLR

  46. [46]

    Pessimistic q-learning for offline reinforcement learning: Towards optimal sample complexity

    Shi, L.,Li, G.,Wei, Y.,Chen, Y.andChi, Y.(2022). Pessimistic q-learning for offline reinforcement learning: Towards optimal sample complexity. InInternational conference on machine learning. PMLR

  47. [47]

    Near-optimal time and sample complexities for solving discounted markov decision process with a generative model.arXiv preprint arXiv:1806.01492, 2018

    Sidford, A.,Wang, M.,Wu, X.,Yang, L. F.andYe, Y.(2018). Near-optimal time and sample complexities for solving discounted markov decision process with a generative model. arXiv preprint arXiv:1806.01492

  48. [48]

    Fast rates for maximum entropy exploration

    Tiapkin, D.,Belomestny, D.,Calandriello, D.,Moulines, E.,Munos, R.,Naumov, A.,Perrault, P.,Tang, Y.,Valko, M.andMenard, P.(2023). Fast rates for maximum entropy exploration. InInternational Conference on Machine Learning. PMLR

  49. [49]

    Introduction to nonparametric estimation

    Tsybakov, A.(2008). Introduction to nonparametric estimation. InSpringer Series in Statistics

  50. [50]

    arXiv preprint arXiv:2107.06226 , year=

    Uehara, M.andSun, W.(2021). Pessimistic model-based offline reinforcement learning under partial coverage.arXiv preprint arXiv:2107.06226. 38

  51. [51]

    R.(1957)

    Varshamov, R. R.(1957). Estimate of the number of signals in error correcting codes.Docklady Akad. Nauk, SSSR117739–741

  52. [52]

    M.(2021)

    Wang, R.,Foster, D.andKakade, S. M.(2021). What are the statistical limits of offline rl with linear function approximation? InInternational Conference on Learning Representations

  53. [53]

    C.andSun, W.(2025)

    Wang, Z.,Zhou, D.,Lui, J. C.andSun, W.(2025). Model-based RL as a minimalist approach to horizon-free and second-order bounds. InThe Thirteenth International Conference on Learning Representations

  54. [54]

    Improved bounds for private and robust alignment

    Weng, W.,He, Y.andZhou, X.(2025). Improved bounds for private and robust alignment. arXiv preprint arXiv:2512.23816

  55. [55]

    arXiv preprint arXiv:2510.13512 , year=

    Wu, Y.,Thareja, R.,Vepakomma, P.andOrabona, F.(2025b). Offline and online kl- regularized rlhf under differential privacy.arXiv preprint arXiv:2510.13512

  56. [56]

    Behavior Regularized Offline Reinforcement Learning

    Wu, Y.,Tucker, G.andNachum, O.(2019). Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361

  57. [57]

    J.,Krishnamurthy, A.,Rosset, C.,Awadallah, A

    Xie, T.,Foster, D. J.,Krishnamurthy, A.,Rosset, C.,Awadallah, A. H.andRakhlin, A.(2025). Exploratory preference optimization: Harnessing implicit q*-approximation for sample-efficient RLHF. InThe Thirteenth International Conference on Learning Representa- tions

  58. [58]

    Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint

    Xiong, W.,Dong, H.,Ye, C.,Wang, Z.,Zhong, H.,Ji, H.,Jiang, N.andZhang, T.(2024). Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint. InForty-first International Conference on Machine Learning

  59. [59]

    Nearly minimax optimal offline reinforcement learning with linear function approximation: Single-agent mdp and markov game

    Xiong, W.,Zhong, H.,Shi, C.,Shen, C.,Wang, L.andZhang, T.(2023). Nearly minimax optimal offline reinforcement learning with linear function approximation: Single-agent mdp and markov game. InInternational Conference on Learning Representations (ICLR)

  60. [60]

    The efficacy of pessimism in asynchronous q-learning.IEEE Transactions on Information Theory697185–7219

    Yan, Y.,Li, G.,Chen, Y.andFan, J.(2023). The efficacy of pessimism in asynchronous q-learning.IEEE Transactions on Information Theory697185–7219

  61. [61]

    Near-optimal offline reinforcement learning with linear representation: Leveraging variance information with pessimism

    Yin, M.,Duan, Y.,Wang, M.andWang, Y.-X.(2022). Near-optimal offline reinforcement learning with linear representation: Leveraging variance information with pessimism. InInter- national Conference on Learning Representation. 39

  62. [62]

    Assouad, fano, and le cam

    Yu, B.(1997). Assouad, fano, and le cam. InFestschrift for Lucien Le Cam: research papers in probability and statistics. Springer, 423–435

  63. [63]

    Offline reinforcement learning with realizability and single-policy concentrability

    Zhan, W.,Huang, B.,Huang, A.,Jiang, N.andLee, J.(2022). Offline reinforcement learning with realizability and single-policy concentrability. InConference on Learning Theory. PMLR

  64. [64]

    Cambridge University Press

    Zhang, T.(2023).Mathematical Analysis of Machine Learning Algorithms. Cambridge University Press

  65. [65]

    Beyond Pessimism: Offline Learning in KL-regularized Games

    Zhang, Y.,Chen, C.andJiang, N.(2026). Beyond pessimism: Offline learning in kl-regularized games.arXiv preprint arXiv:2604.06738

  66. [66]

    Towards a sharp analysis of offline policy learning for$f$-divergence-regularized contextual bandits

    Zhao, Q.,Ji, K.,Zhao, H.,Zhang, T.andGu, Q.(2026). Towards a sharp analysis of offline policy learning for$f$-divergence-regularized contextual bandits. InThe Fourteenth International Conference on Learning Representations

  67. [67]

    On regret with multiple best arms.Advances in Neural Infor- mation Processing Systems339050–9060

    Zhu, Y.andNowak, R.(2020). On regret with multiple best arms.Advances in Neural Infor- mation Processing Systems339050–9060. 40