pith. machine review for the scientific record. sign in

arxiv: 2605.04979 · v1 · submitted 2026-05-06 · 💻 cs.AI · cs.LG

Recognition: 3 theorem links

· Lean Theorem

On-line Learning in Tree MDPs by Treating Policies as Bandit Arms

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:06 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords tree MDPbandit algorithmsonline learningPAC learningregret minimizationshared data boundspolicy optimizationhidden information games
0
0 comments X

The pith

Bandit algorithms can learn policies in tree MDPs by treating each policy as an arm and using shared-data confidence bounds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard bandit methods such as LUCB and UCB can be applied directly to tree MDPs for both PAC learning and regret minimization, even though the number of policies is exponential in the number of states. The key step is to build confidence bounds from data that policies share because every state has a unique trajectory from the starting state. This sharing keeps both memory use and per-step computation polynomial in the problem size. The resulting bounds on sample complexity and regret are instance-dependent and add a gap term for each terminal state rather than for each policy. The approach is evaluated on hidden-information games where it outperforms prior methods.

Core claim

Treating each policy as a bandit arm allows LUCB and UCB to be applied to T-MDPs. The main technical step is the construction of confidence bounds from observations that are shared across policies due to the single-trajectory structure; these bounds support correct decisions while requiring only polynomial memory and per-step time. The analysis produces instance-dependent upper bounds on sample complexity and regret that sum gap terms over terminal states rather than over the full set of policies.

What carries the argument

Shared-data confidence bounds that pool observations along the unique trajectories reaching each state, allowing exponential policies to be handled with polynomial resources.

If this is right

  • Sample-complexity bounds for PAC learning sum gap terms from terminal states.
  • Regret bounds in the online setting follow the same terminal-state structure.
  • Both algorithms run with polynomial memory and per-step computation.
  • The methods outperform prior approaches on hidden-information games.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sharing idea could reduce policy complexity in other MDPs whose state reachability has limited branching.
  • Bounds that depend only on terminal gaps suggest that learning effort concentrates near the leaves of the decision tree.
  • The reduction from policies to terminals may guide algorithm design for perfect-recall games of larger size.

Load-bearing premise

The unique-trajectory property of T-MDPs lets valid confidence bounds be built from data pooled across policies while preserving correctness.

What would settle it

An implementation that either uses super-polynomial memory on a T-MDP with a modest number of states or produces sample complexity that exceeds the derived bound on a concrete instance with known gaps.

Figures

Figures reproduced from arXiv: 2605.04979 by Anvay Shah, Ramsundar Anandanarayanan, Sharayu Moharir, Shivaram Kalyanakrishnan.

Figure 1
Figure 1. Figure 1: Transition from non-terminal state 𝑠 to state 𝑠 ′ upon taking action 𝑎. State 𝑠 ′ could be non-terminal or terminal. reward (that is, the return) along each trajectory is bounded, and for convenient exposition taken to lie in [0, 1]. A deterministic policy 𝜋 : S → A specifies an action 𝜋 (𝑠) for every non-terminal state 𝑠. Let Π be the set of all deterministic policies for MDP M. For every non-terminal sta… view at source ↗
Figure 2
Figure 2. Figure 2: Performance for Leduc Poker and RBT. The top row (a-d) corresponds to the PAC setting, and the bottom row (e-h) to regret view at source ↗
Figure 3
Figure 3. Figure 3: (a) A visualisation of the uncertainty in RBT after playing a sense move. The sense region (in yellow) reveals information view at source ↗
read the original abstract

A Tree Markov Decision Problem (T-MDP) is a finite-horizon MDP with a starting state $s_{1}$, in which every state is reachable from $s_{1}$ through exactly one state-action trajectory. T-MDPs arise naturally as abstractions of decision making in sequential games with perfect recall, against stationary opponents. We consider the problem of on-line learning in T-MDPs, both in the PAC and the regret-minimisation regimes. We show that well-known bandit algorithms -- \textsc{Lucb} and \textsc{Ucb} -- can be applied on T-MDPs by treating each policy as an arm. The apparent technical challenge in this approach is that the number of policies is exponential in the number of states. Our main innovation is in the design of confidence bounds based on data shared by the policies, so that the bandit algorithms can yet be implemented with polynomial memory and per-step computation. We obtain instance-dependent upper bounds on sample complexity and regret that sum a ``gap term'' from every terminal state, rather than every policy. Empirically, our algorithms consistently outperform available alternatives on a suite of hidden-information games.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper addresses online learning in Tree MDPs (T-MDPs), finite-horizon MDPs where every state is reachable from the start via exactly one trajectory. It shows that standard bandit algorithms (LUCB and UCB) can be applied by treating policies as arms, with the central innovation being the construction of confidence bounds that exploit shared observations across policies due to the unique-trajectory structure. This enables polynomial memory and per-step computation despite the exponential policy space. Instance-dependent bounds on sample complexity and regret are derived that sum gap terms over terminal states (rather than policies). The approach is evaluated empirically on hidden-information games, where it outperforms available alternatives.

Significance. If the claims hold, the work provides an efficient and theoretically grounded method for online learning in structured MDPs with exponentially large policy sets by reducing the problem to tree DP over shared edge statistics. The instance-dependent bounds scaling with terminals (not policies) and the polynomial-time implementation via shared-data CIs are notable strengths, as is the empirical validation on games. This could influence regret minimization and PAC learning in perfect-recall game abstractions and related sequential decision settings.

minor comments (3)
  1. [Abstract] Abstract: the statement that the algorithms 'consistently outperform available alternatives' would be strengthened by including at least one quantitative highlight (e.g., sample-complexity reduction factor or regret ratio on a representative game).
  2. [Section 3 (Algorithm and Confidence Bounds)] The description of how the union bound over O(|S|) edges yields simultaneous CIs for all policies (via additive path returns) is central but would benefit from an explicit statement of the concentration inequality used and the precise failure probability allocation.
  3. [Section 4 (Theoretical Analysis)] In the regret analysis, clarify whether the 'gap term from every terminal state' bound assumes a fixed horizon or accounts for varying path lengths; a short remark on this would remove ambiguity.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and accurate summary of our work on online learning in Tree MDPs, including the reduction to bandit algorithms via shared-data confidence bounds, the polynomial memory and computation guarantees, and the instance-dependent bounds that depend on terminal-state gaps. We appreciate the recommendation for minor revision. As no specific major comments were provided in the report, we have no points requiring rebuttal or substantive changes at this stage and will incorporate any minor editorial suggestions in the revised manuscript.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The derivation applies known LUCB and UCB algorithms to T-MDPs by treating policies as arms, with confidence bounds constructed from per-edge statistics that exploit the unique-trajectory property. Instance-dependent sample-complexity and regret bounds are obtained by summing gap terms over terminal states (one per distinct path) rather than over the exponential policy set; this follows directly from additive path returns and a union bound over O(|S|) edges, without any fitted parameter being renamed as a prediction or any self-citation serving as the load-bearing justification. No equation or claim reduces to its own inputs by construction, and the approach remains self-contained against the stated MDP structure.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the structural definition of T-MDPs (unique trajectory to each state) and the assumption that this structure permits correct shared confidence bounds. No free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Every state in a T-MDP is reachable from the start through exactly one state-action trajectory.
    This property is invoked to justify treating policies as arms while sharing observations across policies that share prefixes.

pith-pipeline@v0.9.0 · 5518 in / 1371 out tokens · 32755 ms · 2026-05-08T18:06:08.571042+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

65 extracted references · 21 canonical work pages · 1 internal anchor

  1. [1]

    Peter Auer. 2002. Using Confidence Bounds for Exploitation-Exploration Trade- offs.J. Mach. Learn. Res.3 (2002), 397–422. https://jmlr.org/papers/v3/auer02a. html

  2. [2]

    Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. 2002. Finite-time Analysis of the Multiarmed Bandit Problem.Mach. Learn.47, 2-3 (2002), 235–256. https://doi.org/10.1023/A:1013689704352

  3. [3]

    Informing sequential clinical decision-making through reinforce- ment learning: an empirical study

    MohammadGheshlaghiAzar,RémiMunos,andHilbertJ.Kappen.2013. Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model.Mach. Learn.91, 3 (2013), 325–349. https://doi.org/10.1007/S10994- 013-5368-1

  4. [4]

    Yu Bai, Chi Jin, Song Mei, and Tiancheng Yu. 2022. Near-optimal learning of extensive-form games with imperfect information. InInternational Conference on Machine Learning. PMLR, 1337–1382

  5. [5]

    2016.Concentration Inequalities: A Nonasymptotic Theory of Independence

    Stephane Boucheron, Gabor Lugosi, and Pascal Massart. 2016.Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press

  6. [6]

    Michael Bowling, Neil Burch, Michael Johanson, and Oskari Tammelin. 2017. Heads-up limit hold’em poker is solved.Commun. ACM60, 11 (oct 2017), 81–88. https://doi.org/10.1145/3131284

  7. [7]

    InProceedings of the AAAI conference on artificial intelligence, Vol

    NoamBrown,ChristianKroer,andTuomasSandholm.2017.Dynamicthresholding and pruning for regret minimization. InProceedings of the AAAI conference on artificial intelligence, Vol. 31

  8. [8]

    Noam Brown, Adam Lerer, Sam Gross, and Tuomas Sandholm. 2019. Deep coun- terfactual regret minimization. InInternational conference on machine learning. PMLR, 793–802

  9. [9]

    Noam Brown and Tuomas Sandholm. 2018. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals.Science 359, 6374 (2018), 418–424. https://doi.org/10.1126/science.aao1733 arXiv:https://www.science.org/doi/pdf/10.1126/science.aao1733

  10. [10]

    Noam Brown and Tuomas Sandholm. 2019. Superhuman AI for multiplayer poker. Science365, 6456 (2019), 885–890

  11. [11]

    Shulun Chen, Runlong Zhou, Zihan Zhang, Maryam Fazel, and Simon S. Du

  12. [12]

    arXiv:2506.06521 [cs.LG] https://arxiv.org/abs/2506.06521

    Sharp Gap-Dependent Variance-Aware Regret Bounds for Tabular MDPs. arXiv:2506.06521 [cs.LG] https://arxiv.org/abs/2506.06521

  13. [13]

    Christoph Dann, Lihong Li, Wei Wei, and Emma Brunskill. 2019. Policy Cer- tificates: Towards Accountable Reinforcement Learning. InProceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA (Proceedings of Machine Learning Research, Vol.97),KamalikaChaudhuriandRuslanSalakhutdinov(Eds.).PMLR...

  14. [14]

    InProceedings of the 35th International Conference on Neural Information Processing Systems (NIPS ’21)

    ChrisDann,TeodorV.Marinov,MehryarMohri,andJulianZimmert.2021.Beyond value-function gaps: improved instance-dependent regret bounds for episodic reinforcement learning. InProceedings of the 35th International Conference on Neural Information Processing Systems (NIPS ’21). Curran Associates Inc., Red Hook, NY, USA, Article 1, 12 pages

  15. [15]

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

  16. [16]

    Eyal Even-Dar, Shie Mannor, and Yishay Mansour. 2006. Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems.J. Mach. Learn. Res.7 (2006), 1079–1105. https://jmlr.org/papers/v7/ evendar06a.html

  17. [17]

    Gabriele Farina and Tuomas Sandholm. 2021. Model-free online learning in unknown sequential decision making problems and games. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 5381–5390

  18. [18]

    Claude-Nicolas Fiechter. 1994. Efficient Reinforcement Learning. InProceedings oftheSeventhAnnualACMConferenceonComputationalLearningTheory,COLT 1994, New Brunswick, NJ, USA, July 12-15, 1994, Manfred K. Warmuth (Ed.). ACM, 88–97. https://doi.org/10.1145/180139.181019

  19. [19]

    Victor Gabillon, Mohammad Ghavamzadeh, and Alessandro Lazaric. 2012. Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United St...

  20. [20]

    Kevin Garbe and Jan Vondrak. 2018. Concentration of Lipschitz Functions of Negatively Dependent Variables. arXiv:1804.10084 [math.PR] https://arxiv.org/ abs/1804.10084

  21. [21]

    Wang,GregoryClark,TimoBertram,JohannesFürnkranz,MartinMüller,BradyP

    RyanW.Gardner,GinoPerrotta,AnvayShah,ShivaramKalyanakrishnan,KevinA. Wang,GregoryClark,TimoBertram,JohannesFürnkranz,MartinMüller,BradyP. Garrison, Prithviraj Dasgupta, and Saeid Rezaei. 2022. The Machine Reconnais- sance Blind Chess Tournament of NeurIPS 2022. InProceedings of the NeurIPS 2022 Competitions Track (Proceedings of Machine Learning Research,...

  22. [22]

    Chi Jin, Akshay Krishnamurthy, Max Simchowitz, and Tiancheng Yu. 2020. Reward-Free Exploration for Reinforcement Learning. InProceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event (Proceedings of Machine Learning Research, Vol. 119). PMLR, 4870–4879. http://proceedings.mlr.press/v119/jin20d.html

  23. [23]

    PlanninginMarkovDecisionProcesses with Gap-Dependent Sample Complexity

    Anders Jonsson, Emilie Kaufmann, Pierre Menard, Omar Darwiche Domingues, EdouardLeurent,andMichalValko.2020. PlanninginMarkovDecisionProcesses with Gap-Dependent Sample Complexity. InAdvances in Neural Information ProcessingSystems,H.Larochelle,M.Ranzato,R.Hadsell,M.F.Balcan,andH.Lin (Eds.), Vol. 33. Curran Associates, Inc., 1253–1263. https://proceedings...

  24. [24]

    Shivaram Kalyanakrishnan, Sheel Shah, and Santhosh Kumar Guguloth. 2025. A View of the Certainty-Equivalence Method for PAC RL as an Application of the Trajectory Tree Method. InProceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems(Detroit, MI, USA)(AAMAS ’25). International Foundation for Autonomous Agents and Multi...

  25. [25]

    Shivaram Kalyanakrishnan, Ambuj Tewari, Peter Auer, and Peter Stone. 2012. PAC Subset Selection in Stochastic Multi-armed Bandits. InProceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012. icml.cc / Omnipress. http://icml.cc/2012/ papers/359.pdf

  26. [26]

    30), Shai Shalev-Shwartz and Ingo Steinwart (Eds.)

    EmilieKaufmannandShivaramKalyanakrishnan.2013.InformationComplexityin BanditSubsetSelection.InCOLT2013-The26thAnnualConferenceonLearning Theory, June 12-14, 2013, Princeton University, NJ, USA (JMLR Workshop and Conference Proceedings, Vol. 30), Shai Shalev-Shwartz and Ingo Steinwart (Eds.). JMLR.org, 228–251. http://proceedings.mlr.press/v30/Kaufmann13.html

  27. [27]

    Emilie Kaufmann, Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Edouard Leurent, and Michal Valko. 2021. Adaptive Reward-Free Exploration. InProceedings of the 32nd International Conference on Algorithmic Learning Theory (Proceedings of Machine Learning Research, Vol. 132), Vitaly Feldman, Katrina Ligett, and Sivan Sabato (Eds.). PMLR, 865–891. h...

  28. [28]

    Kearns and Satinder Singh

    Michael J. Kearns and Satinder Singh. 2002. Near-Optimal Reinforcement Learning in Polynomial Time.Mach. Learn.49, 2-3 (2002), 209–232. https: //doi.org/10.1023/A:1017984413808

  29. [29]

    Bagnell, and Jan Peters

    Jens Kober, J. Andrew Bagnell, and Jan Peters. 2013. Reinforcement learning in robotics: A survey.Int. J. Robotics Res.32, 11 (2013), 1238–1274. https: //doi.org/10.1177/0278364913495721

  30. [30]

    Levente Kocsis and Csaba Szepesvári. 2006. Bandit Based Monte-Carlo Planning. InMachine Learning: ECML 2006, 17th European Conference on Machine Learning, Berlin, Germany, September 18-22, 2006, Proceedings (Lecture Notes in Computer Science, Vol. 4212), Johannes Fürnkranz, Tobias Scheffer, and Myra Spiliopoulou (Eds.). Springer, 282–293. https://doi.org/...

  31. [31]

    Harold W Kuhn. 1953. Extensive games and the problem of information.Contri- butions to the Theory of Games2, 28 (1953), 193–216

  32. [32]

    Harold W Kuhn and Albert W Tucker. 1950. A Simplified Two-Person Poker. Contributions to the Theory of Games, volume 1 of Annals of Mathematics Studies, 24, 97–103. Princeton, New Jersey: Princeton University Press(1950)

  33. [33]

    Marc Lanctot, Kevin Waugh, Martin Zinkevich, and Michael Bowling. 2009. Monte Carlo sampling for regret minimization in extensive games.Advances in neural information processing systems22 (2009)

  34. [34]

    Tor Lattimore and Marcus Hutter. 2014. Near-optimal PAC bounds for discounted MDPs.Theor. Comput. Sci.558 (2014), 125–143. https://doi.org/10.1016/J.TCS. 2014.09.029

  35. [35]

    Aymen Al Marjani, Andrea Tirinzoni, and Emilie Kaufmann. 2023. Towards Instance-Optimality in Online PAC Reinforcement Learning. CoRRabs/2311.05638 (2023). https://doi.org/10.48550/ARXIV.2311.05638 arXiv:2311.05638

  36. [36]

    Fastactivelearningforpureexploration in reinforcement learning

    Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Emilie Kaufmann, EdouardLeurent,andMichalValko.2021. Fastactivelearningforpureexploration in reinforcement learning. InProceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and...

  37. [37]

    Playing Atari with Deep Reinforcement Learning

    Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. 2013. Playing Atari with Deep Reinforcement Learning.CoRRabs/1312.5602 (2013). arXiv:1312.5602 http://arxiv.org/abs/1312.5602

  38. [38]

    Moody and Matthew Saffell

    John E. Moody and Matthew Saffell. 1998. Reinforcement Learning for Trading. InAdvances in Neural Information Processing Systems 11, [NIPS Conference, Denver, Colorado, USA, November 30 - December 5, 1998], Michael J. Kearns, Sara A. Solla, and David A. Cohn (Eds.). The MIT Press, 917–923. http: //papers.nips.cc/paper/1551-reinforcement-learning-for-trading

  39. [39]

    Matej Moravčík, Martin Schmid, Neil Burch, Viliam Lis`y, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling. 2017. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker.Science 356, 6337 (2017), 508–513

  40. [40]

    Andrew Y. Ng, H. Jin Kim, Michael I. Jordan, and Shankar Sastry. 2003. Autonomous Helicopter Flight via Reinforcement Learning. InAdvances in Neural Information Processing Systems 16 [Neural Information Processing Systems, NIPS 2003, December 8-13, 2003, Vancouver and Whistler, British Columbia, Canada], Sebastian Thrun, Lawrence K. Saul, and Bernhard Sch...

  41. [41]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, JohnSchulman,JacobHilton,FraserKelton,LukeMiller,MaddieSimens,Amanda Askell,PeterWelinder,PaulF.Christiano,JanLeike,andRyanLowe.2022. Train- ing language models to follow instructions with human feedback. InAdva...

  42. [42]

    Alessandro Panconesi and Aravind Srinivasan. 1997. Randomized Distributed Edge Coloring via an Extension of the Chernoff–Hoeffding Bounds.SIAM J. Comput.26, 2 (1997), 350–368. https://doi.org/10.1137/S0097539793250767 arXiv:https://doi.org/10.1137/S0097539793250767

  43. [43]

    Gino Perrotta, Ryan W. Gardner, Corey Lowman, Mohammad Taufeeque, Nitish Tongia, Shivaram Kalyanakrishnan, Gregory Clark, Kevin Wang, Eitan Rothberg, BradyP.Garrison,PrithvirajDasgupta,CallumCanavan,andLucasMcCabe.2022. The Second NeurIPS Tournament of Reconnaissance Blind Chess. InProceedings of the NeurIPS 2021 Competitions and Demonstrations Track (Pro...

  44. [44]

    Brian Sheppard. 2002. World-championship-caliber Scrabble SCRABBLE®is a registered trademark. All intellectual property rights in and to the game are owned in the USA by Hasbro Inc., in Canada by Hasbro Canada Corporation, and throughout the rest of the world by J.W. Spear & Sons Limited of Maidenhead, Berkshire,England,asubsidiaryofMattelInc.ArtificialIn...

  45. [45]

    2008.Multiagent systems: Algorithmic, game-theoretic, and logical foundations

    Yoav Shoham and Kevin Leyton-Brown. 2008.Multiagent systems: Algorithmic, game-theoretic, and logical foundations. Cambridge University Press

  46. [46]

    Gerald Tesauro

    David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy P. Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. 2017. Mastering the game of Go without human knowledge.Nat.550, 7676 (2017), 354–359....

  47. [47]

    Non-asymptoticgap-dependentregret bounds for tabular MDPs

    MaxSimchowitzandKevinJamieson.2019. Non-asymptoticgap-dependentregret bounds for tabular MDPs. InProceedings of the 33rd International Conference on Neural Information Processing Systems. Curran Associates Inc., Red Hook, NY, USA, Article 104, 10 pages

  48. [48]

    Bowling, Bryce Larson, Carmelo Piccione, Neil Burch, Darse Billings, and Chris Rayner

    FinneganSouthey,MichaelBowling,BryceLarson,CarmeloPiccione,NeilBurch, Darse Billings, and D. Chris Rayner. 2012. Bayes’ Bluff: Opponent Modelling in Poker.CoRRabs/1207.1411 (2012). arXiv:1207.1411 http://arxiv.org/abs/1207. 1411

  49. [49]

    András Attila Sulyok and Kristóf Karacs. 2022. Towards Using Fully Observable Policies for POMDPs. arXiv:2207.11737 [cs.LG]

  50. [50]

    Sutton and Andrew G

    Richard S. Sutton and Andrew G. Barto. 2018.Reinforcement learning - an introduction, 2nd Edition. MIT Press. http://www.incompleteideas.net/book/the- book-2nd.html

  51. [51]

    Oskari Tammelin. 2014. Solving large imperfect information games using CFR+. arXiv preprint arXiv:1407.5042(2014)

  52. [52]

    Yuandong Tian, Qucheng Gong, and Yu Jiang. 2020. Joint Policy Search for Multi-agent Collaboration with Imperfect Information. InAdvances in Neural Information Processing Systems 33: Annual Conference on Neural Informa- tion Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Flor...

  53. [53]

    AndreaTirinzoni,AymenAl-Marjani,andEmilieKaufmann.2022. Nearinstance- optimalPACreinforcementlearningfordeterministicMDPs.InProceedingsofthe 36th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’22). Curran Associates Inc., Red Hook, NY, USA, Article 639, 14 pages

  54. [54]

    Andrea Tirinzoni, Aymen Al Marjani, and Emilie Kaufmann. 2023. Optimistic PAC Reinforcement Learning: the Instance-Dependent View. InInternational Conference on Algorithmic Learning Theory, February 20-23, 2023, Singapore (Proceedings of Machine Learning Research, Vol. 201), Shipra Agrawal and Francesco Orabona (Eds.). PMLR, 1460–1480. https://proceedings...

  55. [55]

    Jamieson

    Andrew Wagenmaker and Kevin G. Jamieson. 2022. Instance-Dependent Near- OptimalPolicyIdentificationinLinearMDPsviaOnlineExperimentDesign.InAd- vancesinNeuralInformationProcessingSystems35:AnnualConferenceonNeural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, No- vember28-December9,2022,SanmiKoyejo,S.Mohamed,A.Agarwal,Danielle Be...

  56. [56]

    Wagenmaker, Max Simchowitz, and Kevin Jamieson

    Andrew J. Wagenmaker, Max Simchowitz, and Kevin Jamieson. 2022. Beyond No Regret: Instance-Dependent PAC Reinforcement Learning. InConference on Learning Theory, 2-5 July 2022, London, UK (Proceedings of Machine Learning Research, Vol. 178), Po-Ling Loh and Maxim Raginsky (Eds.). PMLR, 358–418. https://proceedings.mlr.press/v178/wagenmaker22a.html

  57. [57]

    Wurman, Samuel Barrett, Kenta Kawamoto, James MacGlashan, Kaushik Subramanian, Thomas J

    Peter R. Wurman, Samuel Barrett, Kenta Kawamoto, James MacGlashan, Kaushik Subramanian, Thomas J. Walsh, Roberto Capobianco, Alisa Devlic, Franziska Eckert, Florian Fuchs, Leilani Gilpin, Piyush Khandelwal, Varun Raj Kompella, HaoChih Lin, Patrick MacAlpine, Declan Oller, Takuma Seno, Craig Sherstan, Michael D. Thomure, Houmehr Aghabozorgi, Leon Barrett, ...

  58. [58]

    Haike Xu, Tengyu Ma, and Simon Du. 2021. Fine-Grained Gap-Dependent Bounds for Tabular MDPs via Adaptive Multi-Step Bootstrap. InProceedings of Thirty Fourth Conference on Learning Theory (Proceedings of Machine Learning Research, Vol. 134), Mikhail Belkin and Samory Kpotufe (Eds.). PMLR, 4438–

  59. [59]

    https://proceedings.mlr.press/v134/xu21a.html

  60. [60]

    Kochenderfer, and Emma Brunskill

    Andrea Zanette, Mykel J. Kochenderfer, and Emma Brunskill. 2019. Almost Horizon-Free Structure-Aware Best Policy Identification with a Generative Model. InAdvances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, Hanna M. Wallach, Hugo La...

  61. [61]

    Zihan Zhang, Yuxin Chen, Jason D Lee, and Simon S Du. 2024. Settling the sample complexity of online reinforcement learning. InProceedings of Thirty Seventh Conference on Learning Theory (Proceedings of Machine Learning Research, Vol. 247), Shipra Agrawal and Aaron Roth (Eds.). PMLR, 5213–5219. https://proceedings.mlr.press/v247/zhang24a.html

  62. [62]

    Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione

  63. [63]

    A PROOF OF PROPOSITION 1 The proposition states that for𝜋∈Π, 𝑉 𝜋 (𝑠1)= ∑︁ 𝜎∈𝑋(𝜋) 𝑞(𝜎) ·𝜌(𝜎)

    Regret minimization in games with incomplete information.Advances in neural information processing systems20 (2007). A PROOF OF PROPOSITION 1 The proposition states that for𝜋∈Π, 𝑉 𝜋 (𝑠1)= ∑︁ 𝜎∈𝑋(𝜋) 𝑞(𝜎) ·𝜌(𝜎). For any internal node¯𝑠∈𝑆 in our tree, let𝜌 ¯𝑠 (𝑠) and 𝑞¯𝑠 (𝑠) denote the respective quantities defined in(2), but on the subtree rooted at¯𝑠. Thus...

  64. [64]

    However,Pedelis known to be computationally intractable, and the problem of designing an efficient algorithm with matching lower bounds remains open

    also showed that thePedelalgorithm [54] closely matches this lower bound. However,Pedelis known to be computationally intractable, and the problem of designing an efficient algorithm with matching lower bounds remains open. Our results for T-MDPs are distinct from the works in the literature because our bounds avoid the dependence on the state-visitation ...

  65. [65]

    Essentially, the argument is that since𝜎𝑡 min (𝜋 𝑡 𝑈 ) belongs to exactly one of the subtrees rooted at𝑠 ′ 𝑗, actions in the other subtrees can be chosen to maximise the empirical value of the policy. Thus we have, 𝜋𝑡 𝑈 (𝑠)=argmax 𝑎∈ A max 𝑠𝑘 ∈ C (𝑠,𝑎)  ∑︁ 𝑠 𝑗 ∈ C (𝑠,𝑎),𝑗≠𝑘 b𝑝(𝑠, 𝑎, 𝑠 𝑗 )b𝑉 𝜋1 (𝑠 𝑗 ) + b𝑝(𝑠, 𝑎, 𝑠𝑘 )b𝑉 𝜋𝑡,𝑘 𝑈 (𝑠𝑘 ) +𝛽(𝑛 𝑡 (𝜋 𝑡,𝑘 𝑈 , ...