arxiv: 2605.04979 · v1 · submitted 2026-05-06 · 💻 cs.AI · cs.LG

Recognition: 3 theorem links

· Lean Theorem

On-line Learning in Tree MDPs by Treating Policies as Bandit Arms

Anvay Shah , Ramsundar Anandanarayanan , Sharayu Moharir , Shivaram Kalyanakrishnan

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:06 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords tree MDPbandit algorithmsonline learningPAC learningregret minimizationshared data boundspolicy optimizationhidden information games

0 comments

The pith

Bandit algorithms can learn policies in tree MDPs by treating each policy as an arm and using shared-data confidence bounds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard bandit methods such as LUCB and UCB can be applied directly to tree MDPs for both PAC learning and regret minimization, even though the number of policies is exponential in the number of states. The key step is to build confidence bounds from data that policies share because every state has a unique trajectory from the starting state. This sharing keeps both memory use and per-step computation polynomial in the problem size. The resulting bounds on sample complexity and regret are instance-dependent and add a gap term for each terminal state rather than for each policy. The approach is evaluated on hidden-information games where it outperforms prior methods.

Core claim

Treating each policy as a bandit arm allows LUCB and UCB to be applied to T-MDPs. The main technical step is the construction of confidence bounds from observations that are shared across policies due to the single-trajectory structure; these bounds support correct decisions while requiring only polynomial memory and per-step time. The analysis produces instance-dependent upper bounds on sample complexity and regret that sum gap terms over terminal states rather than over the full set of policies.

What carries the argument

Shared-data confidence bounds that pool observations along the unique trajectories reaching each state, allowing exponential policies to be handled with polynomial resources.

If this is right

Sample-complexity bounds for PAC learning sum gap terms from terminal states.
Regret bounds in the online setting follow the same terminal-state structure.
Both algorithms run with polynomial memory and per-step computation.
The methods outperform prior approaches on hidden-information games.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same sharing idea could reduce policy complexity in other MDPs whose state reachability has limited branching.
Bounds that depend only on terminal gaps suggest that learning effort concentrates near the leaves of the decision tree.
The reduction from policies to terminals may guide algorithm design for perfect-recall games of larger size.

Load-bearing premise

The unique-trajectory property of T-MDPs lets valid confidence bounds be built from data pooled across policies while preserving correctness.

What would settle it

An implementation that either uses super-polynomial memory on a T-MDP with a modest number of states or produces sample complexity that exceeds the derived bound on a concrete instance with known gaps.

Figures

Figures reproduced from arXiv: 2605.04979 by Anvay Shah, Ramsundar Anandanarayanan, Sharayu Moharir, Shivaram Kalyanakrishnan.

**Figure 1.** Figure 1: Transition from non-terminal state 𝑠 to state 𝑠 ′ upon taking action 𝑎. State 𝑠 ′ could be non-terminal or terminal. reward (that is, the return) along each trajectory is bounded, and for convenient exposition taken to lie in [0, 1]. A deterministic policy 𝜋 : S → A specifies an action 𝜋 (𝑠) for every non-terminal state 𝑠. Let Π be the set of all deterministic policies for MDP M. For every non-terminal sta… view at source ↗

**Figure 2.** Figure 2: Performance for Leduc Poker and RBT. The top row (a-d) corresponds to the PAC setting, and the bottom row (e-h) to regret view at source ↗

**Figure 3.** Figure 3: (a) A visualisation of the uncertainty in RBT after playing a sense move. The sense region (in yellow) reveals information view at source ↗

read the original abstract

A Tree Markov Decision Problem (T-MDP) is a finite-horizon MDP with a starting state $s_{1}$, in which every state is reachable from $s_{1}$ through exactly one state-action trajectory. T-MDPs arise naturally as abstractions of decision making in sequential games with perfect recall, against stationary opponents. We consider the problem of on-line learning in T-MDPs, both in the PAC and the regret-minimisation regimes. We show that well-known bandit algorithms -- \textsc{Lucb} and \textsc{Ucb} -- can be applied on T-MDPs by treating each policy as an arm. The apparent technical challenge in this approach is that the number of policies is exponential in the number of states. Our main innovation is in the design of confidence bounds based on data shared by the policies, so that the bandit algorithms can yet be implemented with polynomial memory and per-step computation. We obtain instance-dependent upper bounds on sample complexity and regret that sum a ``gap term'' from every terminal state, rather than every policy. Empirically, our algorithms consistently outperform available alternatives on a suite of hidden-information games.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper reduces online policy learning in tree MDPs to standard bandit algorithms by building shared per-edge confidence bounds that stay polynomial in the state space.

read the letter

The core contribution is a practical reduction: treat each policy as a bandit arm, but construct confidence bounds from data shared across policies using the unique-trajectory structure of T-MDPs. This lets you keep per-transition statistics on O(|S|) edges, apply a union bound over edges rather than paths, and still recover valid value bounds for every policy. Selection then reduces to a linear-time tree DP that finds the extremal UCB or LCB path. The resulting instance-dependent bounds on sample complexity and regret sum gap terms only over terminal states instead of the full policy set. That is the actual technical step forward relative to naive application of LUCB or UCB to an exponential arm set. The paper also reports that the resulting algorithms beat available baselines on a suite of hidden-information games. The empirical claim is stated at a high level, but the underlying construction appears internally consistent once the tree property is used to share observations. No obvious circularity or self-referential fitting shows up in the argument. The main limitation visible from the description is that the abstract supplies no derivation steps or quantitative tables, so the tightness of the bounds and the size of the empirical gains have to be checked in the full text. Minor implementation details around how exactly the shared data is maintained across episodes would also be useful to see spelled out. This work is aimed at researchers who already work on sample-efficient RL for perfect-recall games or other tree-structured MDPs. Anyone looking for a clean bandit reduction that exploits structure rather than fighting the exponential policy count will find the construction and the terminal-gap bounds worth reading. The paper is coherent on its own terms and supplies a usable algorithmic technique, so it deserves a serious referee rather than a desk reject.

Referee Report

0 major / 3 minor

Summary. The paper addresses online learning in Tree MDPs (T-MDPs), finite-horizon MDPs where every state is reachable from the start via exactly one trajectory. It shows that standard bandit algorithms (LUCB and UCB) can be applied by treating policies as arms, with the central innovation being the construction of confidence bounds that exploit shared observations across policies due to the unique-trajectory structure. This enables polynomial memory and per-step computation despite the exponential policy space. Instance-dependent bounds on sample complexity and regret are derived that sum gap terms over terminal states (rather than policies). The approach is evaluated empirically on hidden-information games, where it outperforms available alternatives.

Significance. If the claims hold, the work provides an efficient and theoretically grounded method for online learning in structured MDPs with exponentially large policy sets by reducing the problem to tree DP over shared edge statistics. The instance-dependent bounds scaling with terminals (not policies) and the polynomial-time implementation via shared-data CIs are notable strengths, as is the empirical validation on games. This could influence regret minimization and PAC learning in perfect-recall game abstractions and related sequential decision settings.

minor comments (3)

[Abstract] Abstract: the statement that the algorithms 'consistently outperform available alternatives' would be strengthened by including at least one quantitative highlight (e.g., sample-complexity reduction factor or regret ratio on a representative game).
[Section 3 (Algorithm and Confidence Bounds)] The description of how the union bound over O(|S|) edges yields simultaneous CIs for all policies (via additive path returns) is central but would benefit from an explicit statement of the concentration inequality used and the precise failure probability allocation.
[Section 4 (Theoretical Analysis)] In the regret analysis, clarify whether the 'gap term from every terminal state' bound assumes a fixed horizon or accounts for varying path lengths; a short remark on this would remove ambiguity.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and accurate summary of our work on online learning in Tree MDPs, including the reduction to bandit algorithms via shared-data confidence bounds, the polynomial memory and computation guarantees, and the instance-dependent bounds that depend on terminal-state gaps. We appreciate the recommendation for minor revision. As no specific major comments were provided in the report, we have no points requiring rebuttal or substantive changes at this stage and will incorporate any minor editorial suggestions in the revised manuscript.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The derivation applies known LUCB and UCB algorithms to T-MDPs by treating policies as arms, with confidence bounds constructed from per-edge statistics that exploit the unique-trajectory property. Instance-dependent sample-complexity and regret bounds are obtained by summing gap terms over terminal states (one per distinct path) rather than over the exponential policy set; this follows directly from additive path returns and a union bound over O(|S|) edges, without any fitted parameter being renamed as a prediction or any self-citation serving as the load-bearing justification. No equation or claim reduces to its own inputs by construction, and the approach remains self-contained against the stated MDP structure.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the structural definition of T-MDPs (unique trajectory to each state) and the assumption that this structure permits correct shared confidence bounds. No free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Every state in a T-MDP is reachable from the start through exactly one state-action trajectory.
This property is invoked to justify treating policies as arms while sharing observations across policies that share prefixes.

pith-pipeline@v0.9.0 · 5518 in / 1371 out tokens · 32755 ms · 2026-05-08T18:06:08.571042+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation / Foundation.LogicAsFunctionalEquation washburn_uniqueness_aczel (no analog: this is a tree-DP value decomposition, not a cost-uniqueness statement) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

V^π(s_1) = Σ_{σ∈X(π)} q(σ)·ρ(σ) ... reduces estimating values for all policies to estimating the q's for all elements of Σ.
Foundation.Atomicity / Cost No RS theorem about Bernstein concentration; RS cost reasoning is ratio-symmetric, not Bernoulli-tail-bounded. unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

β(m,δ) = sqrt((8/3m) ln(1/δ)); confidence bounds via NCD Bernoullis, twin-variables, Bernstein's inequality.
n/a No φ-ladder or 8-tick structure; gaps are purely value differences Δ_σ in a finite tree. unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

instance-dependent upper bounds ... sum a 'gap term' from every terminal state, rather than every policy.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

65 extracted references · 21 canonical work pages · 1 internal anchor

[1]

Peter Auer. 2002. Using Confidence Bounds for Exploitation-Exploration Trade- offs.J. Mach. Learn. Res.3 (2002), 397–422. https://jmlr.org/papers/v3/auer02a. html

2002
[2]

Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. 2002. Finite-time Analysis of the Multiarmed Bandit Problem.Mach. Learn.47, 2-3 (2002), 235–256. https://doi.org/10.1023/A:1013689704352

work page doi:10.1023/a:1013689704352 2002
[3]

Informing sequential clinical decision-making through reinforce- ment learning: an empirical study

MohammadGheshlaghiAzar,RémiMunos,andHilbertJ.Kappen.2013. Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model.Mach. Learn.91, 3 (2013), 325–349. https://doi.org/10.1007/S10994- 013-5368-1

work page doi:10.1007/s10994- 2013
[4]

Yu Bai, Chi Jin, Song Mei, and Tiancheng Yu. 2022. Near-optimal learning of extensive-form games with imperfect information. InInternational Conference on Machine Learning. PMLR, 1337–1382

2022
[5]

2016.Concentration Inequalities: A Nonasymptotic Theory of Independence

Stephane Boucheron, Gabor Lugosi, and Pascal Massart. 2016.Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press

2016
[6]

Michael Bowling, Neil Burch, Michael Johanson, and Oskari Tammelin. 2017. Heads-up limit hold’em poker is solved.Commun. ACM60, 11 (oct 2017), 81–88. https://doi.org/10.1145/3131284

work page doi:10.1145/3131284 2017
[7]

InProceedings of the AAAI conference on artificial intelligence, Vol

NoamBrown,ChristianKroer,andTuomasSandholm.2017.Dynamicthresholding and pruning for regret minimization. InProceedings of the AAAI conference on artificial intelligence, Vol. 31

2017
[8]

Noam Brown, Adam Lerer, Sam Gross, and Tuomas Sandholm. 2019. Deep coun- terfactual regret minimization. InInternational conference on machine learning. PMLR, 793–802

2019
[9]

Noam Brown and Tuomas Sandholm. 2018. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals.Science 359, 6374 (2018), 418–424. https://doi.org/10.1126/science.aao1733 arXiv:https://www.science.org/doi/pdf/10.1126/science.aao1733

work page doi:10.1126/science.aao1733 2018
[10]

Noam Brown and Tuomas Sandholm. 2019. Superhuman AI for multiplayer poker. Science365, 6456 (2019), 885–890

2019
[11]

Shulun Chen, Runlong Zhou, Zihan Zhang, Maryam Fazel, and Simon S. Du
[12]

arXiv:2506.06521 [cs.LG] https://arxiv.org/abs/2506.06521

Sharp Gap-Dependent Variance-Aware Regret Bounds for Tabular MDPs. arXiv:2506.06521 [cs.LG] https://arxiv.org/abs/2506.06521

work page arXiv
[13]

Christoph Dann, Lihong Li, Wei Wei, and Emma Brunskill. 2019. Policy Cer- tificates: Towards Accountable Reinforcement Learning. InProceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA (Proceedings of Machine Learning Research, Vol.97),KamalikaChaudhuriandRuslanSalakhutdinov(Eds.).PMLR...

2019
[14]

InProceedings of the 35th International Conference on Neural Information Processing Systems (NIPS ’21)

ChrisDann,TeodorV.Marinov,MehryarMohri,andJulianZimmert.2021.Beyond value-function gaps: improved instance-dependent regret bounds for episodic reinforcement learning. InProceedings of the 35th International Conference on Neural Information Processing Systems (NIPS ’21). Curran Associates Inc., Red Hook, NY, USA, Article 1, 12 pages

2021
[15]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

work page Pith review doi:10.48550/arxiv.2501.12948 2025
[16]

Eyal Even-Dar, Shie Mannor, and Yishay Mansour. 2006. Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems.J. Mach. Learn. Res.7 (2006), 1079–1105. https://jmlr.org/papers/v7/ evendar06a.html

2006
[17]

Gabriele Farina and Tuomas Sandholm. 2021. Model-free online learning in unknown sequential decision making problems and games. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 5381–5390

2021
[18]

Claude-Nicolas Fiechter. 1994. Efficient Reinforcement Learning. InProceedings oftheSeventhAnnualACMConferenceonComputationalLearningTheory,COLT 1994, New Brunswick, NJ, USA, July 12-15, 1994, Manfred K. Warmuth (Ed.). ACM, 88–97. https://doi.org/10.1145/180139.181019

work page doi:10.1145/180139.181019 1994
[19]

Victor Gabillon, Mohammad Ghavamzadeh, and Alessandro Lazaric. 2012. Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United St...

2012
[20]

Kevin Garbe and Jan Vondrak. 2018. Concentration of Lipschitz Functions of Negatively Dependent Variables. arXiv:1804.10084 [math.PR] https://arxiv.org/ abs/1804.10084

work page arXiv 2018
[21]

Wang,GregoryClark,TimoBertram,JohannesFürnkranz,MartinMüller,BradyP

RyanW.Gardner,GinoPerrotta,AnvayShah,ShivaramKalyanakrishnan,KevinA. Wang,GregoryClark,TimoBertram,JohannesFürnkranz,MartinMüller,BradyP. Garrison, Prithviraj Dasgupta, and Saeid Rezaei. 2022. The Machine Reconnais- sance Blind Chess Tournament of NeurIPS 2022. InProceedings of the NeurIPS 2022 Competitions Track (Proceedings of Machine Learning Research,...

2022
[22]

Chi Jin, Akshay Krishnamurthy, Max Simchowitz, and Tiancheng Yu. 2020. Reward-Free Exploration for Reinforcement Learning. InProceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event (Proceedings of Machine Learning Research, Vol. 119). PMLR, 4870–4879. http://proceedings.mlr.press/v119/jin20d.html

2020
[23]

PlanninginMarkovDecisionProcesses with Gap-Dependent Sample Complexity

Anders Jonsson, Emilie Kaufmann, Pierre Menard, Omar Darwiche Domingues, EdouardLeurent,andMichalValko.2020. PlanninginMarkovDecisionProcesses with Gap-Dependent Sample Complexity. InAdvances in Neural Information ProcessingSystems,H.Larochelle,M.Ranzato,R.Hadsell,M.F.Balcan,andH.Lin (Eds.), Vol. 33. Curran Associates, Inc., 1253–1263. https://proceedings...

2020
[24]

Shivaram Kalyanakrishnan, Sheel Shah, and Santhosh Kumar Guguloth. 2025. A View of the Certainty-Equivalence Method for PAC RL as an Application of the Trajectory Tree Method. InProceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems(Detroit, MI, USA)(AAMAS ’25). International Foundation for Autonomous Agents and Multi...

2025
[25]

Shivaram Kalyanakrishnan, Ambuj Tewari, Peter Auer, and Peter Stone. 2012. PAC Subset Selection in Stochastic Multi-armed Bandits. InProceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012. icml.cc / Omnipress. http://icml.cc/2012/ papers/359.pdf

2012
[26]

30), Shai Shalev-Shwartz and Ingo Steinwart (Eds.)

EmilieKaufmannandShivaramKalyanakrishnan.2013.InformationComplexityin BanditSubsetSelection.InCOLT2013-The26thAnnualConferenceonLearning Theory, June 12-14, 2013, Princeton University, NJ, USA (JMLR Workshop and Conference Proceedings, Vol. 30), Shai Shalev-Shwartz and Ingo Steinwart (Eds.). JMLR.org, 228–251. http://proceedings.mlr.press/v30/Kaufmann13.html

2013
[27]

Emilie Kaufmann, Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Edouard Leurent, and Michal Valko. 2021. Adaptive Reward-Free Exploration. InProceedings of the 32nd International Conference on Algorithmic Learning Theory (Proceedings of Machine Learning Research, Vol. 132), Vitaly Feldman, Katrina Ligett, and Sivan Sabato (Eds.). PMLR, 865–891. h...

2021
[28]

Kearns and Satinder Singh

Michael J. Kearns and Satinder Singh. 2002. Near-Optimal Reinforcement Learning in Polynomial Time.Mach. Learn.49, 2-3 (2002), 209–232. https: //doi.org/10.1023/A:1017984413808

work page doi:10.1023/a:1017984413808 2002
[29]

Bagnell, and Jan Peters

Jens Kober, J. Andrew Bagnell, and Jan Peters. 2013. Reinforcement learning in robotics: A survey.Int. J. Robotics Res.32, 11 (2013), 1238–1274. https: //doi.org/10.1177/0278364913495721

work page doi:10.1177/0278364913495721 2013
[30]

Levente Kocsis and Csaba Szepesvári. 2006. Bandit Based Monte-Carlo Planning. InMachine Learning: ECML 2006, 17th European Conference on Machine Learning, Berlin, Germany, September 18-22, 2006, Proceedings (Lecture Notes in Computer Science, Vol. 4212), Johannes Fürnkranz, Tobias Scheffer, and Myra Spiliopoulou (Eds.). Springer, 282–293. https://doi.org/...

work page doi:10.1007/11871842_29 2006
[31]

Harold W Kuhn. 1953. Extensive games and the problem of information.Contri- butions to the Theory of Games2, 28 (1953), 193–216

1953
[32]

Harold W Kuhn and Albert W Tucker. 1950. A Simplified Two-Person Poker. Contributions to the Theory of Games, volume 1 of Annals of Mathematics Studies, 24, 97–103. Princeton, New Jersey: Princeton University Press(1950)

1950
[33]

Marc Lanctot, Kevin Waugh, Martin Zinkevich, and Michael Bowling. 2009. Monte Carlo sampling for regret minimization in extensive games.Advances in neural information processing systems22 (2009)

2009
[34]

Tor Lattimore and Marcus Hutter. 2014. Near-optimal PAC bounds for discounted MDPs.Theor. Comput. Sci.558 (2014), 125–143. https://doi.org/10.1016/J.TCS. 2014.09.029

work page doi:10.1016/j.tcs 2014
[35]

Aymen Al Marjani, Andrea Tirinzoni, and Emilie Kaufmann. 2023. Towards Instance-Optimality in Online PAC Reinforcement Learning. CoRRabs/2311.05638 (2023). https://doi.org/10.48550/ARXIV.2311.05638 arXiv:2311.05638

work page doi:10.48550/arxiv.2311.05638 2023
[36]

Fastactivelearningforpureexploration in reinforcement learning

Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Emilie Kaufmann, EdouardLeurent,andMichalValko.2021. Fastactivelearningforpureexploration in reinforcement learning. InProceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and...

2021
[37]

Playing Atari with Deep Reinforcement Learning

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. 2013. Playing Atari with Deep Reinforcement Learning.CoRRabs/1312.5602 (2013). arXiv:1312.5602 http://arxiv.org/abs/1312.5602

work page internal anchor Pith review arXiv 2013
[38]

Moody and Matthew Saffell

John E. Moody and Matthew Saffell. 1998. Reinforcement Learning for Trading. InAdvances in Neural Information Processing Systems 11, [NIPS Conference, Denver, Colorado, USA, November 30 - December 5, 1998], Michael J. Kearns, Sara A. Solla, and David A. Cohn (Eds.). The MIT Press, 917–923. http: //papers.nips.cc/paper/1551-reinforcement-learning-for-trading

1998
[39]

Matej Moravčík, Martin Schmid, Neil Burch, Viliam Lis`y, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling. 2017. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker.Science 356, 6337 (2017), 508–513

2017
[40]

Andrew Y. Ng, H. Jin Kim, Michael I. Jordan, and Shankar Sastry. 2003. Autonomous Helicopter Flight via Reinforcement Learning. InAdvances in Neural Information Processing Systems 16 [Neural Information Processing Systems, NIPS 2003, December 8-13, 2003, Vancouver and Whistler, British Columbia, Canada], Sebastian Thrun, Lawrence K. Saul, and Bernhard Sch...

2003
[41]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, JohnSchulman,JacobHilton,FraserKelton,LukeMiller,MaddieSimens,Amanda Askell,PeterWelinder,PaulF.Christiano,JanLeike,andRyanLowe.2022. Train- ing language models to follow instructions with human feedback. InAdva...

2022
[42]

Alessandro Panconesi and Aravind Srinivasan. 1997. Randomized Distributed Edge Coloring via an Extension of the Chernoff–Hoeffding Bounds.SIAM J. Comput.26, 2 (1997), 350–368. https://doi.org/10.1137/S0097539793250767 arXiv:https://doi.org/10.1137/S0097539793250767

work page doi:10.1137/s0097539793250767 1997
[43]

Gino Perrotta, Ryan W. Gardner, Corey Lowman, Mohammad Taufeeque, Nitish Tongia, Shivaram Kalyanakrishnan, Gregory Clark, Kevin Wang, Eitan Rothberg, BradyP.Garrison,PrithvirajDasgupta,CallumCanavan,andLucasMcCabe.2022. The Second NeurIPS Tournament of Reconnaissance Blind Chess. InProceedings of the NeurIPS 2021 Competitions and Demonstrations Track (Pro...

2022
[44]

Brian Sheppard. 2002. World-championship-caliber Scrabble SCRABBLE®is a registered trademark. All intellectual property rights in and to the game are owned in the USA by Hasbro Inc., in Canada by Hasbro Canada Corporation, and throughout the rest of the world by J.W. Spear & Sons Limited of Maidenhead, Berkshire,England,asubsidiaryofMattelInc.ArtificialIn...

work page doi:10.1016/s0004-3702(01)00166-7 2002
[45]

2008.Multiagent systems: Algorithmic, game-theoretic, and logical foundations

Yoav Shoham and Kevin Leyton-Brown. 2008.Multiagent systems: Algorithmic, game-theoretic, and logical foundations. Cambridge University Press

2008
[46]

Gerald Tesauro

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy P. Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. 2017. Mastering the game of Go without human knowledge.Nat.550, 7676 (2017), 354–359....

work page doi:10.1038/nature24270 2017
[47]

Non-asymptoticgap-dependentregret bounds for tabular MDPs

MaxSimchowitzandKevinJamieson.2019. Non-asymptoticgap-dependentregret bounds for tabular MDPs. InProceedings of the 33rd International Conference on Neural Information Processing Systems. Curran Associates Inc., Red Hook, NY, USA, Article 104, 10 pages

2019
[48]

Bowling, Bryce Larson, Carmelo Piccione, Neil Burch, Darse Billings, and Chris Rayner

FinneganSouthey,MichaelBowling,BryceLarson,CarmeloPiccione,NeilBurch, Darse Billings, and D. Chris Rayner. 2012. Bayes’ Bluff: Opponent Modelling in Poker.CoRRabs/1207.1411 (2012). arXiv:1207.1411 http://arxiv.org/abs/1207. 1411

work page arXiv 2012
[49]

András Attila Sulyok and Kristóf Karacs. 2022. Towards Using Fully Observable Policies for POMDPs. arXiv:2207.11737 [cs.LG]

work page arXiv 2022
[50]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto. 2018.Reinforcement learning - an introduction, 2nd Edition. MIT Press. http://www.incompleteideas.net/book/the- book-2nd.html

2018
[51]

Oskari Tammelin. 2014. Solving large imperfect information games using CFR+. arXiv preprint arXiv:1407.5042(2014)

work page arXiv 2014
[52]

Yuandong Tian, Qucheng Gong, and Yu Jiang. 2020. Joint Policy Search for Multi-agent Collaboration with Imperfect Information. InAdvances in Neural Information Processing Systems 33: Annual Conference on Neural Informa- tion Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Flor...

2020
[53]

AndreaTirinzoni,AymenAl-Marjani,andEmilieKaufmann.2022. Nearinstance- optimalPACreinforcementlearningfordeterministicMDPs.InProceedingsofthe 36th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’22). Curran Associates Inc., Red Hook, NY, USA, Article 639, 14 pages

2022
[54]

Andrea Tirinzoni, Aymen Al Marjani, and Emilie Kaufmann. 2023. Optimistic PAC Reinforcement Learning: the Instance-Dependent View. InInternational Conference on Algorithmic Learning Theory, February 20-23, 2023, Singapore (Proceedings of Machine Learning Research, Vol. 201), Shipra Agrawal and Francesco Orabona (Eds.). PMLR, 1460–1480. https://proceedings...

2023
[55]

Jamieson

Andrew Wagenmaker and Kevin G. Jamieson. 2022. Instance-Dependent Near- OptimalPolicyIdentificationinLinearMDPsviaOnlineExperimentDesign.InAd- vancesinNeuralInformationProcessingSystems35:AnnualConferenceonNeural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, No- vember28-December9,2022,SanmiKoyejo,S.Mohamed,A.Agarwal,Danielle Be...

2022
[56]

Wagenmaker, Max Simchowitz, and Kevin Jamieson

Andrew J. Wagenmaker, Max Simchowitz, and Kevin Jamieson. 2022. Beyond No Regret: Instance-Dependent PAC Reinforcement Learning. InConference on Learning Theory, 2-5 July 2022, London, UK (Proceedings of Machine Learning Research, Vol. 178), Po-Ling Loh and Maxim Raginsky (Eds.). PMLR, 358–418. https://proceedings.mlr.press/v178/wagenmaker22a.html

2022
[57]

Wurman, Samuel Barrett, Kenta Kawamoto, James MacGlashan, Kaushik Subramanian, Thomas J

Peter R. Wurman, Samuel Barrett, Kenta Kawamoto, James MacGlashan, Kaushik Subramanian, Thomas J. Walsh, Roberto Capobianco, Alisa Devlic, Franziska Eckert, Florian Fuchs, Leilani Gilpin, Piyush Khandelwal, Varun Raj Kompella, HaoChih Lin, Patrick MacAlpine, Declan Oller, Takuma Seno, Craig Sherstan, Michael D. Thomure, Houmehr Aghabozorgi, Leon Barrett, ...

work page doi:10.1038/s41586-021-04357-7 2022
[58]

Haike Xu, Tengyu Ma, and Simon Du. 2021. Fine-Grained Gap-Dependent Bounds for Tabular MDPs via Adaptive Multi-Step Bootstrap. InProceedings of Thirty Fourth Conference on Learning Theory (Proceedings of Machine Learning Research, Vol. 134), Mikhail Belkin and Samory Kpotufe (Eds.). PMLR, 4438–

2021
[59]

https://proceedings.mlr.press/v134/xu21a.html
[60]

Kochenderfer, and Emma Brunskill

Andrea Zanette, Mykel J. Kochenderfer, and Emma Brunskill. 2019. Almost Horizon-Free Structure-Aware Best Policy Identification with a Generative Model. InAdvances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, Hanna M. Wallach, Hugo La...

2019
[61]

Zihan Zhang, Yuxin Chen, Jason D Lee, and Simon S Du. 2024. Settling the sample complexity of online reinforcement learning. InProceedings of Thirty Seventh Conference on Learning Theory (Proceedings of Machine Learning Research, Vol. 247), Shipra Agrawal and Aaron Roth (Eds.). PMLR, 5213–5219. https://proceedings.mlr.press/v247/zhang24a.html

2024
[62]

Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione
[63]

A PROOF OF PROPOSITION 1 The proposition states that for𝜋∈Π, 𝑉 𝜋 (𝑠1)= ∑︁ 𝜎∈𝑋(𝜋) 𝑞(𝜎) ·𝜌(𝜎)

Regret minimization in games with incomplete information.Advances in neural information processing systems20 (2007). A PROOF OF PROPOSITION 1 The proposition states that for𝜋∈Π, 𝑉 𝜋 (𝑠1)= ∑︁ 𝜎∈𝑋(𝜋) 𝑞(𝜎) ·𝜌(𝜎). For any internal node¯𝑠∈𝑆 in our tree, let𝜌 ¯𝑠 (𝑠) and 𝑞¯𝑠 (𝑠) denote the respective quantities defined in(2), but on the subtree rooted at¯𝑠. Thus...

2007
[64]

However,Pedelis known to be computationally intractable, and the problem of designing an efficient algorithm with matching lower bounds remains open

also showed that thePedelalgorithm [54] closely matches this lower bound. However,Pedelis known to be computationally intractable, and the problem of designing an efficient algorithm with matching lower bounds remains open. Our results for T-MDPs are distinct from the works in the literature because our bounds avoid the dependence on the state-visitation ...
[65]

Essentially, the argument is that since𝜎𝑡 min (𝜋 𝑡 𝑈 ) belongs to exactly one of the subtrees rooted at𝑠 ′ 𝑗, actions in the other subtrees can be chosen to maximise the empirical value of the policy. Thus we have, 𝜋𝑡 𝑈 (𝑠)=argmax 𝑎∈ A max 𝑠𝑘 ∈ C (𝑠,𝑎)  ∑︁ 𝑠 𝑗 ∈ C (𝑠,𝑎),𝑗≠𝑘 b𝑝(𝑠, 𝑎, 𝑠 𝑗 )b𝑉 𝜋1 (𝑠 𝑗 ) + b𝑝(𝑠, 𝑎, 𝑠𝑘 )b𝑉 𝜋𝑡,𝑘 𝑈 (𝑠𝑘 ) +𝛽(𝑛 𝑡 (𝜋 𝑡,𝑘 𝑈 , ...