Recognition: 3 theorem links
· Lean TheoremOn-line Learning in Tree MDPs by Treating Policies as Bandit Arms
Pith reviewed 2026-05-08 18:06 UTC · model grok-4.3
The pith
Bandit algorithms can learn policies in tree MDPs by treating each policy as an arm and using shared-data confidence bounds.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Treating each policy as a bandit arm allows LUCB and UCB to be applied to T-MDPs. The main technical step is the construction of confidence bounds from observations that are shared across policies due to the single-trajectory structure; these bounds support correct decisions while requiring only polynomial memory and per-step time. The analysis produces instance-dependent upper bounds on sample complexity and regret that sum gap terms over terminal states rather than over the full set of policies.
What carries the argument
Shared-data confidence bounds that pool observations along the unique trajectories reaching each state, allowing exponential policies to be handled with polynomial resources.
If this is right
- Sample-complexity bounds for PAC learning sum gap terms from terminal states.
- Regret bounds in the online setting follow the same terminal-state structure.
- Both algorithms run with polynomial memory and per-step computation.
- The methods outperform prior approaches on hidden-information games.
Where Pith is reading between the lines
- The same sharing idea could reduce policy complexity in other MDPs whose state reachability has limited branching.
- Bounds that depend only on terminal gaps suggest that learning effort concentrates near the leaves of the decision tree.
- The reduction from policies to terminals may guide algorithm design for perfect-recall games of larger size.
Load-bearing premise
The unique-trajectory property of T-MDPs lets valid confidence bounds be built from data pooled across policies while preserving correctness.
What would settle it
An implementation that either uses super-polynomial memory on a T-MDP with a modest number of states or produces sample complexity that exceeds the derived bound on a concrete instance with known gaps.
Figures
read the original abstract
A Tree Markov Decision Problem (T-MDP) is a finite-horizon MDP with a starting state $s_{1}$, in which every state is reachable from $s_{1}$ through exactly one state-action trajectory. T-MDPs arise naturally as abstractions of decision making in sequential games with perfect recall, against stationary opponents. We consider the problem of on-line learning in T-MDPs, both in the PAC and the regret-minimisation regimes. We show that well-known bandit algorithms -- \textsc{Lucb} and \textsc{Ucb} -- can be applied on T-MDPs by treating each policy as an arm. The apparent technical challenge in this approach is that the number of policies is exponential in the number of states. Our main innovation is in the design of confidence bounds based on data shared by the policies, so that the bandit algorithms can yet be implemented with polynomial memory and per-step computation. We obtain instance-dependent upper bounds on sample complexity and regret that sum a ``gap term'' from every terminal state, rather than every policy. Empirically, our algorithms consistently outperform available alternatives on a suite of hidden-information games.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper addresses online learning in Tree MDPs (T-MDPs), finite-horizon MDPs where every state is reachable from the start via exactly one trajectory. It shows that standard bandit algorithms (LUCB and UCB) can be applied by treating policies as arms, with the central innovation being the construction of confidence bounds that exploit shared observations across policies due to the unique-trajectory structure. This enables polynomial memory and per-step computation despite the exponential policy space. Instance-dependent bounds on sample complexity and regret are derived that sum gap terms over terminal states (rather than policies). The approach is evaluated empirically on hidden-information games, where it outperforms available alternatives.
Significance. If the claims hold, the work provides an efficient and theoretically grounded method for online learning in structured MDPs with exponentially large policy sets by reducing the problem to tree DP over shared edge statistics. The instance-dependent bounds scaling with terminals (not policies) and the polynomial-time implementation via shared-data CIs are notable strengths, as is the empirical validation on games. This could influence regret minimization and PAC learning in perfect-recall game abstractions and related sequential decision settings.
minor comments (3)
- [Abstract] Abstract: the statement that the algorithms 'consistently outperform available alternatives' would be strengthened by including at least one quantitative highlight (e.g., sample-complexity reduction factor or regret ratio on a representative game).
- [Section 3 (Algorithm and Confidence Bounds)] The description of how the union bound over O(|S|) edges yields simultaneous CIs for all policies (via additive path returns) is central but would benefit from an explicit statement of the concentration inequality used and the precise failure probability allocation.
- [Section 4 (Theoretical Analysis)] In the regret analysis, clarify whether the 'gap term from every terminal state' bound assumes a fixed horizon or accounts for varying path lengths; a short remark on this would remove ambiguity.
Simulated Author's Rebuttal
We thank the referee for the positive and accurate summary of our work on online learning in Tree MDPs, including the reduction to bandit algorithms via shared-data confidence bounds, the polynomial memory and computation guarantees, and the instance-dependent bounds that depend on terminal-state gaps. We appreciate the recommendation for minor revision. As no specific major comments were provided in the report, we have no points requiring rebuttal or substantive changes at this stage and will incorporate any minor editorial suggestions in the revised manuscript.
Circularity Check
No significant circularity
full rationale
The derivation applies known LUCB and UCB algorithms to T-MDPs by treating policies as arms, with confidence bounds constructed from per-edge statistics that exploit the unique-trajectory property. Instance-dependent sample-complexity and regret bounds are obtained by summing gap terms over terminal states (one per distinct path) rather than over the exponential policy set; this follows directly from additive path returns and a union bound over O(|S|) edges, without any fitted parameter being renamed as a prediction or any self-citation serving as the load-bearing justification. No equation or claim reduces to its own inputs by construction, and the approach remains self-contained against the stated MDP structure.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Every state in a T-MDP is reachable from the start through exactly one state-action trajectory.
Lean theorems connected to this paper
-
Cost.FunctionalEquation / Foundation.LogicAsFunctionalEquationwashburn_uniqueness_aczel (no analog: this is a tree-DP value decomposition, not a cost-uniqueness statement) unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
V^π(s_1) = Σ_{σ∈X(π)} q(σ)·ρ(σ) ... reduces estimating values for all policies to estimating the q's for all elements of Σ.
-
Foundation.Atomicity / CostNo RS theorem about Bernstein concentration; RS cost reasoning is ratio-symmetric, not Bernoulli-tail-bounded. unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
β(m,δ) = sqrt((8/3m) ln(1/δ)); confidence bounds via NCD Bernoullis, twin-variables, Bernstein's inequality.
-
n/aNo φ-ladder or 8-tick structure; gaps are purely value differences Δ_σ in a finite tree. unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
instance-dependent upper bounds ... sum a 'gap term' from every terminal state, rather than every policy.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Peter Auer. 2002. Using Confidence Bounds for Exploitation-Exploration Trade- offs.J. Mach. Learn. Res.3 (2002), 397–422. https://jmlr.org/papers/v3/auer02a. html
2002
-
[2]
Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. 2002. Finite-time Analysis of the Multiarmed Bandit Problem.Mach. Learn.47, 2-3 (2002), 235–256. https://doi.org/10.1023/A:1013689704352
-
[3]
Informing sequential clinical decision-making through reinforce- ment learning: an empirical study
MohammadGheshlaghiAzar,RémiMunos,andHilbertJ.Kappen.2013. Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model.Mach. Learn.91, 3 (2013), 325–349. https://doi.org/10.1007/S10994- 013-5368-1
-
[4]
Yu Bai, Chi Jin, Song Mei, and Tiancheng Yu. 2022. Near-optimal learning of extensive-form games with imperfect information. InInternational Conference on Machine Learning. PMLR, 1337–1382
2022
-
[5]
2016.Concentration Inequalities: A Nonasymptotic Theory of Independence
Stephane Boucheron, Gabor Lugosi, and Pascal Massart. 2016.Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press
2016
-
[6]
Michael Bowling, Neil Burch, Michael Johanson, and Oskari Tammelin. 2017. Heads-up limit hold’em poker is solved.Commun. ACM60, 11 (oct 2017), 81–88. https://doi.org/10.1145/3131284
-
[7]
InProceedings of the AAAI conference on artificial intelligence, Vol
NoamBrown,ChristianKroer,andTuomasSandholm.2017.Dynamicthresholding and pruning for regret minimization. InProceedings of the AAAI conference on artificial intelligence, Vol. 31
2017
-
[8]
Noam Brown, Adam Lerer, Sam Gross, and Tuomas Sandholm. 2019. Deep coun- terfactual regret minimization. InInternational conference on machine learning. PMLR, 793–802
2019
-
[9]
Noam Brown and Tuomas Sandholm. 2018. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals.Science 359, 6374 (2018), 418–424. https://doi.org/10.1126/science.aao1733 arXiv:https://www.science.org/doi/pdf/10.1126/science.aao1733
-
[10]
Noam Brown and Tuomas Sandholm. 2019. Superhuman AI for multiplayer poker. Science365, 6456 (2019), 885–890
2019
-
[11]
Shulun Chen, Runlong Zhou, Zihan Zhang, Maryam Fazel, and Simon S. Du
-
[12]
arXiv:2506.06521 [cs.LG] https://arxiv.org/abs/2506.06521
Sharp Gap-Dependent Variance-Aware Regret Bounds for Tabular MDPs. arXiv:2506.06521 [cs.LG] https://arxiv.org/abs/2506.06521
-
[13]
Christoph Dann, Lihong Li, Wei Wei, and Emma Brunskill. 2019. Policy Cer- tificates: Towards Accountable Reinforcement Learning. InProceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA (Proceedings of Machine Learning Research, Vol.97),KamalikaChaudhuriandRuslanSalakhutdinov(Eds.).PMLR...
2019
-
[14]
InProceedings of the 35th International Conference on Neural Information Processing Systems (NIPS ’21)
ChrisDann,TeodorV.Marinov,MehryarMohri,andJulianZimmert.2021.Beyond value-function gaps: improved instance-dependent regret bounds for episodic reinforcement learning. InProceedings of the 35th International Conference on Neural Information Processing Systems (NIPS ’21). Curran Associates Inc., Red Hook, NY, USA, Article 1, 12 pages
2021
-
[15]
DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...
-
[16]
Eyal Even-Dar, Shie Mannor, and Yishay Mansour. 2006. Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems.J. Mach. Learn. Res.7 (2006), 1079–1105. https://jmlr.org/papers/v7/ evendar06a.html
2006
-
[17]
Gabriele Farina and Tuomas Sandholm. 2021. Model-free online learning in unknown sequential decision making problems and games. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 5381–5390
2021
-
[18]
Claude-Nicolas Fiechter. 1994. Efficient Reinforcement Learning. InProceedings oftheSeventhAnnualACMConferenceonComputationalLearningTheory,COLT 1994, New Brunswick, NJ, USA, July 12-15, 1994, Manfred K. Warmuth (Ed.). ACM, 88–97. https://doi.org/10.1145/180139.181019
-
[19]
Victor Gabillon, Mohammad Ghavamzadeh, and Alessandro Lazaric. 2012. Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United St...
2012
- [20]
-
[21]
Wang,GregoryClark,TimoBertram,JohannesFürnkranz,MartinMüller,BradyP
RyanW.Gardner,GinoPerrotta,AnvayShah,ShivaramKalyanakrishnan,KevinA. Wang,GregoryClark,TimoBertram,JohannesFürnkranz,MartinMüller,BradyP. Garrison, Prithviraj Dasgupta, and Saeid Rezaei. 2022. The Machine Reconnais- sance Blind Chess Tournament of NeurIPS 2022. InProceedings of the NeurIPS 2022 Competitions Track (Proceedings of Machine Learning Research,...
2022
-
[22]
Chi Jin, Akshay Krishnamurthy, Max Simchowitz, and Tiancheng Yu. 2020. Reward-Free Exploration for Reinforcement Learning. InProceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event (Proceedings of Machine Learning Research, Vol. 119). PMLR, 4870–4879. http://proceedings.mlr.press/v119/jin20d.html
2020
-
[23]
PlanninginMarkovDecisionProcesses with Gap-Dependent Sample Complexity
Anders Jonsson, Emilie Kaufmann, Pierre Menard, Omar Darwiche Domingues, EdouardLeurent,andMichalValko.2020. PlanninginMarkovDecisionProcesses with Gap-Dependent Sample Complexity. InAdvances in Neural Information ProcessingSystems,H.Larochelle,M.Ranzato,R.Hadsell,M.F.Balcan,andH.Lin (Eds.), Vol. 33. Curran Associates, Inc., 1253–1263. https://proceedings...
2020
-
[24]
Shivaram Kalyanakrishnan, Sheel Shah, and Santhosh Kumar Guguloth. 2025. A View of the Certainty-Equivalence Method for PAC RL as an Application of the Trajectory Tree Method. InProceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems(Detroit, MI, USA)(AAMAS ’25). International Foundation for Autonomous Agents and Multi...
2025
-
[25]
Shivaram Kalyanakrishnan, Ambuj Tewari, Peter Auer, and Peter Stone. 2012. PAC Subset Selection in Stochastic Multi-armed Bandits. InProceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012. icml.cc / Omnipress. http://icml.cc/2012/ papers/359.pdf
2012
-
[26]
30), Shai Shalev-Shwartz and Ingo Steinwart (Eds.)
EmilieKaufmannandShivaramKalyanakrishnan.2013.InformationComplexityin BanditSubsetSelection.InCOLT2013-The26thAnnualConferenceonLearning Theory, June 12-14, 2013, Princeton University, NJ, USA (JMLR Workshop and Conference Proceedings, Vol. 30), Shai Shalev-Shwartz and Ingo Steinwart (Eds.). JMLR.org, 228–251. http://proceedings.mlr.press/v30/Kaufmann13.html
2013
-
[27]
Emilie Kaufmann, Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Edouard Leurent, and Michal Valko. 2021. Adaptive Reward-Free Exploration. InProceedings of the 32nd International Conference on Algorithmic Learning Theory (Proceedings of Machine Learning Research, Vol. 132), Vitaly Feldman, Katrina Ligett, and Sivan Sabato (Eds.). PMLR, 865–891. h...
2021
-
[28]
Michael J. Kearns and Satinder Singh. 2002. Near-Optimal Reinforcement Learning in Polynomial Time.Mach. Learn.49, 2-3 (2002), 209–232. https: //doi.org/10.1023/A:1017984413808
-
[29]
Jens Kober, J. Andrew Bagnell, and Jan Peters. 2013. Reinforcement learning in robotics: A survey.Int. J. Robotics Res.32, 11 (2013), 1238–1274. https: //doi.org/10.1177/0278364913495721
-
[30]
Levente Kocsis and Csaba Szepesvári. 2006. Bandit Based Monte-Carlo Planning. InMachine Learning: ECML 2006, 17th European Conference on Machine Learning, Berlin, Germany, September 18-22, 2006, Proceedings (Lecture Notes in Computer Science, Vol. 4212), Johannes Fürnkranz, Tobias Scheffer, and Myra Spiliopoulou (Eds.). Springer, 282–293. https://doi.org/...
-
[31]
Harold W Kuhn. 1953. Extensive games and the problem of information.Contri- butions to the Theory of Games2, 28 (1953), 193–216
1953
-
[32]
Harold W Kuhn and Albert W Tucker. 1950. A Simplified Two-Person Poker. Contributions to the Theory of Games, volume 1 of Annals of Mathematics Studies, 24, 97–103. Princeton, New Jersey: Princeton University Press(1950)
1950
-
[33]
Marc Lanctot, Kevin Waugh, Martin Zinkevich, and Michael Bowling. 2009. Monte Carlo sampling for regret minimization in extensive games.Advances in neural information processing systems22 (2009)
2009
-
[34]
Tor Lattimore and Marcus Hutter. 2014. Near-optimal PAC bounds for discounted MDPs.Theor. Comput. Sci.558 (2014), 125–143. https://doi.org/10.1016/J.TCS. 2014.09.029
-
[35]
Aymen Al Marjani, Andrea Tirinzoni, and Emilie Kaufmann. 2023. Towards Instance-Optimality in Online PAC Reinforcement Learning. CoRRabs/2311.05638 (2023). https://doi.org/10.48550/ARXIV.2311.05638 arXiv:2311.05638
-
[36]
Fastactivelearningforpureexploration in reinforcement learning
Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Emilie Kaufmann, EdouardLeurent,andMichalValko.2021. Fastactivelearningforpureexploration in reinforcement learning. InProceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and...
2021
-
[37]
Playing Atari with Deep Reinforcement Learning
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. 2013. Playing Atari with Deep Reinforcement Learning.CoRRabs/1312.5602 (2013). arXiv:1312.5602 http://arxiv.org/abs/1312.5602
work page internal anchor Pith review arXiv 2013
-
[38]
Moody and Matthew Saffell
John E. Moody and Matthew Saffell. 1998. Reinforcement Learning for Trading. InAdvances in Neural Information Processing Systems 11, [NIPS Conference, Denver, Colorado, USA, November 30 - December 5, 1998], Michael J. Kearns, Sara A. Solla, and David A. Cohn (Eds.). The MIT Press, 917–923. http: //papers.nips.cc/paper/1551-reinforcement-learning-for-trading
1998
-
[39]
Matej Moravčík, Martin Schmid, Neil Burch, Viliam Lis`y, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling. 2017. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker.Science 356, 6337 (2017), 508–513
2017
-
[40]
Andrew Y. Ng, H. Jin Kim, Michael I. Jordan, and Shankar Sastry. 2003. Autonomous Helicopter Flight via Reinforcement Learning. InAdvances in Neural Information Processing Systems 16 [Neural Information Processing Systems, NIPS 2003, December 8-13, 2003, Vancouver and Whistler, British Columbia, Canada], Sebastian Thrun, Lawrence K. Saul, and Bernhard Sch...
2003
-
[41]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, JohnSchulman,JacobHilton,FraserKelton,LukeMiller,MaddieSimens,Amanda Askell,PeterWelinder,PaulF.Christiano,JanLeike,andRyanLowe.2022. Train- ing language models to follow instructions with human feedback. InAdva...
2022
-
[42]
Alessandro Panconesi and Aravind Srinivasan. 1997. Randomized Distributed Edge Coloring via an Extension of the Chernoff–Hoeffding Bounds.SIAM J. Comput.26, 2 (1997), 350–368. https://doi.org/10.1137/S0097539793250767 arXiv:https://doi.org/10.1137/S0097539793250767
-
[43]
Gino Perrotta, Ryan W. Gardner, Corey Lowman, Mohammad Taufeeque, Nitish Tongia, Shivaram Kalyanakrishnan, Gregory Clark, Kevin Wang, Eitan Rothberg, BradyP.Garrison,PrithvirajDasgupta,CallumCanavan,andLucasMcCabe.2022. The Second NeurIPS Tournament of Reconnaissance Blind Chess. InProceedings of the NeurIPS 2021 Competitions and Demonstrations Track (Pro...
2022
-
[44]
Brian Sheppard. 2002. World-championship-caliber Scrabble SCRABBLE®is a registered trademark. All intellectual property rights in and to the game are owned in the USA by Hasbro Inc., in Canada by Hasbro Canada Corporation, and throughout the rest of the world by J.W. Spear & Sons Limited of Maidenhead, Berkshire,England,asubsidiaryofMattelInc.ArtificialIn...
-
[45]
2008.Multiagent systems: Algorithmic, game-theoretic, and logical foundations
Yoav Shoham and Kevin Leyton-Brown. 2008.Multiagent systems: Algorithmic, game-theoretic, and logical foundations. Cambridge University Press
2008
-
[46]
David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy P. Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. 2017. Mastering the game of Go without human knowledge.Nat.550, 7676 (2017), 354–359....
-
[47]
Non-asymptoticgap-dependentregret bounds for tabular MDPs
MaxSimchowitzandKevinJamieson.2019. Non-asymptoticgap-dependentregret bounds for tabular MDPs. InProceedings of the 33rd International Conference on Neural Information Processing Systems. Curran Associates Inc., Red Hook, NY, USA, Article 104, 10 pages
2019
-
[48]
Bowling, Bryce Larson, Carmelo Piccione, Neil Burch, Darse Billings, and Chris Rayner
FinneganSouthey,MichaelBowling,BryceLarson,CarmeloPiccione,NeilBurch, Darse Billings, and D. Chris Rayner. 2012. Bayes’ Bluff: Opponent Modelling in Poker.CoRRabs/1207.1411 (2012). arXiv:1207.1411 http://arxiv.org/abs/1207. 1411
- [49]
-
[50]
Sutton and Andrew G
Richard S. Sutton and Andrew G. Barto. 2018.Reinforcement learning - an introduction, 2nd Edition. MIT Press. http://www.incompleteideas.net/book/the- book-2nd.html
2018
- [51]
-
[52]
Yuandong Tian, Qucheng Gong, and Yu Jiang. 2020. Joint Policy Search for Multi-agent Collaboration with Imperfect Information. InAdvances in Neural Information Processing Systems 33: Annual Conference on Neural Informa- tion Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Flor...
2020
-
[53]
AndreaTirinzoni,AymenAl-Marjani,andEmilieKaufmann.2022. Nearinstance- optimalPACreinforcementlearningfordeterministicMDPs.InProceedingsofthe 36th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’22). Curran Associates Inc., Red Hook, NY, USA, Article 639, 14 pages
2022
-
[54]
Andrea Tirinzoni, Aymen Al Marjani, and Emilie Kaufmann. 2023. Optimistic PAC Reinforcement Learning: the Instance-Dependent View. InInternational Conference on Algorithmic Learning Theory, February 20-23, 2023, Singapore (Proceedings of Machine Learning Research, Vol. 201), Shipra Agrawal and Francesco Orabona (Eds.). PMLR, 1460–1480. https://proceedings...
2023
-
[55]
Jamieson
Andrew Wagenmaker and Kevin G. Jamieson. 2022. Instance-Dependent Near- OptimalPolicyIdentificationinLinearMDPsviaOnlineExperimentDesign.InAd- vancesinNeuralInformationProcessingSystems35:AnnualConferenceonNeural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, No- vember28-December9,2022,SanmiKoyejo,S.Mohamed,A.Agarwal,Danielle Be...
2022
-
[56]
Wagenmaker, Max Simchowitz, and Kevin Jamieson
Andrew J. Wagenmaker, Max Simchowitz, and Kevin Jamieson. 2022. Beyond No Regret: Instance-Dependent PAC Reinforcement Learning. InConference on Learning Theory, 2-5 July 2022, London, UK (Proceedings of Machine Learning Research, Vol. 178), Po-Ling Loh and Maxim Raginsky (Eds.). PMLR, 358–418. https://proceedings.mlr.press/v178/wagenmaker22a.html
2022
-
[57]
Wurman, Samuel Barrett, Kenta Kawamoto, James MacGlashan, Kaushik Subramanian, Thomas J
Peter R. Wurman, Samuel Barrett, Kenta Kawamoto, James MacGlashan, Kaushik Subramanian, Thomas J. Walsh, Roberto Capobianco, Alisa Devlic, Franziska Eckert, Florian Fuchs, Leilani Gilpin, Piyush Khandelwal, Varun Raj Kompella, HaoChih Lin, Patrick MacAlpine, Declan Oller, Takuma Seno, Craig Sherstan, Michael D. Thomure, Houmehr Aghabozorgi, Leon Barrett, ...
-
[58]
Haike Xu, Tengyu Ma, and Simon Du. 2021. Fine-Grained Gap-Dependent Bounds for Tabular MDPs via Adaptive Multi-Step Bootstrap. InProceedings of Thirty Fourth Conference on Learning Theory (Proceedings of Machine Learning Research, Vol. 134), Mikhail Belkin and Samory Kpotufe (Eds.). PMLR, 4438–
2021
-
[59]
https://proceedings.mlr.press/v134/xu21a.html
-
[60]
Kochenderfer, and Emma Brunskill
Andrea Zanette, Mykel J. Kochenderfer, and Emma Brunskill. 2019. Almost Horizon-Free Structure-Aware Best Policy Identification with a Generative Model. InAdvances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, Hanna M. Wallach, Hugo La...
2019
-
[61]
Zihan Zhang, Yuxin Chen, Jason D Lee, and Simon S Du. 2024. Settling the sample complexity of online reinforcement learning. InProceedings of Thirty Seventh Conference on Learning Theory (Proceedings of Machine Learning Research, Vol. 247), Shipra Agrawal and Aaron Roth (Eds.). PMLR, 5213–5219. https://proceedings.mlr.press/v247/zhang24a.html
2024
-
[62]
Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione
-
[63]
A PROOF OF PROPOSITION 1 The proposition states that for𝜋∈Π, 𝑉 𝜋 (𝑠1)= ∑︁ 𝜎∈𝑋(𝜋) 𝑞(𝜎) ·𝜌(𝜎)
Regret minimization in games with incomplete information.Advances in neural information processing systems20 (2007). A PROOF OF PROPOSITION 1 The proposition states that for𝜋∈Π, 𝑉 𝜋 (𝑠1)= ∑︁ 𝜎∈𝑋(𝜋) 𝑞(𝜎) ·𝜌(𝜎). For any internal node¯𝑠∈𝑆 in our tree, let𝜌 ¯𝑠 (𝑠) and 𝑞¯𝑠 (𝑠) denote the respective quantities defined in(2), but on the subtree rooted at¯𝑠. Thus...
2007
-
[64]
However,Pedelis known to be computationally intractable, and the problem of designing an efficient algorithm with matching lower bounds remains open
also showed that thePedelalgorithm [54] closely matches this lower bound. However,Pedelis known to be computationally intractable, and the problem of designing an efficient algorithm with matching lower bounds remains open. Our results for T-MDPs are distinct from the works in the literature because our bounds avoid the dependence on the state-visitation ...
-
[65]
Essentially, the argument is that since𝜎𝑡 min (𝜋 𝑡 𝑈 ) belongs to exactly one of the subtrees rooted at𝑠 ′ 𝑗, actions in the other subtrees can be chosen to maximise the empirical value of the policy. Thus we have, 𝜋𝑡 𝑈 (𝑠)=argmax 𝑎∈ A max 𝑠𝑘 ∈ C (𝑠,𝑎) ∑︁ 𝑠 𝑗 ∈ C (𝑠,𝑎),𝑗≠𝑘 b𝑝(𝑠, 𝑎, 𝑠 𝑗 )b𝑉 𝜋1 (𝑠 𝑗 ) + b𝑝(𝑠, 𝑎, 𝑠𝑘 )b𝑉 𝜋𝑡,𝑘 𝑈 (𝑠𝑘 ) +𝛽(𝑛 𝑡 (𝜋 𝑡,𝑘 𝑈 , ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.