Recognition: 1 theorem link
· Lean TheoremOffline Two-Player Zero-Sum Markov Games with KL Regularization
Pith reviewed 2026-05-14 19:20 UTC · model grok-4.3
The pith
KL regularization by itself stabilizes learning of Nash equilibria in offline zero-sum Markov games and delivers fast Õ(1/n) convergence under unilateral concentrability.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ROSE is a regularized framework in which KL-regularized value estimation yields Õ(1/n) convergence to Nash equilibria under unilateral concentrability; SOS-MD is the corresponding model-free algorithm that alternates least-squares value updates with self-play mirror-descent policy steps and whose last iterate attains the same statistical rate up to Õ(1/sqrt(T)) optimization error after T iterations.
What carries the argument
KL-regularized sequential equilibrium (ROSE) together with the SOS-MD self-play mirror-descent procedure that uses least-squares value estimation.
If this is right
- The last iterate of SOS-MD converges at the same fast statistical rate as the ROSE framework once optimization error becomes negligible.
- No explicit pessimism term is required once KL regularization is present.
- The method works in a model-free setting using only least-squares value estimation and self-play updates.
- Convergence holds for the full sequence of iterates rather than only the average.
- The same Õ(1/n) rate applies to both the theoretical ROSE object and the practical SOS-MD algorithm.
Where Pith is reading between the lines
- Regularization may be able to replace pessimism penalties in other offline multi-agent settings where explicit bonuses are currently used.
- The unilateral concentrability condition could be relaxed further if both players' data coverage is jointly controlled.
- The approach invites direct empirical tests on benchmark Markov game environments to measure how much the 1/n rate improves sample efficiency over baselines.
- Extending the same KL-regularized self-play template to non-zero-sum or cooperative Markov games is a natural next direction.
Load-bearing premise
The offline data must satisfy unilateral concentrability with respect to the policies being learned; if coverage fails from one side, the fast 1/n rate no longer holds.
What would settle it
Run SOS-MD on an offline dataset that violates unilateral concentrability from one player and check whether the observed convergence rate degrades to the slower Õ(1/sqrt(n)) regime that appears in unregularized methods.
read the original abstract
We study the problem of learning Nash equilibria in offline two-player zero-sum Markov games. While existing approaches often rely on explicit pessimism to address distribution shift, we show that KL regularization alone suffices to stabilize learning and guarantee convergence. We first introduce Regularized Offline Sequential Equilibrium (ROSE), a theoretical framework that achieves a fast $\widetilde{\mathcal{O}}(1/n)$ convergence rate under \textit{unilateral concentrability}, improving over the standard $\widetilde{\mathcal{O}}(1/\sqrt{n})$ rates in unregularized settings. We then propose Sequential Offline Self-play Mirror Descent (SOS-MD), a practical model-free algorithm based on least-squares value estimation and iterative self-play updates. We prove that the last iterate of SOS-MD attains the same $\widetilde{\mathcal{O}}(1/n)$ statistical rate up to a vanishing optimization error of order $\widetilde{\mathcal{O}}(1/\sqrt{T})$ in the number of self-play iterations $T$.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies offline learning of Nash equilibria in two-player zero-sum Markov games. It claims that KL regularization alone suffices to stabilize learning and replace explicit pessimism. It introduces the theoretical ROSE framework achieving a fast Õ(1/n) convergence rate under unilateral concentrability (improving on standard Õ(1/√n) rates), and proposes the practical SOS-MD algorithm based on least-squares value estimation and self-play updates, proving that its last iterate attains the same statistical rate up to a vanishing Õ(1/√T) optimization error.
Significance. If the rates and derivations hold, the work provides a meaningful simplification for offline multi-agent RL by showing regularization can handle distribution shift without explicit pessimism, while delivering faster statistical rates under a unilateral concentrability assumption. The explicit separation of statistical and optimization errors in SOS-MD is a strength, and the focus on last-iterate convergence is practically relevant.
major comments (2)
- [Theorem statements and assumption section] The central Õ(1/n) rate for ROSE (and last-iterate SOS-MD) is load-bearing on the unilateral concentrability assumption with respect to the learned policies. The manuscript should explicitly compare this assumption's strength to standard concentrability coefficients used in prior offline MG works (e.g., in the definition of the concentrability coefficient C and how it enters the error bounds).
- [Proof of ROSE convergence rate] The claim that KL regularization 'suffices' to stabilize learning without pessimism requires verifying that all distribution-shift error terms are controlled solely by the KL term and unilateral concentrability; the abstract states proofs exist, but the bounding steps for the value estimation error under the regularized objective need to be checked for circularity or hidden dependence on the data distribution.
minor comments (2)
- [Algorithm and objective definitions] Notation for the KL regularization strength parameter should be introduced consistently (e.g., as λ or β) and its dependence on n clarified in the rate statements.
- [Introduction or related work] The manuscript would benefit from a table comparing the new rates to prior offline MG results (with and without regularization) to highlight the improvement.
Simulated Author's Rebuttal
Thank you for the detailed and constructive review. We appreciate the positive assessment of the work's significance in simplifying offline multi-agent RL through KL regularization. We address each major comment below and will revise the manuscript to incorporate clarifications where appropriate.
read point-by-point responses
-
Referee: [Theorem statements and assumption section] The central Õ(1/n) rate for ROSE (and last-iterate SOS-MD) is load-bearing on the unilateral concentrability assumption with respect to the learned policies. The manuscript should explicitly compare this assumption's strength to standard concentrability coefficients used in prior offline MG works (e.g., in the definition of the concentrability coefficient C and how it enters the error bounds).
Authors: We agree that an explicit comparison would improve clarity. In the revised manuscript, we will add a dedicated paragraph in the assumptions section (Section 3) that defines the standard concentrability coefficient C from prior offline Markov game literature and contrasts it directly with unilateral concentrability. We will note that unilateral concentrability is a weaker condition, requiring coverage only with respect to one player's policy against arbitrary policies of the opponent, rather than joint coverage over all policy pairs. This milder assumption, when paired with KL regularization, suffices to control distribution shift and yields the improved Õ(1/n) rate. We will also explicitly show how the concentrability factor scales the error terms in the bound of Theorem 1. revision: yes
-
Referee: [Proof of ROSE convergence rate] The claim that KL regularization 'suffices' to stabilize learning without pessimism requires verifying that all distribution-shift error terms are controlled solely by the KL term and unilateral concentrability; the abstract states proofs exist, but the bounding steps for the value estimation error under the regularized objective need to be checked for circularity or hidden dependence on the data distribution.
Authors: The full proofs in Appendix B demonstrate that distribution-shift terms are controlled exclusively by the KL regularization in the objective together with unilateral concentrability, without explicit pessimism or circular reasoning. The argument proceeds by first establishing stability of the regularized iterates (ensuring they remain in a region covered by the assumption), then applying concentrability to bound the occupancy-measure mismatch in the value estimation error; the data distribution enters only through the fixed concentrability coefficient, with no hidden dependence. To address the concern directly, we will insert a concise proof outline immediately following Theorem 1 in the main text that highlights these sequential bounding steps. revision: partial
Circularity Check
No significant circularity; derivation self-contained
full rationale
The paper introduces ROSE as a theoretical object defined by the KL-regularized equilibrium and proves its Õ(1/n) rate directly from the regularized objective plus the unilateral concentrability assumption on the offline data. SOS-MD is then analyzed as a practical algorithm whose last iterate matches the same statistical rate up to an explicit, vanishing optimization error term Õ(1/√T). No equation reduces a prediction to a fitted quantity by construction, no uniqueness theorem is imported from prior self-work, and the central claims rest on standard analysis of regularized MD under a stated data-coverage assumption rather than on any self-referential fit or renaming. The derivation chain is therefore independent of its own outputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- KL regularization strength
axioms (1)
- domain assumption Unilateral concentrability
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
KL regularization alone suffices to stabilize learning... fast Õ(1/n) convergence rate under unilateral concentrability
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Proceedings of the International Conference on Learning Representations , volume=
Efficient policy evaluation with safety constraint for reinforcement learning , author=. Proceedings of the International Conference on Learning Representations , volume=
-
[2]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Efficient multi-policy evaluation for reinforcement learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[3]
Beyond Pessimism: Offline Learning in KL-regularized Games
Beyond Pessimism: Offline Learning in KL-regularized Games , author=. ArXiv Preprint arXiv:2604.06738 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Pessimism-Free Offline Learning in General-Sum Games via KL Regularization
Pessimism-Free Offline Learning in General-Sum Games via KL Regularization , author=. ArXiv Preprint arXiv:2605.00264 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Fast Rates in $\alpha$-Potential Games via Regularized Mirror Descent
Fast Rates in -Potential Games via Regularized Mirror Descent , author=. ArXiv Preprint arXiv:2605.00268 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Games and Economic Behavior , volume=
Adaptive game playing using multiplicative weights , author=. Games and Economic Behavior , volume=. 1999 , publisher=
1999
-
[7]
Iterative Nash Policy Optimization: Aligning
Yuheng Zhang and Dian Yu and Baolin Peng and Linfeng Song and Ye Tian and Mingyue Huo and Nan Jiang and Haitao Mi and Dong Yu , booktitle=. Iterative Nash Policy Optimization: Aligning
-
[8]
1994 , publisher=
A course in game theory , author=. 1994 , publisher=
1994
-
[9]
1998 , publisher=
Dynamic noncooperative game theory , author=. 1998 , publisher=
1998
-
[10]
Handbook of reinforcement learning and control , pages=
Multi-agent reinforcement learning: A selective overview of theories and algorithms , author=. Handbook of reinforcement learning and control , pages=. 2021 , publisher=
2021
-
[11]
Machine learning proceedings 1994 , pages=
Markov games as a framework for multi-agent reinforcement learning , author=. Machine learning proceedings 1994 , pages=. 1994 , publisher=
1994
-
[12]
ArXiv Preprint arXiv:2102.00479 , year=
Fast rates for the regret of offline reinforcement learning , author=. ArXiv Preprint arXiv:2102.00479 , year=
-
[13]
Proceedings of the International Conference on Machine Learning , pages=
Pessimistic minimax value iteration: Provably efficient equilibrium learning from offline datasets , author=. Proceedings of the International Conference on Machine Learning , pages=. 2022 , organization=
2022
-
[14]
Advances in Neural Information Processing Systems , volume=
When are offline two-player zero-sum Markov games solvable? , author=. Advances in Neural Information Processing Systems , volume=
-
[15]
Proceedings of the International Conference on Machine Learning , pages=
Is pessimism provably efficient for offline rl? , author=. Proceedings of the International Conference on Machine Learning , pages=. 2021 , organization=
2021
-
[16]
Advances in Neural Information Processing Systems , volume=
Bellman-consistent pessimism for offline reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=
-
[17]
Proceedings of the International Conference on Machine Learning , pages=
Offline learning in markov games with general function approximation , author=. Proceedings of the International Conference on Machine Learning , pages=. 2023 , organization=
2023
-
[18]
Proceedings of the International Conference on Machine Learning , pages=
A theory of regularized markov decision processes , author=. Proceedings of the International Conference on Machine Learning , pages=. 2019 , organization=
2019
-
[19]
Asadi and Idan Shenfeld and Youssef Mroueh , booktitle=
Gholamali Aminian and Amir R. Asadi and Idan Shenfeld and Youssef Mroueh , booktitle=
-
[20]
ArXiv Preprint , year=
G\"odel's Poetry , author=. ArXiv Preprint , year=
-
[21]
2025 , journal=
ProofAug: Efficient Neural Theorem Proving via Fine-grained Proof Structure Analysis , author=. 2025 , journal=
2025
-
[22]
ArXiv Preprint , year=
Hilbert: Recursively Building Formal Proofs with Informal Reasoning , author=. ArXiv Preprint , year=
-
[23]
ArXiv Preprint , year=
APOLLO: Automated LLM and Lean Collaboration for Advanced Formal Reasoning , author=. ArXiv Preprint , year=
-
[24]
ArXiv Preprint , year=
Solving formal math problems by decomposition and iterative reflection , author=. ArXiv Preprint , year=
-
[25]
ArXiv Preprint , year=
Formal theorem proving by rewarding llms to decompose proofs hierarchically , author=. ArXiv Preprint , year=
-
[26]
ArXiv Preprint , year=
Lemmanaid: Neuro-Symbolic Lemma Conjecturing , author=. ArXiv Preprint , year=
-
[27]
2022 , journal =
Sivaraman, Aishwarya and Sanchez-Stern, Alex and Chen, Bretton and Lerner, Sorin and Millstein, Todd , title =. 2022 , journal =
2022
-
[28]
ArXiv Preprint
LEGO-Prover: Neural Theorem Proving with Growing Libraries , author=. ArXiv Preprint
-
[29]
ArXiv Preprint
LeanConjecturer: Automatic Generation of Mathematical Conjectures for Theorem Proving. ArXiv Preprint. 2025
2025
-
[30]
Discovering New Theorems via LLMs with In-Context Proof Learning in Lean
Kazumi Kasaura and Naoto Onda and Yuta Oriike and Masaya Taniguchi and Akiyoshi Sannai and Sho Sonoda. Discovering New Theorems via LLMs with In-Context Proof Learning in Lean. ArXiv Preprint. 2025
2025
-
[31]
ArXiv Preprint , year=
Aristotle: Imo-level automated theorem proving , author=. ArXiv Preprint , year=
-
[32]
ArXiv Preprint , year=
Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction , author=. ArXiv Preprint , year=
-
[33]
ArXiv Preprint , year=
Goedel-prover: A frontier model for open-source automated theorem proving , author=. ArXiv Preprint , year=
-
[34]
Nature , year=
Olympiad-level formal mathematical reasoning with reinforcement learning , author=. Nature , year=
-
[35]
ArXiv Preprint , year=
Gold-medalist performance in solving olympiad geometry with alphageometry2 , author=. ArXiv Preprint , year=
-
[36]
ArXiv Preprint , year=
Seed-prover: Deep and broad reasoning for automated theorem proving , author=. ArXiv Preprint , year=
-
[37]
ArXiv Preprint , year=
Minif2f: a cross-system benchmark for formal olympiad-level mathematics , author=. ArXiv Preprint , year=
-
[38]
ArXiv Preprint , year=
Formalmath: Benchmarking formal mathematical reasoning of large language models , author=. ArXiv Preprint , year=
-
[39]
ArXiv Preprint , year=
Proofnet: Autoformalizing and formally proving undergraduate-level mathematics , author=. ArXiv Preprint , year=
-
[40]
Advances in Neural Information Processing Systems , year=
Putnambench: Evaluating neural theorem-provers on the putnam mathematical competition , author=. Advances in Neural Information Processing Systems , year=
-
[41]
10 amazon statistics you need to know in 2022
Mohsin, Maryam. 10 amazon statistics you need to know in 2022. Oberlo. 2022
2022
-
[42]
and Deng, Yanzhen and Laber, Eric B
Murphy, Susan A. and Deng, Yanzhen and Laber, Eric B. and Maei, Hamid Reza and Sutton, Richard S. and Witkiewitz, Katie. A Batch, Off-Policy, Actor-Critic Algorithm for Optimizing the Average Reward. ArXiv Preprint. 2016
2016
-
[43]
A Block Coordinate Ascent Algorithm for Mean-Variance Optimization
Xie, Tengyang and Liu, Bo and Xu, Yangyang and Ghavamzadeh, Mohammad and Chow, Yinlam and Lyu, Daoming and Yoon, Daesub. A Block Coordinate Ascent Algorithm for Mean-Variance Optimization. Advances in Neural Information Processing Systems. 2018
2018
-
[44]
A Closer Look at Deep Policy Gradients
Ilyas, Andrew and Engstrom, Logan and Santurkar, Shibani and Tsipras, Dimitris and Janoos, Firdaus and Rudolph, Larry and Madry, Aleksander. A Closer Look at Deep Policy Gradients. Proceedings of the International Conference on Learning Representations. 2020
2020
-
[45]
and Castro, Pablo Samuel
Lyle, Clare and Bellemare, Marc G. and Castro, Pablo Samuel. A Comparative Analysis of Expected and Distributional Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence. 2019
2019
-
[46]
A Concentration Bound for TD (0) with Function Approximation
Chandak, Siddharth and Borkar, Vivek S. A Concentration Bound for TD (0) with Function Approximation. ArXiv Preprint. 2023
2023
-
[47]
and Precup, Doina
Perkins, Theodore J. and Precup, Doina. A Convergent Form of Approximate Policy Iteration. Advances in Neural Information Processing Systems. 2002
2002
-
[48]
and Szepesv \' a ri, Csaba and Maei, Hamid Reza
Sutton, Richard S. and Szepesv \' a ri, Csaba and Maei, Hamid Reza. A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation. Advances in Neural Information Processing Systems. 2008
2008
-
[49]
A Convergent Off-Policy Temporal Difference Algorithm
Diddigi, Raghuram Bharadwaj and Kamanchi, Chandramouli and Bhatnagar, Shalabh. A Convergent Off-Policy Temporal Difference Algorithm. Proceedings of the European Conference on Artificial Intelligence. 2020
2020
-
[50]
A Deeper Look at Discounting Mismatch in Actor-Critic Algorithms
Zhang, Shangtong and Laroche, Romain and van Seijen, Harm and Whiteson, Shimon and des Combes, Remi Tachet. A Deeper Look at Discounting Mismatch in Actor-Critic Algorithms. Proceedings of the International Conference on Autonomous Agents and Multiagent Systems. 2022
2022
-
[51]
A Deeper Look at Planning as Learning from Replay
Vanseijen, Harm and Sutton, Rich. A Deeper Look at Planning as Learning from Replay. Proceedings of the International Conference on Machine Learning. 2015
2015
-
[52]
A Definition of Continual Reinforcement Learning
Abel, David and Barreto, Andr \'e and Van Roy, Benjamin and Precup, Doina and van Hasselt, Hado and Singh, Satinder. A Definition of Continual Reinforcement Learning. Advances in Neural Information Processing Systems. 2023
2023
-
[53]
and Dabney, Will and Munos, R \' e mi
Bellemare, Marc G. and Dabney, Will and Munos, R \' e mi. A Distributional Perspective on Reinforcement Learning. Proceedings of the International Conference on Machine Learning. 2017
2017
-
[54]
A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation
Bhandari, Jalaj and Russo, Daniel and Singal, Raghav. A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation. Proceedings of the Conference on Learning Theory. 2018
2018
-
[55]
A Finite-Time Analysis of Two Time-Scale Actor-Critic Methods
Wu, Yue and Zhang, Weitong and Xu, Pan and Gu, Quanquan. A Finite-Time Analysis of Two Time-Scale Actor-Critic Methods. Advances in Neural Information Processing Systems. 2020
2020
-
[56]
A Finite-Time Analysis of Q-Learning with Neural Network Function Approximation
Xu, Pan and Gu, Quanquan. A Finite-Time Analysis of Q-Learning with Neural Network Function Approximation. Proceedings of the International Conference on Machine Learning. 2020
2020
-
[57]
A Formalization of the Ionescu-Tulcea Theorem in Mathlib
Marion, Etienne. A Formalization of the Ionescu-Tulcea Theorem in Mathlib. ArXiv Preprint. 2025
2025
-
[58]
A Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies
Yu, Huizhen. A Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies. Proceedings of the Conference in Uncertainty in Artificial Intelligence. 2005
2005
-
[59]
A Generalized Reinforcement-Learning Model: Convergence and Applications
Littman, Michael L and Szepesv \'a ri, Csaba. A Generalized Reinforcement-Learning Model: Convergence and Applications. Proceedings of the International Conference on Machine Learning. 1996
1996
-
[60]
and Dabney, Will and Dadashi, Robert and Ta
Bellemare, Marc G. and Dabney, Will and Dadashi, Robert and Ta. A Geometric Perspective on Optimal Representations for Reinforcement Learning. Advances in Neural Information Processing Systems. 2019
2019
-
[61]
A Kernel Loss for Solving the Bellman Equation
Feng, Yihao and Li, Lihong and Liu, Qiang. A Kernel Loss for Solving the Bellman Equation. Advances in Neural Information Processing Systems. 2019
2019
-
[62]
and Bellemare, Marc G
Machado, Marlos C. and Bellemare, Marc G. and Bowling, Michael H. A Laplacian Framework for Option Discovery in Reinforcement Learning. Proceedings of the International Conference on Machine Learning. 2017
2017
-
[63]
A Lyapunov Theory for Finite-Sample Guarantees of Markovian Stochastic Approximation
Chen, Zaiwei and Maguluri, Siva Theja and Shakkottai, Sanjay and Shanmugam, Karthikeyan. A Lyapunov Theory for Finite-Sample Guarantees of Markovian Stochastic Approximation. Operations Research. 2023
2023
-
[64]
A Markovian decision process
Bellman, Richard. A Markovian decision process. Journal of Mathematics and Mechanics. 1957
1957
-
[65]
A Maximum-Entropy Approach to Off-Policy Evaluation in Average-Reward MDPs
Lazic, Nevena and Yin, Dong and Farajtabar, Mehrdad and Levine, Nir and G. A Maximum-Entropy Approach to Off-Policy Evaluation in Average-Reward MDPs. Advances in Neural Information Processing Systems. 2020
2020
-
[66]
and Cohen, Paul R
Oates, Tim and Schmill, Matthew D. and Cohen, Paul R. A Method for Clustering the Experiences of a Mobile Robot that Accords with Human Judgments. Proceedings of the AAAI Conference on Artificial Intelligence. 2000
2000
-
[67]
A New Challenge in Policy Evaluation
Zhang, Shangtong. A New Challenge in Policy Evaluation. Proceedings of the AAAI Conference on Artificial Intelligence. 2023
2023
-
[68]
A Non-Asymptotic Theory of Seminorm Lyapunov Stability: From Deterministic to Stochastic Iterative Algorithms
Chen, Zaiwei and Zhang, Sheng and Zhang, Zhe and Haque, Shaan Ul and Maguluri, Siva Theja. A Non-Asymptotic Theory of Seminorm Lyapunov Stability: From Deterministic to Stochastic Iterative Algorithms. ArXiv Preprint. 2025
2025
-
[69]
A Nonparametric Offpolicy Policy Gradient
Tosatto, Samuele and Carvalho, Jo a o and Abdulsamad, Hany and Peters, Jan. A Nonparametric Offpolicy Policy Gradient. ArXiv Preprint. 2020
2020
-
[70]
A Reinforcement Learning Method for Maximizing Undiscounted Rewards
Schwartz, Anton. A Reinforcement Learning Method for Maximizing Undiscounted Rewards. Proceedings of the International Conference on Machine Learning. 1993
1993
-
[71]
A Remark on a Theorem of M
Edelstein, Michael. A Remark on a Theorem of M. A. Krasnoselski. American Mathematical Monthly. 1966
1966
-
[72]
A Self-Tuning Actor-Critic Algorithm
Zahavy, Tom and Xu, Zhongwen and Veeriah, Vivek and Hessel, Matteo and Oh, Junhyuk and van Hasselt, Hado P and Silver, David and Singh, Satinder. A Self-Tuning Actor-Critic Algorithm. Advances in Neural Information Processing Systems. 2020
2020
-
[73]
A Simple Finite-Time Analysis of TD Learning With Linear Function Approximation
Mitra, Aritra. A Simple Finite-Time Analysis of TD Learning With Linear Function Approximation. IEEE Transactions on Automatic Control. 2025
2025
-
[74]
A Simple Framework for Contrastive Learning of Visual Representations
Chen, Ting and Kornblith, Simon and Norouzi, Mohammad and Hinton, Geoffrey E. A Simple Framework for Contrastive Learning of Visual Representations. Proceedings of the International Conference on Machine Learning. 2020
2020
-
[75]
A Survey for Deep Reinforcement Learning Based Network Intrusion Detection
Yang, Wanrong and Acuto, Alberto and Zhou, Yihang and Wojtczak, Dominik. A Survey for Deep Reinforcement Learning Based Network Intrusion Detection. ArXiv Preprint. 2024
2024
-
[76]
A Survey of Constraint Formulations in Safe Reinforcement Learning
Wachi, Akifumi and Shen, Xun and Sui, Yanan. A Survey of Constraint Formulations in Safe Reinforcement Learning. ArXiv Preprint. 2024
2024
-
[77]
A Survey of In-Context Reinforcement Learning
Moeini, Amir and Wang, Jiuqi and Beck, Jacob and Blaser, Ethan and Whiteson, Shimon and Chandra, Rohan and Zhang, Shangtong. A Survey of In-Context Reinforcement Learning. ArXiv Preprint. 2025
2025
-
[78]
and Cowling, Peter I
Browne, Cameron and Powley, Edward Jack and Whitehouse, Daniel and Lucas, Simon M. and Cowling, Peter I. and Rohlfshagen, Philipp and Tavener, Stephen and Liebana, Diego Perez and Samothrakis, Spyridon and Colton, Simon. A Survey of Monte Carlo Tree Search Methods. IEEE Transactions on Computational Intelligence and AI in Games. 2012
2012
-
[79]
A Theoretical Analysis of Deep Q-Learning
Fan, Jianqing and Wang, Zhaoran and Xie, Yuchen and Yang, Zhuoran. A Theoretical Analysis of Deep Q-Learning. Proceedings of the Annual Conference on Learning for Dynamics and Control. 2020
2020
-
[80]
A Tutorial on Meta-Reinforcement Learning
Beck, Jacob and Vuorio, Risto and Liu, Evan Zheran and Xiong, Zheng and Zintgraf, Luisa and Finn, Chelsea and Whiteson, Shimon. A Tutorial on Meta-Reinforcement Learning. Foundations and Trends® in Machine Learning. 2025
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.