pith. machine review for the scientific record. sign in

arxiv: 2605.13025 · v1 · submitted 2026-05-13 · 💻 cs.LG · cs.GT

Recognition: 1 theorem link

· Lean Theorem

Offline Two-Player Zero-Sum Markov Games with KL Regularization

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:20 UTC · model grok-4.3

classification 💻 cs.LG cs.GT
keywords offline reinforcement learningMarkov gamesNash equilibriaKL regularizationzero-sum gamesself-playmirror descentunilateral concentrability
0
0 comments X

The pith

KL regularization by itself stabilizes learning of Nash equilibria in offline zero-sum Markov games and delivers fast Õ(1/n) convergence under unilateral concentrability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that KL regularization is sufficient to control distribution shift in offline two-player zero-sum Markov games, eliminating the need for separate pessimism penalties while still guaranteeing convergence to equilibrium. A sympathetic reader would care because this replaces a tuning-heavy technique with a simpler regularizer and improves the statistical rate from the usual Õ(1/sqrt(n)) to Õ(1/n) when the data satisfies unilateral concentrability. The authors first define the ROSE framework that achieves this rate in theory, then give the practical SOS-MD algorithm whose last iterate matches the same rate up to a small optimization error that vanishes with more self-play steps.

Core claim

ROSE is a regularized framework in which KL-regularized value estimation yields Õ(1/n) convergence to Nash equilibria under unilateral concentrability; SOS-MD is the corresponding model-free algorithm that alternates least-squares value updates with self-play mirror-descent policy steps and whose last iterate attains the same statistical rate up to Õ(1/sqrt(T)) optimization error after T iterations.

What carries the argument

KL-regularized sequential equilibrium (ROSE) together with the SOS-MD self-play mirror-descent procedure that uses least-squares value estimation.

If this is right

  • The last iterate of SOS-MD converges at the same fast statistical rate as the ROSE framework once optimization error becomes negligible.
  • No explicit pessimism term is required once KL regularization is present.
  • The method works in a model-free setting using only least-squares value estimation and self-play updates.
  • Convergence holds for the full sequence of iterates rather than only the average.
  • The same Õ(1/n) rate applies to both the theoretical ROSE object and the practical SOS-MD algorithm.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Regularization may be able to replace pessimism penalties in other offline multi-agent settings where explicit bonuses are currently used.
  • The unilateral concentrability condition could be relaxed further if both players' data coverage is jointly controlled.
  • The approach invites direct empirical tests on benchmark Markov game environments to measure how much the 1/n rate improves sample efficiency over baselines.
  • Extending the same KL-regularized self-play template to non-zero-sum or cooperative Markov games is a natural next direction.

Load-bearing premise

The offline data must satisfy unilateral concentrability with respect to the policies being learned; if coverage fails from one side, the fast 1/n rate no longer holds.

What would settle it

Run SOS-MD on an offline dataset that violates unilateral concentrability from one player and check whether the observed convergence rate degrades to the slower Õ(1/sqrt(n)) regime that appears in unregularized methods.

read the original abstract

We study the problem of learning Nash equilibria in offline two-player zero-sum Markov games. While existing approaches often rely on explicit pessimism to address distribution shift, we show that KL regularization alone suffices to stabilize learning and guarantee convergence. We first introduce Regularized Offline Sequential Equilibrium (ROSE), a theoretical framework that achieves a fast $\widetilde{\mathcal{O}}(1/n)$ convergence rate under \textit{unilateral concentrability}, improving over the standard $\widetilde{\mathcal{O}}(1/\sqrt{n})$ rates in unregularized settings. We then propose Sequential Offline Self-play Mirror Descent (SOS-MD), a practical model-free algorithm based on least-squares value estimation and iterative self-play updates. We prove that the last iterate of SOS-MD attains the same $\widetilde{\mathcal{O}}(1/n)$ statistical rate up to a vanishing optimization error of order $\widetilde{\mathcal{O}}(1/\sqrt{T})$ in the number of self-play iterations $T$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper studies offline learning of Nash equilibria in two-player zero-sum Markov games. It claims that KL regularization alone suffices to stabilize learning and replace explicit pessimism. It introduces the theoretical ROSE framework achieving a fast Õ(1/n) convergence rate under unilateral concentrability (improving on standard Õ(1/√n) rates), and proposes the practical SOS-MD algorithm based on least-squares value estimation and self-play updates, proving that its last iterate attains the same statistical rate up to a vanishing Õ(1/√T) optimization error.

Significance. If the rates and derivations hold, the work provides a meaningful simplification for offline multi-agent RL by showing regularization can handle distribution shift without explicit pessimism, while delivering faster statistical rates under a unilateral concentrability assumption. The explicit separation of statistical and optimization errors in SOS-MD is a strength, and the focus on last-iterate convergence is practically relevant.

major comments (2)
  1. [Theorem statements and assumption section] The central Õ(1/n) rate for ROSE (and last-iterate SOS-MD) is load-bearing on the unilateral concentrability assumption with respect to the learned policies. The manuscript should explicitly compare this assumption's strength to standard concentrability coefficients used in prior offline MG works (e.g., in the definition of the concentrability coefficient C and how it enters the error bounds).
  2. [Proof of ROSE convergence rate] The claim that KL regularization 'suffices' to stabilize learning without pessimism requires verifying that all distribution-shift error terms are controlled solely by the KL term and unilateral concentrability; the abstract states proofs exist, but the bounding steps for the value estimation error under the regularized objective need to be checked for circularity or hidden dependence on the data distribution.
minor comments (2)
  1. [Algorithm and objective definitions] Notation for the KL regularization strength parameter should be introduced consistently (e.g., as λ or β) and its dependence on n clarified in the rate statements.
  2. [Introduction or related work] The manuscript would benefit from a table comparing the new rates to prior offline MG results (with and without regularization) to highlight the improvement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed and constructive review. We appreciate the positive assessment of the work's significance in simplifying offline multi-agent RL through KL regularization. We address each major comment below and will revise the manuscript to incorporate clarifications where appropriate.

read point-by-point responses
  1. Referee: [Theorem statements and assumption section] The central Õ(1/n) rate for ROSE (and last-iterate SOS-MD) is load-bearing on the unilateral concentrability assumption with respect to the learned policies. The manuscript should explicitly compare this assumption's strength to standard concentrability coefficients used in prior offline MG works (e.g., in the definition of the concentrability coefficient C and how it enters the error bounds).

    Authors: We agree that an explicit comparison would improve clarity. In the revised manuscript, we will add a dedicated paragraph in the assumptions section (Section 3) that defines the standard concentrability coefficient C from prior offline Markov game literature and contrasts it directly with unilateral concentrability. We will note that unilateral concentrability is a weaker condition, requiring coverage only with respect to one player's policy against arbitrary policies of the opponent, rather than joint coverage over all policy pairs. This milder assumption, when paired with KL regularization, suffices to control distribution shift and yields the improved Õ(1/n) rate. We will also explicitly show how the concentrability factor scales the error terms in the bound of Theorem 1. revision: yes

  2. Referee: [Proof of ROSE convergence rate] The claim that KL regularization 'suffices' to stabilize learning without pessimism requires verifying that all distribution-shift error terms are controlled solely by the KL term and unilateral concentrability; the abstract states proofs exist, but the bounding steps for the value estimation error under the regularized objective need to be checked for circularity or hidden dependence on the data distribution.

    Authors: The full proofs in Appendix B demonstrate that distribution-shift terms are controlled exclusively by the KL regularization in the objective together with unilateral concentrability, without explicit pessimism or circular reasoning. The argument proceeds by first establishing stability of the regularized iterates (ensuring they remain in a region covered by the assumption), then applying concentrability to bound the occupancy-measure mismatch in the value estimation error; the data distribution enters only through the fixed concentrability coefficient, with no hidden dependence. To address the concern directly, we will insert a concise proof outline immediately following Theorem 1 in the main text that highlights these sequential bounding steps. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The paper introduces ROSE as a theoretical object defined by the KL-regularized equilibrium and proves its Õ(1/n) rate directly from the regularized objective plus the unilateral concentrability assumption on the offline data. SOS-MD is then analyzed as a practical algorithm whose last iterate matches the same statistical rate up to an explicit, vanishing optimization error term Õ(1/√T). No equation reduces a prediction to a fitted quantity by construction, no uniqueness theorem is imported from prior self-work, and the central claims rest on standard analysis of regularized MD under a stated data-coverage assumption rather than on any self-referential fit or renaming. The derivation chain is therefore independent of its own outputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on the unilateral concentrability assumption and standard properties of KL divergence and least-squares estimation; no new invented entities are introduced.

free parameters (1)
  • KL regularization strength
    The coefficient controlling the strength of the KL penalty; its value is chosen to balance regularization and performance but is not fitted to the target result in the abstract.
axioms (1)
  • domain assumption Unilateral concentrability
    The offline data distribution is assumed to cover the relevant state-action pairs for one player sufficiently well; this is invoked to obtain the 1/n rate.

pith-pipeline@v0.9.0 · 5478 in / 1300 out tokens · 42996 ms · 2026-05-14T19:20:58.057786+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

300 extracted references · 5 canonical work pages · 3 internal anchors

  1. [1]

    Proceedings of the International Conference on Learning Representations , volume=

    Efficient policy evaluation with safety constraint for reinforcement learning , author=. Proceedings of the International Conference on Learning Representations , volume=

  2. [2]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Efficient multi-policy evaluation for reinforcement learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  3. [3]

    Beyond Pessimism: Offline Learning in KL-regularized Games

    Beyond Pessimism: Offline Learning in KL-regularized Games , author=. ArXiv Preprint arXiv:2604.06738 , year=

  4. [4]

    Pessimism-Free Offline Learning in General-Sum Games via KL Regularization

    Pessimism-Free Offline Learning in General-Sum Games via KL Regularization , author=. ArXiv Preprint arXiv:2605.00264 , year=

  5. [5]

    Fast Rates in $\alpha$-Potential Games via Regularized Mirror Descent

    Fast Rates in -Potential Games via Regularized Mirror Descent , author=. ArXiv Preprint arXiv:2605.00268 , year=

  6. [6]

    Games and Economic Behavior , volume=

    Adaptive game playing using multiplicative weights , author=. Games and Economic Behavior , volume=. 1999 , publisher=

  7. [7]

    Iterative Nash Policy Optimization: Aligning

    Yuheng Zhang and Dian Yu and Baolin Peng and Linfeng Song and Ye Tian and Mingyue Huo and Nan Jiang and Haitao Mi and Dong Yu , booktitle=. Iterative Nash Policy Optimization: Aligning

  8. [8]

    1994 , publisher=

    A course in game theory , author=. 1994 , publisher=

  9. [9]

    1998 , publisher=

    Dynamic noncooperative game theory , author=. 1998 , publisher=

  10. [10]

    Handbook of reinforcement learning and control , pages=

    Multi-agent reinforcement learning: A selective overview of theories and algorithms , author=. Handbook of reinforcement learning and control , pages=. 2021 , publisher=

  11. [11]

    Machine learning proceedings 1994 , pages=

    Markov games as a framework for multi-agent reinforcement learning , author=. Machine learning proceedings 1994 , pages=. 1994 , publisher=

  12. [12]

    ArXiv Preprint arXiv:2102.00479 , year=

    Fast rates for the regret of offline reinforcement learning , author=. ArXiv Preprint arXiv:2102.00479 , year=

  13. [13]

    Proceedings of the International Conference on Machine Learning , pages=

    Pessimistic minimax value iteration: Provably efficient equilibrium learning from offline datasets , author=. Proceedings of the International Conference on Machine Learning , pages=. 2022 , organization=

  14. [14]

    Advances in Neural Information Processing Systems , volume=

    When are offline two-player zero-sum Markov games solvable? , author=. Advances in Neural Information Processing Systems , volume=

  15. [15]

    Proceedings of the International Conference on Machine Learning , pages=

    Is pessimism provably efficient for offline rl? , author=. Proceedings of the International Conference on Machine Learning , pages=. 2021 , organization=

  16. [16]

    Advances in Neural Information Processing Systems , volume=

    Bellman-consistent pessimism for offline reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

  17. [17]

    Proceedings of the International Conference on Machine Learning , pages=

    Offline learning in markov games with general function approximation , author=. Proceedings of the International Conference on Machine Learning , pages=. 2023 , organization=

  18. [18]

    Proceedings of the International Conference on Machine Learning , pages=

    A theory of regularized markov decision processes , author=. Proceedings of the International Conference on Machine Learning , pages=. 2019 , organization=

  19. [19]

    Asadi and Idan Shenfeld and Youssef Mroueh , booktitle=

    Gholamali Aminian and Amir R. Asadi and Idan Shenfeld and Youssef Mroueh , booktitle=

  20. [20]

    ArXiv Preprint , year=

    G\"odel's Poetry , author=. ArXiv Preprint , year=

  21. [21]

    2025 , journal=

    ProofAug: Efficient Neural Theorem Proving via Fine-grained Proof Structure Analysis , author=. 2025 , journal=

  22. [22]

    ArXiv Preprint , year=

    Hilbert: Recursively Building Formal Proofs with Informal Reasoning , author=. ArXiv Preprint , year=

  23. [23]

    ArXiv Preprint , year=

    APOLLO: Automated LLM and Lean Collaboration for Advanced Formal Reasoning , author=. ArXiv Preprint , year=

  24. [24]

    ArXiv Preprint , year=

    Solving formal math problems by decomposition and iterative reflection , author=. ArXiv Preprint , year=

  25. [25]

    ArXiv Preprint , year=

    Formal theorem proving by rewarding llms to decompose proofs hierarchically , author=. ArXiv Preprint , year=

  26. [26]

    ArXiv Preprint , year=

    Lemmanaid: Neuro-Symbolic Lemma Conjecturing , author=. ArXiv Preprint , year=

  27. [27]

    2022 , journal =

    Sivaraman, Aishwarya and Sanchez-Stern, Alex and Chen, Bretton and Lerner, Sorin and Millstein, Todd , title =. 2022 , journal =

  28. [28]

    ArXiv Preprint

    LEGO-Prover: Neural Theorem Proving with Growing Libraries , author=. ArXiv Preprint

  29. [29]

    ArXiv Preprint

    LeanConjecturer: Automatic Generation of Mathematical Conjectures for Theorem Proving. ArXiv Preprint. 2025

  30. [30]

    Discovering New Theorems via LLMs with In-Context Proof Learning in Lean

    Kazumi Kasaura and Naoto Onda and Yuta Oriike and Masaya Taniguchi and Akiyoshi Sannai and Sho Sonoda. Discovering New Theorems via LLMs with In-Context Proof Learning in Lean. ArXiv Preprint. 2025

  31. [31]

    ArXiv Preprint , year=

    Aristotle: Imo-level automated theorem proving , author=. ArXiv Preprint , year=

  32. [32]

    ArXiv Preprint , year=

    Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction , author=. ArXiv Preprint , year=

  33. [33]

    ArXiv Preprint , year=

    Goedel-prover: A frontier model for open-source automated theorem proving , author=. ArXiv Preprint , year=

  34. [34]

    Nature , year=

    Olympiad-level formal mathematical reasoning with reinforcement learning , author=. Nature , year=

  35. [35]

    ArXiv Preprint , year=

    Gold-medalist performance in solving olympiad geometry with alphageometry2 , author=. ArXiv Preprint , year=

  36. [36]

    ArXiv Preprint , year=

    Seed-prover: Deep and broad reasoning for automated theorem proving , author=. ArXiv Preprint , year=

  37. [37]

    ArXiv Preprint , year=

    Minif2f: a cross-system benchmark for formal olympiad-level mathematics , author=. ArXiv Preprint , year=

  38. [38]

    ArXiv Preprint , year=

    Formalmath: Benchmarking formal mathematical reasoning of large language models , author=. ArXiv Preprint , year=

  39. [39]

    ArXiv Preprint , year=

    Proofnet: Autoformalizing and formally proving undergraduate-level mathematics , author=. ArXiv Preprint , year=

  40. [40]

    Advances in Neural Information Processing Systems , year=

    Putnambench: Evaluating neural theorem-provers on the putnam mathematical competition , author=. Advances in Neural Information Processing Systems , year=

  41. [41]

    10 amazon statistics you need to know in 2022

    Mohsin, Maryam. 10 amazon statistics you need to know in 2022. Oberlo. 2022

  42. [42]

    and Deng, Yanzhen and Laber, Eric B

    Murphy, Susan A. and Deng, Yanzhen and Laber, Eric B. and Maei, Hamid Reza and Sutton, Richard S. and Witkiewitz, Katie. A Batch, Off-Policy, Actor-Critic Algorithm for Optimizing the Average Reward. ArXiv Preprint. 2016

  43. [43]

    A Block Coordinate Ascent Algorithm for Mean-Variance Optimization

    Xie, Tengyang and Liu, Bo and Xu, Yangyang and Ghavamzadeh, Mohammad and Chow, Yinlam and Lyu, Daoming and Yoon, Daesub. A Block Coordinate Ascent Algorithm for Mean-Variance Optimization. Advances in Neural Information Processing Systems. 2018

  44. [44]

    A Closer Look at Deep Policy Gradients

    Ilyas, Andrew and Engstrom, Logan and Santurkar, Shibani and Tsipras, Dimitris and Janoos, Firdaus and Rudolph, Larry and Madry, Aleksander. A Closer Look at Deep Policy Gradients. Proceedings of the International Conference on Learning Representations. 2020

  45. [45]

    and Castro, Pablo Samuel

    Lyle, Clare and Bellemare, Marc G. and Castro, Pablo Samuel. A Comparative Analysis of Expected and Distributional Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence. 2019

  46. [46]

    A Concentration Bound for TD (0) with Function Approximation

    Chandak, Siddharth and Borkar, Vivek S. A Concentration Bound for TD (0) with Function Approximation. ArXiv Preprint. 2023

  47. [47]

    and Precup, Doina

    Perkins, Theodore J. and Precup, Doina. A Convergent Form of Approximate Policy Iteration. Advances in Neural Information Processing Systems. 2002

  48. [48]

    and Szepesv \' a ri, Csaba and Maei, Hamid Reza

    Sutton, Richard S. and Szepesv \' a ri, Csaba and Maei, Hamid Reza. A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation. Advances in Neural Information Processing Systems. 2008

  49. [49]

    A Convergent Off-Policy Temporal Difference Algorithm

    Diddigi, Raghuram Bharadwaj and Kamanchi, Chandramouli and Bhatnagar, Shalabh. A Convergent Off-Policy Temporal Difference Algorithm. Proceedings of the European Conference on Artificial Intelligence. 2020

  50. [50]

    A Deeper Look at Discounting Mismatch in Actor-Critic Algorithms

    Zhang, Shangtong and Laroche, Romain and van Seijen, Harm and Whiteson, Shimon and des Combes, Remi Tachet. A Deeper Look at Discounting Mismatch in Actor-Critic Algorithms. Proceedings of the International Conference on Autonomous Agents and Multiagent Systems. 2022

  51. [51]

    A Deeper Look at Planning as Learning from Replay

    Vanseijen, Harm and Sutton, Rich. A Deeper Look at Planning as Learning from Replay. Proceedings of the International Conference on Machine Learning. 2015

  52. [52]

    A Definition of Continual Reinforcement Learning

    Abel, David and Barreto, Andr \'e and Van Roy, Benjamin and Precup, Doina and van Hasselt, Hado and Singh, Satinder. A Definition of Continual Reinforcement Learning. Advances in Neural Information Processing Systems. 2023

  53. [53]

    and Dabney, Will and Munos, R \' e mi

    Bellemare, Marc G. and Dabney, Will and Munos, R \' e mi. A Distributional Perspective on Reinforcement Learning. Proceedings of the International Conference on Machine Learning. 2017

  54. [54]

    A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation

    Bhandari, Jalaj and Russo, Daniel and Singal, Raghav. A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation. Proceedings of the Conference on Learning Theory. 2018

  55. [55]

    A Finite-Time Analysis of Two Time-Scale Actor-Critic Methods

    Wu, Yue and Zhang, Weitong and Xu, Pan and Gu, Quanquan. A Finite-Time Analysis of Two Time-Scale Actor-Critic Methods. Advances in Neural Information Processing Systems. 2020

  56. [56]

    A Finite-Time Analysis of Q-Learning with Neural Network Function Approximation

    Xu, Pan and Gu, Quanquan. A Finite-Time Analysis of Q-Learning with Neural Network Function Approximation. Proceedings of the International Conference on Machine Learning. 2020

  57. [57]

    A Formalization of the Ionescu-Tulcea Theorem in Mathlib

    Marion, Etienne. A Formalization of the Ionescu-Tulcea Theorem in Mathlib. ArXiv Preprint. 2025

  58. [58]

    A Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies

    Yu, Huizhen. A Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies. Proceedings of the Conference in Uncertainty in Artificial Intelligence. 2005

  59. [59]

    A Generalized Reinforcement-Learning Model: Convergence and Applications

    Littman, Michael L and Szepesv \'a ri, Csaba. A Generalized Reinforcement-Learning Model: Convergence and Applications. Proceedings of the International Conference on Machine Learning. 1996

  60. [60]

    and Dabney, Will and Dadashi, Robert and Ta

    Bellemare, Marc G. and Dabney, Will and Dadashi, Robert and Ta. A Geometric Perspective on Optimal Representations for Reinforcement Learning. Advances in Neural Information Processing Systems. 2019

  61. [61]

    A Kernel Loss for Solving the Bellman Equation

    Feng, Yihao and Li, Lihong and Liu, Qiang. A Kernel Loss for Solving the Bellman Equation. Advances in Neural Information Processing Systems. 2019

  62. [62]

    and Bellemare, Marc G

    Machado, Marlos C. and Bellemare, Marc G. and Bowling, Michael H. A Laplacian Framework for Option Discovery in Reinforcement Learning. Proceedings of the International Conference on Machine Learning. 2017

  63. [63]

    A Lyapunov Theory for Finite-Sample Guarantees of Markovian Stochastic Approximation

    Chen, Zaiwei and Maguluri, Siva Theja and Shakkottai, Sanjay and Shanmugam, Karthikeyan. A Lyapunov Theory for Finite-Sample Guarantees of Markovian Stochastic Approximation. Operations Research. 2023

  64. [64]

    A Markovian decision process

    Bellman, Richard. A Markovian decision process. Journal of Mathematics and Mechanics. 1957

  65. [65]

    A Maximum-Entropy Approach to Off-Policy Evaluation in Average-Reward MDPs

    Lazic, Nevena and Yin, Dong and Farajtabar, Mehrdad and Levine, Nir and G. A Maximum-Entropy Approach to Off-Policy Evaluation in Average-Reward MDPs. Advances in Neural Information Processing Systems. 2020

  66. [66]

    and Cohen, Paul R

    Oates, Tim and Schmill, Matthew D. and Cohen, Paul R. A Method for Clustering the Experiences of a Mobile Robot that Accords with Human Judgments. Proceedings of the AAAI Conference on Artificial Intelligence. 2000

  67. [67]

    A New Challenge in Policy Evaluation

    Zhang, Shangtong. A New Challenge in Policy Evaluation. Proceedings of the AAAI Conference on Artificial Intelligence. 2023

  68. [68]

    A Non-Asymptotic Theory of Seminorm Lyapunov Stability: From Deterministic to Stochastic Iterative Algorithms

    Chen, Zaiwei and Zhang, Sheng and Zhang, Zhe and Haque, Shaan Ul and Maguluri, Siva Theja. A Non-Asymptotic Theory of Seminorm Lyapunov Stability: From Deterministic to Stochastic Iterative Algorithms. ArXiv Preprint. 2025

  69. [69]

    A Nonparametric Offpolicy Policy Gradient

    Tosatto, Samuele and Carvalho, Jo a o and Abdulsamad, Hany and Peters, Jan. A Nonparametric Offpolicy Policy Gradient. ArXiv Preprint. 2020

  70. [70]

    A Reinforcement Learning Method for Maximizing Undiscounted Rewards

    Schwartz, Anton. A Reinforcement Learning Method for Maximizing Undiscounted Rewards. Proceedings of the International Conference on Machine Learning. 1993

  71. [71]

    A Remark on a Theorem of M

    Edelstein, Michael. A Remark on a Theorem of M. A. Krasnoselski. American Mathematical Monthly. 1966

  72. [72]

    A Self-Tuning Actor-Critic Algorithm

    Zahavy, Tom and Xu, Zhongwen and Veeriah, Vivek and Hessel, Matteo and Oh, Junhyuk and van Hasselt, Hado P and Silver, David and Singh, Satinder. A Self-Tuning Actor-Critic Algorithm. Advances in Neural Information Processing Systems. 2020

  73. [73]

    A Simple Finite-Time Analysis of TD Learning With Linear Function Approximation

    Mitra, Aritra. A Simple Finite-Time Analysis of TD Learning With Linear Function Approximation. IEEE Transactions on Automatic Control. 2025

  74. [74]

    A Simple Framework for Contrastive Learning of Visual Representations

    Chen, Ting and Kornblith, Simon and Norouzi, Mohammad and Hinton, Geoffrey E. A Simple Framework for Contrastive Learning of Visual Representations. Proceedings of the International Conference on Machine Learning. 2020

  75. [75]

    A Survey for Deep Reinforcement Learning Based Network Intrusion Detection

    Yang, Wanrong and Acuto, Alberto and Zhou, Yihang and Wojtczak, Dominik. A Survey for Deep Reinforcement Learning Based Network Intrusion Detection. ArXiv Preprint. 2024

  76. [76]

    A Survey of Constraint Formulations in Safe Reinforcement Learning

    Wachi, Akifumi and Shen, Xun and Sui, Yanan. A Survey of Constraint Formulations in Safe Reinforcement Learning. ArXiv Preprint. 2024

  77. [77]

    A Survey of In-Context Reinforcement Learning

    Moeini, Amir and Wang, Jiuqi and Beck, Jacob and Blaser, Ethan and Whiteson, Shimon and Chandra, Rohan and Zhang, Shangtong. A Survey of In-Context Reinforcement Learning. ArXiv Preprint. 2025

  78. [78]

    and Cowling, Peter I

    Browne, Cameron and Powley, Edward Jack and Whitehouse, Daniel and Lucas, Simon M. and Cowling, Peter I. and Rohlfshagen, Philipp and Tavener, Stephen and Liebana, Diego Perez and Samothrakis, Spyridon and Colton, Simon. A Survey of Monte Carlo Tree Search Methods. IEEE Transactions on Computational Intelligence and AI in Games. 2012

  79. [79]

    A Theoretical Analysis of Deep Q-Learning

    Fan, Jianqing and Wang, Zhaoran and Xie, Yuchen and Yang, Zhuoran. A Theoretical Analysis of Deep Q-Learning. Proceedings of the Annual Conference on Learning for Dynamics and Control. 2020

  80. [80]

    A Tutorial on Meta-Reinforcement Learning

    Beck, Jacob and Vuorio, Risto and Liu, Evan Zheran and Xiong, Zheng and Zintgraf, Luisa and Finn, Chelsea and Whiteson, Shimon. A Tutorial on Meta-Reinforcement Learning. Foundations and Trends® in Machine Learning. 2025

Showing first 80 references.