arxiv: 2605.13025 · v1 · submitted 2026-05-13 · 💻 cs.LG · cs.GT

Recognition: 1 theorem link

· Lean Theorem

Offline Two-Player Zero-Sum Markov Games with KL Regularization

Claire Chen , Yuheng Zhang , Xinyu Liu , Zixuan Xie , Shuze Daniel Liu , Nan Jiang

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:20 UTC · model grok-4.3

classification 💻 cs.LG cs.GT

keywords offline reinforcement learningMarkov gamesNash equilibriaKL regularizationzero-sum gamesself-playmirror descentunilateral concentrability

0 comments

The pith

KL regularization by itself stabilizes learning of Nash equilibria in offline zero-sum Markov games and delivers fast Õ(1/n) convergence under unilateral concentrability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that KL regularization is sufficient to control distribution shift in offline two-player zero-sum Markov games, eliminating the need for separate pessimism penalties while still guaranteeing convergence to equilibrium. A sympathetic reader would care because this replaces a tuning-heavy technique with a simpler regularizer and improves the statistical rate from the usual Õ(1/sqrt(n)) to Õ(1/n) when the data satisfies unilateral concentrability. The authors first define the ROSE framework that achieves this rate in theory, then give the practical SOS-MD algorithm whose last iterate matches the same rate up to a small optimization error that vanishes with more self-play steps.

Core claim

ROSE is a regularized framework in which KL-regularized value estimation yields Õ(1/n) convergence to Nash equilibria under unilateral concentrability; SOS-MD is the corresponding model-free algorithm that alternates least-squares value updates with self-play mirror-descent policy steps and whose last iterate attains the same statistical rate up to Õ(1/sqrt(T)) optimization error after T iterations.

What carries the argument

KL-regularized sequential equilibrium (ROSE) together with the SOS-MD self-play mirror-descent procedure that uses least-squares value estimation.

If this is right

The last iterate of SOS-MD converges at the same fast statistical rate as the ROSE framework once optimization error becomes negligible.
No explicit pessimism term is required once KL regularization is present.
The method works in a model-free setting using only least-squares value estimation and self-play updates.
Convergence holds for the full sequence of iterates rather than only the average.
The same Õ(1/n) rate applies to both the theoretical ROSE object and the practical SOS-MD algorithm.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Regularization may be able to replace pessimism penalties in other offline multi-agent settings where explicit bonuses are currently used.
The unilateral concentrability condition could be relaxed further if both players' data coverage is jointly controlled.
The approach invites direct empirical tests on benchmark Markov game environments to measure how much the 1/n rate improves sample efficiency over baselines.
Extending the same KL-regularized self-play template to non-zero-sum or cooperative Markov games is a natural next direction.

Load-bearing premise

The offline data must satisfy unilateral concentrability with respect to the policies being learned; if coverage fails from one side, the fast 1/n rate no longer holds.

What would settle it

Run SOS-MD on an offline dataset that violates unilateral concentrability from one player and check whether the observed convergence rate degrades to the slower Õ(1/sqrt(n)) regime that appears in unregularized methods.

read the original abstract

We study the problem of learning Nash equilibria in offline two-player zero-sum Markov games. While existing approaches often rely on explicit pessimism to address distribution shift, we show that KL regularization alone suffices to stabilize learning and guarantee convergence. We first introduce Regularized Offline Sequential Equilibrium (ROSE), a theoretical framework that achieves a fast $\widetilde{\mathcal{O}}(1/n)$ convergence rate under \textit{unilateral concentrability}, improving over the standard $\widetilde{\mathcal{O}}(1/\sqrt{n})$ rates in unregularized settings. We then propose Sequential Offline Self-play Mirror Descent (SOS-MD), a practical model-free algorithm based on least-squares value estimation and iterative self-play updates. We prove that the last iterate of SOS-MD attains the same $\widetilde{\mathcal{O}}(1/n)$ statistical rate up to a vanishing optimization error of order $\widetilde{\mathcal{O}}(1/\sqrt{T})$ in the number of self-play iterations $T$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

KL regularization replaces explicit pessimism for Õ(1/n) rates in offline zero-sum Markov games under unilateral concentrability.

read the letter

The main thing here is that KL regularization by itself stabilizes learning enough to get fast Õ(1/n) rates for Nash equilibria in offline two-player zero-sum Markov games, once unilateral concentrability holds. ROSE is the regularized equilibrium object and SOS-MD is the model-free algorithm that reaches the same rate for its last iterate, up to a 1/sqrt(T) optimization term that vanishes with more self-play steps. That last-iterate result is cleaner than many average-iterate guarantees in related work. The paper does a good job showing how the regularization sidesteps the need for explicit pessimism while still delivering the improved rate over the usual 1/sqrt(n) baseline. The algorithm itself looks practical: least-squares value estimation plus iterative self-play updates. The soft spot is the unilateral concentrability assumption. If the offline data does not cover the learned policies from one side, the fast rate disappears and you are back to slower bounds. Without the full derivations in front of me it is also hard to confirm every error term is controlled cleanly, though nothing in the abstract or stress-test flags an obvious circularity. This paper is for people working on offline multi-agent RL who want a regularization route instead of pessimism-based methods. A reader interested in rates for game-theoretic settings would get concrete value from the framework and the algorithm. I would send it to peer review because the claims are coherent and the contribution is sharp enough to deserve referee time on the proofs.

Referee Report

2 major / 2 minor

Summary. The paper studies offline learning of Nash equilibria in two-player zero-sum Markov games. It claims that KL regularization alone suffices to stabilize learning and replace explicit pessimism. It introduces the theoretical ROSE framework achieving a fast Õ(1/n) convergence rate under unilateral concentrability (improving on standard Õ(1/√n) rates), and proposes the practical SOS-MD algorithm based on least-squares value estimation and self-play updates, proving that its last iterate attains the same statistical rate up to a vanishing Õ(1/√T) optimization error.

Significance. If the rates and derivations hold, the work provides a meaningful simplification for offline multi-agent RL by showing regularization can handle distribution shift without explicit pessimism, while delivering faster statistical rates under a unilateral concentrability assumption. The explicit separation of statistical and optimization errors in SOS-MD is a strength, and the focus on last-iterate convergence is practically relevant.

major comments (2)

[Theorem statements and assumption section] The central Õ(1/n) rate for ROSE (and last-iterate SOS-MD) is load-bearing on the unilateral concentrability assumption with respect to the learned policies. The manuscript should explicitly compare this assumption's strength to standard concentrability coefficients used in prior offline MG works (e.g., in the definition of the concentrability coefficient C and how it enters the error bounds).
[Proof of ROSE convergence rate] The claim that KL regularization 'suffices' to stabilize learning without pessimism requires verifying that all distribution-shift error terms are controlled solely by the KL term and unilateral concentrability; the abstract states proofs exist, but the bounding steps for the value estimation error under the regularized objective need to be checked for circularity or hidden dependence on the data distribution.

minor comments (2)

[Algorithm and objective definitions] Notation for the KL regularization strength parameter should be introduced consistently (e.g., as λ or β) and its dependence on n clarified in the rate statements.
[Introduction or related work] The manuscript would benefit from a table comparing the new rates to prior offline MG results (with and without regularization) to highlight the improvement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed and constructive review. We appreciate the positive assessment of the work's significance in simplifying offline multi-agent RL through KL regularization. We address each major comment below and will revise the manuscript to incorporate clarifications where appropriate.

read point-by-point responses

Referee: [Theorem statements and assumption section] The central Õ(1/n) rate for ROSE (and last-iterate SOS-MD) is load-bearing on the unilateral concentrability assumption with respect to the learned policies. The manuscript should explicitly compare this assumption's strength to standard concentrability coefficients used in prior offline MG works (e.g., in the definition of the concentrability coefficient C and how it enters the error bounds).

Authors: We agree that an explicit comparison would improve clarity. In the revised manuscript, we will add a dedicated paragraph in the assumptions section (Section 3) that defines the standard concentrability coefficient C from prior offline Markov game literature and contrasts it directly with unilateral concentrability. We will note that unilateral concentrability is a weaker condition, requiring coverage only with respect to one player's policy against arbitrary policies of the opponent, rather than joint coverage over all policy pairs. This milder assumption, when paired with KL regularization, suffices to control distribution shift and yields the improved Õ(1/n) rate. We will also explicitly show how the concentrability factor scales the error terms in the bound of Theorem 1. revision: yes
Referee: [Proof of ROSE convergence rate] The claim that KL regularization 'suffices' to stabilize learning without pessimism requires verifying that all distribution-shift error terms are controlled solely by the KL term and unilateral concentrability; the abstract states proofs exist, but the bounding steps for the value estimation error under the regularized objective need to be checked for circularity or hidden dependence on the data distribution.

Authors: The full proofs in Appendix B demonstrate that distribution-shift terms are controlled exclusively by the KL regularization in the objective together with unilateral concentrability, without explicit pessimism or circular reasoning. The argument proceeds by first establishing stability of the regularized iterates (ensuring they remain in a region covered by the assumption), then applying concentrability to bound the occupancy-measure mismatch in the value estimation error; the data distribution enters only through the fixed concentrability coefficient, with no hidden dependence. To address the concern directly, we will insert a concise proof outline immediately following Theorem 1 in the main text that highlights these sequential bounding steps. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The paper introduces ROSE as a theoretical object defined by the KL-regularized equilibrium and proves its Õ(1/n) rate directly from the regularized objective plus the unilateral concentrability assumption on the offline data. SOS-MD is then analyzed as a practical algorithm whose last iterate matches the same statistical rate up to an explicit, vanishing optimization error term Õ(1/√T). No equation reduces a prediction to a fitted quantity by construction, no uniqueness theorem is imported from prior self-work, and the central claims rest on standard analysis of regularized MD under a stated data-coverage assumption rather than on any self-referential fit or renaming. The derivation chain is therefore independent of its own outputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on the unilateral concentrability assumption and standard properties of KL divergence and least-squares estimation; no new invented entities are introduced.

free parameters (1)

KL regularization strength
The coefficient controlling the strength of the KL penalty; its value is chosen to balance regularization and performance but is not fitted to the target result in the abstract.

axioms (1)

domain assumption Unilateral concentrability
The offline data distribution is assumed to cover the relevant state-action pairs for one player sufficiently well; this is invoked to obtain the 1/n rate.

pith-pipeline@v0.9.0 · 5478 in / 1300 out tokens · 42996 ms · 2026-05-14T19:20:58.057786+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

KL regularization alone suffices to stabilize learning... fast Õ(1/n) convergence rate under unilateral concentrability

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

300 extracted references · 5 canonical work pages · 3 internal anchors

[1]

Proceedings of the International Conference on Learning Representations , volume=

Efficient policy evaluation with safety constraint for reinforcement learning , author=. Proceedings of the International Conference on Learning Representations , volume=
[2]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Efficient multi-policy evaluation for reinforcement learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[3]

Beyond Pessimism: Offline Learning in KL-regularized Games

Beyond Pessimism: Offline Learning in KL-regularized Games , author=. ArXiv Preprint arXiv:2604.06738 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Pessimism-Free Offline Learning in General-Sum Games via KL Regularization

Pessimism-Free Offline Learning in General-Sum Games via KL Regularization , author=. ArXiv Preprint arXiv:2605.00264 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Fast Rates in $\alpha$-Potential Games via Regularized Mirror Descent

Fast Rates in -Potential Games via Regularized Mirror Descent , author=. ArXiv Preprint arXiv:2605.00268 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Games and Economic Behavior , volume=

Adaptive game playing using multiplicative weights , author=. Games and Economic Behavior , volume=. 1999 , publisher=

1999
[7]

Iterative Nash Policy Optimization: Aligning

Yuheng Zhang and Dian Yu and Baolin Peng and Linfeng Song and Ye Tian and Mingyue Huo and Nan Jiang and Haitao Mi and Dong Yu , booktitle=. Iterative Nash Policy Optimization: Aligning
[8]

1994 , publisher=

A course in game theory , author=. 1994 , publisher=

1994
[9]

1998 , publisher=

Dynamic noncooperative game theory , author=. 1998 , publisher=

1998
[10]

Handbook of reinforcement learning and control , pages=

Multi-agent reinforcement learning: A selective overview of theories and algorithms , author=. Handbook of reinforcement learning and control , pages=. 2021 , publisher=

2021
[11]

Machine learning proceedings 1994 , pages=

Markov games as a framework for multi-agent reinforcement learning , author=. Machine learning proceedings 1994 , pages=. 1994 , publisher=

1994
[12]

ArXiv Preprint arXiv:2102.00479 , year=

Fast rates for the regret of offline reinforcement learning , author=. ArXiv Preprint arXiv:2102.00479 , year=

work page arXiv
[13]

Proceedings of the International Conference on Machine Learning , pages=

Pessimistic minimax value iteration: Provably efficient equilibrium learning from offline datasets , author=. Proceedings of the International Conference on Machine Learning , pages=. 2022 , organization=

2022
[14]

Advances in Neural Information Processing Systems , volume=

When are offline two-player zero-sum Markov games solvable? , author=. Advances in Neural Information Processing Systems , volume=
[15]

Proceedings of the International Conference on Machine Learning , pages=

Is pessimism provably efficient for offline rl? , author=. Proceedings of the International Conference on Machine Learning , pages=. 2021 , organization=

2021
[16]

Advances in Neural Information Processing Systems , volume=

Bellman-consistent pessimism for offline reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=
[17]

Proceedings of the International Conference on Machine Learning , pages=

Offline learning in markov games with general function approximation , author=. Proceedings of the International Conference on Machine Learning , pages=. 2023 , organization=

2023
[18]

Proceedings of the International Conference on Machine Learning , pages=

A theory of regularized markov decision processes , author=. Proceedings of the International Conference on Machine Learning , pages=. 2019 , organization=

2019
[19]

Asadi and Idan Shenfeld and Youssef Mroueh , booktitle=

Gholamali Aminian and Amir R. Asadi and Idan Shenfeld and Youssef Mroueh , booktitle=
[20]

ArXiv Preprint , year=

G\"odel's Poetry , author=. ArXiv Preprint , year=
[21]

2025 , journal=

ProofAug: Efficient Neural Theorem Proving via Fine-grained Proof Structure Analysis , author=. 2025 , journal=

2025
[22]

ArXiv Preprint , year=

Hilbert: Recursively Building Formal Proofs with Informal Reasoning , author=. ArXiv Preprint , year=
[23]

ArXiv Preprint , year=

APOLLO: Automated LLM and Lean Collaboration for Advanced Formal Reasoning , author=. ArXiv Preprint , year=
[24]

ArXiv Preprint , year=

Solving formal math problems by decomposition and iterative reflection , author=. ArXiv Preprint , year=
[25]

ArXiv Preprint , year=

Formal theorem proving by rewarding llms to decompose proofs hierarchically , author=. ArXiv Preprint , year=
[26]

ArXiv Preprint , year=

Lemmanaid: Neuro-Symbolic Lemma Conjecturing , author=. ArXiv Preprint , year=
[27]

2022 , journal =

Sivaraman, Aishwarya and Sanchez-Stern, Alex and Chen, Bretton and Lerner, Sorin and Millstein, Todd , title =. 2022 , journal =

2022
[28]

ArXiv Preprint

LEGO-Prover: Neural Theorem Proving with Growing Libraries , author=. ArXiv Preprint
[29]

ArXiv Preprint

LeanConjecturer: Automatic Generation of Mathematical Conjectures for Theorem Proving. ArXiv Preprint. 2025

2025
[30]

Discovering New Theorems via LLMs with In-Context Proof Learning in Lean

Kazumi Kasaura and Naoto Onda and Yuta Oriike and Masaya Taniguchi and Akiyoshi Sannai and Sho Sonoda. Discovering New Theorems via LLMs with In-Context Proof Learning in Lean. ArXiv Preprint. 2025

2025
[31]

ArXiv Preprint , year=

Aristotle: Imo-level automated theorem proving , author=. ArXiv Preprint , year=
[32]

ArXiv Preprint , year=

Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction , author=. ArXiv Preprint , year=
[33]

ArXiv Preprint , year=

Goedel-prover: A frontier model for open-source automated theorem proving , author=. ArXiv Preprint , year=
[34]

Nature , year=

Olympiad-level formal mathematical reasoning with reinforcement learning , author=. Nature , year=
[35]

ArXiv Preprint , year=

Gold-medalist performance in solving olympiad geometry with alphageometry2 , author=. ArXiv Preprint , year=
[36]

ArXiv Preprint , year=

Seed-prover: Deep and broad reasoning for automated theorem proving , author=. ArXiv Preprint , year=
[37]

ArXiv Preprint , year=

Minif2f: a cross-system benchmark for formal olympiad-level mathematics , author=. ArXiv Preprint , year=
[38]

ArXiv Preprint , year=

Formalmath: Benchmarking formal mathematical reasoning of large language models , author=. ArXiv Preprint , year=
[39]

ArXiv Preprint , year=

Proofnet: Autoformalizing and formally proving undergraduate-level mathematics , author=. ArXiv Preprint , year=
[40]

Advances in Neural Information Processing Systems , year=

Putnambench: Evaluating neural theorem-provers on the putnam mathematical competition , author=. Advances in Neural Information Processing Systems , year=
[41]

10 amazon statistics you need to know in 2022

Mohsin, Maryam. 10 amazon statistics you need to know in 2022. Oberlo. 2022

2022
[42]

and Deng, Yanzhen and Laber, Eric B

Murphy, Susan A. and Deng, Yanzhen and Laber, Eric B. and Maei, Hamid Reza and Sutton, Richard S. and Witkiewitz, Katie. A Batch, Off-Policy, Actor-Critic Algorithm for Optimizing the Average Reward. ArXiv Preprint. 2016

2016
[43]

A Block Coordinate Ascent Algorithm for Mean-Variance Optimization

Xie, Tengyang and Liu, Bo and Xu, Yangyang and Ghavamzadeh, Mohammad and Chow, Yinlam and Lyu, Daoming and Yoon, Daesub. A Block Coordinate Ascent Algorithm for Mean-Variance Optimization. Advances in Neural Information Processing Systems. 2018

2018
[44]

A Closer Look at Deep Policy Gradients

Ilyas, Andrew and Engstrom, Logan and Santurkar, Shibani and Tsipras, Dimitris and Janoos, Firdaus and Rudolph, Larry and Madry, Aleksander. A Closer Look at Deep Policy Gradients. Proceedings of the International Conference on Learning Representations. 2020

2020
[45]

and Castro, Pablo Samuel

Lyle, Clare and Bellemare, Marc G. and Castro, Pablo Samuel. A Comparative Analysis of Expected and Distributional Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence. 2019

2019
[46]

A Concentration Bound for TD (0) with Function Approximation

Chandak, Siddharth and Borkar, Vivek S. A Concentration Bound for TD (0) with Function Approximation. ArXiv Preprint. 2023

2023
[47]

and Precup, Doina

Perkins, Theodore J. and Precup, Doina. A Convergent Form of Approximate Policy Iteration. Advances in Neural Information Processing Systems. 2002

2002
[48]

and Szepesv \' a ri, Csaba and Maei, Hamid Reza

Sutton, Richard S. and Szepesv \' a ri, Csaba and Maei, Hamid Reza. A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation. Advances in Neural Information Processing Systems. 2008

2008
[49]

A Convergent Off-Policy Temporal Difference Algorithm

Diddigi, Raghuram Bharadwaj and Kamanchi, Chandramouli and Bhatnagar, Shalabh. A Convergent Off-Policy Temporal Difference Algorithm. Proceedings of the European Conference on Artificial Intelligence. 2020

2020
[50]

A Deeper Look at Discounting Mismatch in Actor-Critic Algorithms

Zhang, Shangtong and Laroche, Romain and van Seijen, Harm and Whiteson, Shimon and des Combes, Remi Tachet. A Deeper Look at Discounting Mismatch in Actor-Critic Algorithms. Proceedings of the International Conference on Autonomous Agents and Multiagent Systems. 2022

2022
[51]

A Deeper Look at Planning as Learning from Replay

Vanseijen, Harm and Sutton, Rich. A Deeper Look at Planning as Learning from Replay. Proceedings of the International Conference on Machine Learning. 2015

2015
[52]

A Definition of Continual Reinforcement Learning

Abel, David and Barreto, Andr \'e and Van Roy, Benjamin and Precup, Doina and van Hasselt, Hado and Singh, Satinder. A Definition of Continual Reinforcement Learning. Advances in Neural Information Processing Systems. 2023

2023
[53]

and Dabney, Will and Munos, R \' e mi

Bellemare, Marc G. and Dabney, Will and Munos, R \' e mi. A Distributional Perspective on Reinforcement Learning. Proceedings of the International Conference on Machine Learning. 2017

2017
[54]

A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation

Bhandari, Jalaj and Russo, Daniel and Singal, Raghav. A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation. Proceedings of the Conference on Learning Theory. 2018

2018
[55]

A Finite-Time Analysis of Two Time-Scale Actor-Critic Methods

Wu, Yue and Zhang, Weitong and Xu, Pan and Gu, Quanquan. A Finite-Time Analysis of Two Time-Scale Actor-Critic Methods. Advances in Neural Information Processing Systems. 2020

2020
[56]

A Finite-Time Analysis of Q-Learning with Neural Network Function Approximation

Xu, Pan and Gu, Quanquan. A Finite-Time Analysis of Q-Learning with Neural Network Function Approximation. Proceedings of the International Conference on Machine Learning. 2020

2020
[57]

A Formalization of the Ionescu-Tulcea Theorem in Mathlib

Marion, Etienne. A Formalization of the Ionescu-Tulcea Theorem in Mathlib. ArXiv Preprint. 2025

2025
[58]

A Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies

Yu, Huizhen. A Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies. Proceedings of the Conference in Uncertainty in Artificial Intelligence. 2005

2005
[59]

A Generalized Reinforcement-Learning Model: Convergence and Applications

Littman, Michael L and Szepesv \'a ri, Csaba. A Generalized Reinforcement-Learning Model: Convergence and Applications. Proceedings of the International Conference on Machine Learning. 1996

1996
[60]

and Dabney, Will and Dadashi, Robert and Ta

Bellemare, Marc G. and Dabney, Will and Dadashi, Robert and Ta. A Geometric Perspective on Optimal Representations for Reinforcement Learning. Advances in Neural Information Processing Systems. 2019

2019
[61]

A Kernel Loss for Solving the Bellman Equation

Feng, Yihao and Li, Lihong and Liu, Qiang. A Kernel Loss for Solving the Bellman Equation. Advances in Neural Information Processing Systems. 2019

2019
[62]

and Bellemare, Marc G

Machado, Marlos C. and Bellemare, Marc G. and Bowling, Michael H. A Laplacian Framework for Option Discovery in Reinforcement Learning. Proceedings of the International Conference on Machine Learning. 2017

2017
[63]

A Lyapunov Theory for Finite-Sample Guarantees of Markovian Stochastic Approximation

Chen, Zaiwei and Maguluri, Siva Theja and Shakkottai, Sanjay and Shanmugam, Karthikeyan. A Lyapunov Theory for Finite-Sample Guarantees of Markovian Stochastic Approximation. Operations Research. 2023

2023
[64]

A Markovian decision process

Bellman, Richard. A Markovian decision process. Journal of Mathematics and Mechanics. 1957

1957
[65]

A Maximum-Entropy Approach to Off-Policy Evaluation in Average-Reward MDPs

Lazic, Nevena and Yin, Dong and Farajtabar, Mehrdad and Levine, Nir and G. A Maximum-Entropy Approach to Off-Policy Evaluation in Average-Reward MDPs. Advances in Neural Information Processing Systems. 2020

2020
[66]

and Cohen, Paul R

Oates, Tim and Schmill, Matthew D. and Cohen, Paul R. A Method for Clustering the Experiences of a Mobile Robot that Accords with Human Judgments. Proceedings of the AAAI Conference on Artificial Intelligence. 2000

2000
[67]

A New Challenge in Policy Evaluation

Zhang, Shangtong. A New Challenge in Policy Evaluation. Proceedings of the AAAI Conference on Artificial Intelligence. 2023

2023
[68]

A Non-Asymptotic Theory of Seminorm Lyapunov Stability: From Deterministic to Stochastic Iterative Algorithms

Chen, Zaiwei and Zhang, Sheng and Zhang, Zhe and Haque, Shaan Ul and Maguluri, Siva Theja. A Non-Asymptotic Theory of Seminorm Lyapunov Stability: From Deterministic to Stochastic Iterative Algorithms. ArXiv Preprint. 2025

2025
[69]

A Nonparametric Offpolicy Policy Gradient

Tosatto, Samuele and Carvalho, Jo a o and Abdulsamad, Hany and Peters, Jan. A Nonparametric Offpolicy Policy Gradient. ArXiv Preprint. 2020

2020
[70]

A Reinforcement Learning Method for Maximizing Undiscounted Rewards

Schwartz, Anton. A Reinforcement Learning Method for Maximizing Undiscounted Rewards. Proceedings of the International Conference on Machine Learning. 1993

1993
[71]

A Remark on a Theorem of M

Edelstein, Michael. A Remark on a Theorem of M. A. Krasnoselski. American Mathematical Monthly. 1966

1966
[72]

A Self-Tuning Actor-Critic Algorithm

Zahavy, Tom and Xu, Zhongwen and Veeriah, Vivek and Hessel, Matteo and Oh, Junhyuk and van Hasselt, Hado P and Silver, David and Singh, Satinder. A Self-Tuning Actor-Critic Algorithm. Advances in Neural Information Processing Systems. 2020

2020
[73]

A Simple Finite-Time Analysis of TD Learning With Linear Function Approximation

Mitra, Aritra. A Simple Finite-Time Analysis of TD Learning With Linear Function Approximation. IEEE Transactions on Automatic Control. 2025

2025
[74]

A Simple Framework for Contrastive Learning of Visual Representations

Chen, Ting and Kornblith, Simon and Norouzi, Mohammad and Hinton, Geoffrey E. A Simple Framework for Contrastive Learning of Visual Representations. Proceedings of the International Conference on Machine Learning. 2020

2020
[75]

A Survey for Deep Reinforcement Learning Based Network Intrusion Detection

Yang, Wanrong and Acuto, Alberto and Zhou, Yihang and Wojtczak, Dominik. A Survey for Deep Reinforcement Learning Based Network Intrusion Detection. ArXiv Preprint. 2024

2024
[76]

A Survey of Constraint Formulations in Safe Reinforcement Learning

Wachi, Akifumi and Shen, Xun and Sui, Yanan. A Survey of Constraint Formulations in Safe Reinforcement Learning. ArXiv Preprint. 2024

2024
[77]

A Survey of In-Context Reinforcement Learning

Moeini, Amir and Wang, Jiuqi and Beck, Jacob and Blaser, Ethan and Whiteson, Shimon and Chandra, Rohan and Zhang, Shangtong. A Survey of In-Context Reinforcement Learning. ArXiv Preprint. 2025

2025
[78]

and Cowling, Peter I

Browne, Cameron and Powley, Edward Jack and Whitehouse, Daniel and Lucas, Simon M. and Cowling, Peter I. and Rohlfshagen, Philipp and Tavener, Stephen and Liebana, Diego Perez and Samothrakis, Spyridon and Colton, Simon. A Survey of Monte Carlo Tree Search Methods. IEEE Transactions on Computational Intelligence and AI in Games. 2012

2012
[79]

A Theoretical Analysis of Deep Q-Learning

Fan, Jianqing and Wang, Zhaoran and Xie, Yuchen and Yang, Zhuoran. A Theoretical Analysis of Deep Q-Learning. Proceedings of the Annual Conference on Learning for Dynamics and Control. 2020

2020
[80]

A Tutorial on Meta-Reinforcement Learning

Beck, Jacob and Vuorio, Risto and Liu, Evan Zheran and Xiong, Zheng and Zintgraf, Luisa and Finn, Chelsea and Whiteson, Shimon. A Tutorial on Meta-Reinforcement Learning. Foundations and Trends® in Machine Learning. 2025

2025

Showing first 80 references.