arxiv: 2605.00264 · v1 · submitted 2026-04-30 · 💻 cs.LG · cs.GT

Recognition: unknown

Pessimism-Free Offline Learning in General-Sum Games via KL Regularization

Claire Chen , Yuheng Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-09 20:05 UTC · model grok-4.3

classification 💻 cs.LG cs.GT

keywords offline multi-agent reinforcement learninggeneral-sum gamesKL regularizationNash equilibriumcoarse correlated equilibriumpessimism-free methodsdistribution shift

0 comments

The pith

KL regularization alone stabilizes offline learning and recovers equilibria in general-sum games without pessimistic penalties.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that in offline multi-agent reinforcement learning for general-sum games, KL regularization can control distribution shift between logged data and target policies, allowing equilibrium recovery without manual pessimistic penalties. A sympathetic reader would care because current methods often add pessimistic terms to handle uncertainty, complicating the algorithms, whereas here a simple regularization suffices for both stability and improved rates. The authors define the General-sum Anchored Nash Equilibrium (GANE) to recover regularized Nash at accelerated statistical rates of order 1/n. They also introduce the General-sum Anchored Mirror Descent (GAMD) algorithm for computing Coarse Correlated Equilibria at standard rates. If the claim holds, it means offline learning in these games can be done more simply and efficiently by leveraging standard KL regularization as the key stabilizer.

Core claim

We demonstrate that KL regularization suffices to stabilize learning and achieve equilibrium recovery. We propose General-sum Anchored Nash Equilibrium (GANE), which recovers regularized Nash equilibria at an accelerated statistical rate of O(1/n). For computational tractability, we develop General-sum Anchored Mirror Descent (GAMD), an iterative algorithm converging to a Coarse Correlated Equilibrium at the standard rate of O(1/sqrt(n)+1/T). These results establish KL regularization as a standalone mechanism for pessimism-free offline learning that achieves equivalent or accelerated rates in multi-player general-sum games.

What carries the argument

General-sum Anchored Nash Equilibrium (GANE), which uses KL regularization to anchor policies to the logged dataset and recover regularized Nash equilibria while controlling distribution shift.

If this is right

Regularized Nash equilibria can be recovered from offline datasets at accelerated rates of O(1/n) using only KL regularization.
The GAMD algorithm provides a practical way to compute Coarse Correlated Equilibria converging at O(1/sqrt(n) + 1/T).
No additional pessimistic penalties are needed to stabilize learning in general-sum games.
KL regularization acts as a sufficient standalone mechanism for pessimism-free offline multi-agent learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This could simplify implementation of offline MARL systems by removing the need to tune pessimism parameters.
The anchoring idea via KL might extend to single-agent offline RL or other multi-agent settings with distribution shift.
In applications like autonomous driving or robotics with multiple agents, this could lead to more robust learned policies from logged data.
It raises the possibility that other regularizations could achieve similar effects, warranting comparison studies.

Load-bearing premise

The assumption that KL regularization alone controls distribution shift and recovers equilibria in general-sum games without needing any additional pessimistic terms or game-specific structure.

What would settle it

A counterexample consisting of a general-sum game and a logged dataset where policies optimized with KL regularization fail to approach the regularized Nash equilibrium as the dataset size n increases, with error not scaling as 1/n.

read the original abstract

Offline multi-agent reinforcement learning in general-sum settings is challenged by the distribution shift between logged datasets and target equilibrium policies. While standard methods rely on manual pessimistic penalties, we demonstrate that KL regularization suffices to stabilize learning and achieve equilibrium recovery. We propose General-sum Anchored Nash Equilibrium (GANE), which recovers regularized Nash equilibria at an accelerated statistical rate of $\widetilde{O}(1/n)$. For computational tractability, we develop General-sum Anchored Mirror Descent (GAMD), an iterative algorithm converging to a Coarse Correlated Equilibrium at the standard rate of $\widetilde{O}(1/\sqrt{n}+1/T)$. These results establish KL regularization as a standalone mechanism for pessimism-free offline learning that achieves equivalent or accelerated rates in multi-player general-sum games.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

KL regularization alone is claimed to replace pessimism for offline equilibrium recovery in general-sum games, with an accelerated rate for anchored Nash, but the coverage conditions need explicit checking.

read the letter

The main takeaway is that this paper argues KL regularization by itself stabilizes offline data in general-sum multi-agent settings and recovers equilibria without added pessimistic penalties. They introduce GANE to hit an accelerated statistical rate for regularized Nash equilibria and GAMD as a practical mirror descent procedure that reaches coarse correlated equilibrium at the usual rate. This is new as a standalone use of KL for these anchored concepts rather than a mix with other terms. It does well by offering a cleaner mechanism that could simplify code and tuning compared to standard pessimistic approaches in the area. The stress-test concern about coverage lands here. If the logged dataset puts zero mass on actions that become best responses under the regularized payoffs, the KL term alone may not stop the recovered policy from deviating while still satisfying the objective. The abstract presents the rates as following directly, so the full paper must lay out the precise dataset assumptions on support and behavior policy. If those assumptions are mild and standard, the claim holds; if they are strong or unstated, the pessimism-free label weakens. The derivations for the O(1/n) rate will also need verification for any hidden dependence on game structure or data richness. This work targets researchers in offline MARL and general-sum equilibrium learning who want simpler regularizers. A reader already familiar with regularized Nash and CCE concepts would get the most from the details and comparisons. It deserves a serious referee because the idea is distinct and the rates are concrete enough to test. Send it for review, but ask the authors to state the coverage conditions up front and sketch the key proof steps for the accelerated rate.

Referee Report

2 major / 0 minor

Summary. The paper claims that KL regularization alone suffices for pessimism-free offline learning in general-sum games, without manual pessimistic penalties. It introduces General-sum Anchored Nash Equilibrium (GANE) to recover regularized Nash equilibria at an accelerated statistical rate of Õ(1/n), and General-sum Anchored Mirror Descent (GAMD) as an iterative algorithm converging to a Coarse Correlated Equilibrium at the standard rate of Õ(1/√n + 1/T). These results are positioned as establishing KL regularization as a standalone mechanism achieving equivalent or better rates than pessimistic baselines in multi-player settings.

Significance. If the central claims hold under verifiable assumptions, the work would be significant for offline multi-agent RL: it offers a simpler alternative to pessimism-based methods in general-sum games (where equilibria are non-unique and distribution shift is acute), while delivering an accelerated Õ(1/n) rate for regularized Nash recovery. This could streamline algorithm design and improve statistical efficiency, provided the KL term provably controls deviation without additional structure.

major comments (2)

[Abstract] Abstract: The central claim that 'KL regularization suffices to stabilize learning and achieve equilibrium recovery' at Õ(1/n) for GANE is stated without any derivation, explicit assumptions on the logged dataset, or coverage conditions. This is load-bearing because, without positive mass on the support of the target regularized equilibrium under the behavior policy, the KL penalty alone permits arbitrary deviation on unobserved best-response actions while still satisfying the regularized objective, undermining the rate and pessimism-free guarantee.
[Abstract] Abstract: The GAMD convergence claim to CCE at Õ(1/√n + 1/T) is presented as standard, but the manuscript provides no visible proof sketch or comparison to existing pessimistic mirror-descent analyses; this gap prevents verification that the anchoring mechanism preserves the rate without introducing hidden pessimism or game-specific restrictions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below with clarifications on assumptions and proofs from the full manuscript, and indicate where revisions will strengthen the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'KL regularization suffices to stabilize learning and achieve equilibrium recovery' at Õ(1/n) for GANE is stated without any derivation, explicit assumptions on the logged dataset, or coverage conditions. This is load-bearing because, without positive mass on the support of the target regularized equilibrium under the behavior policy, the KL penalty alone permits arbitrary deviation on unobserved best-response actions while still satisfying the regularized objective, undermining the rate and pessimism-free guarantee.

Authors: We agree the abstract is concise and omits explicit mention of coverage. Section 2.2 and Assumption 3.1 of the manuscript require that the behavior policy has positive mass on the support of the target regularized equilibrium; this ensures the KL term penalizes deviations on unobserved actions. Theorem 3.1 derives the Õ(1/n) rate for GANE under this coverage using standard concentration bounds on the anchored objective. The claim holds conditionally on this assumption, which is standard in offline RL and prevents the deviation issue raised. We will revise the abstract to note the coverage condition on the logged dataset. revision: partial
Referee: [Abstract] Abstract: The GAMD convergence claim to CCE at Õ(1/√n + 1/T) is presented as standard, but the manuscript provides no visible proof sketch or comparison to existing pessimistic mirror-descent analyses; this gap prevents verification that the anchoring mechanism preserves the rate without introducing hidden pessimism or game-specific restrictions.

Authors: Section 4 presents GAMD and states the Õ(1/√n + 1/T) rate to CCE; the full proof appears in Appendix C. The anchoring is incorporated into the Bregman divergence of the mirror-descent update, preserving the standard rate for CCE in general-sum games while handling offline distribution shift via KL regularization alone. Section 5 compares to pessimistic mirror-descent baselines and shows equivalent rates without manual penalties or game-specific restrictions. We will add a concise proof sketch to the main text and expand the comparison section to explicitly verify that anchoring introduces no hidden pessimism. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper defines new objects (GANE as a regularized Nash equilibrium concept and GAMD as an anchored mirror descent algorithm) and states new statistical rates (Õ(1/n) for GANE, Õ(1/√n + 1/T) for GAMD) as derived results. No load-bearing step reduces by construction to a fitted parameter, a self-citation chain, or a renamed input; the KL-regularization claim is presented as a demonstrated property of the new formulation rather than an identity or tautology. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not enumerate free parameters, axioms, or invented entities. The central claim implicitly rests on standard RL assumptions (bounded payoffs, finite action spaces) plus the novel claim that KL regularization suffices without pessimism; no explicit free parameters or invented entities beyond the named algorithms are stated.

pith-pipeline@v0.9.0 · 5422 in / 1277 out tokens · 31261 ms · 2026-05-09T20:05:59.933334+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Offline Two-Player Zero-Sum Markov Games with KL Regularization
cs.LG 2026-05 unverdicted novelty 8.0

KL regularization enables Õ(1/n) convergence for offline Nash equilibria in zero-sum Markov games under unilateral concentrability via the ROSE framework and SOS-MD algorithm.

Reference graph

Works this paper leans on

300 extracted references · 7 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

Concurrent submission to NeurIPS 2026 , year=

Fast Rates in -Potential Games via Regularized Mirror Descent , author=. Concurrent submission to NeurIPS 2026 , year=

2026
[2]

arXiv preprint arXiv:2310.06243 , year=

Sample-efficient multi-agent rl: An optimization perspective , author=. arXiv preprint arXiv:2310.06243 , year=

work page arXiv
[3]

Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback

Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback , author=. arXiv preprint arXiv:2603.28281 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Journal of the ACM (JACM) , volume=

Settling the complexity of computing two-player Nash equilibria , author=. Journal of the ACM (JACM) , volume=. 2009 , publisher=

2009
[5]

Journal of Computer and system Sciences , volume=

On the complexity of the parity argument and other inefficient proofs of existence , author=. Journal of Computer and system Sciences , volume=. 1994 , publisher=

1994
[6]

Communications of the ACM , volume=

The complexity of computing a Nash equilibrium , author=. Communications of the ACM , volume=. 2009 , publisher=

2009
[7]

International conference on machine learning , pages=

Pessimistic q-learning for offline reinforcement learning: Towards optimal sample complexity , author=. International conference on machine learning , pages=. 2022 , organization=

2022
[8]

Operations research , volume=

Model-based reinforcement learning for offline zero-sum Markov games , author=. Operations research , volume=. 2024 , publisher=

2024
[9]

Beyond Pessimism: Offline Learning in KL-regularized Games

Beyond Pessimism: Offline Learning in KL-regularized Games , author=. arXiv preprint arXiv:2604.06738 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=
[11]

2016 , publisher=

Twenty lectures on algorithmic game theory , author=. 2016 , publisher=

2016
[12]

2006 , publisher=

Prediction, learning, and games , author=. 2006 , publisher=

2006
[13]

Theory of computing , volume=

The multiplicative weights update method: a meta-algorithm and applications , author=. Theory of computing , volume=. 2012 , publisher=

2012
[14]

Machine Intelligence Research , volume=

Offline pre-trained multi-agent decision transformer , author=. Machine Intelligence Research , volume=. 2023 , publisher=

2023
[15]

arXiv preprint arXiv:2102.04402 , year=

Contrasting centralized and decentralized critics in multi-agent reinforcement learning , author=. arXiv preprint arXiv:2102.04402 , year=

work page arXiv
[16]

International Journal of Group Decision and Negotiation , volume=

Automated negotiation: prospects, methods and challenges , author=. International Journal of Group Decision and Negotiation , volume=
[17]

2001 , publisher=

Strategic negotiation in multiagent environments , author=. 2001 , publisher=

2001
[18]

Communications of the ACM , volume=

Algorithmic game theory , author=. Communications of the ACM , volume=. 2010 , publisher=

2010
[19]

Econometrica: Journal of the Econometric Society , pages=

A theory of auctions and competitive bidding , author=. Econometrica: Journal of the Econometric Society , pages=. 1982 , publisher=

1982
[20]

Games and Economic Behavior , volume=

On the value of information in a strategic conflict , author=. Games and Economic Behavior , volume=. 1990 , publisher=

1990
[21]

1995 , publisher=

Repeated games with incomplete information , author=. 1995 , publisher=

1995
[22]

Bayesian

Games with incomplete information played by “Bayesian” players, I--III Part I. The basic model , author=. Management science , volume=. 1967 , publisher=

1967
[23]

Mathematics of operations research , volume=

Optimal auction design , author=. Mathematics of operations research , volume=. 1981 , publisher=

1981
[24]

The Journal of finance , volume=

Counterspeculation, auctions, and competitive sealed tenders , author=. The Journal of finance , volume=. 1961 , publisher=

1961
[25]

Proceedings of the national academy of sciences , volume=

Stochastic games , author=. Proceedings of the national academy of sciences , volume=. 1953 , publisher=

1953
[26]

Behavior Regularized Offline Reinforcement Learning

Behavior regularized offline reinforcement learning , author=. arXiv preprint arXiv:1911.11361 , year=

work page internal anchor Pith review arXiv 1911
[27]

Proceedings of the AAAI conference on artificial intelligence , volume=

Adaptive trust region policy optimization: Global convergence and faster rates for regularized mdps , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
[28]

Foundations and Trends

Online learning and online convex optimization , author=. Foundations and Trends. 2025 , publisher=

2025
[29]

2000 , publisher=

Empirical Processes in M-estimation , author=. 2000 , publisher=

2000
[30]

IEEE transactions on information theory , volume=

Minimum complexity density estimation , author=. IEEE transactions on information theory , volume=. 2002 , publisher=

2002
[31]

, author=

On general minimax theorems. , author=
[32]

Dynamic Games and Applications , volume=

Upper and lower values in zero-sum stochastic games with asymmetric information , author=. Dynamic Games and Applications , volume=. 2021 , publisher=

2021
[33]

Games and Economic Behavior , volume=

Adaptive game playing using multiplicative weights , author=. Games and Economic Behavior , volume=. 1999 , publisher=

1999
[34]

Iterative Nash Policy Optimization: Aligning

Yuheng Zhang and Dian Yu and Baolin Peng and Linfeng Song and Ye Tian and Mingyue Huo and Nan Jiang and Haitao Mi and Dong Yu , booktitle=. Iterative Nash Policy Optimization: Aligning
[35]

1994 , publisher=

A course in game theory , author=. 1994 , publisher=

1994
[36]

1998 , publisher=

Dynamic noncooperative game theory , author=. 1998 , publisher=

1998
[37]

Handbook of reinforcement learning and control , pages=

Multi-agent reinforcement learning: A selective overview of theories and algorithms , author=. Handbook of reinforcement learning and control , pages=. 2021 , publisher=

2021
[38]

Machine learning proceedings 1994 , pages=

Markov games as a framework for multi-agent reinforcement learning , author=. Machine learning proceedings 1994 , pages=. 1994 , publisher=

1994
[39]

ArXiv Preprint arXiv:2102.00479 , year=

Fast rates for the regret of offline reinforcement learning , author=. arXiv preprint arXiv:2102.00479 , year=

work page arXiv
[40]

International Conference on Machine Learning , pages=

Pessimistic minimax value iteration: Provably efficient equilibrium learning from offline datasets , author=. International Conference on Machine Learning , pages=. 2022 , organization=

2022
[41]

Advances in Neural Information Processing Systems , volume=

When are offline two-player zero-sum Markov games solvable? , author=. Advances in Neural Information Processing Systems , volume=
[42]

International conference on machine learning , pages=

Is pessimism provably efficient for offline rl? , author=. International conference on machine learning , pages=. 2021 , organization=

2021
[43]

Advances in neural information processing systems , volume=

Bellman-consistent pessimism for offline reinforcement learning , author=. Advances in neural information processing systems , volume=
[44]

International Conference on Machine Learning , pages=

Offline learning in markov games with general function approximation , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023
[45]

International conference on machine learning , pages=

A theory of regularized markov decision processes , author=. International conference on machine learning , pages=. 2019 , organization=

2019
[46]

Asadi and Idan Shenfeld and Youssef Mroueh , booktitle=

Gholamali Aminian and Amir R. Asadi and Idan Shenfeld and Youssef Mroueh , booktitle=
[47]

ArXiv Preprint , year=

G\"odel's Poetry , author=. ArXiv Preprint , year=
[48]

2025 , journal=

ProofAug: Efficient Neural Theorem Proving via Fine-grained Proof Structure Analysis , author=. 2025 , journal=

2025
[49]

ArXiv Preprint , year=

Hilbert: Recursively Building Formal Proofs with Informal Reasoning , author=. ArXiv Preprint , year=
[50]

ArXiv Preprint , year=

APOLLO: Automated LLM and Lean Collaboration for Advanced Formal Reasoning , author=. ArXiv Preprint , year=
[51]

ArXiv Preprint , year=

Solving formal math problems by decomposition and iterative reflection , author=. ArXiv Preprint , year=
[52]

ArXiv Preprint , year=

Formal theorem proving by rewarding llms to decompose proofs hierarchically , author=. ArXiv Preprint , year=
[53]

ArXiv Preprint , year=

Lemmanaid: Neuro-Symbolic Lemma Conjecturing , author=. ArXiv Preprint , year=
[54]

2022 , journal =

Sivaraman, Aishwarya and Sanchez-Stern, Alex and Chen, Bretton and Lerner, Sorin and Millstein, Todd , title =. 2022 , journal =

2022
[55]

ArXiv Preprint

LEGO-Prover: Neural Theorem Proving with Growing Libraries , author=. ArXiv Preprint
[56]

ArXiv Preprint

LeanConjecturer: Automatic Generation of Mathematical Conjectures for Theorem Proving. ArXiv Preprint. 2025

2025
[57]

Discovering New Theorems via LLMs with In-Context Proof Learning in Lean

Kazumi Kasaura and Naoto Onda and Yuta Oriike and Masaya Taniguchi and Akiyoshi Sannai and Sho Sonoda. Discovering New Theorems via LLMs with In-Context Proof Learning in Lean. ArXiv Preprint. 2025

2025
[58]

ArXiv Preprint , year=

Aristotle: Imo-level automated theorem proving , author=. ArXiv Preprint , year=
[59]

ArXiv Preprint , year=

Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction , author=. ArXiv Preprint , year=
[60]

ArXiv Preprint , year=

Goedel-prover: A frontier model for open-source automated theorem proving , author=. ArXiv Preprint , year=
[61]

Nature , year=

Olympiad-level formal mathematical reasoning with reinforcement learning , author=. Nature , year=
[62]

ArXiv Preprint , year=

Gold-medalist performance in solving olympiad geometry with alphageometry2 , author=. ArXiv Preprint , year=
[63]

ArXiv Preprint , year=

Seed-prover: Deep and broad reasoning for automated theorem proving , author=. ArXiv Preprint , year=
[64]

ArXiv Preprint , year=

Minif2f: a cross-system benchmark for formal olympiad-level mathematics , author=. ArXiv Preprint , year=
[65]

ArXiv Preprint , year=

Formalmath: Benchmarking formal mathematical reasoning of large language models , author=. ArXiv Preprint , year=
[66]

ArXiv Preprint , year=

Proofnet: Autoformalizing and formally proving undergraduate-level mathematics , author=. ArXiv Preprint , year=
[67]

Advances in Neural Information Processing Systems , year=

Putnambench: Evaluating neural theorem-provers on the putnam mathematical competition , author=. Advances in Neural Information Processing Systems , year=
[68]

10 amazon statistics you need to know in 2022

Mohsin, Maryam. 10 amazon statistics you need to know in 2022. Oberlo. 2022

2022
[69]

and Deng, Yanzhen and Laber, Eric B

Murphy, Susan A. and Deng, Yanzhen and Laber, Eric B. and Maei, Hamid Reza and Sutton, Richard S. and Witkiewitz, Katie. A Batch, Off-Policy, Actor-Critic Algorithm for Optimizing the Average Reward. ArXiv Preprint. 2016

2016
[70]

A Block Coordinate Ascent Algorithm for Mean-Variance Optimization

Xie, Tengyang and Liu, Bo and Xu, Yangyang and Ghavamzadeh, Mohammad and Chow, Yinlam and Lyu, Daoming and Yoon, Daesub. A Block Coordinate Ascent Algorithm for Mean-Variance Optimization. Advances in Neural Information Processing Systems. 2018

2018
[71]

A Closer Look at Deep Policy Gradients

Ilyas, Andrew and Engstrom, Logan and Santurkar, Shibani and Tsipras, Dimitris and Janoos, Firdaus and Rudolph, Larry and Madry, Aleksander. A Closer Look at Deep Policy Gradients. Proceedings of the International Conference on Learning Representations. 2020

2020
[72]

and Castro, Pablo Samuel

Lyle, Clare and Bellemare, Marc G. and Castro, Pablo Samuel. A Comparative Analysis of Expected and Distributional Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence. 2019

2019
[73]

A Concentration Bound for TD (0) with Function Approximation

Chandak, Siddharth and Borkar, Vivek S. A Concentration Bound for TD (0) with Function Approximation. ArXiv Preprint. 2023

2023
[74]

and Precup, Doina

Perkins, Theodore J. and Precup, Doina. A Convergent Form of Approximate Policy Iteration. Advances in Neural Information Processing Systems. 2002

2002
[75]

and Szepesv \' a ri, Csaba and Maei, Hamid Reza

Sutton, Richard S. and Szepesv \' a ri, Csaba and Maei, Hamid Reza. A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation. Advances in Neural Information Processing Systems. 2008

2008
[76]

A Convergent Off-Policy Temporal Difference Algorithm

Diddigi, Raghuram Bharadwaj and Kamanchi, Chandramouli and Bhatnagar, Shalabh. A Convergent Off-Policy Temporal Difference Algorithm. Proceedings of the European Conference on Artificial Intelligence. 2020

2020
[77]

A Deeper Look at Discounting Mismatch in Actor-Critic Algorithms

Zhang, Shangtong and Laroche, Romain and van Seijen, Harm and Whiteson, Shimon and des Combes, Remi Tachet. A Deeper Look at Discounting Mismatch in Actor-Critic Algorithms. Proceedings of the International Conference on Autonomous Agents and Multiagent Systems. 2022

2022
[78]

A Deeper Look at Planning as Learning from Replay

Vanseijen, Harm and Sutton, Rich. A Deeper Look at Planning as Learning from Replay. Proceedings of the International Conference on Machine Learning. 2015

2015
[79]

A Definition of Continual Reinforcement Learning

Abel, David and Barreto, Andr \'e and Van Roy, Benjamin and Precup, Doina and van Hasselt, Hado and Singh, Satinder. A Definition of Continual Reinforcement Learning. Advances in Neural Information Processing Systems. 2023

2023
[80]

and Dabney, Will and Munos, R \' e mi

Bellemare, Marc G. and Dabney, Will and Munos, R \' e mi. A Distributional Perspective on Reinforcement Learning. Proceedings of the International Conference on Machine Learning. 2017

2017

Showing first 80 references.