pith. machine review for the scientific record. sign in

arxiv: 2605.00264 · v1 · submitted 2026-04-30 · 💻 cs.LG · cs.GT

Recognition: unknown

Pessimism-Free Offline Learning in General-Sum Games via KL Regularization

Authors on Pith no claims yet

Pith reviewed 2026-05-09 20:05 UTC · model grok-4.3

classification 💻 cs.LG cs.GT
keywords offline multi-agent reinforcement learninggeneral-sum gamesKL regularizationNash equilibriumcoarse correlated equilibriumpessimism-free methodsdistribution shift
0
0 comments X

The pith

KL regularization alone stabilizes offline learning and recovers equilibria in general-sum games without pessimistic penalties.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that in offline multi-agent reinforcement learning for general-sum games, KL regularization can control distribution shift between logged data and target policies, allowing equilibrium recovery without manual pessimistic penalties. A sympathetic reader would care because current methods often add pessimistic terms to handle uncertainty, complicating the algorithms, whereas here a simple regularization suffices for both stability and improved rates. The authors define the General-sum Anchored Nash Equilibrium (GANE) to recover regularized Nash at accelerated statistical rates of order 1/n. They also introduce the General-sum Anchored Mirror Descent (GAMD) algorithm for computing Coarse Correlated Equilibria at standard rates. If the claim holds, it means offline learning in these games can be done more simply and efficiently by leveraging standard KL regularization as the key stabilizer.

Core claim

We demonstrate that KL regularization suffices to stabilize learning and achieve equilibrium recovery. We propose General-sum Anchored Nash Equilibrium (GANE), which recovers regularized Nash equilibria at an accelerated statistical rate of O(1/n). For computational tractability, we develop General-sum Anchored Mirror Descent (GAMD), an iterative algorithm converging to a Coarse Correlated Equilibrium at the standard rate of O(1/sqrt(n)+1/T). These results establish KL regularization as a standalone mechanism for pessimism-free offline learning that achieves equivalent or accelerated rates in multi-player general-sum games.

What carries the argument

General-sum Anchored Nash Equilibrium (GANE), which uses KL regularization to anchor policies to the logged dataset and recover regularized Nash equilibria while controlling distribution shift.

If this is right

  • Regularized Nash equilibria can be recovered from offline datasets at accelerated rates of O(1/n) using only KL regularization.
  • The GAMD algorithm provides a practical way to compute Coarse Correlated Equilibria converging at O(1/sqrt(n) + 1/T).
  • No additional pessimistic penalties are needed to stabilize learning in general-sum games.
  • KL regularization acts as a sufficient standalone mechanism for pessimism-free offline multi-agent learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This could simplify implementation of offline MARL systems by removing the need to tune pessimism parameters.
  • The anchoring idea via KL might extend to single-agent offline RL or other multi-agent settings with distribution shift.
  • In applications like autonomous driving or robotics with multiple agents, this could lead to more robust learned policies from logged data.
  • It raises the possibility that other regularizations could achieve similar effects, warranting comparison studies.

Load-bearing premise

The assumption that KL regularization alone controls distribution shift and recovers equilibria in general-sum games without needing any additional pessimistic terms or game-specific structure.

What would settle it

A counterexample consisting of a general-sum game and a logged dataset where policies optimized with KL regularization fail to approach the regularized Nash equilibrium as the dataset size n increases, with error not scaling as 1/n.

read the original abstract

Offline multi-agent reinforcement learning in general-sum settings is challenged by the distribution shift between logged datasets and target equilibrium policies. While standard methods rely on manual pessimistic penalties, we demonstrate that KL regularization suffices to stabilize learning and achieve equilibrium recovery. We propose General-sum Anchored Nash Equilibrium (GANE), which recovers regularized Nash equilibria at an accelerated statistical rate of $\widetilde{O}(1/n)$. For computational tractability, we develop General-sum Anchored Mirror Descent (GAMD), an iterative algorithm converging to a Coarse Correlated Equilibrium at the standard rate of $\widetilde{O}(1/\sqrt{n}+1/T)$. These results establish KL regularization as a standalone mechanism for pessimism-free offline learning that achieves equivalent or accelerated rates in multi-player general-sum games.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper claims that KL regularization alone suffices for pessimism-free offline learning in general-sum games, without manual pessimistic penalties. It introduces General-sum Anchored Nash Equilibrium (GANE) to recover regularized Nash equilibria at an accelerated statistical rate of Õ(1/n), and General-sum Anchored Mirror Descent (GAMD) as an iterative algorithm converging to a Coarse Correlated Equilibrium at the standard rate of Õ(1/√n + 1/T). These results are positioned as establishing KL regularization as a standalone mechanism achieving equivalent or better rates than pessimistic baselines in multi-player settings.

Significance. If the central claims hold under verifiable assumptions, the work would be significant for offline multi-agent RL: it offers a simpler alternative to pessimism-based methods in general-sum games (where equilibria are non-unique and distribution shift is acute), while delivering an accelerated Õ(1/n) rate for regularized Nash recovery. This could streamline algorithm design and improve statistical efficiency, provided the KL term provably controls deviation without additional structure.

major comments (2)
  1. [Abstract] Abstract: The central claim that 'KL regularization suffices to stabilize learning and achieve equilibrium recovery' at Õ(1/n) for GANE is stated without any derivation, explicit assumptions on the logged dataset, or coverage conditions. This is load-bearing because, without positive mass on the support of the target regularized equilibrium under the behavior policy, the KL penalty alone permits arbitrary deviation on unobserved best-response actions while still satisfying the regularized objective, undermining the rate and pessimism-free guarantee.
  2. [Abstract] Abstract: The GAMD convergence claim to CCE at Õ(1/√n + 1/T) is presented as standard, but the manuscript provides no visible proof sketch or comparison to existing pessimistic mirror-descent analyses; this gap prevents verification that the anchoring mechanism preserves the rate without introducing hidden pessimism or game-specific restrictions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below with clarifications on assumptions and proofs from the full manuscript, and indicate where revisions will strengthen the presentation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that 'KL regularization suffices to stabilize learning and achieve equilibrium recovery' at Õ(1/n) for GANE is stated without any derivation, explicit assumptions on the logged dataset, or coverage conditions. This is load-bearing because, without positive mass on the support of the target regularized equilibrium under the behavior policy, the KL penalty alone permits arbitrary deviation on unobserved best-response actions while still satisfying the regularized objective, undermining the rate and pessimism-free guarantee.

    Authors: We agree the abstract is concise and omits explicit mention of coverage. Section 2.2 and Assumption 3.1 of the manuscript require that the behavior policy has positive mass on the support of the target regularized equilibrium; this ensures the KL term penalizes deviations on unobserved actions. Theorem 3.1 derives the Õ(1/n) rate for GANE under this coverage using standard concentration bounds on the anchored objective. The claim holds conditionally on this assumption, which is standard in offline RL and prevents the deviation issue raised. We will revise the abstract to note the coverage condition on the logged dataset. revision: partial

  2. Referee: [Abstract] Abstract: The GAMD convergence claim to CCE at Õ(1/√n + 1/T) is presented as standard, but the manuscript provides no visible proof sketch or comparison to existing pessimistic mirror-descent analyses; this gap prevents verification that the anchoring mechanism preserves the rate without introducing hidden pessimism or game-specific restrictions.

    Authors: Section 4 presents GAMD and states the Õ(1/√n + 1/T) rate to CCE; the full proof appears in Appendix C. The anchoring is incorporated into the Bregman divergence of the mirror-descent update, preserving the standard rate for CCE in general-sum games while handling offline distribution shift via KL regularization alone. Section 5 compares to pessimistic mirror-descent baselines and shows equivalent rates without manual penalties or game-specific restrictions. We will add a concise proof sketch to the main text and expand the comparison section to explicitly verify that anchoring introduces no hidden pessimism. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper defines new objects (GANE as a regularized Nash equilibrium concept and GAMD as an anchored mirror descent algorithm) and states new statistical rates (Õ(1/n) for GANE, Õ(1/√n + 1/T) for GAMD) as derived results. No load-bearing step reduces by construction to a fitted parameter, a self-citation chain, or a renamed input; the KL-regularization claim is presented as a demonstrated property of the new formulation rather than an identity or tautology. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not enumerate free parameters, axioms, or invented entities. The central claim implicitly rests on standard RL assumptions (bounded payoffs, finite action spaces) plus the novel claim that KL regularization suffices without pessimism; no explicit free parameters or invented entities beyond the named algorithms are stated.

pith-pipeline@v0.9.0 · 5422 in / 1277 out tokens · 31261 ms · 2026-05-09T20:05:59.933334+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Offline Two-Player Zero-Sum Markov Games with KL Regularization

    cs.LG 2026-05 unverdicted novelty 8.0

    KL regularization enables Õ(1/n) convergence for offline Nash equilibria in zero-sum Markov games under unilateral concentrability via the ROSE framework and SOS-MD algorithm.

Reference graph

Works this paper leans on

300 extracted references · 7 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    Concurrent submission to NeurIPS 2026 , year=

    Fast Rates in -Potential Games via Regularized Mirror Descent , author=. Concurrent submission to NeurIPS 2026 , year=

  2. [2]

    arXiv preprint arXiv:2310.06243 , year=

    Sample-efficient multi-agent rl: An optimization perspective , author=. arXiv preprint arXiv:2310.06243 , year=

  3. [3]

    Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback

    Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback , author=. arXiv preprint arXiv:2603.28281 , year=

  4. [4]

    Journal of the ACM (JACM) , volume=

    Settling the complexity of computing two-player Nash equilibria , author=. Journal of the ACM (JACM) , volume=. 2009 , publisher=

  5. [5]

    Journal of Computer and system Sciences , volume=

    On the complexity of the parity argument and other inefficient proofs of existence , author=. Journal of Computer and system Sciences , volume=. 1994 , publisher=

  6. [6]

    Communications of the ACM , volume=

    The complexity of computing a Nash equilibrium , author=. Communications of the ACM , volume=. 2009 , publisher=

  7. [7]

    International conference on machine learning , pages=

    Pessimistic q-learning for offline reinforcement learning: Towards optimal sample complexity , author=. International conference on machine learning , pages=. 2022 , organization=

  8. [8]

    Operations research , volume=

    Model-based reinforcement learning for offline zero-sum Markov games , author=. Operations research , volume=. 2024 , publisher=

  9. [9]

    Beyond Pessimism: Offline Learning in KL-regularized Games

    Beyond Pessimism: Offline Learning in KL-regularized Games , author=. arXiv preprint arXiv:2604.06738 , year=

  10. [10]

    Advances in neural information processing systems , volume=

    Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

  11. [11]

    2016 , publisher=

    Twenty lectures on algorithmic game theory , author=. 2016 , publisher=

  12. [12]

    2006 , publisher=

    Prediction, learning, and games , author=. 2006 , publisher=

  13. [13]

    Theory of computing , volume=

    The multiplicative weights update method: a meta-algorithm and applications , author=. Theory of computing , volume=. 2012 , publisher=

  14. [14]

    Machine Intelligence Research , volume=

    Offline pre-trained multi-agent decision transformer , author=. Machine Intelligence Research , volume=. 2023 , publisher=

  15. [15]

    arXiv preprint arXiv:2102.04402 , year=

    Contrasting centralized and decentralized critics in multi-agent reinforcement learning , author=. arXiv preprint arXiv:2102.04402 , year=

  16. [16]

    International Journal of Group Decision and Negotiation , volume=

    Automated negotiation: prospects, methods and challenges , author=. International Journal of Group Decision and Negotiation , volume=

  17. [17]

    2001 , publisher=

    Strategic negotiation in multiagent environments , author=. 2001 , publisher=

  18. [18]

    Communications of the ACM , volume=

    Algorithmic game theory , author=. Communications of the ACM , volume=. 2010 , publisher=

  19. [19]

    Econometrica: Journal of the Econometric Society , pages=

    A theory of auctions and competitive bidding , author=. Econometrica: Journal of the Econometric Society , pages=. 1982 , publisher=

  20. [20]

    Games and Economic Behavior , volume=

    On the value of information in a strategic conflict , author=. Games and Economic Behavior , volume=. 1990 , publisher=

  21. [21]

    1995 , publisher=

    Repeated games with incomplete information , author=. 1995 , publisher=

  22. [22]

    Bayesian

    Games with incomplete information played by “Bayesian” players, I--III Part I. The basic model , author=. Management science , volume=. 1967 , publisher=

  23. [23]

    Mathematics of operations research , volume=

    Optimal auction design , author=. Mathematics of operations research , volume=. 1981 , publisher=

  24. [24]

    The Journal of finance , volume=

    Counterspeculation, auctions, and competitive sealed tenders , author=. The Journal of finance , volume=. 1961 , publisher=

  25. [25]

    Proceedings of the national academy of sciences , volume=

    Stochastic games , author=. Proceedings of the national academy of sciences , volume=. 1953 , publisher=

  26. [26]

    Behavior Regularized Offline Reinforcement Learning

    Behavior regularized offline reinforcement learning , author=. arXiv preprint arXiv:1911.11361 , year=

  27. [27]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    Adaptive trust region policy optimization: Global convergence and faster rates for regularized mdps , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  28. [28]

    Foundations and Trends

    Online learning and online convex optimization , author=. Foundations and Trends. 2025 , publisher=

  29. [29]

    2000 , publisher=

    Empirical Processes in M-estimation , author=. 2000 , publisher=

  30. [30]

    IEEE transactions on information theory , volume=

    Minimum complexity density estimation , author=. IEEE transactions on information theory , volume=. 2002 , publisher=

  31. [31]

    , author=

    On general minimax theorems. , author=

  32. [32]

    Dynamic Games and Applications , volume=

    Upper and lower values in zero-sum stochastic games with asymmetric information , author=. Dynamic Games and Applications , volume=. 2021 , publisher=

  33. [33]

    Games and Economic Behavior , volume=

    Adaptive game playing using multiplicative weights , author=. Games and Economic Behavior , volume=. 1999 , publisher=

  34. [34]

    Iterative Nash Policy Optimization: Aligning

    Yuheng Zhang and Dian Yu and Baolin Peng and Linfeng Song and Ye Tian and Mingyue Huo and Nan Jiang and Haitao Mi and Dong Yu , booktitle=. Iterative Nash Policy Optimization: Aligning

  35. [35]

    1994 , publisher=

    A course in game theory , author=. 1994 , publisher=

  36. [36]

    1998 , publisher=

    Dynamic noncooperative game theory , author=. 1998 , publisher=

  37. [37]

    Handbook of reinforcement learning and control , pages=

    Multi-agent reinforcement learning: A selective overview of theories and algorithms , author=. Handbook of reinforcement learning and control , pages=. 2021 , publisher=

  38. [38]

    Machine learning proceedings 1994 , pages=

    Markov games as a framework for multi-agent reinforcement learning , author=. Machine learning proceedings 1994 , pages=. 1994 , publisher=

  39. [39]

    ArXiv Preprint arXiv:2102.00479 , year=

    Fast rates for the regret of offline reinforcement learning , author=. arXiv preprint arXiv:2102.00479 , year=

  40. [40]

    International Conference on Machine Learning , pages=

    Pessimistic minimax value iteration: Provably efficient equilibrium learning from offline datasets , author=. International Conference on Machine Learning , pages=. 2022 , organization=

  41. [41]

    Advances in Neural Information Processing Systems , volume=

    When are offline two-player zero-sum Markov games solvable? , author=. Advances in Neural Information Processing Systems , volume=

  42. [42]

    International conference on machine learning , pages=

    Is pessimism provably efficient for offline rl? , author=. International conference on machine learning , pages=. 2021 , organization=

  43. [43]

    Advances in neural information processing systems , volume=

    Bellman-consistent pessimism for offline reinforcement learning , author=. Advances in neural information processing systems , volume=

  44. [44]

    International Conference on Machine Learning , pages=

    Offline learning in markov games with general function approximation , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  45. [45]

    International conference on machine learning , pages=

    A theory of regularized markov decision processes , author=. International conference on machine learning , pages=. 2019 , organization=

  46. [46]

    Asadi and Idan Shenfeld and Youssef Mroueh , booktitle=

    Gholamali Aminian and Amir R. Asadi and Idan Shenfeld and Youssef Mroueh , booktitle=

  47. [47]

    ArXiv Preprint , year=

    G\"odel's Poetry , author=. ArXiv Preprint , year=

  48. [48]

    2025 , journal=

    ProofAug: Efficient Neural Theorem Proving via Fine-grained Proof Structure Analysis , author=. 2025 , journal=

  49. [49]

    ArXiv Preprint , year=

    Hilbert: Recursively Building Formal Proofs with Informal Reasoning , author=. ArXiv Preprint , year=

  50. [50]

    ArXiv Preprint , year=

    APOLLO: Automated LLM and Lean Collaboration for Advanced Formal Reasoning , author=. ArXiv Preprint , year=

  51. [51]

    ArXiv Preprint , year=

    Solving formal math problems by decomposition and iterative reflection , author=. ArXiv Preprint , year=

  52. [52]

    ArXiv Preprint , year=

    Formal theorem proving by rewarding llms to decompose proofs hierarchically , author=. ArXiv Preprint , year=

  53. [53]

    ArXiv Preprint , year=

    Lemmanaid: Neuro-Symbolic Lemma Conjecturing , author=. ArXiv Preprint , year=

  54. [54]

    2022 , journal =

    Sivaraman, Aishwarya and Sanchez-Stern, Alex and Chen, Bretton and Lerner, Sorin and Millstein, Todd , title =. 2022 , journal =

  55. [55]

    ArXiv Preprint

    LEGO-Prover: Neural Theorem Proving with Growing Libraries , author=. ArXiv Preprint

  56. [56]

    ArXiv Preprint

    LeanConjecturer: Automatic Generation of Mathematical Conjectures for Theorem Proving. ArXiv Preprint. 2025

  57. [57]

    Discovering New Theorems via LLMs with In-Context Proof Learning in Lean

    Kazumi Kasaura and Naoto Onda and Yuta Oriike and Masaya Taniguchi and Akiyoshi Sannai and Sho Sonoda. Discovering New Theorems via LLMs with In-Context Proof Learning in Lean. ArXiv Preprint. 2025

  58. [58]

    ArXiv Preprint , year=

    Aristotle: Imo-level automated theorem proving , author=. ArXiv Preprint , year=

  59. [59]

    ArXiv Preprint , year=

    Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction , author=. ArXiv Preprint , year=

  60. [60]

    ArXiv Preprint , year=

    Goedel-prover: A frontier model for open-source automated theorem proving , author=. ArXiv Preprint , year=

  61. [61]

    Nature , year=

    Olympiad-level formal mathematical reasoning with reinforcement learning , author=. Nature , year=

  62. [62]

    ArXiv Preprint , year=

    Gold-medalist performance in solving olympiad geometry with alphageometry2 , author=. ArXiv Preprint , year=

  63. [63]

    ArXiv Preprint , year=

    Seed-prover: Deep and broad reasoning for automated theorem proving , author=. ArXiv Preprint , year=

  64. [64]

    ArXiv Preprint , year=

    Minif2f: a cross-system benchmark for formal olympiad-level mathematics , author=. ArXiv Preprint , year=

  65. [65]

    ArXiv Preprint , year=

    Formalmath: Benchmarking formal mathematical reasoning of large language models , author=. ArXiv Preprint , year=

  66. [66]

    ArXiv Preprint , year=

    Proofnet: Autoformalizing and formally proving undergraduate-level mathematics , author=. ArXiv Preprint , year=

  67. [67]

    Advances in Neural Information Processing Systems , year=

    Putnambench: Evaluating neural theorem-provers on the putnam mathematical competition , author=. Advances in Neural Information Processing Systems , year=

  68. [68]

    10 amazon statistics you need to know in 2022

    Mohsin, Maryam. 10 amazon statistics you need to know in 2022. Oberlo. 2022

  69. [69]

    and Deng, Yanzhen and Laber, Eric B

    Murphy, Susan A. and Deng, Yanzhen and Laber, Eric B. and Maei, Hamid Reza and Sutton, Richard S. and Witkiewitz, Katie. A Batch, Off-Policy, Actor-Critic Algorithm for Optimizing the Average Reward. ArXiv Preprint. 2016

  70. [70]

    A Block Coordinate Ascent Algorithm for Mean-Variance Optimization

    Xie, Tengyang and Liu, Bo and Xu, Yangyang and Ghavamzadeh, Mohammad and Chow, Yinlam and Lyu, Daoming and Yoon, Daesub. A Block Coordinate Ascent Algorithm for Mean-Variance Optimization. Advances in Neural Information Processing Systems. 2018

  71. [71]

    A Closer Look at Deep Policy Gradients

    Ilyas, Andrew and Engstrom, Logan and Santurkar, Shibani and Tsipras, Dimitris and Janoos, Firdaus and Rudolph, Larry and Madry, Aleksander. A Closer Look at Deep Policy Gradients. Proceedings of the International Conference on Learning Representations. 2020

  72. [72]

    and Castro, Pablo Samuel

    Lyle, Clare and Bellemare, Marc G. and Castro, Pablo Samuel. A Comparative Analysis of Expected and Distributional Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence. 2019

  73. [73]

    A Concentration Bound for TD (0) with Function Approximation

    Chandak, Siddharth and Borkar, Vivek S. A Concentration Bound for TD (0) with Function Approximation. ArXiv Preprint. 2023

  74. [74]

    and Precup, Doina

    Perkins, Theodore J. and Precup, Doina. A Convergent Form of Approximate Policy Iteration. Advances in Neural Information Processing Systems. 2002

  75. [75]

    and Szepesv \' a ri, Csaba and Maei, Hamid Reza

    Sutton, Richard S. and Szepesv \' a ri, Csaba and Maei, Hamid Reza. A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation. Advances in Neural Information Processing Systems. 2008

  76. [76]

    A Convergent Off-Policy Temporal Difference Algorithm

    Diddigi, Raghuram Bharadwaj and Kamanchi, Chandramouli and Bhatnagar, Shalabh. A Convergent Off-Policy Temporal Difference Algorithm. Proceedings of the European Conference on Artificial Intelligence. 2020

  77. [77]

    A Deeper Look at Discounting Mismatch in Actor-Critic Algorithms

    Zhang, Shangtong and Laroche, Romain and van Seijen, Harm and Whiteson, Shimon and des Combes, Remi Tachet. A Deeper Look at Discounting Mismatch in Actor-Critic Algorithms. Proceedings of the International Conference on Autonomous Agents and Multiagent Systems. 2022

  78. [78]

    A Deeper Look at Planning as Learning from Replay

    Vanseijen, Harm and Sutton, Rich. A Deeper Look at Planning as Learning from Replay. Proceedings of the International Conference on Machine Learning. 2015

  79. [79]

    A Definition of Continual Reinforcement Learning

    Abel, David and Barreto, Andr \'e and Van Roy, Benjamin and Precup, Doina and van Hasselt, Hado and Singh, Satinder. A Definition of Continual Reinforcement Learning. Advances in Neural Information Processing Systems. 2023

  80. [80]

    and Dabney, Will and Munos, R \' e mi

    Bellemare, Marc G. and Dabney, Will and Munos, R \' e mi. A Distributional Perspective on Reinforcement Learning. Proceedings of the International Conference on Machine Learning. 2017

Showing first 80 references.