One More Time: Revisiting Neural Quantum States from a Reinforcement Learning Perspective

Aaron Courville; Anna Dawid; Eli\v{s}ka Greplov\'a; Juan Agust\'in Duque; Sergio Garc\'ia Heredia; Thomas Spriggs; Vinicius Hernandes

arxiv: 2607.02292 · v1 · pith:SUKQDMMNnew · submitted 2026-07-02 · 💻 cs.LG · cond-mat.dis-nn· quant-ph

One More Time: Revisiting Neural Quantum States from a Reinforcement Learning Perspective

Juan Agust\'in Duque , Sergio Garc\'ia Heredia , Vinicius Hernandes , Eli\v{s}ka Greplov\'a , Thomas Spriggs , Aaron Courville , Anna Dawid This is my paper

Pith reviewed 2026-07-03 16:34 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.dis-nnquant-ph

keywords neural quantum statespolicy gradienttrust region optimizationautoregressive modelsvariational Monte Carloquantum many-body systemsreinforcement learning

0 comments

The pith

Viewing energy minimization for neural quantum states as a policy gradient problem yields a trust-region optimizer that scales to billion-parameter models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that variational optimization of autoregressive neural quantum states can be recast as an advantage policy-gradient task defined over the Born distribution. This reframing directly motivates a trust-region algorithm, Proximal Wavefunction Optimization, that clips probability ratios in the amplitude channel and phase increments in the phase channel. PWO reuses samples across multiple gradient steps, avoids matrix inversion, and supplies first-order scalability together with guarantees that the variational energy bound remains intact. On Ising and frustrated J1-J2 lattices in one and two dimensions the method shows improved stability and wall-clock convergence relative to Adam, minSR, and SPRING. The same procedure is used to fine-tune a 1.5-billion-parameter RWKV-7 model, extending the reachable size of NQS optimization by more than three orders of magnitude.

Core claim

Variational energy minimization for autoregressive neural quantum states is equivalent to an advantage policy-gradient problem over the Born distribution; a trust-region update that clips amplitude probability ratios and phase increments produces a stable optimizer that preserves the variational upper bound on energy while scaling to models with over a billion parameters.

What carries the argument

Proximal Wavefunction Optimization (PWO), a trust-region algorithm that clips probability-ratio changes in the amplitude channel and phase increments in the phase channel.

If this is right

PWO improves stability and wall-clock convergence over Adam, minSR, and SPRING on Ising and J1-J2 spin systems.
The algorithm enables optimization of neural quantum states at a scale exceeding one billion parameters.
Sample reuse across multiple updates reduces wall-clock cost while maintaining theoretical guarantees.
Clipping in separate amplitude and phase channels combines first-order scalability with trust-region safety.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the policy-gradient equivalence holds, analogous trust-region clipping could be applied to other variational Monte Carlo methods that admit exact sampling.
The demonstrated scaling suggests that autoregressive architectures could be tested on higher-dimensional or frustrated quantum systems previously inaccessible to NQS.
The separation of amplitude and phase clipping channels may generalize to wavefunction ansatzes where the phase is represented separately from the modulus.

Load-bearing premise

That reframing variational energy minimization as an advantage policy-gradient problem produces a trust-region update that preserves the variational energy bound without introducing uncontrolled bias from the clipping operations.

What would settle it

An experiment in which PWO either fails to improve stability or convergence speed over Adam on the same large-scale Ising instances, or in which the 1.5-billion-parameter model training produces an energy estimate that violates the variational upper bound by more than sampling noise.

Figures

Figures reproduced from arXiv: 2607.02292 by Aaron Courville, Anna Dawid, Eli\v{s}ka Greplov\'a, Juan Agust\'in Duque, Sergio Garc\'ia Heredia, Thomas Spriggs, Vinicius Hernandes.

**Figure 2.** Figure 2: Comparison of PWO, Adam, minSR, and SPRING on the transverse-field Ising model over [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of PWO, Adam, minSR, and SPRING on the Heisenberg [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Two-dimensional frustrated J1–J2 Heisenberg model on the 10 × 10 lattice. Left: mean real energy. Right: V-score. PWO reaches lower energies faster than Adam and maintains a lower variance-based error signal over the same wall-clock budget. 5.3 Two-dimensional J1–J2 model on the square lattice We next test whether PWO remains effective beyond one-dimensional chains on a frustrated squarelattice J1–J2 Heis… view at source ↗

**Figure 5.** Figure 5: Wall-clock scaling comparison across model sizes and optimization methods. Boxplots [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Fine-tuning curves of a 1.5B-parameter RWKV7LLM on the 1-D Ising Model. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Per-iteration computational cost and normalized iteration speed for PWO and the baseline [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗

**Figure 8.** Figure 8: Individual-seed learning curves for all Hamiltonians. Each row corresponds to one Hamiltonian, and [PITH_FULL_IMAGE:figures/full_fig_p032_8.png] view at source ↗

**Figure 9.** Figure 9: Wall-clock scaling comparison across number of samples and optimization methods. [PITH_FULL_IMAGE:figures/full_fig_p033_9.png] view at source ↗

**Figure 10.** Figure 10: Wall-clock scaling comparison across system size and optimization methods. Boxplots [PITH_FULL_IMAGE:figures/full_fig_p033_10.png] view at source ↗

**Figure 11.** Figure 11: Individual-seed fine-tuning curves for the [PITH_FULL_IMAGE:figures/full_fig_p034_11.png] view at source ↗

read the original abstract

Neural quantum states (NQS) provide a flexible and scalable framework for approximating quantum many-body wavefunctions. Among NQS parameterizations, autoregressive models are especially attractive because they enable exact, independent sampling from the Born distribution, avoiding the autocorrelation and mixing issues of Markov chain methods. Yet their optimization remains comparatively underexplored: Adam is a scalable method but ignores function space geometry, while stochastic reconfiguration is principled but costly and numerically fragile in large models. To address this gap, we show that variational energy minimization can be viewed as an advantage policy-gradient problem over the Born distribution, motivating trust-region optimization for NQS training. We introduce Proximal Wavefunction Optimization (PWO), a principled trust-region algorithm that clips probability-ratio changes in the amplitude channel and phase increments in the phase channel. PWO avoids explicit matrix inversion, reuses samples across multiple updates, and combines the scalability of first-order optimization with theoretical guarantees. Across Ising and frustrated $J_1$-$J_2$ one- and two-dimensional spin systems, PWO improves stability and wall-clock convergence over Adam, minSR, and SPRING. Finally, we fine-tune a $1.5$B-parameter RWKV-7 model, demonstrating NQS optimization at a scale over three orders of magnitude beyond prior work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PWO gives a workable way to train much larger autoregressive NQS by clipping amplitude ratios and phase increments separately, but the effect of that clipping on the variational bound is not yet clear.

read the letter

The main takeaway is that this paper reframes variational energy minimization for neural quantum states as an advantage policy-gradient problem over the Born distribution and then introduces Proximal Wavefunction Optimization with separate clipping on amplitude probability ratios and phase increments. That combination lets them train a 1.5 billion parameter RWKV model, which is the concrete scaling advance.

They handle the practical side well. The method avoids matrix inversion, reuses samples across updates, and shows better wall-clock stability than Adam, minSR, or SPRING on the Ising and J1-J2 lattices they test. Those are real engineering wins for anyone trying to push autoregressive NQS past the sizes where stochastic reconfiguration becomes fragile.

The soft spot is the clipping step itself. The stress-test note is right to flag that standard PPO-style clipping already biases the gradient, and splitting the rule across amplitude and phase channels adds another layer whose effect on the energy estimator is not obviously controlled. The abstract gives no derivation or bound showing that the clipped updates still produce an energy that is guaranteed to sit above the true ground state, or that any bias vanishes in the large-sample limit. If the full paper only has empirical curves without that check, the stability gains stay observational rather than theoretically grounded.

This is for people who optimize large variational wavefunctions and are tired of trading off scalability against stability. A reader who works on quantum many-body methods will find the scaling result worth examining even if the theory around the dual clipping needs tightening. I would send it to peer review because the size jump is large enough that referees should see the full derivations and numbers.

Referee Report

2 major / 1 minor

Summary. The manuscript reframes variational energy minimization for neural quantum states (NQS) as an advantage policy-gradient problem over the Born distribution and introduces Proximal Wavefunction Optimization (PWO), a trust-region algorithm that applies separate clipping to probability ratios in the amplitude channel and to phase increments. It claims that PWO improves stability and wall-clock convergence over Adam, minSR, and SPRING on Ising and frustrated J1-J2 models in 1D and 2D while enabling optimization of a 1.5B-parameter RWKV-7 model, three orders of magnitude beyond prior NQS work, and states that the method combines first-order scalability with theoretical guarantees.

Significance. If the dual-channel clipping can be shown to preserve the variational upper bound on energy with controlled or vanishing bias, the approach would meaningfully advance scalable, geometry-aware optimization for autoregressive NQS at large parameter counts. The reported scaling to 1.5B parameters and sample reuse across updates would constitute a concrete practical advance if the theoretical grounding holds.

major comments (2)

[Abstract] Abstract and method description: the claim that PWO supplies 'theoretical guarantees' while using PPO-style clipping on separate amplitude and phase channels is load-bearing for the central contribution, yet no derivation, bias bound, or limit argument is supplied showing that the clipped updates keep the energy estimator above the true ground-state energy or that any introduced bias vanishes.
[Method (PWO definition)] The weakest assumption identified in the stress test is not addressed: standard PPO clipping is known to bias policy gradients; the manuscript must demonstrate (via an explicit inequality or expectation argument) that the amplitude-ratio clipping plus independent phase clipping does not inject uncontrolled estimator error that violates the variational principle.

minor comments (1)

[Abstract] The abstract supplies no quantitative results, error bars, hyperparameter values for clipping or sample reuse, or description of the baselines, making the performance claims impossible to assess from the provided text.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed and constructive report. The two major comments both concern the strength of the theoretical claims around PWO's clipping procedure. We address them directly below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract and method description: the claim that PWO supplies 'theoretical guarantees' while using PPO-style clipping on separate amplitude and phase channels is load-bearing for the central contribution, yet no derivation, bias bound, or limit argument is supplied showing that the clipped updates keep the energy estimator above the true ground-state energy or that any introduced bias vanishes.

Authors: The referee is correct that the manuscript does not contain an explicit derivation or bias bound for the dual-channel clipping. The phrase 'theoretical guarantees' in the abstract and introduction is intended to refer to the fact that PWO inherits the trust-region structure of PPO, which limits policy deviation and reuses samples. However, because amplitude and phase are clipped independently, we do not supply a proof that the variational upper bound on energy is preserved or that bias vanishes. In the revised manuscript we will qualify or remove this phrasing and replace it with a statement that PWO is a practical trust-region method whose bias properties are left for future analysis, consistent with the empirical focus of the work. revision: yes
Referee: [Method (PWO definition)] The weakest assumption identified in the stress test is not addressed: standard PPO clipping is known to bias policy gradients; the manuscript must demonstrate (via an explicit inequality or expectation argument) that the amplitude-ratio clipping plus independent phase clipping does not inject uncontrolled estimator error that violates the variational principle.

Authors: No such explicit inequality or expectation argument appears in the current manuscript. We acknowledge that standard PPO clipping introduces bias and that the separate phase clipping adds an additional degree of freedom whose effect on the energy estimator has not been bounded. The paper therefore cannot claim that the variational principle is strictly preserved. In revision we will add a short paragraph in the method section noting this limitation and stating that the algorithm is motivated by PPO but treated as a heuristic whose bias is controlled empirically through the reported stability results. revision: yes

standing simulated objections not resolved

Deriving a rigorous inequality or expectation argument showing that the dual-channel clipping preserves the variational upper bound without uncontrolled bias.

Circularity Check

0 steps flagged

No circularity: RL reframing is motivational framing, not a self-referential reduction

full rationale

The paper presents viewing energy minimization as an advantage policy-gradient problem over the Born distribution as a way to motivate the introduction of PWO with trust-region clipping on amplitude and phase channels. No equations, fitted parameters, or self-citations are shown to reduce the claimed stability gains, wall-clock improvements, or scaling results to inputs by construction. The central claims rest on empirical comparisons across Ising and J1-J2 systems plus a large-scale RWKV-7 demonstration, which are independent of the framing. No self-definitional, fitted-input-called-prediction, or load-bearing self-citation patterns appear in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract alone supplies no concrete free parameters, axioms, or invented entities; the work appears to rest on standard variational Monte Carlo assumptions and the transfer of trust-region methods from RL.

pith-pipeline@v0.9.1-grok · 5799 in / 1101 out tokens · 31129 ms · 2026-07-03T16:34:11.919731+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

165 extracted references · 54 canonical work pages · 7 internal anchors

[1]

doi:10.4324/9780203769560 , address =

Social Dilemmas: Theoretical Issues and Research Findings , year =. doi:10.4324/9780203769560 , address =

work page doi:10.4324/9780203769560
[2]

and Stornati, Paolo and Koch, Rouven and Büttner, Miriam and Okuła, Robert and Muñoz-Gil, Gorka and Vargas-Hernández, Rodrigo A

Dawid, Anna and Arnold, Julian and Requena, Borja and Gresch, Alexander and Płodzień, Marcin and Donatella, Kaelan and Nicoli, Kim A. and Stornati, Paolo and Koch, Rouven and Büttner, Miriam and Okuła, Robert and Muñoz-Gil, Gorka and Vargas-Hernández, Rodrigo A. and Cervera-Lierta, Alba and Carrasquilla, Juan and Dunjko, Vedran and Gabrié, Marylou and Hue...

work page doi:10.1017/9781009504942
[3]

Kearns and Satinder Singh , editor =

Michael J. Kearns and Satinder Singh , editor =. Bias--Variance Error Bounds for Temporal Difference Updates , booktitle =. 2000 , url =

2000
[4]

, title =

Nash, John F. , title =. Proceedings of the National Academy of Sciences of the United States of America , volume =. 1950 , doi =

1950
[5]

The Annals of Mathematical Statistics , year =

Herbert Robbins and Sutton Monro , title =. The Annals of Mathematical Statistics , year =
[6]

Stochastic Games , volume =

Shapley, Lloyd , journal =. Stochastic Games , volume =
[7]

Journal of Law and Economics , volume=

The Problem of Social Cost , author=. Journal of Law and Economics , volume=. 1960 , publisher=

1960
[8]

Watkins, Christopher J. C. H. and Dayan, Peter , biburl =. Q-learning , url =. Machine Learning , keywords =. doi:10.1007/BF00992698 , interhash =

work page doi:10.1007/bf00992698
[9]

Simple statistical gradient-following algorithms for connectionist reinforcement learning , volume =

Williams, Ronald , journal =. Simple statistical gradient-following algorithms for connectionist reinforcement learning , volume =
[10]

Oxford Economic Papers , year =

Barrett, Scott , title =. Oxford Economic Papers , year =
[11]

1996 , journal=

Multiagent reinforcement learning in the Iterated Prisoner's Dilemma , author=. 1996 , journal=

1996
[12]

2000 , issn =

Does voluntary participation undermine the Coase Theorem? , journal =. 2000 , issn =. doi:https://doi.org/10.1016/S0047-2727(99)00089-4 , url =

work page doi:10.1016/s0047-2727(99)00089-4 2000
[13]

Proceedings of the National Academy of Sciences , volume=

Iterated Prisoner’s Dilemma contains strategies that dominate any evolutionary opponent , author=. Proceedings of the National Academy of Sciences , volume=. 2012 , publisher=

2012
[14]

Solving the quantum many-body problem with artificial neural networks , volume=

Carleo, Giuseppe and Troyer, Matthias , year=. Solving the quantum many-body problem with artificial neural networks , volume=. Science , publisher=. doi:10.1126/science.aag2302 , number=

work page doi:10.1126/science.aag2302
[15]

Social Science Research Network , year =

AI for Global Climate Cooperation: Modeling Global Climate Negotiations, Agreements, and Long-Term Cooperation in RICE-N , author =. Social Science Research Network , year =. doi:10.48550/arXiv.2208.07004 , bibSource =

work page doi:10.48550/arxiv.2208.07004
[16]

Machine Intelligence Research , year=

An Empirical Study on Google Research Football Multi-agent Reinforcement Learning , author=. Machine Intelligence Research , year=. doi:10.1007/s11633-023-1426-8 , url=

work page doi:10.1007/s11633-023-1426-8
[17]

Empowering deep neural quantum states through efficient optimization , volume=

Chen, Ao and Heyl, Markus , year=. Empowering deep neural quantum states through efficient optimization , volume=. Nature Physics , publisher=. doi:10.1038/s41567-024-02566-1 , number=

work page doi:10.1038/s41567-024-02566-1
[18]

2013 , eprint=

Playing Atari with Deep Reinforcement Learning , author=. 2013 , eprint=

2013
[19]

2017 , eprint=

Trust Region Policy Optimization , author=. 2017 , eprint=

2017
[20]

2017 , eprint=

Proximal Policy Optimization Algorithms , author=. 2017 , eprint=

2017
[21]

2017 , eprint=

Adam: A Method for Stochastic Optimization , author=. 2017 , eprint=

2017
[22]

2017 , journal =

Deal or No Deal? End-to-End Learning for Negotiation Dialogues , author =. 2017 , journal =

2017
[23]

2018 , eprint=

Maintaining cooperation in complex social dilemmas using deep reinforcement learning , author=. 2018 , eprint=

2018
[24]

2018 , eprint=

Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments , author=. 2018 , eprint=

2018
[25]

2018 , eprint=

Emergent Communication through Negotiation , author=. 2018 , eprint=

2018
[26]

2018 , eprint=

IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , author=. 2018 , eprint=

2018
[27]

Learning with Opponent-Learning Awareness

Learning with Opponent-Learning Awareness , author=. arXiv , primaryClass=:1709.04326 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[28]

2018 , eprint=

DiCE: The Infinitely Differentiable Monte-Carlo Estimator , author=. 2018 , eprint=

2018
[29]

2018 , eprint=

High-Dimensional Continuous Control Using Generalized Advantage Estimation , author=. 2018 , eprint=

2018
[30]

Radford, Alec and Narasimhan, Karthik and Salimans, Tim and Sutskever, Ilya , biburl =
[31]

2019 , eprint=

Stabilizing Transformers for Reinforcement Learning , author=. 2019 , eprint=

2019
[32]

2019 , eprint=

V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control , author=. 2019 , eprint=

2019
[33]

2020 , eprint=

Denoising Diffusion Probabilistic Models , author=. 2020 , eprint=

2020
[34]

2021 , eprint=

The AI Economist: Optimal Economic Policy Design via Two-level Deep Reinforcement Learning , author=. 2021 , eprint=

2021
[35]

2021 , eprint=

Stable Opponent Shaping in Differentiable Games , author=. 2021 , eprint=

2021
[36]

2022 , eprint=

COLA: Consistent Learning with Opponent-Learning Awareness , author=. 2022 , eprint=

2022
[37]

2022 , eprint=

Proximal Learning With Opponent-Learning Awareness , author=. 2022 , eprint=

2022
[38]

2022 , eprint=

Model-Free Opponent Shaping , author=. 2022 , eprint=

2022
[39]

2022 , eprint=

A Generalist Agent , author=. 2022 , eprint=

2022
[40]

2023 , eprint=

Melting Pot 2.0 , author=. 2023 , eprint=

2023
[41]

2023 , eprint=

Deep Reinforcement Learning for Active High Frequency Trading , author=. 2023 , eprint=

2023
[42]

2023 , eprint=

Meta-Value Learning: a General Framework for Learning with Learning Awareness , author=. 2023 , eprint=

2023
[43]

2023 , eprint=

Q-learners Can Provably Collude in the Iterated Prisoner's Dilemma , author=. 2023 , eprint=

2023
[44]

2024 , eprint=

From Architectures to Applications: A Review of Neural Quantum States , author=. 2024 , eprint=

2024
[45]

2024 , eprint=

Best Response Shaping , author=. 2024 , eprint=

2024
[46]

2024 , eprint=

LOQA: Learning with Opponent Q-Learning Awareness , author=. 2024 , eprint=

2024
[47]

2024 , eprint=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

2024
[48]

2024 , eprint=

Dissecting Deep RL with High Update Ratios: Combatting Value Overestimation and Divergence , author=. 2024 , eprint=

2024
[49]

2024 , eprint=

Scaling Opponent Shaping to High Dimensional Games , author=. 2024 , eprint=

2024
[50]

2025 , eprint=

Advantage Alignment Algorithms , author=. 2025 , eprint=

2025
[51]

2025 , eprint=

InvestESG: A multi-agent reinforcement learning benchmark for studying climate investment as a social dilemma , author=. 2025 , eprint=

2025
[52]

2025 , eprint=

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks , author=. 2025 , eprint=

2025
[53]

1944 , address =

John von Neumann and Oskar Morgenstern , title =. 1944 , address =

1944
[54]

1965 , publisher=

Prisoner's Dilemma: A Study in Conflict and Cooperation , author=. 1965 , publisher=

1965
[55]

Axelrod, Robert , biburl =
[56]

2007 , publisher =

Algorithmic Game Theory , editor =. 2007 , publisher =

2007
[57]

2021 , publisher=

Reinforcement Learning: Theory and Algorithms , author=. 2021 , publisher=

2021
[58]

2022 , publisher =

Climate Change 2022: Impacts, Adaptation and Vulnerability , editor =. 2022 , publisher =. doi:10.1017/9781009325844 , url =

work page doi:10.1017/9781009325844 2022
[59]

AAAI Spring Symposia , year=

Toward Natural Turn-Taking in a Virtual Human Negotiation Agent , author=. AAAI Spring Symposia , year=
[60]

Halpern , editor =

Valerio Capraro and Joseph Y. Halpern , editor =. Translucent Players: Explaining Cooperative Behavior in Social Dilemmas , booktitle =. 2015 , url =. doi:10.4204/EPTCS.215.9 , timestamp =

work page doi:10.4204/eptcs.215.9 2015
[61]

Courville , editor =

Michael Noukhovitch and Travis LaCroix and Angeliki Lazaridou and Aaron C. Courville , editor =. Emergent Communication under Competition , booktitle =. 2021 , url =. doi:10.5555/3463952.3464066 , timestamp =

work page doi:10.5555/3463952.3464066 2021
[62]

Proceedings of the 37th International Conference on Machine Learning , pages =

Vezhnevets, Alexander and Wu, Yuhuai and Eckstein, Maria and Leblond, R. Proceedings of the 37th International Conference on Machine Learning , pages =. 2020 , editor =

2020
[63]

Proceedings of the 31st International Joint Conference on Artificial Intelligence (IJCAI), Demo Track , year=

Carbon Market Simulation with Adaptive Mechanism Design , author=. Proceedings of the 31st International Joint Conference on Artificial Intelligence (IJCAI), Demo Track , year=
[64]

and Niranjan, M

Rummery, G. and Niranjan, M. , biburl =
[65]

2023 , title =

Atanasova, Hristiana and Bernheimer, Liam and Cohen, Guy , journal =. 2023 , title =. doi:10.1038/s41467-023-39244-4 , pmid =

work page doi:10.1038/s41467-023-39244-4 2023
[66]

2018 , eprint=

Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates , author=. 2018 , eprint=

2018
[67]

2022 , eprint=

Deep Reinforcement Learning at the Edge of the Statistical Precipice , author=. 2022 , eprint=

2022
[68]

2022 , eprint=

Neural network quantum state with proximal optimization: a ground-state searching scheme based on variational Monte Carlo , author=. 2022 , eprint=

2022
[69]

2026 , eprint=

Parallel Scan Recurrent Neural Quantum States for Scalable Variational Monte Carlo , author=. 2026 , eprint=

2026
[70]

Proceedings of the Nineteenth International Conference on Machine Learning , pages =

Kakade, Sham and Langford, John , title =. Proceedings of the Nineteenth International Conference on Machine Learning , pages =. 2002 , isbn =

2002
[71]

Deep learning-enhanced variational Monte Carlo method for quantum many-body physics , volume=

Yang, Li and Leng, Zhaoqi and Yu, Guangyuan and Patel, Ankit and Hu, Wen-Jun and Pu, Han , year=. Deep learning-enhanced variational Monte Carlo method for quantum many-body physics , volume=. Physical Review Research , publisher=. doi:10.1103/physrevresearch.2.012039 , number=

work page doi:10.1103/physrevresearch.2.012039
[72]

2024 , title =

Rende, Riccardo and Viteritti, Luciano Loris and Bardone, Lorenzo and Becca, Federico and Goldt, Sebastian , journal =. 2024 , title =. doi:10.1038/s42005-024-01732-4 , eprint =

work page doi:10.1038/s42005-024-01732-4 2024
[73]

Nature communications , volume=

Fermionic neural-network states for ab-initio electronic structure , author=. Nature communications , volume=. 2020 , publisher=

2020
[74]

Solving many-electron Schr

Han, Jiequn and Zhang, Linfeng and others , journal=. Solving many-electron Schr. 2019 , publisher=

2019
[75]

Physical review letters , volume=

Backflow transformations via neural networks for quantum many-body wave functions , author=. Physical review letters , volume=. 2019 , publisher=

2019
[76]

Deep-neural-network solution of the electronic Schr

Hermann, Jan and Sch. Deep-neural-network solution of the electronic Schr. Nature Chemistry , volume=. 2020 , publisher=

2020
[77]

Ab initio solution of the many-electron Schr

Pfau, David and Spencer, James S and Matthews, Alexander GDG and Foulkes, W Matthew C , journal=. Ab initio solution of the many-electron Schr. 2020 , publisher=

2020
[78]

The Eleventh International Conference on Learning Representations , year=

A Self-Attention Ansatz for Ab-initio Quantum Chemistry , author=. The Eleventh International Conference on Learning Representations , year=
[79]

International conference on machine learning , pages=

Optimizing neural networks with kronecker-factored approximate curvature , author=. International conference on machine learning , pages=. 2015 , organization=

2015
[80]

The Journal of chemical physics , volume=

Schnet--a deep learning architecture for molecules and materials , author=. The Journal of chemical physics , volume=. 2018 , publisher=

2018

Showing first 80 references.

[1] [1]

doi:10.4324/9780203769560 , address =

Social Dilemmas: Theoretical Issues and Research Findings , year =. doi:10.4324/9780203769560 , address =

work page doi:10.4324/9780203769560

[2] [2]

and Stornati, Paolo and Koch, Rouven and Büttner, Miriam and Okuła, Robert and Muñoz-Gil, Gorka and Vargas-Hernández, Rodrigo A

Dawid, Anna and Arnold, Julian and Requena, Borja and Gresch, Alexander and Płodzień, Marcin and Donatella, Kaelan and Nicoli, Kim A. and Stornati, Paolo and Koch, Rouven and Büttner, Miriam and Okuła, Robert and Muñoz-Gil, Gorka and Vargas-Hernández, Rodrigo A. and Cervera-Lierta, Alba and Carrasquilla, Juan and Dunjko, Vedran and Gabrié, Marylou and Hue...

work page doi:10.1017/9781009504942

[3] [3]

Kearns and Satinder Singh , editor =

Michael J. Kearns and Satinder Singh , editor =. Bias--Variance Error Bounds for Temporal Difference Updates , booktitle =. 2000 , url =

2000

[4] [4]

, title =

Nash, John F. , title =. Proceedings of the National Academy of Sciences of the United States of America , volume =. 1950 , doi =

1950

[5] [5]

The Annals of Mathematical Statistics , year =

Herbert Robbins and Sutton Monro , title =. The Annals of Mathematical Statistics , year =

[6] [6]

Stochastic Games , volume =

Shapley, Lloyd , journal =. Stochastic Games , volume =

[7] [7]

Journal of Law and Economics , volume=

The Problem of Social Cost , author=. Journal of Law and Economics , volume=. 1960 , publisher=

1960

[8] [8]

Watkins, Christopher J. C. H. and Dayan, Peter , biburl =. Q-learning , url =. Machine Learning , keywords =. doi:10.1007/BF00992698 , interhash =

work page doi:10.1007/bf00992698

[9] [9]

Simple statistical gradient-following algorithms for connectionist reinforcement learning , volume =

Williams, Ronald , journal =. Simple statistical gradient-following algorithms for connectionist reinforcement learning , volume =

[10] [10]

Oxford Economic Papers , year =

Barrett, Scott , title =. Oxford Economic Papers , year =

[11] [11]

1996 , journal=

Multiagent reinforcement learning in the Iterated Prisoner's Dilemma , author=. 1996 , journal=

1996

[12] [12]

2000 , issn =

Does voluntary participation undermine the Coase Theorem? , journal =. 2000 , issn =. doi:https://doi.org/10.1016/S0047-2727(99)00089-4 , url =

work page doi:10.1016/s0047-2727(99)00089-4 2000

[13] [13]

Proceedings of the National Academy of Sciences , volume=

Iterated Prisoner’s Dilemma contains strategies that dominate any evolutionary opponent , author=. Proceedings of the National Academy of Sciences , volume=. 2012 , publisher=

2012

[14] [14]

Solving the quantum many-body problem with artificial neural networks , volume=

Carleo, Giuseppe and Troyer, Matthias , year=. Solving the quantum many-body problem with artificial neural networks , volume=. Science , publisher=. doi:10.1126/science.aag2302 , number=

work page doi:10.1126/science.aag2302

[15] [15]

Social Science Research Network , year =

AI for Global Climate Cooperation: Modeling Global Climate Negotiations, Agreements, and Long-Term Cooperation in RICE-N , author =. Social Science Research Network , year =. doi:10.48550/arXiv.2208.07004 , bibSource =

work page doi:10.48550/arxiv.2208.07004

[16] [16]

Machine Intelligence Research , year=

An Empirical Study on Google Research Football Multi-agent Reinforcement Learning , author=. Machine Intelligence Research , year=. doi:10.1007/s11633-023-1426-8 , url=

work page doi:10.1007/s11633-023-1426-8

[17] [17]

Empowering deep neural quantum states through efficient optimization , volume=

Chen, Ao and Heyl, Markus , year=. Empowering deep neural quantum states through efficient optimization , volume=. Nature Physics , publisher=. doi:10.1038/s41567-024-02566-1 , number=

work page doi:10.1038/s41567-024-02566-1

[18] [18]

2013 , eprint=

Playing Atari with Deep Reinforcement Learning , author=. 2013 , eprint=

2013

[19] [19]

2017 , eprint=

Trust Region Policy Optimization , author=. 2017 , eprint=

2017

[20] [20]

2017 , eprint=

Proximal Policy Optimization Algorithms , author=. 2017 , eprint=

2017

[21] [21]

2017 , eprint=

Adam: A Method for Stochastic Optimization , author=. 2017 , eprint=

2017

[22] [22]

2017 , journal =

Deal or No Deal? End-to-End Learning for Negotiation Dialogues , author =. 2017 , journal =

2017

[23] [23]

2018 , eprint=

Maintaining cooperation in complex social dilemmas using deep reinforcement learning , author=. 2018 , eprint=

2018

[24] [24]

2018 , eprint=

Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments , author=. 2018 , eprint=

2018

[25] [25]

2018 , eprint=

Emergent Communication through Negotiation , author=. 2018 , eprint=

2018

[26] [26]

2018 , eprint=

IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , author=. 2018 , eprint=

2018

[27] [27]

Learning with Opponent-Learning Awareness

Learning with Opponent-Learning Awareness , author=. arXiv , primaryClass=:1709.04326 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

2018 , eprint=

DiCE: The Infinitely Differentiable Monte-Carlo Estimator , author=. 2018 , eprint=

2018

[29] [29]

2018 , eprint=

High-Dimensional Continuous Control Using Generalized Advantage Estimation , author=. 2018 , eprint=

2018

[30] [30]

Radford, Alec and Narasimhan, Karthik and Salimans, Tim and Sutskever, Ilya , biburl =

[31] [31]

2019 , eprint=

Stabilizing Transformers for Reinforcement Learning , author=. 2019 , eprint=

2019

[32] [32]

2019 , eprint=

V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control , author=. 2019 , eprint=

2019

[33] [33]

2020 , eprint=

Denoising Diffusion Probabilistic Models , author=. 2020 , eprint=

2020

[34] [34]

2021 , eprint=

The AI Economist: Optimal Economic Policy Design via Two-level Deep Reinforcement Learning , author=. 2021 , eprint=

2021

[35] [35]

2021 , eprint=

Stable Opponent Shaping in Differentiable Games , author=. 2021 , eprint=

2021

[36] [36]

2022 , eprint=

COLA: Consistent Learning with Opponent-Learning Awareness , author=. 2022 , eprint=

2022

[37] [37]

2022 , eprint=

Proximal Learning With Opponent-Learning Awareness , author=. 2022 , eprint=

2022

[38] [38]

2022 , eprint=

Model-Free Opponent Shaping , author=. 2022 , eprint=

2022

[39] [39]

2022 , eprint=

A Generalist Agent , author=. 2022 , eprint=

2022

[40] [40]

2023 , eprint=

Melting Pot 2.0 , author=. 2023 , eprint=

2023

[41] [41]

2023 , eprint=

Deep Reinforcement Learning for Active High Frequency Trading , author=. 2023 , eprint=

2023

[42] [42]

2023 , eprint=

Meta-Value Learning: a General Framework for Learning with Learning Awareness , author=. 2023 , eprint=

2023

[43] [43]

2023 , eprint=

Q-learners Can Provably Collude in the Iterated Prisoner's Dilemma , author=. 2023 , eprint=

2023

[44] [44]

2024 , eprint=

From Architectures to Applications: A Review of Neural Quantum States , author=. 2024 , eprint=

2024

[45] [45]

2024 , eprint=

Best Response Shaping , author=. 2024 , eprint=

2024

[46] [46]

2024 , eprint=

LOQA: Learning with Opponent Q-Learning Awareness , author=. 2024 , eprint=

2024

[47] [47]

2024 , eprint=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

2024

[48] [48]

2024 , eprint=

Dissecting Deep RL with High Update Ratios: Combatting Value Overestimation and Divergence , author=. 2024 , eprint=

2024

[49] [49]

2024 , eprint=

Scaling Opponent Shaping to High Dimensional Games , author=. 2024 , eprint=

2024

[50] [50]

2025 , eprint=

Advantage Alignment Algorithms , author=. 2025 , eprint=

2025

[51] [51]

2025 , eprint=

InvestESG: A multi-agent reinforcement learning benchmark for studying climate investment as a social dilemma , author=. 2025 , eprint=

2025

[52] [52]

2025 , eprint=

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks , author=. 2025 , eprint=

2025

[53] [53]

1944 , address =

John von Neumann and Oskar Morgenstern , title =. 1944 , address =

1944

[54] [54]

1965 , publisher=

Prisoner's Dilemma: A Study in Conflict and Cooperation , author=. 1965 , publisher=

1965

[55] [55]

Axelrod, Robert , biburl =

[56] [56]

2007 , publisher =

Algorithmic Game Theory , editor =. 2007 , publisher =

2007

[57] [57]

2021 , publisher=

Reinforcement Learning: Theory and Algorithms , author=. 2021 , publisher=

2021

[58] [58]

2022 , publisher =

Climate Change 2022: Impacts, Adaptation and Vulnerability , editor =. 2022 , publisher =. doi:10.1017/9781009325844 , url =

work page doi:10.1017/9781009325844 2022

[59] [59]

AAAI Spring Symposia , year=

Toward Natural Turn-Taking in a Virtual Human Negotiation Agent , author=. AAAI Spring Symposia , year=

[60] [60]

Halpern , editor =

Valerio Capraro and Joseph Y. Halpern , editor =. Translucent Players: Explaining Cooperative Behavior in Social Dilemmas , booktitle =. 2015 , url =. doi:10.4204/EPTCS.215.9 , timestamp =

work page doi:10.4204/eptcs.215.9 2015

[61] [61]

Courville , editor =

Michael Noukhovitch and Travis LaCroix and Angeliki Lazaridou and Aaron C. Courville , editor =. Emergent Communication under Competition , booktitle =. 2021 , url =. doi:10.5555/3463952.3464066 , timestamp =

work page doi:10.5555/3463952.3464066 2021

[62] [62]

Proceedings of the 37th International Conference on Machine Learning , pages =

Vezhnevets, Alexander and Wu, Yuhuai and Eckstein, Maria and Leblond, R. Proceedings of the 37th International Conference on Machine Learning , pages =. 2020 , editor =

2020

[63] [63]

Proceedings of the 31st International Joint Conference on Artificial Intelligence (IJCAI), Demo Track , year=

Carbon Market Simulation with Adaptive Mechanism Design , author=. Proceedings of the 31st International Joint Conference on Artificial Intelligence (IJCAI), Demo Track , year=

[64] [64]

and Niranjan, M

Rummery, G. and Niranjan, M. , biburl =

[65] [65]

2023 , title =

Atanasova, Hristiana and Bernheimer, Liam and Cohen, Guy , journal =. 2023 , title =. doi:10.1038/s41467-023-39244-4 , pmid =

work page doi:10.1038/s41467-023-39244-4 2023

[66] [66]

2018 , eprint=

Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates , author=. 2018 , eprint=

2018

[67] [67]

2022 , eprint=

Deep Reinforcement Learning at the Edge of the Statistical Precipice , author=. 2022 , eprint=

2022

[68] [68]

2022 , eprint=

Neural network quantum state with proximal optimization: a ground-state searching scheme based on variational Monte Carlo , author=. 2022 , eprint=

2022

[69] [69]

2026 , eprint=

Parallel Scan Recurrent Neural Quantum States for Scalable Variational Monte Carlo , author=. 2026 , eprint=

2026

[70] [70]

Proceedings of the Nineteenth International Conference on Machine Learning , pages =

Kakade, Sham and Langford, John , title =. Proceedings of the Nineteenth International Conference on Machine Learning , pages =. 2002 , isbn =

2002

[71] [71]

Deep learning-enhanced variational Monte Carlo method for quantum many-body physics , volume=

Yang, Li and Leng, Zhaoqi and Yu, Guangyuan and Patel, Ankit and Hu, Wen-Jun and Pu, Han , year=. Deep learning-enhanced variational Monte Carlo method for quantum many-body physics , volume=. Physical Review Research , publisher=. doi:10.1103/physrevresearch.2.012039 , number=

work page doi:10.1103/physrevresearch.2.012039

[72] [72]

2024 , title =

Rende, Riccardo and Viteritti, Luciano Loris and Bardone, Lorenzo and Becca, Federico and Goldt, Sebastian , journal =. 2024 , title =. doi:10.1038/s42005-024-01732-4 , eprint =

work page doi:10.1038/s42005-024-01732-4 2024

[73] [73]

Nature communications , volume=

Fermionic neural-network states for ab-initio electronic structure , author=. Nature communications , volume=. 2020 , publisher=

2020

[74] [74]

Solving many-electron Schr

Han, Jiequn and Zhang, Linfeng and others , journal=. Solving many-electron Schr. 2019 , publisher=

2019

[75] [75]

Physical review letters , volume=

Backflow transformations via neural networks for quantum many-body wave functions , author=. Physical review letters , volume=. 2019 , publisher=

2019

[76] [76]

Deep-neural-network solution of the electronic Schr

Hermann, Jan and Sch. Deep-neural-network solution of the electronic Schr. Nature Chemistry , volume=. 2020 , publisher=

2020

[77] [77]

Ab initio solution of the many-electron Schr

Pfau, David and Spencer, James S and Matthews, Alexander GDG and Foulkes, W Matthew C , journal=. Ab initio solution of the many-electron Schr. 2020 , publisher=

2020

[78] [78]

The Eleventh International Conference on Learning Representations , year=

A Self-Attention Ansatz for Ab-initio Quantum Chemistry , author=. The Eleventh International Conference on Learning Representations , year=

[79] [79]

International conference on machine learning , pages=

Optimizing neural networks with kronecker-factored approximate curvature , author=. International conference on machine learning , pages=. 2015 , organization=

2015

[80] [80]

The Journal of chemical physics , volume=

Schnet--a deep learning architecture for molecules and materials , author=. The Journal of chemical physics , volume=. 2018 , publisher=

2018