pith. sign in

arxiv: 2605.20348 · v1 · pith:U2KKJD5Unew · submitted 2026-05-19 · 💱 q-fin.CP · cs.AI

Memory-Induced Supra-Competitive Outcomes Between Deep Reinforcement Learning Agents in Optimal Trade Execution

Pith reviewed 2026-05-21 07:01 UTC · model grok-4.3

classification 💱 q-fin.CP cs.AI
keywords reinforcement learningoptimal executionAlmgren-Chriss modelsupra-competitive outcomesmulti-agent RLliquidation gamesintra-episode memory
0
0 comments X

The pith

Access to intra-episode memory lets RL agents in a two-agent liquidation game sustain lower implementation shortfalls than the game-theoretic benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether deep reinforcement learning agents can outperform the competitive equilibrium in an Almgren-Chriss optimal execution setting. It compares agents that commit to fixed schedules in advance with agents that receive ongoing feedback from recent prices and their own past trades. When memory is available, supra-competitive results become both more common and more sustained. A reader would care because this points to memory as the ingredient that turns a standard competitive game into one where agents can achieve better collective execution costs through state-contingent behavior.

Core claim

In the two-agent Almgren-Chriss liquidation game, DDQN agents that condition on intra-episode history—especially recent mid-prices and own past actions—produce supra-competitive outcomes, defined as lower implementation shortfalls than the relevant game-theoretic benchmark, at substantially higher rates and with greater persistence than agents restricted to ex-ante complete schedules.

What carries the argument

The contrast between ex-ante schedule-learning agents and state-contingent DDQN policies that incorporate intra-episode feedback and memory within the Almgren-Chriss two-agent execution environment.

If this is right

  • Supra-competitive behavior requires state-contingent interaction along the realized execution path rather than multi-agent learning or current-price observation alone.
  • Ex-ante schedule commitment removes the conditions under which supra-competitive results emerge.
  • Recent prices combined with the agent's own past actions form the most effective memory signals for sustaining outperformance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Market venues that limit real-time data feeds to trading algorithms might reduce the frequency of these memory-driven outcomes.
  • Similar memory effects could appear in other sequential multi-agent games where agents share a common price path.
  • Extending the setup to three or more agents would test whether the same memory channel continues to support supra-competitive execution.

Load-bearing premise

Differences in observed outcomes are caused by the presence or absence of memory and intra-episode feedback rather than by unexamined variations in training stability or hyperparameter choices.

What would settle it

A controlled retraining experiment in which agents receive identical hyperparameters and architectures but are denied access to intra-episode price history and past actions, with outcomes then compared against the original memory-enabled runs.

Figures

Figures reproduced from arXiv: 2605.20348 by Carlo Campajola, Christos Spyridon Koulouris.

Figure 1
Figure 1. Figure 1: Benchmark inventory paths for the symmetric risk-neutral two-player setting. The [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Schedule-learning results in the aggregate-temporary-impact environment. [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Schedule-learning results in the own-temporary-impact environment. [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Average testing inventory paths in the own-temporary-impact environment. [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Own-temp experiment with one player fixed at the aggregate-temporary-impact Nash [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Baseline DDQN experiment with intra-episode feedback. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Integrated-gradient diagnostics for the baseline DDQN architecture in an illustrative [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Integrated-gradient diagnostics for the price conditioned DDQN architecture in an [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Price-conditioned DDQN replication. (a) Final testing centroids relative to the continuous-time and grid-implemented Nash and TWAP benchmarks. (b) Rolling 20-episode share of training episodes in the collusive region under the discrete benchmark. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: History-aware DDQN replication. (a) Final testing centroids relative to the Nash and TWAP benchmarks. (b) Rolling 20-episode share of training episodes in the supra-competitive region under the competitive benchmark. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Quadrant occupancies and transition probabilities during the last 500 training [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Average testing inventory paths for the history-aware architecture, shown separately [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Average Euclidean distance of the final testing centroids to the discrete Nash and [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗
read the original abstract

In this paper, we investigate whether deep reinforcement-learning agents interacting in a shared optimal-execution environment can sustain supra-competitive outcomes, in the sense of achieving lower implementation shortfalls than the relevant game-theoretical competitive benchmark. We study a two-agent Almgren-Chriss liquidation game and examine how learned behavior depends on intra-episode environment feedback, the ability to interpret the mid-price and the agent's knoledge of the past. We first use ex-ante schedule-learning agents to remove intra-episode feedback and isolate what can arise when agents commit to complete liquidation trajectories before execution begins. We then allow agents to condition on the evolving state using a variety of DDQN architectures. We find that, when agents are given access to intra-episode history, especially recent prices and own past actions, supra-competitive outcomes become substantially more frequent and more persistent. These findings indicate that supra-competitive behavior in this execution game is driven not by multi-agent learning or by current price observation alone, but by feedback, memory, and state-contingent interaction along the realized execution path.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper studies a two-agent Almgren-Chriss liquidation game and asks whether deep RL agents can produce supra-competitive outcomes (lower implementation shortfalls than the game-theoretic benchmark). It first examines ex-ante schedule-learning agents that commit to full trajectories without intra-episode feedback, then compares these to DDQN agents that receive varying degrees of intra-episode state information, including recent prices and own past actions. The central empirical claim is that access to intra-episode history substantially raises both the frequency and persistence of supra-competitive outcomes.

Significance. If the attribution to memory holds after controlling for training differences, the result would indicate that state-contingent feedback along the execution path, rather than multi-agent learning or price observation alone, drives outperformance of the competitive benchmark. This has potential implications for the design of execution algorithms and for understanding emergent non-competitive behavior in multi-agent financial RL settings. The paper's use of a clearly defined game-theoretic benchmark and forward simulation is a methodological strength.

major comments (2)
  1. [methods / experimental setup] Experimental setup (methods section describing DDQN variants): the manuscript does not state whether the DDQN agents with memory use identical network architectures, learning rates, replay-buffer sizes, training episode counts, and convergence criteria as the ex-ante schedule-learning baselines. If these differ, the reported increase in supra-competitive frequency could reflect optimization advantages or reduced non-stationarity rather than the memory mechanism itself.
  2. [results] Results on frequency and persistence: the claim that supra-competitive outcomes become 'substantially more frequent and more persistent' with intra-episode history requires quantitative support (number of independent runs, error bars or confidence intervals on the reported frequencies, and statistical tests comparing conditions). Without these, it is impossible to separate the memory effect from training variance.
minor comments (2)
  1. Notation: the distinction between 'ex-ante schedule-learning agents' and the various DDQN state-input configurations should be summarized in a single table for clarity.
  2. [results] The abstract states that the effect is driven by 'feedback, memory, and state-contingent interaction'; the results section should explicitly isolate which component (recent prices vs. own past actions) contributes most.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the clarity and rigor of our work. We address each major comment below.

read point-by-point responses
  1. Referee: [methods / experimental setup] Experimental setup (methods section describing DDQN variants): the manuscript does not state whether the DDQN agents with memory use identical network architectures, learning rates, replay-buffer sizes, training episode counts, and convergence criteria as the ex-ante schedule-learning baselines. If these differ, the reported increase in supra-competitive frequency could reflect optimization advantages or reduced non-stationarity rather than the memory mechanism itself.

    Authors: We confirm that all agents—both the ex-ante schedule-learning baselines and the DDQN variants with varying intra-episode state information—were trained using identical network architectures, learning rates, replay-buffer sizes, training episode counts, and convergence criteria. This design choice was made explicitly to isolate the contribution of intra-episode memory and state feedback. We will add a dedicated paragraph in the revised Methods section stating these shared hyperparameters and training protocols. revision: yes

  2. Referee: [results] Results on frequency and persistence: the claim that supra-competitive outcomes become 'substantially more frequent and more persistent' with intra-episode history requires quantitative support (number of independent runs, error bars or confidence intervals on the reported frequencies, and statistical tests comparing conditions). Without these, it is impossible to separate the memory effect from training variance.

    Authors: We agree that additional quantitative detail is necessary to support the frequency and persistence claims. Our experiments were performed across multiple independent training runs using different random seeds. In the revision we will report the exact number of runs, include error bars or confidence intervals on the supra-competitive frequencies, and add appropriate statistical comparisons (e.g., two-sample t-tests) between the memory and no-memory conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical simulation results against external benchmark

full rationale

The paper reports outcomes from forward simulation of RL policies (DDQN variants with and without intra-episode state) against the externally defined Almgren-Chriss game-theoretic benchmark. No derivation step reduces a claimed result to a fitted parameter or self-citation by construction; the frequency of supra-competitive outcomes is measured directly from independent rollouts rather than being algebraically entailed by the training objective or prior author work. The central attribution to memory is therefore an empirical observation, not a definitional or fitted tautology.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the Almgren-Chriss model captures the relevant price-impact dynamics and on the standard RL training assumptions implicit in DDQN; no new entities are postulated and free parameters are limited to typical neural-network hyperparameters.

free parameters (1)
  • DDQN architecture and training hyperparameters
    Various DDQN architectures are tested; their specific layer sizes, learning rates, and replay-buffer settings are fitted or chosen to produce stable learning.
axioms (1)
  • domain assumption The Almgren-Chriss model is a valid representation of optimal execution price dynamics.
    The entire study is conducted inside a two-agent Almgren-Chriss liquidation game.

pith-pipeline@v0.9.0 · 5723 in / 1377 out tokens · 46570 ms · 2026-05-21T07:01:47.143491+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 5 internal anchors

  1. [1]

    IEEE Transactions on Neural Networks , volume =

    Learning to trade via direct reinforcement , author =. IEEE Transactions on Neural Networks , volume =. 2001 , publisher =

  2. [2]

    Expert Systems with Applications , volume =

    An automated FX trading system using adaptive reinforcement learning , author =. Expert Systems with Applications , volume =. 2006 , publisher =

  3. [3]

    Data , volume=

    Reinforcement learning in financial markets , author=. Data , volume=. 2019 , publisher=

  4. [4]

    Mathematical Finance , volume=

    Recent advances in reinforcement learning in finance , author=. Mathematical Finance , volume=. 2023 , publisher=

  5. [5]

    Annual Review of Statistics and Its Application , volume=

    A review of reinforcement learning in financial applications , author=. Annual Review of Statistics and Its Application , volume=. 2025 , publisher=

  6. [6]

    arXiv preprint arXiv:1911.10107 , year=

    Deep reinforcement learning for trading , author=. arXiv preprint arXiv:1911.10107 , year=

  7. [7]

    arXiv preprint arXiv:2101.07107 , year=

    Deep reinforcement learning for active high frequency trading , author=. arXiv preprint arXiv:2101.07107 , year=

  8. [8]

    ESANN 2018-Proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning , pages=

    Reinforcement learning for high-frequency market making , author=. ESANN 2018-Proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning , pages=. 2018 , organization=

  9. [9]

    Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems , pages=

    Performance of deep reinforcement learning for high frequency market making on actual tick data , author=. Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems , pages=

  10. [10]

    2025 IEEE 14th International Conference on Communication Systems and Network Technologies (CSNT) , pages=

    Automatic Optimization of Trading Strategies Based on Reinforcement Learning , author=. 2025 IEEE 14th International Conference on Communication Systems and Network Technologies (CSNT) , pages=. 2025 , organization=

  11. [11]

    Journal of Risk , volume=

    Optimal execution of portfolio transactions , author=. Journal of Risk , volume=

  12. [12]

    Quantitative Finance , volume=

    Limit order books , author=. Quantitative Finance , volume=. 2013 , publisher=

  13. [13]

    Proceedings of the 23rd international conference on Machine learning , pages=

    Reinforcement learning for optimized trade execution , author=. Proceedings of the 23rd international conference on Machine learning , pages=

  14. [14]

    2014 IEEE Conference on computational intelligence for financial engineering & economics (CIFEr) , pages=

    A reinforcement learning extension to the Almgren-Chriss framework for optimal trade execution , author=. 2014 IEEE Conference on computational intelligence for financial engineering & economics (CIFEr) , pages=. 2014 , organization=

  15. [15]

    Applied Mathematical Finance , volume=

    Double deep q-learning for optimal execution , author=. Applied Mathematical Finance , volume=. 2021 , publisher=

  16. [16]

    Available at SSRN 3374766 , year=

    Deep execution-value and policy based reinforcement learning for trading and beating market benchmarks , author=. Available at SSRN 3374766 , year=

  17. [17]

    Proceedings of the twenty-ninth international conference on international joint conferences on artificial intelligence , pages=

    An end-to-end optimal trade execution framework based on proximal policy optimization , author=. Proceedings of the twenty-ninth international conference on international joint conferences on artificial intelligence , pages=

  18. [18]

    European Journal of Operational Research , volume=

    Deep reinforcement learning for the optimal placement of cryptocurrency limit orders , author=. European Journal of Operational Research , volume=. 2022 , publisher=

  19. [19]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Universal trading for order execution with oracle policy distillation , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  20. [20]

    Proceedings of the Third ACM International Conference on AI in Finance , pages=

    Cost-efficient reinforcement learning for optimal trade execution on dynamic market environment , author=. Proceedings of the Third ACM International Conference on AI in Finance , pages=

  21. [21]

    Quantitative Finance , volume=

    Learning a functional control for high-frequency finance , author=. Quantitative Finance , volume=. 2022 , publisher=

  22. [22]

    Quantitative Finance , volume =

    A reinforcement learning approach to optimal execution , author =. Quantitative Finance , volume =

  23. [23]

    FinTech , volume=

    Practical application of deep reinforcement learning to optimal trade execution , author=. FinTech , volume=. 2023 , publisher=

  24. [24]

    Applied Mathematical Finance , volume=

    Reinforcement learning for optimal execution when liquidity is time-varying , author=. Applied Mathematical Finance , volume=. 2024 , publisher=

  25. [25]

    arXiv preprint arXiv:2410.13493 , year=

    Deep Reinforcement Learning for Online Optimal Execution Strategies , author=. arXiv preprint arXiv:2410.13493 , year=

  26. [26]

    Mathematics , volume=

    Joint Learning of Volume Scheduling and Order Placement Policies for Optimal Order Execution , author=. Mathematics , volume=. 2024 , publisher=

  27. [27]

    Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24 , pages=

    Macmic: Executing iceberg orders via hierarchical reinforcement learning , author=. Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24 , pages=

  28. [28]

    Expert Systems with Applications , volume=

    An adaptive dual-level reinforcement learning approach for optimal trade execution , author=. Expert Systems with Applications , volume=. 2024 , publisher=

  29. [29]

    arXiv preprint arXiv:2207.11152 , year=

    Learn continuously, act discretely: Hybrid action-space reinforcement learning for optimal execution , author=. arXiv preprint arXiv:2207.11152 , year=

  30. [30]

    Multi-agent Reinforcement Learning in Sequential Social Dilemmas

    Multi-agent reinforcement learning in sequential social dilemmas , author=. arXiv preprint arXiv:1702.03037 , year=

  31. [31]

    Maintaining cooperation in complex social dilemmas using deep reinforcement learning

    Maintaining cooperation in complex social dilemmas using deep reinforcement learning , author=. arXiv preprint arXiv:1707.01068 , year=

  32. [32]

    Learning with Opponent-Learning Awareness

    Learning with opponent-learning awareness , author=. arXiv preprint arXiv:1709.04326 , year=

  33. [33]

    Journal of Economic Dynamics and Control , volume=

    Q-learning agents in a Cournot oligopoly model , author=. Journal of Economic Dynamics and Control , volume=. 2008 , publisher=

  34. [34]

    American Economic Review , volume=

    Artificial intelligence, algorithmic pricing, and collusion , author=. American Economic Review , volume=. 2020 , publisher=

  35. [35]

    The RAND Journal of Economics , volume=

    Autonomous algorithmic collusion: Q-learning under sequential pricing , author=. The RAND Journal of Economics , volume=. 2021 , publisher=

  36. [36]

    arXiv preprint arXiv:2503.11270 , year=

    Exploring Competitive and Collusive Behaviors in Algorithmic Pricing with Deep Reinforcement Learning , author=. arXiv preprint arXiv:2503.11270 , year=

  37. [37]

    arXiv preprint arXiv:2409.01147 , year =

    On Mechanism Underlying Algorithmic Collusion , author =. arXiv preprint arXiv:2409.01147 , year =

  38. [38]

    2022 , number =

    Artificial Collusion: Examining Supracompetitive Pricing by Q-Learning Algorithms , author =. 2022 , number =

  39. [39]

    Management Science , volume =

    Artificial Intelligence: Can Seemingly Collusive Outcomes Be Avoided? , author =. Management Science , volume =. 2023 , doi =

  40. [40]

    Advances in Neural Information Processing Systems 35 (NeurIPS 2022) , year =

    Learning to Mitigate AI Collusion on Economic Platforms , author =. Advances in Neural Information Processing Systems 35 (NeurIPS 2022) , year =

  41. [41]

    arXiv preprint arXiv:2508.14766 , year =

    Algorithmic Collusion is Algorithm Orchestration , author =. arXiv preprint arXiv:2508.14766 , year =

  42. [42]

    Dynamic Games and Applications , volume=

    Transient impact from the Nash equilibrium of a permanent market impact game , author=. Dynamic Games and Applications , volume=. 2024 , publisher=

  43. [43]

    Mathematical Finance , volume =

    A State-Constrained Differential Game Arising in Optimal Portfolio Liquidation , author =. Mathematical Finance , volume =

  44. [44]

    Mathematical Finance , volume =

    Dynamics of Market Making Algorithms in Dealer Markets: Learning and Tacit Collusion , author =. Mathematical Finance , volume =. 2024 , doi =

  45. [45]

    Quantitative Finance , volume =

    Cooperation Between Independent Market Makers , author =. Quantitative Finance , volume =. 2022 , doi =

  46. [46]

    arXiv preprint arXiv:2408.11773 , year =

    Deviations from the Nash Equilibrium and Emergence of Tacit Collusion in a Two-Player Optimal Execution Game with Reinforcement Learning , author =. arXiv preprint arXiv:2408.11773 , year =

  47. [47]

    SSRN Electronic Journal , year =

    Algorithmic Collusion in Electronic Markets: The Impact of Tick Size , author =. SSRN Electronic Journal , year =

  48. [48]

    Dou, Winston Wei and Goldstein, Itay and Ji, Yan , journal =

  49. [49]

    The Invisible Handshake: Persistent Overpricing by Adaptive Market Agents

    The Invisible Handshake: Tacit Collusion Between Adaptive Market Agents , author =. arXiv preprint arXiv:2510.15995 , year =

  50. [50]

    arXiv preprint arXiv:1911.05892 , year =

    Reinforcement Learning for Market Making in a Multi-Agent Dealer Market , author =. arXiv preprint arXiv:1911.05892 , year =

  51. [51]

    Towards a Fully

    Ardon, Leo and Vadori, Nelson and Spooner, Thomas and Xu, Mengda and Vann, Jared and Ganesh, Sumitra , booktitle =. Towards a Fully. 2021 , doi =

  52. [52]

    Mathematical Finance , year =

    Towards Multi-Agent Reinforcement Learning-Driven Over-the-Counter Market Simulations , author =. Mathematical Finance , year =

  53. [53]

    arXiv preprint arXiv:2407.21025 , year =

    Reinforcement Learning in High-Frequency Market Making , author =. arXiv preprint arXiv:2407.21025 , year =

  54. [54]

    Proceedings of the 7th Annual Conference on Learning for Dynamics and Control , series =

    Eberhard, Onno and Vernade, Claire and Muehlebach, Michael , title =. Proceedings of the 7th Annual Conference on Learning for Dynamics and Control , series =. 2025 , publisher =

  55. [55]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    Film: Visual reasoning with a general conditioning layer , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  56. [56]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Outrageously large neural networks: The sparsely-gated mixture-of-experts layer , author=. arXiv preprint arXiv:1701.06538 , year=

  57. [57]

    Quantitative Finance , year =

    Cheridito, Patrick and Weiss, Moritz , title =. Quantitative Finance , year =. doi:10.1080/14697688.2026.2631116 , note =