Memory-Induced Supra-Competitive Outcomes Between Deep Reinforcement Learning Agents in Optimal Trade Execution

Carlo Campajola; Christos Spyridon Koulouris

arxiv: 2605.20348 · v1 · pith:U2KKJD5Unew · submitted 2026-05-19 · 💱 q-fin.CP · cs.AI

Memory-Induced Supra-Competitive Outcomes Between Deep Reinforcement Learning Agents in Optimal Trade Execution

Christos Spyridon Koulouris , Carlo Campajola This is my paper

Pith reviewed 2026-05-21 07:01 UTC · model grok-4.3

classification 💱 q-fin.CP cs.AI

keywords reinforcement learningoptimal executionAlmgren-Chriss modelsupra-competitive outcomesmulti-agent RLliquidation gamesintra-episode memory

0 comments

The pith

Access to intra-episode memory lets RL agents in a two-agent liquidation game sustain lower implementation shortfalls than the game-theoretic benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether deep reinforcement learning agents can outperform the competitive equilibrium in an Almgren-Chriss optimal execution setting. It compares agents that commit to fixed schedules in advance with agents that receive ongoing feedback from recent prices and their own past trades. When memory is available, supra-competitive results become both more common and more sustained. A reader would care because this points to memory as the ingredient that turns a standard competitive game into one where agents can achieve better collective execution costs through state-contingent behavior.

Core claim

In the two-agent Almgren-Chriss liquidation game, DDQN agents that condition on intra-episode history—especially recent mid-prices and own past actions—produce supra-competitive outcomes, defined as lower implementation shortfalls than the relevant game-theoretic benchmark, at substantially higher rates and with greater persistence than agents restricted to ex-ante complete schedules.

What carries the argument

The contrast between ex-ante schedule-learning agents and state-contingent DDQN policies that incorporate intra-episode feedback and memory within the Almgren-Chriss two-agent execution environment.

If this is right

Supra-competitive behavior requires state-contingent interaction along the realized execution path rather than multi-agent learning or current-price observation alone.
Ex-ante schedule commitment removes the conditions under which supra-competitive results emerge.
Recent prices combined with the agent's own past actions form the most effective memory signals for sustaining outperformance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Market venues that limit real-time data feeds to trading algorithms might reduce the frequency of these memory-driven outcomes.
Similar memory effects could appear in other sequential multi-agent games where agents share a common price path.
Extending the setup to three or more agents would test whether the same memory channel continues to support supra-competitive execution.

Load-bearing premise

Differences in observed outcomes are caused by the presence or absence of memory and intra-episode feedback rather than by unexamined variations in training stability or hyperparameter choices.

What would settle it

A controlled retraining experiment in which agents receive identical hyperparameters and architectures but are denied access to intra-episode price history and past actions, with outcomes then compared against the original memory-enabled runs.

Figures

Figures reproduced from arXiv: 2605.20348 by Carlo Campajola, Christos Spyridon Koulouris.

**Figure 2.** Figure 2: Schedule-learning results in the aggregate-temporary-impact environment. [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Schedule-learning results in the own-temporary-impact environment. [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Average testing inventory paths in the own-temporary-impact environment. [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Own-temp experiment with one player fixed at the aggregate-temporary-impact Nash [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Baseline DDQN experiment with intra-episode feedback. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Integrated-gradient diagnostics for the baseline DDQN architecture in an illustrative [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Integrated-gradient diagnostics for the price conditioned DDQN architecture in an [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Price-conditioned DDQN replication. (a) Final testing centroids relative to the continuous-time and grid-implemented Nash and TWAP benchmarks. (b) Rolling 20-episode share of training episodes in the collusive region under the discrete benchmark. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: History-aware DDQN replication. (a) Final testing centroids relative to the Nash and TWAP benchmarks. (b) Rolling 20-episode share of training episodes in the supra-competitive region under the competitive benchmark. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Quadrant occupancies and transition probabilities during the last 500 training [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: Average testing inventory paths for the history-aware architecture, shown separately [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗

**Figure 13.** Figure 13: Average Euclidean distance of the final testing centroids to the discrete Nash and [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗

read the original abstract

In this paper, we investigate whether deep reinforcement-learning agents interacting in a shared optimal-execution environment can sustain supra-competitive outcomes, in the sense of achieving lower implementation shortfalls than the relevant game-theoretical competitive benchmark. We study a two-agent Almgren-Chriss liquidation game and examine how learned behavior depends on intra-episode environment feedback, the ability to interpret the mid-price and the agent's knoledge of the past. We first use ex-ante schedule-learning agents to remove intra-episode feedback and isolate what can arise when agents commit to complete liquidation trajectories before execution begins. We then allow agents to condition on the evolving state using a variety of DDQN architectures. We find that, when agents are given access to intra-episode history, especially recent prices and own past actions, supra-competitive outcomes become substantially more frequent and more persistent. These findings indicate that supra-competitive behavior in this execution game is driven not by multi-agent learning or by current price observation alone, but by feedback, memory, and state-contingent interaction along the realized execution path.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper studies a two-agent Almgren-Chriss liquidation game and asks whether deep RL agents can produce supra-competitive outcomes (lower implementation shortfalls than the game-theoretic benchmark). It first examines ex-ante schedule-learning agents that commit to full trajectories without intra-episode feedback, then compares these to DDQN agents that receive varying degrees of intra-episode state information, including recent prices and own past actions. The central empirical claim is that access to intra-episode history substantially raises both the frequency and persistence of supra-competitive outcomes.

Significance. If the attribution to memory holds after controlling for training differences, the result would indicate that state-contingent feedback along the execution path, rather than multi-agent learning or price observation alone, drives outperformance of the competitive benchmark. This has potential implications for the design of execution algorithms and for understanding emergent non-competitive behavior in multi-agent financial RL settings. The paper's use of a clearly defined game-theoretic benchmark and forward simulation is a methodological strength.

major comments (2)

[methods / experimental setup] Experimental setup (methods section describing DDQN variants): the manuscript does not state whether the DDQN agents with memory use identical network architectures, learning rates, replay-buffer sizes, training episode counts, and convergence criteria as the ex-ante schedule-learning baselines. If these differ, the reported increase in supra-competitive frequency could reflect optimization advantages or reduced non-stationarity rather than the memory mechanism itself.
[results] Results on frequency and persistence: the claim that supra-competitive outcomes become 'substantially more frequent and more persistent' with intra-episode history requires quantitative support (number of independent runs, error bars or confidence intervals on the reported frequencies, and statistical tests comparing conditions). Without these, it is impossible to separate the memory effect from training variance.

minor comments (2)

Notation: the distinction between 'ex-ante schedule-learning agents' and the various DDQN state-input configurations should be summarized in a single table for clarity.
[results] The abstract states that the effect is driven by 'feedback, memory, and state-contingent interaction'; the results section should explicitly isolate which component (recent prices vs. own past actions) contributes most.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the clarity and rigor of our work. We address each major comment below.

read point-by-point responses

Referee: [methods / experimental setup] Experimental setup (methods section describing DDQN variants): the manuscript does not state whether the DDQN agents with memory use identical network architectures, learning rates, replay-buffer sizes, training episode counts, and convergence criteria as the ex-ante schedule-learning baselines. If these differ, the reported increase in supra-competitive frequency could reflect optimization advantages or reduced non-stationarity rather than the memory mechanism itself.

Authors: We confirm that all agents—both the ex-ante schedule-learning baselines and the DDQN variants with varying intra-episode state information—were trained using identical network architectures, learning rates, replay-buffer sizes, training episode counts, and convergence criteria. This design choice was made explicitly to isolate the contribution of intra-episode memory and state feedback. We will add a dedicated paragraph in the revised Methods section stating these shared hyperparameters and training protocols. revision: yes
Referee: [results] Results on frequency and persistence: the claim that supra-competitive outcomes become 'substantially more frequent and more persistent' with intra-episode history requires quantitative support (number of independent runs, error bars or confidence intervals on the reported frequencies, and statistical tests comparing conditions). Without these, it is impossible to separate the memory effect from training variance.

Authors: We agree that additional quantitative detail is necessary to support the frequency and persistence claims. Our experiments were performed across multiple independent training runs using different random seeds. In the revision we will report the exact number of runs, include error bars or confidence intervals on the supra-competitive frequencies, and add appropriate statistical comparisons (e.g., two-sample t-tests) between the memory and no-memory conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical simulation results against external benchmark

full rationale

The paper reports outcomes from forward simulation of RL policies (DDQN variants with and without intra-episode state) against the externally defined Almgren-Chriss game-theoretic benchmark. No derivation step reduces a claimed result to a fitted parameter or self-citation by construction; the frequency of supra-competitive outcomes is measured directly from independent rollouts rather than being algebraically entailed by the training objective or prior author work. The central attribution to memory is therefore an empirical observation, not a definitional or fitted tautology.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the Almgren-Chriss model captures the relevant price-impact dynamics and on the standard RL training assumptions implicit in DDQN; no new entities are postulated and free parameters are limited to typical neural-network hyperparameters.

free parameters (1)

DDQN architecture and training hyperparameters
Various DDQN architectures are tested; their specific layer sizes, learning rates, and replay-buffer settings are fitted or chosen to produce stable learning.

axioms (1)

domain assumption The Almgren-Chriss model is a valid representation of optimal execution price dynamics.
The entire study is conducted inside a two-agent Almgren-Chriss liquidation game.

pith-pipeline@v0.9.0 · 5723 in / 1377 out tokens · 46570 ms · 2026-05-21T07:01:47.143491+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

supra-competitive outcomes become substantially more frequent and more persistent... driven not by multi-agent learning or by current price observation alone, but by feedback, memory, and state-contingent interaction
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

history-aware DDQN... Transformer encoder with masked self-attention... recent prices and own past actions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 5 internal anchors

[1]

IEEE Transactions on Neural Networks , volume =

Learning to trade via direct reinforcement , author =. IEEE Transactions on Neural Networks , volume =. 2001 , publisher =

work page 2001
[2]

Expert Systems with Applications , volume =

An automated FX trading system using adaptive reinforcement learning , author =. Expert Systems with Applications , volume =. 2006 , publisher =

work page 2006
[3]

Data , volume=

Reinforcement learning in financial markets , author=. Data , volume=. 2019 , publisher=

work page 2019
[4]

Mathematical Finance , volume=

Recent advances in reinforcement learning in finance , author=. Mathematical Finance , volume=. 2023 , publisher=

work page 2023
[5]

Annual Review of Statistics and Its Application , volume=

A review of reinforcement learning in financial applications , author=. Annual Review of Statistics and Its Application , volume=. 2025 , publisher=

work page 2025
[6]

arXiv preprint arXiv:1911.10107 , year=

Deep reinforcement learning for trading , author=. arXiv preprint arXiv:1911.10107 , year=

work page arXiv 1911
[7]

arXiv preprint arXiv:2101.07107 , year=

Deep reinforcement learning for active high frequency trading , author=. arXiv preprint arXiv:2101.07107 , year=

work page arXiv
[8]

ESANN 2018-Proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning , pages=

Reinforcement learning for high-frequency market making , author=. ESANN 2018-Proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning , pages=. 2018 , organization=

work page 2018
[9]

Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems , pages=

Performance of deep reinforcement learning for high frequency market making on actual tick data , author=. Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems , pages=

work page
[10]

2025 IEEE 14th International Conference on Communication Systems and Network Technologies (CSNT) , pages=

Automatic Optimization of Trading Strategies Based on Reinforcement Learning , author=. 2025 IEEE 14th International Conference on Communication Systems and Network Technologies (CSNT) , pages=. 2025 , organization=

work page 2025
[11]

Journal of Risk , volume=

Optimal execution of portfolio transactions , author=. Journal of Risk , volume=

work page
[12]

Quantitative Finance , volume=

Limit order books , author=. Quantitative Finance , volume=. 2013 , publisher=

work page 2013
[13]

Proceedings of the 23rd international conference on Machine learning , pages=

Reinforcement learning for optimized trade execution , author=. Proceedings of the 23rd international conference on Machine learning , pages=

work page
[14]

2014 IEEE Conference on computational intelligence for financial engineering & economics (CIFEr) , pages=

A reinforcement learning extension to the Almgren-Chriss framework for optimal trade execution , author=. 2014 IEEE Conference on computational intelligence for financial engineering & economics (CIFEr) , pages=. 2014 , organization=

work page 2014
[15]

Applied Mathematical Finance , volume=

Double deep q-learning for optimal execution , author=. Applied Mathematical Finance , volume=. 2021 , publisher=

work page 2021
[16]

Available at SSRN 3374766 , year=

Deep execution-value and policy based reinforcement learning for trading and beating market benchmarks , author=. Available at SSRN 3374766 , year=

work page
[17]

Proceedings of the twenty-ninth international conference on international joint conferences on artificial intelligence , pages=

An end-to-end optimal trade execution framework based on proximal policy optimization , author=. Proceedings of the twenty-ninth international conference on international joint conferences on artificial intelligence , pages=

work page
[18]

European Journal of Operational Research , volume=

Deep reinforcement learning for the optimal placement of cryptocurrency limit orders , author=. European Journal of Operational Research , volume=. 2022 , publisher=

work page 2022
[19]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Universal trading for order execution with oracle policy distillation , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[20]

Proceedings of the Third ACM International Conference on AI in Finance , pages=

Cost-efficient reinforcement learning for optimal trade execution on dynamic market environment , author=. Proceedings of the Third ACM International Conference on AI in Finance , pages=

work page
[21]

Quantitative Finance , volume=

Learning a functional control for high-frequency finance , author=. Quantitative Finance , volume=. 2022 , publisher=

work page 2022
[22]

Quantitative Finance , volume =

A reinforcement learning approach to optimal execution , author =. Quantitative Finance , volume =

work page
[23]

FinTech , volume=

Practical application of deep reinforcement learning to optimal trade execution , author=. FinTech , volume=. 2023 , publisher=

work page 2023
[24]

Applied Mathematical Finance , volume=

Reinforcement learning for optimal execution when liquidity is time-varying , author=. Applied Mathematical Finance , volume=. 2024 , publisher=

work page 2024
[25]

arXiv preprint arXiv:2410.13493 , year=

Deep Reinforcement Learning for Online Optimal Execution Strategies , author=. arXiv preprint arXiv:2410.13493 , year=

work page arXiv
[26]

Mathematics , volume=

Joint Learning of Volume Scheduling and Order Placement Policies for Optimal Order Execution , author=. Mathematics , volume=. 2024 , publisher=

work page 2024
[27]

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24 , pages=

Macmic: Executing iceberg orders via hierarchical reinforcement learning , author=. Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24 , pages=

work page
[28]

Expert Systems with Applications , volume=

An adaptive dual-level reinforcement learning approach for optimal trade execution , author=. Expert Systems with Applications , volume=. 2024 , publisher=

work page 2024
[29]

arXiv preprint arXiv:2207.11152 , year=

Learn continuously, act discretely: Hybrid action-space reinforcement learning for optimal execution , author=. arXiv preprint arXiv:2207.11152 , year=

work page arXiv
[30]

Multi-agent Reinforcement Learning in Sequential Social Dilemmas

Multi-agent reinforcement learning in sequential social dilemmas , author=. arXiv preprint arXiv:1702.03037 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Maintaining cooperation in complex social dilemmas using deep reinforcement learning

Maintaining cooperation in complex social dilemmas using deep reinforcement learning , author=. arXiv preprint arXiv:1707.01068 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Learning with Opponent-Learning Awareness

Learning with opponent-learning awareness , author=. arXiv preprint arXiv:1709.04326 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Journal of Economic Dynamics and Control , volume=

Q-learning agents in a Cournot oligopoly model , author=. Journal of Economic Dynamics and Control , volume=. 2008 , publisher=

work page 2008
[34]

American Economic Review , volume=

Artificial intelligence, algorithmic pricing, and collusion , author=. American Economic Review , volume=. 2020 , publisher=

work page 2020
[35]

The RAND Journal of Economics , volume=

Autonomous algorithmic collusion: Q-learning under sequential pricing , author=. The RAND Journal of Economics , volume=. 2021 , publisher=

work page 2021
[36]

arXiv preprint arXiv:2503.11270 , year=

Exploring Competitive and Collusive Behaviors in Algorithmic Pricing with Deep Reinforcement Learning , author=. arXiv preprint arXiv:2503.11270 , year=

work page arXiv
[37]

arXiv preprint arXiv:2409.01147 , year =

On Mechanism Underlying Algorithmic Collusion , author =. arXiv preprint arXiv:2409.01147 , year =

work page arXiv
[38]

2022 , number =

Artificial Collusion: Examining Supracompetitive Pricing by Q-Learning Algorithms , author =. 2022 , number =

work page 2022
[39]

Management Science , volume =

Artificial Intelligence: Can Seemingly Collusive Outcomes Be Avoided? , author =. Management Science , volume =. 2023 , doi =

work page 2023
[40]

Advances in Neural Information Processing Systems 35 (NeurIPS 2022) , year =

Learning to Mitigate AI Collusion on Economic Platforms , author =. Advances in Neural Information Processing Systems 35 (NeurIPS 2022) , year =

work page 2022
[41]

arXiv preprint arXiv:2508.14766 , year =

Algorithmic Collusion is Algorithm Orchestration , author =. arXiv preprint arXiv:2508.14766 , year =

work page arXiv
[42]

Dynamic Games and Applications , volume=

Transient impact from the Nash equilibrium of a permanent market impact game , author=. Dynamic Games and Applications , volume=. 2024 , publisher=

work page 2024
[43]

Mathematical Finance , volume =

A State-Constrained Differential Game Arising in Optimal Portfolio Liquidation , author =. Mathematical Finance , volume =

work page
[44]

Mathematical Finance , volume =

Dynamics of Market Making Algorithms in Dealer Markets: Learning and Tacit Collusion , author =. Mathematical Finance , volume =. 2024 , doi =

work page 2024
[45]

Quantitative Finance , volume =

Cooperation Between Independent Market Makers , author =. Quantitative Finance , volume =. 2022 , doi =

work page 2022
[46]

arXiv preprint arXiv:2408.11773 , year =

Deviations from the Nash Equilibrium and Emergence of Tacit Collusion in a Two-Player Optimal Execution Game with Reinforcement Learning , author =. arXiv preprint arXiv:2408.11773 , year =

work page arXiv
[47]

SSRN Electronic Journal , year =

Algorithmic Collusion in Electronic Markets: The Impact of Tick Size , author =. SSRN Electronic Journal , year =

work page
[48]

Dou, Winston Wei and Goldstein, Itay and Ji, Yan , journal =

work page
[49]

The Invisible Handshake: Persistent Overpricing by Adaptive Market Agents

The Invisible Handshake: Tacit Collusion Between Adaptive Market Agents , author =. arXiv preprint arXiv:2510.15995 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[50]

arXiv preprint arXiv:1911.05892 , year =

Reinforcement Learning for Market Making in a Multi-Agent Dealer Market , author =. arXiv preprint arXiv:1911.05892 , year =

work page arXiv 1911
[51]

Towards a Fully

Ardon, Leo and Vadori, Nelson and Spooner, Thomas and Xu, Mengda and Vann, Jared and Ganesh, Sumitra , booktitle =. Towards a Fully. 2021 , doi =

work page 2021
[52]

Mathematical Finance , year =

Towards Multi-Agent Reinforcement Learning-Driven Over-the-Counter Market Simulations , author =. Mathematical Finance , year =

work page
[53]

arXiv preprint arXiv:2407.21025 , year =

Reinforcement Learning in High-Frequency Market Making , author =. arXiv preprint arXiv:2407.21025 , year =

work page arXiv
[54]

Proceedings of the 7th Annual Conference on Learning for Dynamics and Control , series =

Eberhard, Onno and Vernade, Claire and Muehlebach, Michael , title =. Proceedings of the 7th Annual Conference on Learning for Dynamics and Control , series =. 2025 , publisher =

work page 2025
[55]

Proceedings of the AAAI conference on artificial intelligence , volume=

Film: Visual reasoning with a general conditioning layer , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page
[56]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer , author=. arXiv preprint arXiv:1701.06538 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[57]

Quantitative Finance , year =

Cheridito, Patrick and Weiss, Moritz , title =. Quantitative Finance , year =. doi:10.1080/14697688.2026.2631116 , note =

work page doi:10.1080/14697688.2026.2631116 2026

[1] [1]

IEEE Transactions on Neural Networks , volume =

Learning to trade via direct reinforcement , author =. IEEE Transactions on Neural Networks , volume =. 2001 , publisher =

work page 2001

[2] [2]

Expert Systems with Applications , volume =

An automated FX trading system using adaptive reinforcement learning , author =. Expert Systems with Applications , volume =. 2006 , publisher =

work page 2006

[3] [3]

Data , volume=

Reinforcement learning in financial markets , author=. Data , volume=. 2019 , publisher=

work page 2019

[4] [4]

Mathematical Finance , volume=

Recent advances in reinforcement learning in finance , author=. Mathematical Finance , volume=. 2023 , publisher=

work page 2023

[5] [5]

Annual Review of Statistics and Its Application , volume=

A review of reinforcement learning in financial applications , author=. Annual Review of Statistics and Its Application , volume=. 2025 , publisher=

work page 2025

[6] [6]

arXiv preprint arXiv:1911.10107 , year=

Deep reinforcement learning for trading , author=. arXiv preprint arXiv:1911.10107 , year=

work page arXiv 1911

[7] [7]

arXiv preprint arXiv:2101.07107 , year=

Deep reinforcement learning for active high frequency trading , author=. arXiv preprint arXiv:2101.07107 , year=

work page arXiv

[8] [8]

ESANN 2018-Proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning , pages=

Reinforcement learning for high-frequency market making , author=. ESANN 2018-Proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning , pages=. 2018 , organization=

work page 2018

[9] [9]

Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems , pages=

Performance of deep reinforcement learning for high frequency market making on actual tick data , author=. Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems , pages=

work page

[10] [10]

2025 IEEE 14th International Conference on Communication Systems and Network Technologies (CSNT) , pages=

Automatic Optimization of Trading Strategies Based on Reinforcement Learning , author=. 2025 IEEE 14th International Conference on Communication Systems and Network Technologies (CSNT) , pages=. 2025 , organization=

work page 2025

[11] [11]

Journal of Risk , volume=

Optimal execution of portfolio transactions , author=. Journal of Risk , volume=

work page

[12] [12]

Quantitative Finance , volume=

Limit order books , author=. Quantitative Finance , volume=. 2013 , publisher=

work page 2013

[13] [13]

Proceedings of the 23rd international conference on Machine learning , pages=

Reinforcement learning for optimized trade execution , author=. Proceedings of the 23rd international conference on Machine learning , pages=

work page

[14] [14]

2014 IEEE Conference on computational intelligence for financial engineering & economics (CIFEr) , pages=

A reinforcement learning extension to the Almgren-Chriss framework for optimal trade execution , author=. 2014 IEEE Conference on computational intelligence for financial engineering & economics (CIFEr) , pages=. 2014 , organization=

work page 2014

[15] [15]

Applied Mathematical Finance , volume=

Double deep q-learning for optimal execution , author=. Applied Mathematical Finance , volume=. 2021 , publisher=

work page 2021

[16] [16]

Available at SSRN 3374766 , year=

Deep execution-value and policy based reinforcement learning for trading and beating market benchmarks , author=. Available at SSRN 3374766 , year=

work page

[17] [17]

Proceedings of the twenty-ninth international conference on international joint conferences on artificial intelligence , pages=

An end-to-end optimal trade execution framework based on proximal policy optimization , author=. Proceedings of the twenty-ninth international conference on international joint conferences on artificial intelligence , pages=

work page

[18] [18]

European Journal of Operational Research , volume=

Deep reinforcement learning for the optimal placement of cryptocurrency limit orders , author=. European Journal of Operational Research , volume=. 2022 , publisher=

work page 2022

[19] [19]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Universal trading for order execution with oracle policy distillation , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page

[20] [20]

Proceedings of the Third ACM International Conference on AI in Finance , pages=

Cost-efficient reinforcement learning for optimal trade execution on dynamic market environment , author=. Proceedings of the Third ACM International Conference on AI in Finance , pages=

work page

[21] [21]

Quantitative Finance , volume=

Learning a functional control for high-frequency finance , author=. Quantitative Finance , volume=. 2022 , publisher=

work page 2022

[22] [22]

Quantitative Finance , volume =

A reinforcement learning approach to optimal execution , author =. Quantitative Finance , volume =

work page

[23] [23]

FinTech , volume=

Practical application of deep reinforcement learning to optimal trade execution , author=. FinTech , volume=. 2023 , publisher=

work page 2023

[24] [24]

Applied Mathematical Finance , volume=

Reinforcement learning for optimal execution when liquidity is time-varying , author=. Applied Mathematical Finance , volume=. 2024 , publisher=

work page 2024

[25] [25]

arXiv preprint arXiv:2410.13493 , year=

Deep Reinforcement Learning for Online Optimal Execution Strategies , author=. arXiv preprint arXiv:2410.13493 , year=

work page arXiv

[26] [26]

Mathematics , volume=

Joint Learning of Volume Scheduling and Order Placement Policies for Optimal Order Execution , author=. Mathematics , volume=. 2024 , publisher=

work page 2024

[27] [27]

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24 , pages=

Macmic: Executing iceberg orders via hierarchical reinforcement learning , author=. Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24 , pages=

work page

[28] [28]

Expert Systems with Applications , volume=

An adaptive dual-level reinforcement learning approach for optimal trade execution , author=. Expert Systems with Applications , volume=. 2024 , publisher=

work page 2024

[29] [29]

arXiv preprint arXiv:2207.11152 , year=

Learn continuously, act discretely: Hybrid action-space reinforcement learning for optimal execution , author=. arXiv preprint arXiv:2207.11152 , year=

work page arXiv

[30] [30]

Multi-agent Reinforcement Learning in Sequential Social Dilemmas

Multi-agent reinforcement learning in sequential social dilemmas , author=. arXiv preprint arXiv:1702.03037 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

Maintaining cooperation in complex social dilemmas using deep reinforcement learning

Maintaining cooperation in complex social dilemmas using deep reinforcement learning , author=. arXiv preprint arXiv:1707.01068 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

Learning with Opponent-Learning Awareness

Learning with opponent-learning awareness , author=. arXiv preprint arXiv:1709.04326 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[33] [33]

Journal of Economic Dynamics and Control , volume=

Q-learning agents in a Cournot oligopoly model , author=. Journal of Economic Dynamics and Control , volume=. 2008 , publisher=

work page 2008

[34] [34]

American Economic Review , volume=

Artificial intelligence, algorithmic pricing, and collusion , author=. American Economic Review , volume=. 2020 , publisher=

work page 2020

[35] [35]

The RAND Journal of Economics , volume=

Autonomous algorithmic collusion: Q-learning under sequential pricing , author=. The RAND Journal of Economics , volume=. 2021 , publisher=

work page 2021

[36] [36]

arXiv preprint arXiv:2503.11270 , year=

Exploring Competitive and Collusive Behaviors in Algorithmic Pricing with Deep Reinforcement Learning , author=. arXiv preprint arXiv:2503.11270 , year=

work page arXiv

[37] [37]

arXiv preprint arXiv:2409.01147 , year =

On Mechanism Underlying Algorithmic Collusion , author =. arXiv preprint arXiv:2409.01147 , year =

work page arXiv

[38] [38]

2022 , number =

Artificial Collusion: Examining Supracompetitive Pricing by Q-Learning Algorithms , author =. 2022 , number =

work page 2022

[39] [39]

Management Science , volume =

Artificial Intelligence: Can Seemingly Collusive Outcomes Be Avoided? , author =. Management Science , volume =. 2023 , doi =

work page 2023

[40] [40]

Advances in Neural Information Processing Systems 35 (NeurIPS 2022) , year =

Learning to Mitigate AI Collusion on Economic Platforms , author =. Advances in Neural Information Processing Systems 35 (NeurIPS 2022) , year =

work page 2022

[41] [41]

arXiv preprint arXiv:2508.14766 , year =

Algorithmic Collusion is Algorithm Orchestration , author =. arXiv preprint arXiv:2508.14766 , year =

work page arXiv

[42] [42]

Dynamic Games and Applications , volume=

Transient impact from the Nash equilibrium of a permanent market impact game , author=. Dynamic Games and Applications , volume=. 2024 , publisher=

work page 2024

[43] [43]

Mathematical Finance , volume =

A State-Constrained Differential Game Arising in Optimal Portfolio Liquidation , author =. Mathematical Finance , volume =

work page

[44] [44]

Mathematical Finance , volume =

Dynamics of Market Making Algorithms in Dealer Markets: Learning and Tacit Collusion , author =. Mathematical Finance , volume =. 2024 , doi =

work page 2024

[45] [45]

Quantitative Finance , volume =

Cooperation Between Independent Market Makers , author =. Quantitative Finance , volume =. 2022 , doi =

work page 2022

[46] [46]

arXiv preprint arXiv:2408.11773 , year =

Deviations from the Nash Equilibrium and Emergence of Tacit Collusion in a Two-Player Optimal Execution Game with Reinforcement Learning , author =. arXiv preprint arXiv:2408.11773 , year =

work page arXiv

[47] [47]

SSRN Electronic Journal , year =

Algorithmic Collusion in Electronic Markets: The Impact of Tick Size , author =. SSRN Electronic Journal , year =

work page

[48] [48]

Dou, Winston Wei and Goldstein, Itay and Ji, Yan , journal =

work page

[49] [49]

The Invisible Handshake: Persistent Overpricing by Adaptive Market Agents

The Invisible Handshake: Tacit Collusion Between Adaptive Market Agents , author =. arXiv preprint arXiv:2510.15995 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[50] [50]

arXiv preprint arXiv:1911.05892 , year =

Reinforcement Learning for Market Making in a Multi-Agent Dealer Market , author =. arXiv preprint arXiv:1911.05892 , year =

work page arXiv 1911

[51] [51]

Towards a Fully

Ardon, Leo and Vadori, Nelson and Spooner, Thomas and Xu, Mengda and Vann, Jared and Ganesh, Sumitra , booktitle =. Towards a Fully. 2021 , doi =

work page 2021

[52] [52]

Mathematical Finance , year =

Towards Multi-Agent Reinforcement Learning-Driven Over-the-Counter Market Simulations , author =. Mathematical Finance , year =

work page

[53] [53]

arXiv preprint arXiv:2407.21025 , year =

Reinforcement Learning in High-Frequency Market Making , author =. arXiv preprint arXiv:2407.21025 , year =

work page arXiv

[54] [54]

Proceedings of the 7th Annual Conference on Learning for Dynamics and Control , series =

Eberhard, Onno and Vernade, Claire and Muehlebach, Michael , title =. Proceedings of the 7th Annual Conference on Learning for Dynamics and Control , series =. 2025 , publisher =

work page 2025

[55] [55]

Proceedings of the AAAI conference on artificial intelligence , volume=

Film: Visual reasoning with a general conditioning layer , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page

[56] [56]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer , author=. arXiv preprint arXiv:1701.06538 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[57] [57]

Quantitative Finance , year =

Cheridito, Patrick and Weiss, Moritz , title =. Quantitative Finance , year =. doi:10.1080/14697688.2026.2631116 , note =

work page doi:10.1080/14697688.2026.2631116 2026