Realistic Market Impact Modeling for Reinforcement Learning Trading Environments
Pith reviewed 2026-05-14 20:58 UTC · model grok-4.3
The pith
Realistic nonlinear market impact costs change both absolute performance and relative rankings of reinforcement learning trading algorithms.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The MACE environments integrate pluggable Almgren-Chriss cost models into three trading tasks; when five DRL algorithms are evaluated under both fixed and full impact costs, the realistic model produces dramatically lower turnover and costs while reversing or shifting algorithm rankings in an environment-specific manner.
What carries the argument
Pluggable Almgren-Chriss cost module with square-root temporary impact and exponential-decay permanent impact, embedded inside Gymnasium trading environments that log trade-level execution costs.
If this is right
- Absolute performance numbers for A2C, PPO, DDPG, SAC, and TD3 all shift when realistic impact is used instead of fixed costs.
- The ordering of which algorithm performs best changes across the three environments once impact is modeled.
- Agents switch from high-turnover policies (19 percent daily) to low-turnover policies (1 percent daily) under the full cost model.
- Hyperparameter tuning becomes necessary to prevent the agent from incurring extreme costs that the fixed-cost baseline hides.
- Algorithm-cost interactions differ by task, with some algorithms improving and others worsening under realistic impact.
Where Pith is reading between the lines
- Published RL trading results that rely only on fixed or zero transaction costs are likely to overstate live performance.
- Any new trading environment or benchmark should include at least one realistic impact variant as a default test case.
- Sensitivity analysis across cost models could become a standard step when selecting an algorithm for production trading.
Load-bearing premise
The Almgren-Chriss framework together with the square-root impact law accurately describes market impact for the NASDAQ-100 stocks and holding periods used in the tests.
What would settle it
A direct comparison of the model's predicted daily execution costs against actual realized slippage on the same NASDAQ-100 trades executed through a live broker at comparable sizes and speeds.
Figures
read the original abstract
Reinforcement learning (RL) has shown promise for trading, yet most open-source backtesting environments assume negligible or fixed transaction costs, causing agents to learn trading behaviors that fail under realistic execution. We introduce three Gymnasium-compatible trading environments -- MACE (Market-Adjusted Cost Execution) stock trading, margin trading, and portfolio optimization -- that integrate nonlinear market impact models grounded in the Almgren-Chriss framework and the empirically validated square-root impact law. Each environment provides pluggable cost models, permanent impact tracking with exponential decay, and comprehensive trade-level logging. We evaluate five DRL algorithms (A2C, PPO, DDPG, SAC, TD3) on the NASDAQ-100, comparing a fixed 10 bps baseline against the AC model with Optuna-tuned hyperparameters. Our results show that (i) the cost model materially changes both absolute performance and the relative ranking of algorithms across all three environments; (ii) the AC model produces dramatically different trading behavior, e.g., daily costs dropping from $200k to $8k with turnover falling from 19% to 1%; (iii) hyperparameter optimization is essential for constraining pathological trading, with costs dropping up to 82%; and (iv) algorithm-cost model interactions are strongly environment-specific, e.g., DDPG's OOS Sharpe jumps from -2.1 to 0.3 under AC in margin trading while SAC's drops from -0.5 to -1.2. We release the full suite as an open-source extension to FinRL-Meta.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces three Gymnasium-compatible RL trading environments (MACE for stock trading, margin trading, and portfolio optimization) that embed nonlinear market impact via the Almgren-Chriss framework and the square-root impact law, with pluggable cost models, permanent impact decay, and detailed logging. It evaluates five DRL algorithms (A2C, PPO, DDPG, SAC, TD3) on NASDAQ-100 data, contrasting a fixed 10 bps baseline against the AC model under Optuna-tuned hyperparameters, and reports that the cost model alters absolute performance, algorithm rankings, and trading behavior (e.g., turnover dropping from 19% to 1% and costs from $200k to $8k), while stressing that hyperparameter optimization is required to prevent pathological policies.
Significance. If the central claims hold after addressing evaluation confounds, the work provides a concrete demonstration that simplified transaction-cost assumptions in RL trading agents produce unrealistic policies, and supplies reusable environments that can improve the fidelity of future research. The open-source release as a FinRL-Meta extension and the observation of environment-specific algorithm-cost interactions are practical strengths.
major comments (2)
- [Abstract and §4 (Evaluation)] Abstract and evaluation results: the claim that the cost model 'materially changes both absolute performance and the relative ranking of algorithms' is not isolated from hyperparameter tuning. Optuna tuning is applied only to the AC model (explicitly noted as essential to avoid pathological behavior), while the 10 bps baseline remains fixed; consequently, observed shifts (turnover 19%→1%, costs $200k→$8k, Sharpe changes such as DDPG -2.1→0.3) cannot be unambiguously attributed to the nonlinear impact model rather than the extra optimization step.
- [Results section] Results on algorithm-cost interactions: reported out-of-sample Sharpe differences across environments lack accompanying statistical details (number of independent runs, standard errors, or significance tests), so it is unclear whether the claimed ranking reversals are robust or sensitive to random seeds and data splits.
minor comments (2)
- [Abstract] The abstract states that each environment provides 'comprehensive trade-level logging' but does not enumerate the exact logged fields or how they are aggregated into the reported metrics.
- [Methods] Notation for the square-root impact law and the exponential decay of permanent impact should be defined explicitly with equation numbers in the methods section for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and positive overall assessment of the work. We address each major comment below and will revise the manuscript accordingly to improve clarity and rigor.
read point-by-point responses
-
Referee: [Abstract and §4 (Evaluation)] Abstract and evaluation results: the claim that the cost model 'materially changes both absolute performance and the relative ranking of algorithms' is not isolated from hyperparameter tuning. Optuna tuning is applied only to the AC model (explicitly noted as essential to avoid pathological behavior), while the 10 bps baseline remains fixed; consequently, observed shifts (turnover 19%→1%, costs $200k→$8k, Sharpe changes such as DDPG -2.1→0.3) cannot be unambiguously attributed to the nonlinear impact model rather than the extra optimization step.
Authors: We agree that the experimental design confounds the cost-model effect with the hyperparameter-optimization step. The manuscript already notes that tuning is required for the AC model to prevent pathological behavior, but the referee is correct that this asymmetry prevents unambiguous attribution. In the revision we will run Optuna tuning on the fixed 10 bps baseline as well, re-evaluate all algorithms under both tuned settings, and explicitly compare the two regimes to isolate the contribution of the nonlinear impact model. revision: yes
-
Referee: [Results section] Results on algorithm-cost interactions: reported out-of-sample Sharpe differences across environments lack accompanying statistical details (number of independent runs, standard errors, or significance tests), so it is unclear whether the claimed ranking reversals are robust or sensitive to random seeds and data splits.
Authors: We accept this criticism. The current results are based on single runs without reported variability. In the revised manuscript we will repeat all experiments with at least five independent random seeds, report means and standard errors for Sharpe ratios, turnover, and costs, and include paired statistical tests (e.g., t-tests) to assess whether observed ranking changes are statistically significant across environments. revision: yes
Circularity Check
No circularity: claims rest on external models and data without self-referential reductions
full rationale
The paper introduces environments using the standard Almgren-Chriss framework and square-root impact law (external literature) evaluated on NASDAQ-100 data with standard DRL algorithms and Optuna. No equations, parameters, or claims reduce by construction to the authors' own fitted values or self-citations; performance and ranking shifts are reported from direct simulation rather than tautological redefinitions. The central evaluation chain is independent of the paper's inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- AC model parameters
axioms (1)
- domain assumption Square-root impact law and Almgren-Chriss framework accurately represent market impact for the tested assets
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce ... nonlinear market impact models grounded in the Almgren–Chriss (AC) framework and the empirically validated square-root impact law ... C_perm = ½ α σ (x/V) |x| P, C_temp = β σ (x/V) |x| P, I(Q) = Y·σ·√(Q/V)
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
hyperparameter optimization is essential for constraining pathological trading ... algorithm-cost model interactions are strongly environment-specific
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Reinforcement learning for quantitative trading,
S. Sun, R. Wang, and B. An, “Reinforcement learning for quantitative trading,”ACM Trans. Intell. Syst. Technol., vol. 14, no. 3, Mar. 2023
work page 2023
-
[2]
Optimal execution of portfolio transactions,
R. Almgren and N. Chriss, “Optimal execution of portfolio transactions,” Journal of Risk, vol. 3, no. 2, pp. 5–39, 2001
work page 2001
-
[3]
Model comparison with transaction costs,
A. DETZEL, R. NOVY-MARX, and M. VELIKOV , “Model comparison with transaction costs,”The Journal of Finance, vol. 78, no. 3, pp. 1743– 1775, 2023
work page 2023
-
[4]
G. Brockman, V . Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” 2016
work page 2016
-
[5]
Anomalous price impact and the critical nature of liquidity in financial markets,
B. T ´oth, Y . Lemperiere, C. Deremble, J. de Lataillade, J. Kockelkoren, and J.-P. Bouchaud, “Anomalous price impact and the critical nature of liquidity in financial markets,”Physical Review X, vol. 1, no. 2, p. 021006, 2011
work page 2011
-
[6]
Finrl-meta: Market environments and benchmarks for data-driven financial reinforcement learning,
X.-Y . Liu, Z. Xia, J. Rui, J. Gao, H. Yang, M. Zhu, C. Wang, Z. Wang, and J. Guo, “Finrl-meta: Market environments and benchmarks for data-driven financial reinforcement learning,” inAdvances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran Associates, Inc., 2022, pp. 1835–1849
work page 2022
-
[7]
Optimal trading strategy and supply/demand dynamics,
A. Obizhaeva and J. Wang, “Optimal trading strategy and supply/demand dynamics,”Journal of Financial Markets, vol. 16, no. 1, pp. 1–32, 2013
work page 2013
-
[8]
Performance functions and reinforcement learning for trading systems and portfolios,
J. Moody, L. Wu, Y . Liao, and M. Saffell, “Performance functions and reinforcement learning for trading systems and portfolios,”Journal of Forecasting, vol. 17, no. 5-6, pp. 441–470, 1998
work page 1998
-
[9]
J. Gu, W. Du, A. M. M. Rahman, and G. Wang, “Margin trader: A reinforcement learning framework for portfolio management with margin and constraints,” inProceedings of the Fourth ACM International Conference on AI in Finance, ser. ICAIF ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 610–618
work page 2023
-
[10]
POE: A general portfolio optimization envi- ronment for FinRL,
C. Costa and A. Costa, “POE: A general portfolio optimization envi- ronment for FinRL,” inAnais do II Brazilian Workshop on Artificial Intelligence in Finance. Porto Alegre, RS, Brasil: SBC, 2023, pp. 132–143
work page 2023
-
[11]
Continuous auctions and insider trading,
A. S. Kyle, “Continuous auctions and insider trading,”Econometrica, vol. 53, no. 6, pp. 1315–1335, 1985
work page 1985
-
[12]
Chapter 2 - how markets slowly digest changes in supply and demand,
J.-P. Bouchaud, J. D. Farmer, and F. Lillo, “Chapter 2 - how markets slowly digest changes in supply and demand,” inHandbook of Financial Markets: Dynamics and Evolution, ser. Handbooks in Finance, T. Hens and K. R. Schenk-Hopp ´e, Eds. San Diego: North-Holland, 2009, pp. 57–160
work page 2009
-
[13]
J.-P. Bouchaud, J. Bonart, J. Donier, and M. Gould,Trades, quotes and prices: financial markets under the microscope. Cambridge University Press, 2018
work page 2018
-
[14]
Direct estimation of equity market impact,
R. Almgren, C. Thum, E. Hauptmann, and H. Li, “Direct estimation of equity market impact,”Risk, vol. 18, no. 7, pp. 58–62, 2005
work page 2005
-
[15]
Market impacts and the life cycle of investors orders,
E. Bacry, A. Iuga, M. Lasnier, and C.-A. Lehalle, “Market impacts and the life cycle of investors orders,”Market Microstructure and Liquidity, vol. 01, no. 02, p. 1550009, 2015
work page 2015
-
[16]
How efficiency shapes market impact,
J. D. Farmer, A. Gerig, F. Lillo, and H. Waelbroeck, “How efficiency shapes market impact,”Quantitative Finance, vol. 13, no. 11, pp. 1743– 1758, 2013
work page 2013
-
[17]
Slow decay of impact in equity markets,
X. Brokmann, E. S ´eri´e, J. Kockelkoren, and J.-P. Bouchaud, “Slow decay of impact in equity markets,”Market Microstructure and Liquidity, vol. 01, no. 02, p. 1550007, 2015
work page 2015
-
[18]
Stable-baselines3: Reliable reinforcement learning implementa- tions,
A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dor- mann, “Stable-baselines3: Reliable reinforcement learning implementa- tions,”Journal of Machine Learning Research, vol. 22, no. 268, pp. 1–8, 2021
work page 2021
-
[19]
Optuna: A next- generation hyperparameter optimization framework,
T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, “Optuna: A next- generation hyperparameter optimization framework,” inProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, ser. KDD ’19. New York, NY , USA: Association for Computing Machinery, 2019, p. 2623–2631
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.