Realistic Market Impact Modeling for Reinforcement Learning Trading Environments

Anna Helena Reali Costa; Lucas Riera Abbade

arxiv: 2603.29086 · v2 · submitted 2026-03-30 · 💻 cs.LG · cs.CE

Realistic Market Impact Modeling for Reinforcement Learning Trading Environments

Lucas Riera Abbade , Anna Helena Reali Costa This is my paper

Pith reviewed 2026-05-14 20:58 UTC · model grok-4.3

classification 💻 cs.LG cs.CE

keywords reinforcement learningmarket impacttrading environmentsAlmgren-Chrisstransaction costsDRL algorithmsGymnasiumportfolio optimization

0 comments

The pith

Realistic nonlinear market impact costs change both absolute performance and relative rankings of reinforcement learning trading algorithms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds three new Gymnasium environments that embed Almgren-Chriss market impact with a square-root temporary impact law and exponentially decaying permanent impact. Experiments on NASDAQ-100 assets show that replacing a fixed 10 basis point cost with the full model produces sharply lower turnover, lower realized costs, and different algorithm rankings across stock trading, margin trading, and portfolio optimization tasks. A reader would care because most existing RL trading agents learn policies that would incur far higher execution costs in live markets than their backtests suggest. The environments are released as an open extension to FinRL-Meta so that future agents can be trained under more representative cost conditions.

Core claim

The MACE environments integrate pluggable Almgren-Chriss cost models into three trading tasks; when five DRL algorithms are evaluated under both fixed and full impact costs, the realistic model produces dramatically lower turnover and costs while reversing or shifting algorithm rankings in an environment-specific manner.

What carries the argument

Pluggable Almgren-Chriss cost module with square-root temporary impact and exponential-decay permanent impact, embedded inside Gymnasium trading environments that log trade-level execution costs.

If this is right

Absolute performance numbers for A2C, PPO, DDPG, SAC, and TD3 all shift when realistic impact is used instead of fixed costs.
The ordering of which algorithm performs best changes across the three environments once impact is modeled.
Agents switch from high-turnover policies (19 percent daily) to low-turnover policies (1 percent daily) under the full cost model.
Hyperparameter tuning becomes necessary to prevent the agent from incurring extreme costs that the fixed-cost baseline hides.
Algorithm-cost interactions differ by task, with some algorithms improving and others worsening under realistic impact.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Published RL trading results that rely only on fixed or zero transaction costs are likely to overstate live performance.
Any new trading environment or benchmark should include at least one realistic impact variant as a default test case.
Sensitivity analysis across cost models could become a standard step when selecting an algorithm for production trading.

Load-bearing premise

The Almgren-Chriss framework together with the square-root impact law accurately describes market impact for the NASDAQ-100 stocks and holding periods used in the tests.

What would settle it

A direct comparison of the model's predicted daily execution costs against actual realized slippage on the same NASDAQ-100 trades executed through a live broker at comparable sizes and speeds.

Figures

Figures reproduced from arXiv: 2603.29086 by Anna Helena Reali Costa, Lucas Riera Abbade.

**Figure 1.** Figure 1: OOS total return—MACE stock trading, all five agents under baseline vs. AC impact, optimized params. Black line: QQEW benchmark (19%). [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Non-optimized TD3 trading costs—MACE stock trading, baseline vs. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 4.** Figure 4: OOS portfolio value—margin trading, A2C/PPO/DDPG/SAC, baseline [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: PPO average order POV per epoch—margin trading. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 8.** Figure 8: TD3 (optimized) sharpe per epoch—POE, AC vs. 10 bps baseline [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗

read the original abstract

Reinforcement learning (RL) has shown promise for trading, yet most open-source backtesting environments assume negligible or fixed transaction costs, causing agents to learn trading behaviors that fail under realistic execution. We introduce three Gymnasium-compatible trading environments -- MACE (Market-Adjusted Cost Execution) stock trading, margin trading, and portfolio optimization -- that integrate nonlinear market impact models grounded in the Almgren-Chriss framework and the empirically validated square-root impact law. Each environment provides pluggable cost models, permanent impact tracking with exponential decay, and comprehensive trade-level logging. We evaluate five DRL algorithms (A2C, PPO, DDPG, SAC, TD3) on the NASDAQ-100, comparing a fixed 10 bps baseline against the AC model with Optuna-tuned hyperparameters. Our results show that (i) the cost model materially changes both absolute performance and the relative ranking of algorithms across all three environments; (ii) the AC model produces dramatically different trading behavior, e.g., daily costs dropping from $200k to $8k with turnover falling from 19% to 1%; (iii) hyperparameter optimization is essential for constraining pathological trading, with costs dropping up to 82%; and (iv) algorithm-cost model interactions are strongly environment-specific, e.g., DDPG's OOS Sharpe jumps from -2.1 to 0.3 under AC in margin trading while SAC's drops from -0.5 to -1.2. We release the full suite as an open-source extension to FinRL-Meta.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper ships three practical Gymnasium environments with pluggable Almgren-Chriss impact but the headline comparison mixes the cost model with unequal hyperparameter tuning.

read the letter

The new piece is three Gymnasium environments (stock, margin, portfolio) that drop Almgren-Chriss nonlinear impact and square-root laws into RL trading loops on top of FinRL-Meta. They add pluggable cost functions, exponential decay on permanent impact, and full trade logs. That is a concrete engineering step beyond prior open environments that mostly used flat fees. The release is open source, which makes it immediately usable for anyone running DRL agents on execution problems. The reported runs on NASDAQ-100 data show that switching to the AC model changes both absolute metrics and algorithm rankings, with big drops in turnover and costs in some cases and environment-specific flips in Sharpe for DDPG versus SAC. Those patterns are worth seeing even if they are not surprising once realistic costs are present. The main weakness is the experimental contrast. The baseline stays at a fixed 10 bps while the AC version receives Optuna tuning; the text notes tuning is needed to stop pathological behavior. Without the same search on the baseline, the observed shifts in performance and rankings cannot be cleanly attributed to the impact model rather than the extra optimization. The rest of the work uses standard algorithms and public data with no new derivations, so the contribution stays in the implementation and the demonstration rather than in theory or fresh evidence. This is useful for people already working inside FinRL or building RL execution agents who need better simulators. A referee could usefully check the tuning controls and ask for matched hyperparameter budgets on both sides, but the environments themselves are worth the time to review.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces three Gymnasium-compatible RL trading environments (MACE for stock trading, margin trading, and portfolio optimization) that embed nonlinear market impact via the Almgren-Chriss framework and the square-root impact law, with pluggable cost models, permanent impact decay, and detailed logging. It evaluates five DRL algorithms (A2C, PPO, DDPG, SAC, TD3) on NASDAQ-100 data, contrasting a fixed 10 bps baseline against the AC model under Optuna-tuned hyperparameters, and reports that the cost model alters absolute performance, algorithm rankings, and trading behavior (e.g., turnover dropping from 19% to 1% and costs from $200k to $8k), while stressing that hyperparameter optimization is required to prevent pathological policies.

Significance. If the central claims hold after addressing evaluation confounds, the work provides a concrete demonstration that simplified transaction-cost assumptions in RL trading agents produce unrealistic policies, and supplies reusable environments that can improve the fidelity of future research. The open-source release as a FinRL-Meta extension and the observation of environment-specific algorithm-cost interactions are practical strengths.

major comments (2)

[Abstract and §4 (Evaluation)] Abstract and evaluation results: the claim that the cost model 'materially changes both absolute performance and the relative ranking of algorithms' is not isolated from hyperparameter tuning. Optuna tuning is applied only to the AC model (explicitly noted as essential to avoid pathological behavior), while the 10 bps baseline remains fixed; consequently, observed shifts (turnover 19%→1%, costs $200k→$8k, Sharpe changes such as DDPG -2.1→0.3) cannot be unambiguously attributed to the nonlinear impact model rather than the extra optimization step.
[Results section] Results on algorithm-cost interactions: reported out-of-sample Sharpe differences across environments lack accompanying statistical details (number of independent runs, standard errors, or significance tests), so it is unclear whether the claimed ranking reversals are robust or sensitive to random seeds and data splits.

minor comments (2)

[Abstract] The abstract states that each environment provides 'comprehensive trade-level logging' but does not enumerate the exact logged fields or how they are aggregated into the reported metrics.
[Methods] Notation for the square-root impact law and the exponential decay of permanent impact should be defined explicitly with equation numbers in the methods section for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and positive overall assessment of the work. We address each major comment below and will revise the manuscript accordingly to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract and §4 (Evaluation)] Abstract and evaluation results: the claim that the cost model 'materially changes both absolute performance and the relative ranking of algorithms' is not isolated from hyperparameter tuning. Optuna tuning is applied only to the AC model (explicitly noted as essential to avoid pathological behavior), while the 10 bps baseline remains fixed; consequently, observed shifts (turnover 19%→1%, costs $200k→$8k, Sharpe changes such as DDPG -2.1→0.3) cannot be unambiguously attributed to the nonlinear impact model rather than the extra optimization step.

Authors: We agree that the experimental design confounds the cost-model effect with the hyperparameter-optimization step. The manuscript already notes that tuning is required for the AC model to prevent pathological behavior, but the referee is correct that this asymmetry prevents unambiguous attribution. In the revision we will run Optuna tuning on the fixed 10 bps baseline as well, re-evaluate all algorithms under both tuned settings, and explicitly compare the two regimes to isolate the contribution of the nonlinear impact model. revision: yes
Referee: [Results section] Results on algorithm-cost interactions: reported out-of-sample Sharpe differences across environments lack accompanying statistical details (number of independent runs, standard errors, or significance tests), so it is unclear whether the claimed ranking reversals are robust or sensitive to random seeds and data splits.

Authors: We accept this criticism. The current results are based on single runs without reported variability. In the revised manuscript we will repeat all experiments with at least five independent random seeds, report means and standard errors for Sharpe ratios, turnover, and costs, and include paired statistical tests (e.g., t-tests) to assess whether observed ranking changes are statistically significant across environments. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on external models and data without self-referential reductions

full rationale

The paper introduces environments using the standard Almgren-Chriss framework and square-root impact law (external literature) evaluated on NASDAQ-100 data with standard DRL algorithms and Optuna. No equations, parameters, or claims reduce by construction to the authors' own fitted values or self-citations; performance and ranking shifts are reported from direct simulation rather than tautological redefinitions. The central evaluation chain is independent of the paper's inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that the chosen impact models are sufficiently realistic proxies for actual execution costs; no new entities are postulated and free parameters are inherited from the cited frameworks.

free parameters (1)

AC model parameters
Parameters of the Almgren-Chriss model and square-root law are taken from literature or tuned; their specific values affect the reported cost reductions.

axioms (1)

domain assumption Square-root impact law and Almgren-Chriss framework accurately represent market impact for the tested assets
Invoked to justify the cost models used in the environments.

pith-pipeline@v0.9.0 · 5581 in / 1132 out tokens · 43821 ms · 2026-05-14T20:58:46.167424+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce ... nonlinear market impact models grounded in the Almgren–Chriss (AC) framework and the empirically validated square-root impact law ... C_perm = ½ α σ (x/V) |x| P, C_temp = β σ (x/V) |x| P, I(Q) = Y·σ·√(Q/V)
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

hyperparameter optimization is essential for constraining pathological trading ... algorithm-cost model interactions are strongly environment-specific

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

[1]

Reinforcement learning for quantitative trading,

S. Sun, R. Wang, and B. An, “Reinforcement learning for quantitative trading,”ACM Trans. Intell. Syst. Technol., vol. 14, no. 3, Mar. 2023

work page 2023
[2]

Optimal execution of portfolio transactions,

R. Almgren and N. Chriss, “Optimal execution of portfolio transactions,” Journal of Risk, vol. 3, no. 2, pp. 5–39, 2001

work page 2001
[3]

Model comparison with transaction costs,

A. DETZEL, R. NOVY-MARX, and M. VELIKOV , “Model comparison with transaction costs,”The Journal of Finance, vol. 78, no. 3, pp. 1743– 1775, 2023

work page 2023
[4]

Openai gym,

G. Brockman, V . Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” 2016

work page 2016
[5]

Anomalous price impact and the critical nature of liquidity in financial markets,

B. T ´oth, Y . Lemperiere, C. Deremble, J. de Lataillade, J. Kockelkoren, and J.-P. Bouchaud, “Anomalous price impact and the critical nature of liquidity in financial markets,”Physical Review X, vol. 1, no. 2, p. 021006, 2011

work page 2011
[6]

Finrl-meta: Market environments and benchmarks for data-driven financial reinforcement learning,

X.-Y . Liu, Z. Xia, J. Rui, J. Gao, H. Yang, M. Zhu, C. Wang, Z. Wang, and J. Guo, “Finrl-meta: Market environments and benchmarks for data-driven financial reinforcement learning,” inAdvances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran Associates, Inc., 2022, pp. 1835–1849

work page 2022
[7]

Optimal trading strategy and supply/demand dynamics,

A. Obizhaeva and J. Wang, “Optimal trading strategy and supply/demand dynamics,”Journal of Financial Markets, vol. 16, no. 1, pp. 1–32, 2013

work page 2013
[8]

Performance functions and reinforcement learning for trading systems and portfolios,

J. Moody, L. Wu, Y . Liao, and M. Saffell, “Performance functions and reinforcement learning for trading systems and portfolios,”Journal of Forecasting, vol. 17, no. 5-6, pp. 441–470, 1998

work page 1998
[9]

Margin trader: A reinforcement learning framework for portfolio management with margin and constraints,

J. Gu, W. Du, A. M. M. Rahman, and G. Wang, “Margin trader: A reinforcement learning framework for portfolio management with margin and constraints,” inProceedings of the Fourth ACM International Conference on AI in Finance, ser. ICAIF ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 610–618

work page 2023
[10]

POE: A general portfolio optimization envi- ronment for FinRL,

C. Costa and A. Costa, “POE: A general portfolio optimization envi- ronment for FinRL,” inAnais do II Brazilian Workshop on Artificial Intelligence in Finance. Porto Alegre, RS, Brasil: SBC, 2023, pp. 132–143

work page 2023
[11]

Continuous auctions and insider trading,

A. S. Kyle, “Continuous auctions and insider trading,”Econometrica, vol. 53, no. 6, pp. 1315–1335, 1985

work page 1985
[12]

Chapter 2 - how markets slowly digest changes in supply and demand,

J.-P. Bouchaud, J. D. Farmer, and F. Lillo, “Chapter 2 - how markets slowly digest changes in supply and demand,” inHandbook of Financial Markets: Dynamics and Evolution, ser. Handbooks in Finance, T. Hens and K. R. Schenk-Hopp ´e, Eds. San Diego: North-Holland, 2009, pp. 57–160

work page 2009
[13]

Bouchaud, J

J.-P. Bouchaud, J. Bonart, J. Donier, and M. Gould,Trades, quotes and prices: financial markets under the microscope. Cambridge University Press, 2018

work page 2018
[14]

Direct estimation of equity market impact,

R. Almgren, C. Thum, E. Hauptmann, and H. Li, “Direct estimation of equity market impact,”Risk, vol. 18, no. 7, pp. 58–62, 2005

work page 2005
[15]

Market impacts and the life cycle of investors orders,

E. Bacry, A. Iuga, M. Lasnier, and C.-A. Lehalle, “Market impacts and the life cycle of investors orders,”Market Microstructure and Liquidity, vol. 01, no. 02, p. 1550009, 2015

work page 2015
[16]

How efficiency shapes market impact,

J. D. Farmer, A. Gerig, F. Lillo, and H. Waelbroeck, “How efficiency shapes market impact,”Quantitative Finance, vol. 13, no. 11, pp. 1743– 1758, 2013

work page 2013
[17]

Slow decay of impact in equity markets,

X. Brokmann, E. S ´eri´e, J. Kockelkoren, and J.-P. Bouchaud, “Slow decay of impact in equity markets,”Market Microstructure and Liquidity, vol. 01, no. 02, p. 1550007, 2015

work page 2015
[18]

Stable-baselines3: Reliable reinforcement learning implementa- tions,

A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dor- mann, “Stable-baselines3: Reliable reinforcement learning implementa- tions,”Journal of Machine Learning Research, vol. 22, no. 268, pp. 1–8, 2021

work page 2021
[19]

Optuna: A next- generation hyperparameter optimization framework,

T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, “Optuna: A next- generation hyperparameter optimization framework,” inProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, ser. KDD ’19. New York, NY , USA: Association for Computing Machinery, 2019, p. 2623–2631

work page 2019

[1] [1]

Reinforcement learning for quantitative trading,

S. Sun, R. Wang, and B. An, “Reinforcement learning for quantitative trading,”ACM Trans. Intell. Syst. Technol., vol. 14, no. 3, Mar. 2023

work page 2023

[2] [2]

Optimal execution of portfolio transactions,

R. Almgren and N. Chriss, “Optimal execution of portfolio transactions,” Journal of Risk, vol. 3, no. 2, pp. 5–39, 2001

work page 2001

[3] [3]

Model comparison with transaction costs,

A. DETZEL, R. NOVY-MARX, and M. VELIKOV , “Model comparison with transaction costs,”The Journal of Finance, vol. 78, no. 3, pp. 1743– 1775, 2023

work page 2023

[4] [4]

Openai gym,

G. Brockman, V . Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” 2016

work page 2016

[5] [5]

Anomalous price impact and the critical nature of liquidity in financial markets,

B. T ´oth, Y . Lemperiere, C. Deremble, J. de Lataillade, J. Kockelkoren, and J.-P. Bouchaud, “Anomalous price impact and the critical nature of liquidity in financial markets,”Physical Review X, vol. 1, no. 2, p. 021006, 2011

work page 2011

[6] [6]

Finrl-meta: Market environments and benchmarks for data-driven financial reinforcement learning,

X.-Y . Liu, Z. Xia, J. Rui, J. Gao, H. Yang, M. Zhu, C. Wang, Z. Wang, and J. Guo, “Finrl-meta: Market environments and benchmarks for data-driven financial reinforcement learning,” inAdvances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran Associates, Inc., 2022, pp. 1835–1849

work page 2022

[7] [7]

Optimal trading strategy and supply/demand dynamics,

A. Obizhaeva and J. Wang, “Optimal trading strategy and supply/demand dynamics,”Journal of Financial Markets, vol. 16, no. 1, pp. 1–32, 2013

work page 2013

[8] [8]

Performance functions and reinforcement learning for trading systems and portfolios,

J. Moody, L. Wu, Y . Liao, and M. Saffell, “Performance functions and reinforcement learning for trading systems and portfolios,”Journal of Forecasting, vol. 17, no. 5-6, pp. 441–470, 1998

work page 1998

[9] [9]

Margin trader: A reinforcement learning framework for portfolio management with margin and constraints,

J. Gu, W. Du, A. M. M. Rahman, and G. Wang, “Margin trader: A reinforcement learning framework for portfolio management with margin and constraints,” inProceedings of the Fourth ACM International Conference on AI in Finance, ser. ICAIF ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 610–618

work page 2023

[10] [10]

POE: A general portfolio optimization envi- ronment for FinRL,

C. Costa and A. Costa, “POE: A general portfolio optimization envi- ronment for FinRL,” inAnais do II Brazilian Workshop on Artificial Intelligence in Finance. Porto Alegre, RS, Brasil: SBC, 2023, pp. 132–143

work page 2023

[11] [11]

Continuous auctions and insider trading,

A. S. Kyle, “Continuous auctions and insider trading,”Econometrica, vol. 53, no. 6, pp. 1315–1335, 1985

work page 1985

[12] [12]

Chapter 2 - how markets slowly digest changes in supply and demand,

J.-P. Bouchaud, J. D. Farmer, and F. Lillo, “Chapter 2 - how markets slowly digest changes in supply and demand,” inHandbook of Financial Markets: Dynamics and Evolution, ser. Handbooks in Finance, T. Hens and K. R. Schenk-Hopp ´e, Eds. San Diego: North-Holland, 2009, pp. 57–160

work page 2009

[13] [13]

Bouchaud, J

J.-P. Bouchaud, J. Bonart, J. Donier, and M. Gould,Trades, quotes and prices: financial markets under the microscope. Cambridge University Press, 2018

work page 2018

[14] [14]

Direct estimation of equity market impact,

R. Almgren, C. Thum, E. Hauptmann, and H. Li, “Direct estimation of equity market impact,”Risk, vol. 18, no. 7, pp. 58–62, 2005

work page 2005

[15] [15]

Market impacts and the life cycle of investors orders,

E. Bacry, A. Iuga, M. Lasnier, and C.-A. Lehalle, “Market impacts and the life cycle of investors orders,”Market Microstructure and Liquidity, vol. 01, no. 02, p. 1550009, 2015

work page 2015

[16] [16]

How efficiency shapes market impact,

J. D. Farmer, A. Gerig, F. Lillo, and H. Waelbroeck, “How efficiency shapes market impact,”Quantitative Finance, vol. 13, no. 11, pp. 1743– 1758, 2013

work page 2013

[17] [17]

Slow decay of impact in equity markets,

X. Brokmann, E. S ´eri´e, J. Kockelkoren, and J.-P. Bouchaud, “Slow decay of impact in equity markets,”Market Microstructure and Liquidity, vol. 01, no. 02, p. 1550007, 2015

work page 2015

[18] [18]

Stable-baselines3: Reliable reinforcement learning implementa- tions,

A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dor- mann, “Stable-baselines3: Reliable reinforcement learning implementa- tions,”Journal of Machine Learning Research, vol. 22, no. 268, pp. 1–8, 2021

work page 2021

[19] [19]

Optuna: A next- generation hyperparameter optimization framework,

T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, “Optuna: A next- generation hyperparameter optimization framework,” inProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, ser. KDD ’19. New York, NY , USA: Association for Computing Machinery, 2019, p. 2623–2631

work page 2019