pith. sign in

arxiv: 2508.20467 · v2 · submitted 2025-08-28 · 💱 q-fin.PM · cs.LG· q-fin.CP

QTMRL: An Agent for Quantitative Trading Decision-Making Based on Multi-Indicator Guided Reinforcement Learning

Pith reviewed 2026-05-18 21:20 UTC · model grok-4.3

classification 💱 q-fin.PM cs.LGq-fin.CP
keywords quantitative tradingreinforcement learningtechnical indicatorsA2C algorithmportfolio managementstock tradingadaptive policies
0
0 comments X

The pith

A reinforcement learning agent guided by multiple technical indicators learns adaptive trading policies that improve returns and risk control over traditional models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents QTMRL as a trading agent that fuses a broad set of technical indicators with reinforcement learning to operate in volatile markets. It assembles 23 years of daily data for 16 S&P 500 stocks and augments the raw prices with trend, volatility, and momentum measures. An Advantage Actor-Critic policy then learns to issue buy, sell, or hold actions that respond to changing conditions. Experiments against nine baselines across multiple market periods show gains in profitability, risk-adjusted metrics, and reduced downside exposure. A reader would care because the approach replaces rigid statistical assumptions with learned behavior that can adjust when markets shift unexpectedly.

Core claim

The paper claims that constructing a multi-indicator dataset from long-term stock records and feeding it into a lightweight A2C reinforcement learning agent produces trading policies that deliver higher profitability, stronger risk adjustment, and better downside protection than statistical models such as ARIMA, neural networks such as LSTM, or simple moving-average rules, with the advantage holding across varied market regimes.

What carries the argument

The QTMRL trading agent, which represents market state through an enriched vector of technical indicators and uses an Advantage Actor-Critic policy to select discrete trading actions.

If this is right

  • Trading decisions can respond to current indicator values rather than fixed statistical assumptions.
  • Portfolio risk can be shaped directly by the reward signal optimized during policy learning.
  • Performance advantages appear across different sectors and market regimes in the tested window.
  • The framework supplies concrete buy, sell, or hold outputs that can be executed without additional rule layers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same indicator-enriched state could be tested with higher-frequency data to improve entry and exit timing.
  • Adding macroeconomic or news-derived features to the state vector might further stabilize performance during regime shifts.
  • The lightweight A2C design invites live deployment trials that measure slippage and execution costs beyond historical backtests.

Load-bearing premise

The multi-indicator patterns found in the 2000-2022 data for these 16 stocks will keep supplying useful signals for profitable and risk-aware decisions under the A2C policy in later periods.

What would settle it

Training the agent on the 2000-2022 data and then measuring its profitability and risk metrics on post-2022 out-of-sample prices for the same stocks, where performance falls below the best baseline, would directly test the claim.

read the original abstract

In the highly volatile and uncertain global financial markets, traditional quantitative trading models relying on statistical modeling or empirical rules often fail to adapt to dynamic market changes and black swan events due to rigid assumptions and limited generalization. To address these issues, this paper proposes QTMRL (Quantitative Trading Multi-Indicator Reinforcement Learning), an intelligent trading agent combining multi-dimensional technical indicators with reinforcement learning (RL) for adaptive and stable portfolio management. We first construct a comprehensive multi-indicator dataset using 23 years of S&P 500 daily OHLCV data (2000-2022) for 16 representative stocks across 5 sectors, enriching raw data with trend, volatility, and momentum indicators to capture holistic market dynamics. Then we design a lightweight RL framework based on the Advantage Actor-Critic (A2C) algorithm, including data processing, A2C algorithm, and trading agent modules to support policy learning and actionable trading decisions. Extensive experiments compare QTMRL with 9 baselines (e.g., ARIMA, LSTM, moving average strategies) across diverse market regimes, verifying its superiority in profitability, risk adjustment, and downside risk control. The code of QTMRL is publicly available at https://github.com/ChenJiahaoJNU/QTMRL.git

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes QTMRL, a quantitative trading agent that integrates a multi-indicator state representation (trend, volatility, and momentum features derived from 2000-2022 daily OHLCV data on 16 S&P 500 stocks) with an Advantage Actor-Critic (A2C) reinforcement learning policy. It claims that the resulting agent outperforms nine baselines (including ARIMA, LSTM, and moving-average strategies) in profitability, risk-adjusted returns, and downside-risk control across diverse market regimes within the studied period, with code released publicly.

Significance. If the reported superiority were shown to generalize beyond the 2000-2022 training distribution, the work would offer a concrete, reproducible example of lightweight RL for adaptive multi-indicator portfolio management in volatile equity markets. The public code release is a clear strength that supports verification and extension. At present, however, the absence of explicit out-of-sample controls limits the strength of any claim that the learned policy captures persistent rather than transient structure.

major comments (2)
  1. [Experiments section] Experiments section: the profitability, Sharpe-ratio, and maximum-drawdown metrics are obtained by rolling the trained A2C policy over the identical 2000-2022 window used to construct the 23 technical indicators and to optimize the policy parameters. Because no walk-forward split, post-2022 hold-out set, or live-trading forward test is described, the reported gains are in-sample fitted quantities rather than independent evidence of generalization across regime shifts.
  2. [Abstract and §3] Abstract and §3 (Data & Method): the manuscript states that the multi-indicator dataset captures “holistic market dynamics” and that experiments cover “diverse market regimes,” yet provides no description of how look-ahead bias was avoided during indicator calculation, how the state vector was normalized across the full period, or whether any form of purged cross-validation was applied. These omissions directly affect the load-bearing claim that the A2C policy delivers stable out-of-sample decision-making.
minor comments (2)
  1. [Abstract] The abstract lists nine baselines but does not name them all or indicate whether they were re-implemented with the same multi-indicator state or run on raw prices only; a short table clarifying this would improve clarity.
  2. [Method] Hyperparameter values for the A2C agent (learning rate, entropy coefficient, discount factor, network architecture) are not reported in the main text or appendix, hindering exact reproduction despite the public code link.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address the major concerns point by point below, acknowledging limitations in the current experimental design while outlining planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Experiments section] Experiments section: the profitability, Sharpe-ratio, and maximum-drawdown metrics are obtained by rolling the trained A2C policy over the identical 2000-2022 window used to construct the 23 technical indicators and to optimize the policy parameters. Because no walk-forward split, post-2022 hold-out set, or live-trading forward test is described, the reported gains are in-sample fitted quantities rather than independent evidence of generalization across regime shifts.

    Authors: We agree that evaluating the trained policy by rolling it over the same 2000-2022 window used for training and indicator construction means the reported metrics are in-sample. Although the A2C framework is intended to produce an adaptive policy and we segmented results across sub-periods with distinct market conditions, this does not constitute independent out-of-sample validation. In the revised manuscript we will add a walk-forward optimization procedure and, where feasible, extend the dataset with a post-2022 hold-out period to provide clearer evidence of generalization. revision: yes

  2. Referee: [Abstract and §3] Abstract and §3 (Data & Method): the manuscript states that the multi-indicator dataset captures “holistic market dynamics” and that experiments cover “diverse market regimes,” yet provides no description of how look-ahead bias was avoided during indicator calculation, how the state vector was normalized across the full period, or whether any form of purged cross-validation was applied. These omissions directly affect the load-bearing claim that the A2C policy delivers stable out-of-sample decision-making.

    Authors: We accept that the original submission omitted explicit details on these preprocessing steps. In the revised §3 we will document that all 23 indicators are computed using only information available at each time step, that state normalization is performed with parameters estimated from prior data only, and that time-series-aware validation (including purged cross-validation) was used during hyper-parameter selection. These practices are present in the released code but were not described in the initial text. revision: yes

Circularity Check

0 steps flagged

No significant circularity in claimed derivation or results

full rationale

The paper constructs a multi-indicator dataset from 2000-2022 OHLCV data, trains an A2C policy on it, and reports empirical performance comparisons against baselines on the same historical window. No mathematical derivation chain, first-principles result, or self-definitional loop is present in the abstract or described method. The central claim of superiority rests on reported backtest metrics rather than any quantity that reduces by construction to its own inputs via equations or self-citation. While generalization concerns exist (as noted by the reader), they do not constitute circularity under the specified patterns; the work is self-contained as an empirical RL application without load-bearing self-citations or renamed known results.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that daily OHLCV plus standard indicators from 2000-2022 are sufficient to learn a policy that generalizes, plus standard RL convergence assumptions for A2C.

free parameters (1)
  • A2C hyperparameters (learning rate, discount factor, entropy coefficient)
    Chosen to train the policy; values not stated in abstract but required for reproduction.
axioms (1)
  • domain assumption Market dynamics are stationary enough within the 2000-2022 window that patterns learned by A2C remain useful after training.
    Invoked when claiming superiority on historical test periods.

pith-pipeline@v0.9.0 · 5763 in / 1291 out tokens · 32612 ms · 2026-05-18T21:20:43.395637+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages

  1. [1]

    F. F. Noor, M. F. Hossain, A quantitative neural network model (qnnm) for stock trading decisions, Jahangirnagar Review, Part II: Social Science 29 (2005) 177–194

  2. [2]

    J. M. S. Bjørnsen, H. Heir, A. Høsøien, A quantitative approach to asset allocation and trad- ing, Master’s thesis, NTNU, 2012

  3. [3]

    S. K. Sahu, A. Mokhade, N. D. Bokde, An overview of machine learning, deep learning, and reinforcement learning-based techniques in quantitative finance: recent progress and chal- lenges, Applied Sciences 13 (2023) 1956

  4. [4]

    E. A. Gerlein, M. McGinnity, A. Belatreche, S. Coleman, Evaluating machine learning clas- sification for financial trading: An empirical approach, Expert Systems with Applications 54 (2016) 193–207

  5. [5]

    Hai, et al., Research on asset trading strategy based on forecasting model and decision- making trading model, Academic Journal of Computing & Information Science 5 (2022) 47–54

    Y . Hai, et al., Research on asset trading strategy based on forecasting model and decision- making trading model, Academic Journal of Computing & Information Science 5 (2022) 47–54

  6. [6]

    S. Wang, H. Yuan, L. Zhou, L. M. Ni, H.-Y . Shum, J. Guo, Alpha-gpt: Human-ai interactive alpha mining for quantitative investment, arXiv preprint arXiv:2308.00016 (2023)

  7. [7]

    B. Cao, S. Wang, X. Lin, X. Wu, H. Zhang, L. M. Ni, J. Guo, From deep learning to llms: a survey of ai in quantitative investment, arXiv preprint arXiv:2503.21422 (2025)

  8. [8]

    Zhang, B

    Z. Zhang, B. Chen, S. Zhu, N. Langren ´e, Quantformer: from attention to profit with a quantitative transformer trading strategy, arXiv preprint arXiv:2404.00424 (2024)

  9. [9]

    X. Guo, T. L. Lai, H. Shek, S. P.-S. Wong, Quantitative trading: algorithms, analytics, data, models, optimization, Chapman and Hall/CRC, 2017

  10. [10]

    B. An, S. Sun, R. Wang, Deep reinforcement learning for quantitative trading: Challenges and opportunities, IEEE Intelligent Systems 37 (2022) 23–26

  11. [11]

    Hirsa, J

    A. Hirsa, J. Osterrieder, B. Hadji-Misheva, J.-A. Posth, Deep reinforcement learning on a multi-asset environment for trading, arXiv preprint arXiv:2106.08437 (2021)

  12. [12]

    N. D. Nguyen, T. Nguyen, S. Nahavandi, System design perspective for human-level agents using deep reinforcement learning: A survey, IEEE Access 5 (2017) 27091–27102

  13. [13]

    V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., Human-level control through deep reinforcement learning, nature 518 (2015) 529–533

  14. [14]

    S. Chen, W. Luo, C. Yu, Reinforcement learning with expert trajectory for quantitative trad- ing, arXiv preprint arXiv:2105.03844 (2021)

  15. [15]

    R. S. Sutton, A. G. Barto, et al., Reinforcement Learning: An Introduction, volume 1, MIT Press Cambridge, 1998. 10

  16. [16]

    Geibel, Reinforcement learning for mdps with constraints, in: Proceedings of the European Conference on Machine Learning, 2006, pp

    P. Geibel, Reinforcement learning for mdps with constraints, in: Proceedings of the European Conference on Machine Learning, 2006, pp. 646–653

  17. [17]

    Zhang, J

    J. Zhang, J. Kim, B. O’Donoghue, S. Boyd, Sample efficient reinforcement learning with reinforce, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 2021, pp. 10887–10895

  18. [18]

    V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., Human-level control through deep reinforcement learning, Nature 518 (2015) 529–533

  19. [19]

    A. G. Barto, R. S. Sutton, C. W. Anderson, Neuronlike adaptive elements that can solve difficult learning control problems, IEEE Transactions on Systems, Man, and Cybernetics (2012) 834–846

  20. [20]

    V . Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, K. Kavukcuoglu, Asynchronous methods for deep reinforcement learning, in: Proceedings of the International Conference on Machine Learning, 2016, pp. 1928–1937

  21. [21]

    M. Xu, Z. Lan, Z. Tao, J. Du, Z. Ye, Deep reinforcement learning for quantitative trading, in: 2024 4th International Conference on Electronics, Circuits and Information Engineering (ECIE), IEEE, 2024, pp. 583–589. 11 Appendix A. Algorithm Algorithm 1 A2C Algorithm for Multi-Asset Trading 1: Initialize actor πθ and critic Vφ networks; 2: set optimizers an...