pith. sign in

arxiv: 2410.14927 · v2 · submitted 2024-10-19 · 💱 q-fin.TR · cs.CE· cs.LG

Hierarchical Reinforced Trader (HRT): A Bi-Level Approach for Optimizing Stock Selection and Execution

Pith reviewed 2026-05-23 19:24 UTC · model grok-4.3

classification 💱 q-fin.TR cs.CEcs.LG
keywords hierarchical reinforcement learningportfolio managementstock selectiontext-aware tradingbi-level optimizationreinforcement learning in financeNasdaq equity trading
0
0 comments X

The pith

A bi-level reinforcement learning system improves equity trading by separating directional stock selection from risk-aware execution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes HRT, a hierarchical reinforcement learning framework that splits trading into two levels to handle market and text signals in multi-asset portfolios. A high-level controller chooses sparse increase/reduce/hold directions for individual stocks, while a low-level controller translates those into weight changes while respecting turnover, drawdown, and text-risk penalties. Evaluated on a fixed 89-stock Nasdaq universe with training through 2018 and testing on 2020-2023, the full model reaches a Sharpe ratio of 1.24 and daily turnover of 0.090, beating flat RL and other baselines. This separation avoids the full joint action space and keeps decisions easier to inspect.

Core claim

HRT uses a factorized sparse High-Level Controller to select asset-level increase, reduce, or hold directions from compact market and text-derived signals, while a risk-aware Low-Level Controller converts these directions into feasible portfolio weight adjustments under turnover, drawdown, and text-risk penalties. This decomposition yields the strongest learning-based return-risk-cost trade-off across market-proxy, same-universe portfolio, alpha-only, flat-RL, and hierarchical ablation baselines, improving Sharpe from 1.06 to 1.24 and reducing turnover from 0.112 to 0.090 on the 2020-2023 out-of-sample period.

What carries the argument

The bi-level reinforcement learning structure with a High-Level Controller for sparse directional selection and a Low-Level Controller for constrained weight execution.

If this is right

  • Higher Sharpe ratio of 1.24 versus 1.06 for the base hierarchical model
  • Lower daily turnover of 0.090 versus 0.112
  • Robust performance when transaction costs increase
  • More inspectable decisions because selection and execution are handled separately

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separation of selection from execution may reduce the sample complexity needed to learn stable policies in live trading environments
  • Text-risk penalties could be swapped for other sentiment sources without retraining the entire hierarchy
  • Factorized controllers might allow scaling the approach to universes larger than 89 stocks while keeping action spaces manageable

Load-bearing premise

That results on the fixed 89-stock Nasdaq universe and the 2013-2018 training plus 2020-2023 test split reflect generalizable performance without look-ahead bias or excessive tuning on the test horizon.

What would settle it

Retraining and testing the same HRT architecture on a different stock universe or on data after 2023 and finding no improvement in Sharpe or turnover over the flat-RL baseline.

Figures

Figures reproduced from arXiv: 2410.14927 by Roy E. Welsch, Zijie Zhao.

Figure 1
Figure 1. Figure 1: Trading operations heatmap on DJIA 30 stocks for 2021 and 2022. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Hierarchical Reinforced Trader (HRT) architecture. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Cumulative return curves of different investment strategies and S&P 500. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of Trading Volume Proportions: DDPG versus HRT. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Automated equity trading requires converting noisy market and news signals into executable portfolio decisions under risk, turnover, and transaction costs. We propose Hierarchical Reinforced Trader (HRT), a bi-level reinforcement learning framework for text-aware portfolio management in multi-asset equity markets. HRT separates trading into two coordinated decisions: a factorized sparse High-Level Controller (HLC) selects asset-level increase, reduce, or hold directions from compact market and text-derived signals, while a risk-aware Low-Level Controller (LLC) converts these directions into feasible portfolio weight adjustments under turnover, drawdown, and text-risk penalties. This decomposition avoids enumerating the full joint action space and makes selection and execution easier to inspect. We evaluate HRT on an open stock-news benchmark with a fixed 89-stock Nasdaq universe, using 2013--2018 for training, 2019 for validation, and 2020--2023 for final out-of-sample testing; the test horizon is restricted to 2020--2023 due to public benchmark data availability under the same timestamp-clean text-aware protocol. Across market-proxy, same-universe portfolio, alpha-only, flat-RL, and hierarchical ablation baselines, HRT delivers the strongest learning-based return--risk--cost trade-off. The full model improves Sharpe from 1.06 for HRT-Base to 1.24, reduces daily turnover from 0.112 to 0.090, and remains robust under transaction-cost stress. These results suggest that separating sparse directional selection from risk-aware execution is an effective way to incorporate market forecasts and text-derived risk signals into portfolio management.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Hierarchical Reinforced Trader (HRT), a bi-level RL framework with a High-Level Controller (HLC) for sparse directional asset selection from market/text signals and a Low-Level Controller (LLC) for risk-aware execution under turnover/drawdown/text-risk penalties. It evaluates on a fixed 89-stock Nasdaq universe (2013-2018 train, 2019 val, 2020-2023 test) and claims HRT achieves the best return-risk-cost trade-off among market-proxy, same-universe, alpha-only, flat-RL, and ablation baselines, improving Sharpe from 1.06 (HRT-Base) to 1.24 and reducing daily turnover from 0.112 to 0.090 while remaining robust to transaction costs.

Significance. If the reported gains prove robust, the bi-level decomposition provides a practical method for scaling RL to high-dimensional portfolio actions while incorporating text-derived risk signals, addressing a key challenge in automated equity trading.

major comments (3)
  1. [Abstract] Abstract and results section: The central performance claims (Sharpe 1.24 vs. 1.06, turnover 0.090 vs. 0.112) are single point estimates with no reported standard errors, multi-seed averages, or statistical tests. In RL settings, policy-gradient variance routinely produces Sharpe fluctuations of 0.1-0.3 across seeds on identical data, so the 0.18 delta cannot be confidently attributed to the hierarchical design rather than training stochasticity.
  2. [Abstract] Evaluation protocol (described in abstract): Reliance on one fixed 2013-2018/2019/2020-2023 split on a static 89-stock universe without rolling windows, cross-validation, or regime-stratified tests leaves open whether gains are regime-specific to 2020-2023 or generalizable; this directly affects the claim of the 'strongest' trade-off.
  3. [Abstract] Abstract: The comparison to flat-RL and other baselines lacks detail on whether those baselines received identical reward-coefficient tuning, network architectures, or data-exclusion rules as HRT; without this, the superiority claim cannot be isolated to the bi-level structure.
minor comments (2)
  1. Clarify the precise functional form and coefficient values of the text-risk, drawdown, and turnover penalties in the LLC objective function.
  2. Specify the RL algorithm (e.g., PPO, SAC) and network dimensions used for HLC and LLC.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment point by point below, indicating planned revisions to strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract and results section: The central performance claims (Sharpe 1.24 vs. 1.06, turnover 0.090 vs. 0.112) are single point estimates with no reported standard errors, multi-seed averages, or statistical tests. In RL settings, policy-gradient variance routinely produces Sharpe fluctuations of 0.1-0.3 across seeds on identical data, so the 0.18 delta cannot be confidently attributed to the hierarchical design rather than training stochasticity.

    Authors: We agree that single-point estimates limit assessment of variability in RL training. In the revised manuscript we will report all key metrics as averages over multiple random seeds together with standard errors, allowing readers to evaluate whether the observed improvements exceed typical training stochasticity. revision: yes

  2. Referee: [Abstract] Evaluation protocol (described in abstract): Reliance on one fixed 2013-2018/2019/2020-2023 split on a static 89-stock universe without rolling windows, cross-validation, or regime-stratified tests leaves open whether gains are regime-specific to 2020-2023 or generalizable; this directly affects the claim of the 'strongest' trade-off.

    Authors: The fixed temporal split follows the public benchmark protocol whose timestamp-clean text data are only available for the 2020-2023 test window. We will add an explicit limitations paragraph discussing the absence of rolling-window or regime-stratified tests and the consequent possibility that results are influenced by 2020-2023 market conditions. Additional regime-stratified breakdowns will be included where they can be computed without violating the benchmark constraints. revision: partial

  3. Referee: [Abstract] Abstract: The comparison to flat-RL and other baselines lacks detail on whether those baselines received identical reward-coefficient tuning, network architectures, or data-exclusion rules as HRT; without this, the superiority claim cannot be isolated to the bi-level structure.

    Authors: We will expand the experimental-details section to document the hyperparameter search procedure, network architectures, reward-coefficient grids, and data-exclusion rules applied uniformly to HRT and all baselines. This will make the fairness of the comparison explicit and allow isolation of the bi-level contribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical RL evaluation on held-out test window

full rationale

The paper describes a bi-level RL agent trained on 2013-2018 data, validated on 2019, and evaluated on the separate 2020-2023 window using a fixed 89-stock universe. Reported Sharpe (1.24) and turnover (0.090) are out-of-sample metrics produced by running the trained policy on unseen future data; they are not obtained by re-fitting or re-using the training objective on the test set itself. No equations, self-citations, or uniqueness theorems are invoked in the abstract or description that would reduce the central claim to a definitional loop or fitted-input renaming. Standard train/test separation in RL portfolio papers is externally falsifiable and does not meet any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 2 invented entities

The central claim rests on standard RL assumptions plus several free parameters in the reward function and the invented controllers; no machine-checked proofs or external benchmarks beyond the reported backtest are supplied.

free parameters (3)
  • turnover penalty coefficient
    Tuned to balance trading frequency against returns in the low-level controller.
  • drawdown and text-risk penalty weights
    Chosen to incorporate risk signals; directly affect the reported Sharpe and turnover.
  • RL learning rates and discount factors
    Standard hyperparameters fitted during training on 2013-2018 data.
axioms (2)
  • domain assumption A bi-level policy decomposition yields near-optimal joint actions without enumerating the full space
    Invoked to justify separating HLC and LLC rather than training a flat RL agent.
  • domain assumption Text-derived signals can be processed without look-ahead bias under the benchmark protocol
    Required for the out-of-sample claims on 2020-2023 data.
invented entities (2)
  • High-Level Controller (HLC) no independent evidence
    purpose: Produces sparse asset-level increase/reduce/hold decisions from compact signals
    New component introduced to factorize the action space.
  • Low-Level Controller (LLC) no independent evidence
    purpose: Converts directions into feasible weight adjustments under turnover and risk penalties
    New component introduced to handle execution constraints.

pith-pipeline@v0.9.0 · 5828 in / 1725 out tokens · 63875 ms · 2026-05-23T19:24:06.800349+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 3 internal anchors

  1. [1]

    Dimitri Bertsekas. 2012. Dynamic programming and optimal control: Volume I . Vol. 4. Athena scientific

  2. [2]

    Scott Fujimoto, Herke Hoof, and David Meger. 2018. Addressing function ap- proximation error in actor-critic methods. In International conference on machine learning. PMLR, 1587–1596

  3. [3]

    Weiguang Han, Boyi Zhang, Qianqian Xie, Min Peng, Yanzhao Lai, and Jimin Huang. 2023. Select and trade: Towards unified pair trading with hierarchical reinforcement learning. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining . 4123–4134

  4. [4]

    Taylan Kabbani and Ekrem Duman. 2022. Deep reinforcement learning approach for trading automation in the stock market. IEEE Access 10 (2022), 93564–93574

  5. [5]

    Prahlad Koratamaddi, Karan Wadhwani, Mridul Gupta, and Sriram G Sanjeevi

  6. [6]

    Engineering Science and Technology, an International Journal 24, 4 (2021), 848–859

    Market sentiment-aware deep reinforcement learning approach for stock portfolio allocation. Engineering Science and Technology, an International Journal 24, 4 (2021), 848–859

  7. [7]

    Xinyi Li, Yinchuan Li, Yuancheng Zhan, and Xiao-Yang Liu. 2019. Optimistic bull or pessimistic bear: Adaptive deep reinforcement learning for stock portfolio allocation. arXiv preprint arXiv:1907.01503 (2019)

  8. [8]

    Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015)

  9. [9]

    Xiao-Yang Liu, Ziyi Xia, Hongyang Yang, Jiechao Gao, Daochen Zha, Ming Zhu, Christina Dan Wang, Zhaoran Wang, and Jian Guo. 2024. Dynamic datasets and market environments for financial reinforcement learning. Machine Learning (2024), 1–45

  10. [10]

    Xiao-Yang Liu, Zhuoran Xiong, Shan Zhong, Hongyang Yang, and Anwar Walid

  11. [11]

    arXiv preprint arXiv:1811.07522 (2018)

    Practical deep reinforcement learning approach for stock trading. arXiv preprint arXiv:1811.07522 (2018)

  12. [12]

    Xiao-Yang Liu, Hongyang Yang, Jiechao Gao, and Christina Dan Wang. 2021. FinRL: Deep reinforcement learning framework to automate trading in quantita- tive finance. In Proceedings of the second ACM international conference on AI in finance. 1–9

  13. [13]

    Yang Liu, Qi Liu, Hongke Zhao, Zhen Pan, and Chuanren Liu. 2020. Adaptive quantitative trading: An imitative deep reinforcement learning approach. In Proceedings of the AAAI conference on artificial intelligence , Vol. 34. 2128–2135

  14. [14]

    Harry Markowitz. 1952. Portfolio selection. The Journal of Finance 7, 1 (1952), 77–91

  15. [15]

    Adrian Millea. 2021. Deep reinforcement learning for trading—A critical survey. Data 6, 11 (2021), 119

  16. [16]

    Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Tim- othy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asyn- chronous methods for deep reinforcement learning. In International conference on machine learning. PMLR, 1928–1937

  17. [17]

    Abhishek Nan, Anandh Perumal, and Osmar R Zaiane. 2022. Sentiment and knowledge based algorithmic trading with deep reinforcement learning. In In- ternational Conference on Database and Expert Systems Applications . Springer, 167–180

  18. [18]

    Shubham Pateria, Budhitama Subagdja, Ah-hwee Tan, and Chai Quek. 2021. Hierarchical reinforcement learning: A comprehensive survey. ACM Computing Surveys (CSUR) 54, 5 (2021), 1–35

  19. [19]

    Molei Qin, Shuo Sun, Wentao Zhang, Haochong Xia, Xinrun Wang, and Bo An

  20. [20]

    In Proceedings of the AAAI Conference on Artificial Intelligence , Vol

    Earnhft: Efficient hierarchical reinforcement learning for high frequency trading. In Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 38. 14669–14676

  21. [21]

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov

  22. [22]

    Proximal Policy Optimization Algorithms

    Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

  23. [23]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017)

  24. [24]

    Rundong Wang, Hongxin Wei, Bo An, Zhouyan Feng, and Jun Yao. 2020. Deep stock trading: A hierarchical reinforcement learning framework for portfolio optimization and order execution. arXiv preprint arXiv:2012.12620 (2020)

  25. [25]

    Hongyang Yang, Xiao-Yang Liu, and Qingwei Wu. 2018. A practical machine learning approach for dynamic stock recommendation. In 2018 17th IEEE in- ternational conference on trust, security and privacy in computing and commu- nications/12th IEEE international conference on big data science and engineering (TrustCom/BigDataSE). IEEE, 1693–1697

  26. [26]

    Hongyang Yang, Xiao-Yang Liu, Shan Zhong, and Anwar Walid. 2020. Deep reinforcement learning for automated stock trading: An ensemble strategy. In Proceedings of the first ACM international conference on AI in finance . 1–8

  27. [27]

    Boyu Zhang, Hongyang Yang, and Xiao-Yang Liu. 2023. Instruct-fingpt: Financial sentiment analysis by instruction tuning of general-purpose large language models. arXiv preprint arXiv:2306.12659 (2023)