Hierarchical Reinforced Trader (HRT): A Bi-Level Approach for Optimizing Stock Selection and Execution

Roy E. Welsch; Zijie Zhao

arxiv: 2410.14927 · v2 · submitted 2024-10-19 · 💱 q-fin.TR · cs.CE· cs.LG

Hierarchical Reinforced Trader (HRT): A Bi-Level Approach for Optimizing Stock Selection and Execution

Zijie Zhao , Roy E. Welsch This is my paper

Pith reviewed 2026-05-23 19:24 UTC · model grok-4.3

classification 💱 q-fin.TR cs.CEcs.LG

keywords hierarchical reinforcement learningportfolio managementstock selectiontext-aware tradingbi-level optimizationreinforcement learning in financeNasdaq equity trading

0 comments

The pith

A bi-level reinforcement learning system improves equity trading by separating directional stock selection from risk-aware execution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes HRT, a hierarchical reinforcement learning framework that splits trading into two levels to handle market and text signals in multi-asset portfolios. A high-level controller chooses sparse increase/reduce/hold directions for individual stocks, while a low-level controller translates those into weight changes while respecting turnover, drawdown, and text-risk penalties. Evaluated on a fixed 89-stock Nasdaq universe with training through 2018 and testing on 2020-2023, the full model reaches a Sharpe ratio of 1.24 and daily turnover of 0.090, beating flat RL and other baselines. This separation avoids the full joint action space and keeps decisions easier to inspect.

Core claim

HRT uses a factorized sparse High-Level Controller to select asset-level increase, reduce, or hold directions from compact market and text-derived signals, while a risk-aware Low-Level Controller converts these directions into feasible portfolio weight adjustments under turnover, drawdown, and text-risk penalties. This decomposition yields the strongest learning-based return-risk-cost trade-off across market-proxy, same-universe portfolio, alpha-only, flat-RL, and hierarchical ablation baselines, improving Sharpe from 1.06 to 1.24 and reducing turnover from 0.112 to 0.090 on the 2020-2023 out-of-sample period.

What carries the argument

The bi-level reinforcement learning structure with a High-Level Controller for sparse directional selection and a Low-Level Controller for constrained weight execution.

If this is right

Higher Sharpe ratio of 1.24 versus 1.06 for the base hierarchical model
Lower daily turnover of 0.090 versus 0.112
Robust performance when transaction costs increase
More inspectable decisions because selection and execution are handled separately

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation of selection from execution may reduce the sample complexity needed to learn stable policies in live trading environments
Text-risk penalties could be swapped for other sentiment sources without retraining the entire hierarchy
Factorized controllers might allow scaling the approach to universes larger than 89 stocks while keeping action spaces manageable

Load-bearing premise

That results on the fixed 89-stock Nasdaq universe and the 2013-2018 training plus 2020-2023 test split reflect generalizable performance without look-ahead bias or excessive tuning on the test horizon.

What would settle it

Retraining and testing the same HRT architecture on a different stock universe or on data after 2023 and finding no improvement in Sharpe or turnover over the flat-RL baseline.

Figures

Figures reproduced from arXiv: 2410.14927 by Roy E. Welsch, Zijie Zhao.

**Figure 2.** Figure 2: Overview of the Hierarchical Reinforced Trader (HRT) architecture. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Cumulative return curves of different investment strategies and S&P 500. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of Trading Volume Proportions: DDPG versus HRT. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Automated equity trading requires converting noisy market and news signals into executable portfolio decisions under risk, turnover, and transaction costs. We propose Hierarchical Reinforced Trader (HRT), a bi-level reinforcement learning framework for text-aware portfolio management in multi-asset equity markets. HRT separates trading into two coordinated decisions: a factorized sparse High-Level Controller (HLC) selects asset-level increase, reduce, or hold directions from compact market and text-derived signals, while a risk-aware Low-Level Controller (LLC) converts these directions into feasible portfolio weight adjustments under turnover, drawdown, and text-risk penalties. This decomposition avoids enumerating the full joint action space and makes selection and execution easier to inspect. We evaluate HRT on an open stock-news benchmark with a fixed 89-stock Nasdaq universe, using 2013--2018 for training, 2019 for validation, and 2020--2023 for final out-of-sample testing; the test horizon is restricted to 2020--2023 due to public benchmark data availability under the same timestamp-clean text-aware protocol. Across market-proxy, same-universe portfolio, alpha-only, flat-RL, and hierarchical ablation baselines, HRT delivers the strongest learning-based return--risk--cost trade-off. The full model improves Sharpe from 1.06 for HRT-Base to 1.24, reduces daily turnover from 0.112 to 0.090, and remains robust under transaction-cost stress. These results suggest that separating sparse directional selection from risk-aware execution is an effective way to incorporate market forecasts and text-derived risk signals into portfolio management.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HRT factors trading into a sparse high-level selector and constrained low-level executor with text penalties, which is a practical split, but the reported Sharpe edge rests on one fixed test window with point estimates only.

read the letter

The paper's core move is splitting the RL action space so a high-level controller picks directional signals from market and text features while a low-level one maps those into weight changes under turnover, drawdown, and text-risk penalties. That decomposition is the actual novelty here; it keeps the joint space manageable and makes the policy easier to read than a flat RL agent on 89 assets. They run it on the public stock-news benchmark with a clean timestamp split and show the full model beating their own base and several ablations on Sharpe and turnover, plus some cost-stress checks. That is useful engineering for anyone already working on text-augmented portfolio RL. The numbers themselves are harder to lean on. Everything is a single point estimate from one 2020-2023 window on a static 89-stock universe, with no seed averages, standard errors, or rolling-window results. In RL portfolio work those deltas often sit inside run-to-run noise, so the 0.18 Sharpe lift could be real or it could be optimization luck on the chosen penalties. The free parameters (turnover weight, text-risk coefficient, learning rates) are tuned on the validation slice, which is standard but still leaves the test performance tied to that specific choice. No post-hoc data exclusions are described, but the protocol is thin on baseline implementation details. This is the kind of paper that belongs in a quant-finance reading group or a specialized conference track. A serious referee could usefully press for variance numbers and clearer ablation code, but the idea is coherent enough on its own terms that it deserves that look rather than a desk reject.

Referee Report

3 major / 2 minor

Summary. The paper proposes Hierarchical Reinforced Trader (HRT), a bi-level RL framework with a High-Level Controller (HLC) for sparse directional asset selection from market/text signals and a Low-Level Controller (LLC) for risk-aware execution under turnover/drawdown/text-risk penalties. It evaluates on a fixed 89-stock Nasdaq universe (2013-2018 train, 2019 val, 2020-2023 test) and claims HRT achieves the best return-risk-cost trade-off among market-proxy, same-universe, alpha-only, flat-RL, and ablation baselines, improving Sharpe from 1.06 (HRT-Base) to 1.24 and reducing daily turnover from 0.112 to 0.090 while remaining robust to transaction costs.

Significance. If the reported gains prove robust, the bi-level decomposition provides a practical method for scaling RL to high-dimensional portfolio actions while incorporating text-derived risk signals, addressing a key challenge in automated equity trading.

major comments (3)

[Abstract] Abstract and results section: The central performance claims (Sharpe 1.24 vs. 1.06, turnover 0.090 vs. 0.112) are single point estimates with no reported standard errors, multi-seed averages, or statistical tests. In RL settings, policy-gradient variance routinely produces Sharpe fluctuations of 0.1-0.3 across seeds on identical data, so the 0.18 delta cannot be confidently attributed to the hierarchical design rather than training stochasticity.
[Abstract] Evaluation protocol (described in abstract): Reliance on one fixed 2013-2018/2019/2020-2023 split on a static 89-stock universe without rolling windows, cross-validation, or regime-stratified tests leaves open whether gains are regime-specific to 2020-2023 or generalizable; this directly affects the claim of the 'strongest' trade-off.
[Abstract] Abstract: The comparison to flat-RL and other baselines lacks detail on whether those baselines received identical reward-coefficient tuning, network architectures, or data-exclusion rules as HRT; without this, the superiority claim cannot be isolated to the bi-level structure.

minor comments (2)

Clarify the precise functional form and coefficient values of the text-risk, drawdown, and turnover penalties in the LLC objective function.
Specify the RL algorithm (e.g., PPO, SAC) and network dimensions used for HLC and LLC.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment point by point below, indicating planned revisions to strengthen the paper.

read point-by-point responses

Referee: [Abstract] Abstract and results section: The central performance claims (Sharpe 1.24 vs. 1.06, turnover 0.090 vs. 0.112) are single point estimates with no reported standard errors, multi-seed averages, or statistical tests. In RL settings, policy-gradient variance routinely produces Sharpe fluctuations of 0.1-0.3 across seeds on identical data, so the 0.18 delta cannot be confidently attributed to the hierarchical design rather than training stochasticity.

Authors: We agree that single-point estimates limit assessment of variability in RL training. In the revised manuscript we will report all key metrics as averages over multiple random seeds together with standard errors, allowing readers to evaluate whether the observed improvements exceed typical training stochasticity. revision: yes
Referee: [Abstract] Evaluation protocol (described in abstract): Reliance on one fixed 2013-2018/2019/2020-2023 split on a static 89-stock universe without rolling windows, cross-validation, or regime-stratified tests leaves open whether gains are regime-specific to 2020-2023 or generalizable; this directly affects the claim of the 'strongest' trade-off.

Authors: The fixed temporal split follows the public benchmark protocol whose timestamp-clean text data are only available for the 2020-2023 test window. We will add an explicit limitations paragraph discussing the absence of rolling-window or regime-stratified tests and the consequent possibility that results are influenced by 2020-2023 market conditions. Additional regime-stratified breakdowns will be included where they can be computed without violating the benchmark constraints. revision: partial
Referee: [Abstract] Abstract: The comparison to flat-RL and other baselines lacks detail on whether those baselines received identical reward-coefficient tuning, network architectures, or data-exclusion rules as HRT; without this, the superiority claim cannot be isolated to the bi-level structure.

Authors: We will expand the experimental-details section to document the hyperparameter search procedure, network architectures, reward-coefficient grids, and data-exclusion rules applied uniformly to HRT and all baselines. This will make the fairness of the comparison explicit and allow isolation of the bi-level contribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical RL evaluation on held-out test window

full rationale

The paper describes a bi-level RL agent trained on 2013-2018 data, validated on 2019, and evaluated on the separate 2020-2023 window using a fixed 89-stock universe. Reported Sharpe (1.24) and turnover (0.090) are out-of-sample metrics produced by running the trained policy on unseen future data; they are not obtained by re-fitting or re-using the training objective on the test set itself. No equations, self-citations, or uniqueness theorems are invoked in the abstract or description that would reduce the central claim to a definitional loop or fitted-input renaming. Standard train/test separation in RL portfolio papers is externally falsifiable and does not meet any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 2 invented entities

The central claim rests on standard RL assumptions plus several free parameters in the reward function and the invented controllers; no machine-checked proofs or external benchmarks beyond the reported backtest are supplied.

free parameters (3)

turnover penalty coefficient
Tuned to balance trading frequency against returns in the low-level controller.
drawdown and text-risk penalty weights
Chosen to incorporate risk signals; directly affect the reported Sharpe and turnover.
RL learning rates and discount factors
Standard hyperparameters fitted during training on 2013-2018 data.

axioms (2)

domain assumption A bi-level policy decomposition yields near-optimal joint actions without enumerating the full space
Invoked to justify separating HLC and LLC rather than training a flat RL agent.
domain assumption Text-derived signals can be processed without look-ahead bias under the benchmark protocol
Required for the out-of-sample claims on 2020-2023 data.

invented entities (2)

High-Level Controller (HLC) no independent evidence
purpose: Produces sparse asset-level increase/reduce/hold decisions from compact signals
New component introduced to factorize the action space.
Low-Level Controller (LLC) no independent evidence
purpose: Converts directions into feasible weight adjustments under turnover and risk penalties
New component introduced to handle execution constraints.

pith-pipeline@v0.9.0 · 5828 in / 1725 out tokens · 63875 ms · 2026-05-23T19:24:06.800349+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

HRT separates trading into two coordinated decisions: a factorized sparse High-Level Controller (HLC) selects asset-level increase, reduce, or hold directions ... while a risk-aware Low-Level Controller (LLC) converts these directions into feasible portfolio weight adjustments
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The HLC’s reward ... merges the real-world price movement alignment reward with the downstream feedback from the Low-Level Controller

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 3 internal anchors

[1]

Dimitri Bertsekas. 2012. Dynamic programming and optimal control: Volume I . Vol. 4. Athena scientific

work page 2012
[2]

Scott Fujimoto, Herke Hoof, and David Meger. 2018. Addressing function ap- proximation error in actor-critic methods. In International conference on machine learning. PMLR, 1587–1596

work page 2018
[3]

Weiguang Han, Boyi Zhang, Qianqian Xie, Min Peng, Yanzhao Lai, and Jimin Huang. 2023. Select and trade: Towards unified pair trading with hierarchical reinforcement learning. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining . 4123–4134

work page 2023
[4]

Taylan Kabbani and Ekrem Duman. 2022. Deep reinforcement learning approach for trading automation in the stock market. IEEE Access 10 (2022), 93564–93574

work page 2022
[5]

Prahlad Koratamaddi, Karan Wadhwani, Mridul Gupta, and Sriram G Sanjeevi

work page
[6]

Engineering Science and Technology, an International Journal 24, 4 (2021), 848–859

Market sentiment-aware deep reinforcement learning approach for stock portfolio allocation. Engineering Science and Technology, an International Journal 24, 4 (2021), 848–859

work page 2021
[7]

Xinyi Li, Yinchuan Li, Yuancheng Zhan, and Xiao-Yang Liu. 2019. Optimistic bull or pessimistic bear: Adaptive deep reinforcement learning for stock portfolio allocation. arXiv preprint arXiv:1907.01503 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[8]

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[9]

Xiao-Yang Liu, Ziyi Xia, Hongyang Yang, Jiechao Gao, Daochen Zha, Ming Zhu, Christina Dan Wang, Zhaoran Wang, and Jian Guo. 2024. Dynamic datasets and market environments for financial reinforcement learning. Machine Learning (2024), 1–45

work page 2024
[10]

Xiao-Yang Liu, Zhuoran Xiong, Shan Zhong, Hongyang Yang, and Anwar Walid

work page
[11]

arXiv preprint arXiv:1811.07522 (2018)

Practical deep reinforcement learning approach for stock trading. arXiv preprint arXiv:1811.07522 (2018)

work page arXiv 2018
[12]

Xiao-Yang Liu, Hongyang Yang, Jiechao Gao, and Christina Dan Wang. 2021. FinRL: Deep reinforcement learning framework to automate trading in quantita- tive finance. In Proceedings of the second ACM international conference on AI in finance. 1–9

work page 2021
[13]

Yang Liu, Qi Liu, Hongke Zhao, Zhen Pan, and Chuanren Liu. 2020. Adaptive quantitative trading: An imitative deep reinforcement learning approach. In Proceedings of the AAAI conference on artificial intelligence , Vol. 34. 2128–2135

work page 2020
[14]

Harry Markowitz. 1952. Portfolio selection. The Journal of Finance 7, 1 (1952), 77–91

work page 1952
[15]

Adrian Millea. 2021. Deep reinforcement learning for trading—A critical survey. Data 6, 11 (2021), 119

work page 2021
[16]

Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Tim- othy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asyn- chronous methods for deep reinforcement learning. In International conference on machine learning. PMLR, 1928–1937

work page 2016
[17]

Abhishek Nan, Anandh Perumal, and Osmar R Zaiane. 2022. Sentiment and knowledge based algorithmic trading with deep reinforcement learning. In In- ternational Conference on Database and Expert Systems Applications . Springer, 167–180

work page 2022
[18]

Shubham Pateria, Budhitama Subagdja, Ah-hwee Tan, and Chai Quek. 2021. Hierarchical reinforcement learning: A comprehensive survey. ACM Computing Surveys (CSUR) 54, 5 (2021), 1–35

work page 2021
[19]

Molei Qin, Shuo Sun, Wentao Zhang, Haochong Xia, Xinrun Wang, and Bo An

work page
[20]

In Proceedings of the AAAI Conference on Artificial Intelligence , Vol

Earnhft: Efficient hierarchical reinforcement learning for high frequency trading. In Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 38. 14669–14676

work page
[21]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov

work page
[22]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[23]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017)

work page 2017
[24]

Rundong Wang, Hongxin Wei, Bo An, Zhouyan Feng, and Jun Yao. 2020. Deep stock trading: A hierarchical reinforcement learning framework for portfolio optimization and order execution. arXiv preprint arXiv:2012.12620 (2020)

work page arXiv 2020
[25]

Hongyang Yang, Xiao-Yang Liu, and Qingwei Wu. 2018. A practical machine learning approach for dynamic stock recommendation. In 2018 17th IEEE in- ternational conference on trust, security and privacy in computing and commu- nications/12th IEEE international conference on big data science and engineering (TrustCom/BigDataSE). IEEE, 1693–1697

work page 2018
[26]

Hongyang Yang, Xiao-Yang Liu, Shan Zhong, and Anwar Walid. 2020. Deep reinforcement learning for automated stock trading: An ensemble strategy. In Proceedings of the first ACM international conference on AI in finance . 1–8

work page 2020
[27]

Boyu Zhang, Hongyang Yang, and Xiao-Yang Liu. 2023. Instruct-fingpt: Financial sentiment analysis by instruction tuning of general-purpose large language models. arXiv preprint arXiv:2306.12659 (2023)

work page arXiv 2023

[1] [1]

Dimitri Bertsekas. 2012. Dynamic programming and optimal control: Volume I . Vol. 4. Athena scientific

work page 2012

[2] [2]

Scott Fujimoto, Herke Hoof, and David Meger. 2018. Addressing function ap- proximation error in actor-critic methods. In International conference on machine learning. PMLR, 1587–1596

work page 2018

[3] [3]

Weiguang Han, Boyi Zhang, Qianqian Xie, Min Peng, Yanzhao Lai, and Jimin Huang. 2023. Select and trade: Towards unified pair trading with hierarchical reinforcement learning. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining . 4123–4134

work page 2023

[4] [4]

Taylan Kabbani and Ekrem Duman. 2022. Deep reinforcement learning approach for trading automation in the stock market. IEEE Access 10 (2022), 93564–93574

work page 2022

[5] [5]

Prahlad Koratamaddi, Karan Wadhwani, Mridul Gupta, and Sriram G Sanjeevi

work page

[6] [6]

Engineering Science and Technology, an International Journal 24, 4 (2021), 848–859

Market sentiment-aware deep reinforcement learning approach for stock portfolio allocation. Engineering Science and Technology, an International Journal 24, 4 (2021), 848–859

work page 2021

[7] [7]

Xinyi Li, Yinchuan Li, Yuancheng Zhan, and Xiao-Yang Liu. 2019. Optimistic bull or pessimistic bear: Adaptive deep reinforcement learning for stock portfolio allocation. arXiv preprint arXiv:1907.01503 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019

[8] [8]

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015

[9] [9]

Xiao-Yang Liu, Ziyi Xia, Hongyang Yang, Jiechao Gao, Daochen Zha, Ming Zhu, Christina Dan Wang, Zhaoran Wang, and Jian Guo. 2024. Dynamic datasets and market environments for financial reinforcement learning. Machine Learning (2024), 1–45

work page 2024

[10] [10]

Xiao-Yang Liu, Zhuoran Xiong, Shan Zhong, Hongyang Yang, and Anwar Walid

work page

[11] [11]

arXiv preprint arXiv:1811.07522 (2018)

Practical deep reinforcement learning approach for stock trading. arXiv preprint arXiv:1811.07522 (2018)

work page arXiv 2018

[12] [12]

Xiao-Yang Liu, Hongyang Yang, Jiechao Gao, and Christina Dan Wang. 2021. FinRL: Deep reinforcement learning framework to automate trading in quantita- tive finance. In Proceedings of the second ACM international conference on AI in finance. 1–9

work page 2021

[13] [13]

Yang Liu, Qi Liu, Hongke Zhao, Zhen Pan, and Chuanren Liu. 2020. Adaptive quantitative trading: An imitative deep reinforcement learning approach. In Proceedings of the AAAI conference on artificial intelligence , Vol. 34. 2128–2135

work page 2020

[14] [14]

Harry Markowitz. 1952. Portfolio selection. The Journal of Finance 7, 1 (1952), 77–91

work page 1952

[15] [15]

Adrian Millea. 2021. Deep reinforcement learning for trading—A critical survey. Data 6, 11 (2021), 119

work page 2021

[16] [16]

Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Tim- othy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asyn- chronous methods for deep reinforcement learning. In International conference on machine learning. PMLR, 1928–1937

work page 2016

[17] [17]

Abhishek Nan, Anandh Perumal, and Osmar R Zaiane. 2022. Sentiment and knowledge based algorithmic trading with deep reinforcement learning. In In- ternational Conference on Database and Expert Systems Applications . Springer, 167–180

work page 2022

[18] [18]

Shubham Pateria, Budhitama Subagdja, Ah-hwee Tan, and Chai Quek. 2021. Hierarchical reinforcement learning: A comprehensive survey. ACM Computing Surveys (CSUR) 54, 5 (2021), 1–35

work page 2021

[19] [19]

Molei Qin, Shuo Sun, Wentao Zhang, Haochong Xia, Xinrun Wang, and Bo An

work page

[20] [20]

In Proceedings of the AAAI Conference on Artificial Intelligence , Vol

Earnhft: Efficient hierarchical reinforcement learning for high frequency trading. In Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 38. 14669–14676

work page

[21] [21]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov

work page

[22] [22]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[23] [23]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017)

work page 2017

[24] [24]

Rundong Wang, Hongxin Wei, Bo An, Zhouyan Feng, and Jun Yao. 2020. Deep stock trading: A hierarchical reinforcement learning framework for portfolio optimization and order execution. arXiv preprint arXiv:2012.12620 (2020)

work page arXiv 2020

[25] [25]

Hongyang Yang, Xiao-Yang Liu, and Qingwei Wu. 2018. A practical machine learning approach for dynamic stock recommendation. In 2018 17th IEEE in- ternational conference on trust, security and privacy in computing and commu- nications/12th IEEE international conference on big data science and engineering (TrustCom/BigDataSE). IEEE, 1693–1697

work page 2018

[26] [26]

Hongyang Yang, Xiao-Yang Liu, Shan Zhong, and Anwar Walid. 2020. Deep reinforcement learning for automated stock trading: An ensemble strategy. In Proceedings of the first ACM international conference on AI in finance . 1–8

work page 2020

[27] [27]

Boyu Zhang, Hongyang Yang, and Xiao-Yang Liu. 2023. Instruct-fingpt: Financial sentiment analysis by instruction tuning of general-purpose large language models. arXiv preprint arXiv:2306.12659 (2023)

work page arXiv 2023