AlphaQuanter: An End-to-End Tool-Augmented Agentic Reinforcement Learning Framework for Stock Trading

Changlong Yu; Jiashu Wang; Weixiang Yan; Zheye Deng

arxiv: 2510.14264 · v2 · submitted 2025-10-16 · 💻 cs.CE

AlphaQuanter: An End-to-End Tool-Augmented Agentic Reinforcement Learning Framework for Stock Trading

Zheye Deng , Weixiang Yan , Changlong Yu , Jiashu Wang This is my paper

Pith reviewed 2026-05-18 06:52 UTC · model grok-4.3

classification 💻 cs.CE

keywords stock tradingreinforcement learningLLM agenttool augmentationautomated tradinginterpretable reasoningend-to-end optimization

0 comments

The pith

A single reinforcement learning agent learns coherent stock trading strategies by dynamically orchestrating tools from market feedback.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AlphaQuanter to fix problems in multi-agent LLM trading systems, such as inefficiency and inconsistent signals. It does this by training one agent with reinforcement learning to control a transparent workflow where the agent calls tools to gather information and make decisions as needed. This setup allows the entire process to optimize end-to-end based on how trades perform in the market. A sympathetic reader would care because it points to simpler agent designs that could produce more reliable automated trading while also generating readable explanations of the strategies used. The experiments show this leads to strong results on standard financial measures and surfaces trading patterns that humans might find useful.

Core claim

AlphaQuanter shows that reinforcement learning applied to a single agent can train a dynamic policy over a tool-augmented decision workflow. This lets the agent autonomously decide when to acquire data and when to act, creating a consistent trading approach directly from market outcomes rather than relying on separate agents that must coordinate.

What carries the argument

The tool-augmented decision workflow governed by a reinforcement learning policy, which lets the agent choose tool calls and actions to build and execute trading plans based on ongoing market results.

If this is right

Trading decisions become more consistent because one policy learns the full sequence of tool use and actions from direct feedback.
The agent's step-by-step reasoning supplies concrete examples of strategy logic that human traders can examine and adapt.
End-to-end training removes the need for separate modules to align on signals, which reduces wasted computation during operation.
The same workflow structure could support decisions in other sequential financial tasks where information must be gathered on demand.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This single-agent design might scale to live trading environments with lower latency than systems that require multiple agents to negotiate.
The transparent tool calls could be logged and replayed by human traders to test similar information-gathering habits in their own processes.
Extending the framework to include transaction costs and slippage in the reward signal would provide a stricter test of real-world viability.
The approach connects to broader questions about whether tool use in agents benefits from unified policy training rather than modular decomposition.

Load-bearing premise

Market feedback can train a reinforcement learning policy that produces a coherent and consistent trading approach in a single-agent tool workflow without the coordination problems seen in multi-agent systems.

What would settle it

Backtests or forward tests on the same market data where the single-agent system generates lower returns, worse risk-adjusted metrics, or more inconsistent signals than comparable multi-agent trading setups would indicate the approach does not deliver the claimed coherence.

Figures

Figures reproduced from arXiv: 2510.14264 by Changlong Yu, Jiashu Wang, Weixiang Yan, Zheye Deng.

**Figure 2.** Figure 2: Comparison of training dynamics for the AlphaQuanter-3B and -7B models. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of key backtesting metrics for the AlphaQuanter-3B and -7B models on the [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Evolution of the tool selection strategies [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: The effect of different decision threshold (θ) values on the agent’s action distribution during training. We conduct an ablation study to validate the contributions of our key designs, with all results shown in [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Full prompt for the AlphaQuanter agent. B.2 Hyperparameters for RL Training We train AlphaQuanter using verl [27]. In [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: A comparative analysis of the training dynamics for the AlphaQuanter-3B and -7B models, [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

read the original abstract

While Large Language Model (LLM) agents show promise in automated trading, they still face critical limitations. Prominent multi-agent frameworks often suffer from inefficiency, produce inconsistent signals, and lack the end-to-end optimization required to learn a coherent strategy from market feedback. To address this, we introduce AlphaQuanter, a single-agent framework that uses reinforcement learning (RL) to learn a dynamic policy over a transparent, tool-augmented decision workflow, which empowers a single agent to autonomously orchestrate tools and proactively acquire information on demand, establishing a transparent reasoning process. Extensive experiments demonstrate that AlphaQuanter achieves state-of-the-art performance on key financial metrics. Moreover, its interpretable reasoning reveals sophisticated strategies, offering novel and valuable insights for human traders. Our code and data can be found at https://github.com/horizon-llm/AlphaQuanter.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AlphaQuanter shifts to single-agent RL over tool use for trading but the abstract leaves the performance claims and credit assignment details unexamined.

read the letter

The key point with this paper is that it replaces the usual multi-agent LLM trading setups with a single agent that learns a policy over both tool calls and trading decisions using reinforcement learning. This end-to-end training is meant to produce more consistent behavior than separate agents arguing with each other. The paper does a decent job laying out the workflow: the agent can call tools to get market data or news on demand and then decide on positions, all under one RL loop. Releasing the code and data is helpful and lets others reproduce or extend it. The claim that the reasoning becomes interpretable and shows sophisticated strategies is interesting if it holds. On the downside, the abstract asserts SOTA performance but skips any concrete comparison details, statistical tests, or ablation results. That makes it tough to evaluate whether the single-agent approach actually solves the problems they identify or if the results are sensitive to the specific market conditions tested. The concern in the stress-test note about RL credit assignment is worth taking seriously here. Market rewards are sparse and non-stationary, so it's not obvious that standard policy gradients will properly credit useful information-gathering tool uses versus just profitable trades that happened to occur. If the full paper has no analysis of this, the central advantage over multi-agent methods remains unproven. Readers who work on LLM agents for quantitative finance would find this relevant. It gives a concrete alternative architecture that they could implement and test on their own data. For someone outside that niche, the lack of detailed evidence limits how much they can take away. Overall, I would send this to peer review. The framework is described clearly enough and the open-source aspect makes it feasible for referees to verify the implementation, even if the experimental section needs strengthening.

Referee Report

2 major / 2 minor

Summary. The paper introduces AlphaQuanter, a single-agent tool-augmented agentic reinforcement learning framework for stock trading. It argues that multi-agent LLM frameworks suffer from inefficiency and inconsistent signals, and proposes instead an end-to-end RL policy over a transparent workflow that lets a single agent dynamically select tools and acquire information on demand. The central claims are that this yields state-of-the-art performance on standard financial metrics and produces interpretable reasoning that reveals sophisticated trading strategies useful to human traders. Code and data are released.

Significance. If the empirical claims hold after proper controls, the work would be significant for automated trading research: it offers a concrete alternative to multi-agent LLM systems by demonstrating that end-to-end RL can produce coherent, interpretable policies from market feedback alone. The open-source release and emphasis on transparency are positive contributions that could enable follow-up studies on credit assignment in financial RL.

major comments (2)

[§4] §4 (Experiments): the claim of SOTA performance is not supported by the reported results. No table or section provides the full set of baselines (including recent multi-agent LLM trading agents and standard RL benchmarks), the exact train/test split dates, or statistical significance tests (e.g., t-tests or bootstrap confidence intervals) on Sharpe ratio or cumulative return differences. Without these, the superiority over multi-agent frameworks cannot be verified.
[§3.2] §3.2 (RL formulation): the reward signal and credit-assignment mechanism across tool-selection actions and final trade decisions are not specified in sufficient detail. In non-stationary markets with sparse, delayed returns, standard policy-gradient updates risk conflating informative tool use with lucky trades; the manuscript does not show ablations that isolate the contribution of the end-to-end RL objective versus simpler supervised or heuristic baselines.

minor comments (2)

[Figure 3] Figure 3 and the accompanying text use inconsistent notation for the policy network output (sometimes π(a|s), sometimes π_θ). Standardize throughout.
[§4.1] The abstract states 'extensive experiments' but the main text does not report the number of independent runs or random seeds used for each method; add this to §4.1.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the revisions planned for the next version.

read point-by-point responses

Referee: [§4] §4 (Experiments): the claim of SOTA performance is not supported by the reported results. No table or section provides the full set of baselines (including recent multi-agent LLM trading agents and standard RL benchmarks), the exact train/test split dates, or statistical significance tests (e.g., t-tests or bootstrap confidence intervals) on Sharpe ratio or cumulative return differences. Without these, the superiority over multi-agent frameworks cannot be verified.

Authors: We agree that the experimental reporting can be strengthened to better substantiate the SOTA claims. In the revised manuscript we will add a comprehensive results table that includes all relevant baselines (recent multi-agent LLM trading agents and standard RL benchmarks), explicitly state the precise train/test split dates, and report statistical significance tests (t-tests and bootstrap confidence intervals) on differences in Sharpe ratio and cumulative return. These additions will enable direct verification of the claimed performance advantages. revision: yes
Referee: [§3.2] §3.2 (RL formulation): the reward signal and credit-assignment mechanism across tool-selection actions and final trade decisions are not specified in sufficient detail. In non-stationary markets with sparse, delayed returns, standard policy-gradient updates risk conflating informative tool use with lucky trades; the manuscript does not show ablations that isolate the contribution of the end-to-end RL objective versus simpler supervised or heuristic baselines.

Authors: We acknowledge the need for greater detail on the reward signal and credit assignment. We will expand §3.2 to specify the reward function and the mechanism for assigning credit across tool-selection actions and final trade decisions. We will also add ablation studies that compare the full end-to-end RL objective against supervised and heuristic baselines, helping to isolate its contribution under non-stationary conditions and sparse rewards. revision: yes

Circularity Check

0 steps flagged

No significant circularity; RL policy derives from external market feedback

full rationale

The paper introduces AlphaQuanter as a single-agent RL framework that learns a dynamic policy over a tool-augmented workflow by optimizing against market feedback. No load-bearing equations, self-definitional reductions, fitted inputs renamed as predictions, or self-citation chains appear in the provided abstract or described derivation. The central claims rest on empirical SOTA results from external financial metrics and interpretable reasoning traces, which are independent of the framework's own outputs. This is the standard case of a self-contained empirical RL setup without circular collapse.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the framework implicitly assumes standard RL convergence properties and market data availability.

pith-pipeline@v0.9.0 · 5686 in / 1087 out tokens · 30403 ms · 2026-05-18T06:52:15.978527+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

single-agent framework that uses reinforcement learning (RL) to learn a dynamic policy over a transparent, tool-augmented decision workflow
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Reward Function R ... exponentially weighted forward return rt ... discrete rewards by action

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 4 internal anchors

[1]

Learning representations by back-propagating errors.nature, 323(6088):533–536, 1986

David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors.nature, 323(6088):533–536, 1986

work page 1986
[2]

Support-vector networks.Machine learning, 20(3): 273–297, 1995

Corinna Cortes and Vladimir Vapnik. Support-vector networks.Machine learning, 20(3): 273–297, 1995

work page 1995
[3]

Random forests.Machine learning, 45(1):5–32, 2001

Leo Breiman. Random forests.Machine learning, 45(1):5–32, 2001

work page 2001
[4]

Reinforcement learning for trading

John Moody and Matthew Saffell. Reinforcement learning for trading. In M. Kearns, S. Solla, and D. Cohn, editors,Advances in Neural Information Processing Systems, vol- ume 11. MIT Press, 1998. URL https://proceedings.neurips.cc/paper_files/paper/1998/ file/4e6cd95227cb0c280e99a195be5f6615-Paper.pdf

work page 1998
[5]

Deeptrader: a deep reinforce- ment learning approach for risk-return balanced portfolio management with market conditions embedding

Zhicheng Wang, Biwei Huang, Shikui Tu, Kun Zhang, and Lei Xu. Deeptrader: a deep reinforce- ment learning approach for risk-return balanced portfolio management with market conditions embedding. InProceedings of the AAAI conference on artificial intelligence, volume 35, pages 643–650, 2021

work page 2021
[6]

Pineapple Express

Yijia Xiao, Edward Sun, Di Luo, and Wei Wang. Tradingagents: Multi-agents llm financial trading framework.arXiv preprint arXiv:2412.20138, 2024

work page arXiv 2024
[7]

A multimodal 9 foundation agent for financial trading: Tool-augmented, diversified, and generalist

Wentao Zhang, Lingxuan Zhao, Haochong Xia, Shuo Sun, Jiaze Sun, Molei Qin, Xinyi Li, Yuqing Zhao, Yilei Zhao, Xinyu Cai, Longtao Zheng, Xinrun Wang, and Bo An. A multimodal 9 foundation agent for financial trading: Tool-augmented, diversified, and generalist. In Ricardo Baeza-Yates and Francesco Bonchi, editors,Proceedings of the 30th ACM SIGKDD Conferenc...

work page doi:10.1145/3637528.3671801 2024
[8]

Alpha-gpt: Human-ai interactive alpha mining for quantitative investment.arXiv preprint arXiv:2308.00016, 2023

Saizhuo Wang, Hang Yuan, Leon Zhou, Lionel M Ni, Heung-Yeung Shum, and Jian Guo. Alpha-gpt: Human-ai interactive alpha mining for quantitative investment.arXiv preprint arXiv:2308.00016, 2023

work page arXiv 2023
[9]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5,

work page 2023
[10]

URLhttps://openreview.net/forum?id=WE_vluYUL-X

OpenReview.net, 2023. URLhttps://openreview.net/forum?id=WE_vluYUL-X

work page 2023
[11]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Advancements and applications of artificial intelligence in stock market prediction

Lin Zhong. Advancements and applications of artificial intelligence in stock market prediction. 2025

work page 2025
[14]

Adaptive quantitative trading: An imitative deep reinforcement learning approach

Yang Liu, Qi Liu, Hongke Zhao, Zhen Pan, and Chuanren Liu. Adaptive quantitative trading: An imitative deep reinforcement learning approach. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 2128–2135, 2020

work page 2020
[16]

Smith, Xiao-Yang Liu, Jimin Huang, Sophia Ananiadou, and Qianqian Xie

Guojun Xiong, Zhiyang Deng, Keyi Wang, Yupeng Cao, Haohang Li, Yangyang Yu, Xueqing Peng, Mingquan Lin, Kaleb E Smith, Xiao-Yang Liu, et al. Flag-trader: Fusion llm-agent with gradient-based reinforcement learning for financial trading.arXiv preprint arXiv:2502.11433, 2025

work page arXiv 2025
[17]

Trading-r1: Financial trading with llm reasoning via reinforcement learning.arXiv preprint arXiv:2509.11420, 2025

Yijia Xiao, Edward Sun, Tong Chen, Fang Wu, Di Luo, and Wei Wang. Trading-r1: Financial trading with llm reasoning via reinforcement learning.arXiv preprint arXiv:2509.11420, 2025

work page arXiv 2025
[18]

URLhttps://doi.org/10.1109/CVPR.2016.308

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV , USA, June 27-30, 2016, pages 2818–2826. IEEE Computer Society, 2016. doi: 10.1109/CVPR.2016.308. URL https://doi.or...

work page doi:10.1109/cvpr.2016.308 2016
[19]

Finrl: A deep reinforcement learning library for automated stock trading in quantitative finance.CoRR, abs/2011.09607, 2020

Xiao-Yang Liu, Hongyang Yang, Qian Chen, Runjia Zhang, Liuqing Yang, Bowen Xiao, and Christina Dan Wang. Finrl: A deep reinforcement learning library for automated stock trading in quantitative finance.CoRR, abs/2011.09607, 2020. URLhttps://arxiv.org/abs/2011.09607

work page arXiv 2011
[20]

URL http://www.jstor.org/stable/2975974

Harry Markowitz. Portfolio selection.The Journal of Finance, 7(1):77–91, 1952. ISSN 00221082, 15406261. URLhttp://www.jstor.org/stable/2975974

work page arXiv 1952
[21]

Earnhft: Effi- cient hierarchical reinforcement learning for high frequency trading

Molei Qin, Shuo Sun, Wentao Zhang, Haochong Xia, Xinrun Wang, and Bo An. Earnhft: Effi- cient hierarchical reinforcement learning for high frequency trading. In Michael J. Wooldridge, Jennifer G. Dy, and Sriraam Natarajan, editors,Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Art...

work page doi:10.1609/aaai.v38i13.29384 2024
[22]

Smith, Xiao-Yang Liu, Jimin Huang, Sophia Ananiadou, and Qianqian Xie

Guojun Xiong, Zhiyang Deng, Keyi Wang, Yupeng Cao, Haohang Li, Yangyang Yu, Xue- qing Peng, Mingquan Lin, Kaleb E. Smith, Xiao-Yang Liu, Jimin Huang, Sophia Ananiadou, and Qianqian Xie. FLAG-TRADER: fusion llm-agent with gradient-based reinforcement learning for financial trading. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and 10 Mohammad Taher Pi...

work page 2025
[27]

Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Madry, Alex Baker- Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexander Kirillov, Alexi Christakis, Alexis 11 Conn...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2024
[28]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient RLHF framework. InProceedings of the Twentieth European Conference on Computer Systems, EuroSys 2025, Rotterdam, The Netherlands, 30 March 2025 - 3 April 2025, pages 1279–1297. ACM, 2025. doi: 10.1145/3689...

work page doi:10.1145/3689031.3696075 2025
[29]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 12 A Detailed Information Sources A.1 Market Data Market data consists of two tiers: raw price/volume, and a c...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Learning representations by back-propagating errors.nature, 323(6088):533–536, 1986

David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors.nature, 323(6088):533–536, 1986

work page 1986

[2] [2]

Support-vector networks.Machine learning, 20(3): 273–297, 1995

Corinna Cortes and Vladimir Vapnik. Support-vector networks.Machine learning, 20(3): 273–297, 1995

work page 1995

[3] [3]

Random forests.Machine learning, 45(1):5–32, 2001

Leo Breiman. Random forests.Machine learning, 45(1):5–32, 2001

work page 2001

[4] [4]

Reinforcement learning for trading

John Moody and Matthew Saffell. Reinforcement learning for trading. In M. Kearns, S. Solla, and D. Cohn, editors,Advances in Neural Information Processing Systems, vol- ume 11. MIT Press, 1998. URL https://proceedings.neurips.cc/paper_files/paper/1998/ file/4e6cd95227cb0c280e99a195be5f6615-Paper.pdf

work page 1998

[5] [5]

Deeptrader: a deep reinforce- ment learning approach for risk-return balanced portfolio management with market conditions embedding

Zhicheng Wang, Biwei Huang, Shikui Tu, Kun Zhang, and Lei Xu. Deeptrader: a deep reinforce- ment learning approach for risk-return balanced portfolio management with market conditions embedding. InProceedings of the AAAI conference on artificial intelligence, volume 35, pages 643–650, 2021

work page 2021

[6] [6]

Pineapple Express

Yijia Xiao, Edward Sun, Di Luo, and Wei Wang. Tradingagents: Multi-agents llm financial trading framework.arXiv preprint arXiv:2412.20138, 2024

work page arXiv 2024

[7] [7]

A multimodal 9 foundation agent for financial trading: Tool-augmented, diversified, and generalist

Wentao Zhang, Lingxuan Zhao, Haochong Xia, Shuo Sun, Jiaze Sun, Molei Qin, Xinyi Li, Yuqing Zhao, Yilei Zhao, Xinyu Cai, Longtao Zheng, Xinrun Wang, and Bo An. A multimodal 9 foundation agent for financial trading: Tool-augmented, diversified, and generalist. In Ricardo Baeza-Yates and Francesco Bonchi, editors,Proceedings of the 30th ACM SIGKDD Conferenc...

work page doi:10.1145/3637528.3671801 2024

[8] [8]

Alpha-gpt: Human-ai interactive alpha mining for quantitative investment.arXiv preprint arXiv:2308.00016, 2023

Saizhuo Wang, Hang Yuan, Leon Zhou, Lionel M Ni, Heung-Yeung Shum, and Jian Guo. Alpha-gpt: Human-ai interactive alpha mining for quantitative investment.arXiv preprint arXiv:2308.00016, 2023

work page arXiv 2023

[9] [9]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5,

work page 2023

[10] [10]

URLhttps://openreview.net/forum?id=WE_vluYUL-X

OpenReview.net, 2023. URLhttps://openreview.net/forum?id=WE_vluYUL-X

work page 2023

[11] [11]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Advancements and applications of artificial intelligence in stock market prediction

Lin Zhong. Advancements and applications of artificial intelligence in stock market prediction. 2025

work page 2025

[14] [14]

Adaptive quantitative trading: An imitative deep reinforcement learning approach

Yang Liu, Qi Liu, Hongke Zhao, Zhen Pan, and Chuanren Liu. Adaptive quantitative trading: An imitative deep reinforcement learning approach. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 2128–2135, 2020

work page 2020

[15] [16]

Smith, Xiao-Yang Liu, Jimin Huang, Sophia Ananiadou, and Qianqian Xie

Guojun Xiong, Zhiyang Deng, Keyi Wang, Yupeng Cao, Haohang Li, Yangyang Yu, Xueqing Peng, Mingquan Lin, Kaleb E Smith, Xiao-Yang Liu, et al. Flag-trader: Fusion llm-agent with gradient-based reinforcement learning for financial trading.arXiv preprint arXiv:2502.11433, 2025

work page arXiv 2025

[16] [17]

Trading-r1: Financial trading with llm reasoning via reinforcement learning.arXiv preprint arXiv:2509.11420, 2025

Yijia Xiao, Edward Sun, Tong Chen, Fang Wu, Di Luo, and Wei Wang. Trading-r1: Financial trading with llm reasoning via reinforcement learning.arXiv preprint arXiv:2509.11420, 2025

work page arXiv 2025

[17] [18]

URLhttps://doi.org/10.1109/CVPR.2016.308

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV , USA, June 27-30, 2016, pages 2818–2826. IEEE Computer Society, 2016. doi: 10.1109/CVPR.2016.308. URL https://doi.or...

work page doi:10.1109/cvpr.2016.308 2016

[18] [19]

Finrl: A deep reinforcement learning library for automated stock trading in quantitative finance.CoRR, abs/2011.09607, 2020

Xiao-Yang Liu, Hongyang Yang, Qian Chen, Runjia Zhang, Liuqing Yang, Bowen Xiao, and Christina Dan Wang. Finrl: A deep reinforcement learning library for automated stock trading in quantitative finance.CoRR, abs/2011.09607, 2020. URLhttps://arxiv.org/abs/2011.09607

work page arXiv 2011

[19] [20]

URL http://www.jstor.org/stable/2975974

Harry Markowitz. Portfolio selection.The Journal of Finance, 7(1):77–91, 1952. ISSN 00221082, 15406261. URLhttp://www.jstor.org/stable/2975974

work page arXiv 1952

[20] [21]

Earnhft: Effi- cient hierarchical reinforcement learning for high frequency trading

Molei Qin, Shuo Sun, Wentao Zhang, Haochong Xia, Xinrun Wang, and Bo An. Earnhft: Effi- cient hierarchical reinforcement learning for high frequency trading. In Michael J. Wooldridge, Jennifer G. Dy, and Sriraam Natarajan, editors,Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Art...

work page doi:10.1609/aaai.v38i13.29384 2024

[21] [22]

Smith, Xiao-Yang Liu, Jimin Huang, Sophia Ananiadou, and Qianqian Xie

Guojun Xiong, Zhiyang Deng, Keyi Wang, Yupeng Cao, Haohang Li, Yangyang Yu, Xue- qing Peng, Mingquan Lin, Kaleb E. Smith, Xiao-Yang Liu, Jimin Huang, Sophia Ananiadou, and Qianqian Xie. FLAG-TRADER: fusion llm-agent with gradient-based reinforcement learning for financial trading. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and 10 Mohammad Taher Pi...

work page 2025

[22] [27]

Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Madry, Alex Baker- Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexander Kirillov, Alexi Christakis, Alexis 11 Conn...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2024

[23] [28]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient RLHF framework. InProceedings of the Twentieth European Conference on Computer Systems, EuroSys 2025, Rotterdam, The Netherlands, 30 March 2025 - 3 April 2025, pages 1279–1297. ACM, 2025. doi: 10.1145/3689...

work page doi:10.1145/3689031.3696075 2025

[24] [29]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 12 A Detailed Information Sources A.1 Market Data Market data consists of two tiers: raw price/volume, and a c...

work page internal anchor Pith review Pith/arXiv arXiv 2024