Recognition: no theorem link
SHARP: A Self-Evolving Human-Auditable Rubric Policy for Financial Trading Agents
Pith reviewed 2026-05-11 00:47 UTC · model grok-4.3
The pith
SHARP confines LLM trading agents to explicit condition-action rule rubrics and uses cross-sample attribution to isolate and fix failures, yielding 10-20 point gains for compact models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SHARP replaces unbounded free-form prompt optimization with structured symbolic policy optimization that confines the agent's reasoning to a bounded, human-readable rubric of explicit condition-action rules. When sub-optimal trades occur, an attribution agent employs cross-sample reasoning across multiple samples to isolate specific rule failures, enabling targeted atomic policy edits that are subsequently regularized through strict walk-forward validation. Evaluated across three diverse equity sectors and four LLM backbones, this process consistently transforms generic initial heuristics into highly robust strategies.
What carries the argument
The SHARP neuro-symbolic framework: a bounded rubric of explicit condition-action rules combined with an attribution agent that performs cross-sample reasoning to isolate individual rule failures for atomic edits.
If this is right
- Compact LLMs such as GPT-4o-mini achieve 10-20 percentage point average gains in empirical trading performance.
- Policies remain structurally transparent and human-auditable, meeting institutional finance requirements.
- Targeted edits reduce policy drift compared with unstructured optimization in non-stationary markets.
- The framework operates consistently across multiple equity sectors and LLM backbones.
Where Pith is reading between the lines
- The cross-sample attribution step could be tested in other delayed-reward domains such as robotic control or sequential game strategies to check transferability.
- Combining the rubric structure with existing symbolic verification tools might further strengthen guarantees against unintended rule interactions.
- The emphasis on walk-forward validation suggests a general template for safe self-modification in any agent that receives noisy scalar feedback.
Load-bearing premise
That an attribution agent using cross-sample reasoning can reliably isolate specific rule failures in low signal-to-noise market data without introducing new selection biases or missing interactions among rules.
What would settle it
A controlled test in which known rule defects are injected into simulated trading histories and the attribution agent is measured on whether it correctly identifies and isolates only those defective rules without false positives across varying noise levels.
Figures
read the original abstract
Large language models (LLMs) are increasingly deployed for autonomous financial trading, a domain requiring continuous adaptation to noisy, non-stationary markets. Existing self-improving agents typically address this through unbounded free-form prompt optimization. However, in low signal-to-noise environments with delayed scalar rewards (P\&L), this unstructured approach exacerbates the fundamental credit assignment problem: optimizers cannot reliably distinguish systematic logic flaws from stochastic market variance, inevitably leading to policy drift. To overcome this bottleneck, we introduce the Self-Evolving Human-Auditable Rubric Policy (SHARP), a neuro-symbolic framework that replaces unconstrained text mutation with structured, symbolic policy optimization. SHARP confines the agent's reasoning to a bounded, human-readable rubric of explicit condition-action rules. When sub-optimal trades occur, an attribution agent employs cross-sample reasoning across multiple samples to isolate specific rule failures. This enables targeted, atomic policy edits that are subsequently regularized through strict walk-forward validation. Evaluated across three diverse equity sectors and four LLM backbones, SHARP consistently transforms generic initial heuristics into highly robust strategies, lifting the empirical performance of compact models by 10 to 20 percentage points on average (e.g., GPT-4o-mini). Ultimately, SHARP demonstrates that LLMs can achieve dynamic and efficient adaptation while significantly enhancing the structural transparency and auditability demanded by institutional finance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SHARP, a neuro-symbolic framework for self-evolving LLM-based financial trading agents. It replaces unconstrained prompt optimization with a bounded human-auditable rubric of explicit condition-action rules. An attribution agent applies cross-sample reasoning over multiple trajectories to isolate specific rule failures, enabling targeted atomic edits that are regularized via strict walk-forward validation. Evaluated on three equity sectors and four LLM backbones, the approach is claimed to convert generic initial heuristics into robust strategies, yielding average performance lifts of 10-20 percentage points for compact models such as GPT-4o-mini.
Significance. If the attribution mechanism can be shown to reliably isolate rule failures amid market noise, SHARP would provide a concrete advance in transparent, auditable self-improving trading systems. The structured symbolic policy and walk-forward regularization directly target the credit-assignment problem in delayed, low-SNR reward settings, offering a more controllable alternative to free-form LLM optimization while satisfying institutional demands for human oversight.
major comments (2)
- [§3.2] §3.2 (Attribution Agent): The central empirical claim of 10-20 pp gains rests on the attribution step correctly diagnosing which condition-action rule caused suboptimal trades. No controlled test of attribution precision (e.g., synthetic failure injection or precision/recall metrics on known rule defects) is reported, leaving open the possibility that observed improvements arise from spurious correlations rather than accurate edits.
- [§4] §4 (Evaluation): Walk-forward validation is presented as sufficient regularization, yet it only checks downstream performance and does not audit whether the preceding attribution correctly identified the responsible rule. An incorrect edit can still pass validation if it improves the subsequent window by chance, undermining the causal interpretation of the reported lifts.
minor comments (2)
- [Abstract and §4] The abstract and §4 should explicitly state the primary performance metric (e.g., total P&L, Sharpe ratio, or win rate) and the exact baseline definitions used for the 10-20 pp comparison.
- [§3] Notation for the rubric (condition-action pairs) and the cross-sample attribution procedure could be formalized with a short pseudocode block or equation to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and insightful comments, which highlight important aspects of validating the attribution mechanism. We address each major comment below and will incorporate revisions to provide stronger empirical support for the causal role of the attribution step.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Attribution Agent): The central empirical claim of 10-20 pp gains rests on the attribution step correctly diagnosing which condition-action rule caused suboptimal trades. No controlled test of attribution precision (e.g., synthetic failure injection or precision/recall metrics on known rule defects) is reported, leaving open the possibility that observed improvements arise from spurious correlations rather than accurate edits.
Authors: We acknowledge that the manuscript does not include a direct controlled test of attribution precision, such as synthetic failure injection with precision/recall metrics. The current evidence relies on consistent end-to-end performance lifts across sectors and backbones, combined with the design of cross-sample reasoning to reduce noise sensitivity. To strengthen the causal interpretation, we will add a new controlled experiment in the revised §3.2. This will inject known rule defects into synthetic trajectories and evaluate the attribution agent's ability to correctly identify them, reporting precision, recall, and accuracy metrics across varying noise levels. revision: yes
-
Referee: [§4] §4 (Evaluation): Walk-forward validation is presented as sufficient regularization, yet it only checks downstream performance and does not audit whether the preceding attribution correctly identified the responsible rule. An incorrect edit can still pass validation if it improves the subsequent window by chance, undermining the causal interpretation of the reported lifts.
Authors: We agree that walk-forward validation alone does not directly audit attribution correctness and that chance improvements remain possible. The validation serves to regularize against non-generalizing edits but leaves the attribution step's accuracy as an implicit assumption. In the revision, we will extend §4 to incorporate the synthetic attribution test described above and add a qualitative audit of a random sample of real attribution decisions, documenting whether the isolated rule aligns with observed trade failures. This combined approach will provide direct evidence that performance gains stem from accurate edits rather than spurious correlations. revision: yes
Circularity Check
No circularity: SHARP framework is an independent engineering construction with empirical claims
full rationale
The paper introduces SHARP as a neuro-symbolic method that structures LLM trading policies into explicit rubrics, uses an attribution agent for targeted edits, and applies walk-forward validation. No derivation chain, equations, or self-citations are present that reduce the claimed 10-20 pp performance lift to a fitted parameter, self-definition, or renamed input. The performance gains are presented as outcomes of the full pipeline evaluated on external equity data and multiple LLM backbones, without any step that is tautological by construction or that imports uniqueness from prior author work. The attribution mechanism is described as an independent cross-sample reasoning process rather than a reparameterization of the final metric.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Financial markets are noisy, non-stationary environments with delayed scalar rewards (P&L).
- domain assumption Unbounded free-form prompt optimization exacerbates policy drift in low signal-to-noise regimes.
invented entities (2)
-
Attribution agent
no independent evidence
-
Human-auditable rubric of condition-action rules
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Alejandro Lopez-Lira and Yuehua Tang. Can chatgpt forecast stock price movements? return predictability and large language models.arXiv preprint arXiv:2304.07619, 2023
-
[4]
Mar- ketsenseai 2.0: Enhancing stock analysis through llm agents,
George Fatouros, Kostas Metaxas, John Soldatos, and Manos Karathanassis. Marketsenseai 2.0: Enhancing stock analysis through llm agents.arXiv preprint arXiv:2502.00415, 2025
-
[5]
Finmem: A performance-enhanced llm trading agent with layered memory and character design.IEEE Transactions on Big Data, 2025
Yangyang Yu, Haohang Li, Zhi Chen, Yuechen Jiang, Yang Li, Jordan W Suchow, Denghui Zhang, and Khaldoun Khashanah. Finmem: A performance-enhanced llm trading agent with layered memory and character design.IEEE Transactions on Big Data, 2025
2025
-
[6]
Yangyang Yu, Zhiyuan Yao, Haohang Li, Zhiyang Deng, Yuechen Jiang, Yupeng Cao, Zhi Chen, Jordan W Suchow, Zhenyu Cui, Rong Liu, et al. Fincon: A synthesized llm multi-agent system with conceptual verbal reinforcement for enhanced financial decision making.Advances in Neural Information Processing Systems, 37:137010–137045, 2024
2024
-
[7]
Large language models as optimizers
Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. InThe Twelfth International Conference on Learning Representations, 2023
2023
-
[8]
GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning
Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457, 2025
work page internal anchor Pith review arXiv 2025
-
[9]
Charidimos Papadakis, Angeliki Dimitriou, Giorgos Filandrianos, Maria Lymperaiou, Konstanti- nos Thomas, and Giorgos Stamou. Atlas: Adaptive trading with llm agents through dynamic prompt optimization and multi-agent coordination.arXiv preprint arXiv:2510.15949, 2025. 10
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
The MIT Press, 2018
Barto Andrew and Sutton Richard S.Reinforcement learning: an introduction. The MIT Press, 2018
2018
-
[11]
Lipeng Xie, Sen Huang, Zhuo Zhang, Anni Zou, Yunpeng Zhai, Dingchao Ren, Kezun Zhang, Haoyuan Hu, Boyin Liu, Haoran Chen, et al. Auto-rubric: Learning from implicit weights to explicit rubrics for reward modeling.arXiv preprint arXiv:2510.17314, 2025
-
[12]
Yang Li, Yangyang Yu, Haohang Li, Zhi Chen, and Khaldoun Khashanah. Tradinggpt: Multi- agent system with layered memory and distinct characters for enhanced financial trading performance.arXiv preprint arXiv:2309.03736, 2023
-
[13]
A multimodal foundation agent for financial trading: Tool-augmented, diversified, and generalist
Wentao Zhang, Lingxuan Zhao, Haochong Xia, Shuo Sun, Jiaze Sun, Molei Qin, Xinyi Li, Yuqing Zhao, Yilei Zhao, Xinyu Cai, et al. A multimodal foundation agent for financial trading: Tool-augmented, diversified, and generalist. InProceedings of the 30th acm sigkdd conference on knowledge discovery and data mining, pages 4314–4325, 2024
2024
-
[14]
Siyi Wu, Junqiao Wang, Zhaoyang Guan, Leyi Zhao, Xinyuan Song, Xinyu Ying, Dexu Yu, Jinhao Wang, Hanlin Zhang, Michele Pak, et al. Mountainlion: A multi-modal llm-based agent system for interpretable and adaptive financial trading.arXiv preprint arXiv:2507.20474, 2025
-
[15]
Optimizing instructions and demonstrations for multi-stage language model programs
Krista Opsahl-Ong, Michael J Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab. Optimizing instructions and demonstrations for multi-stage language model programs. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9340–9366, 2024
2024
-
[16]
QuantaAlpha: An Evolutionary Framework for LLM-Driven Alpha Mining
Jun Han, Shuo Zhang, Wei Li, Zhi Yang, Yifan Dong, Tu Hu, Jialuo Yuan, Xiaomin Yu, Yumo Zhu, Fangqi Lou, et al. Quantaalpha: An evolutionary framework for llm-driven alpha mining. arXiv preprint arXiv:2602.07085, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[17]
Alphaagent: Llm-driven alpha mining with regularized exploration to counteract alpha decay
Ziyi Tang, Zechuan Chen, Jiarui Yang, Jiayao Mai, Yongsen Zheng, Keze Wang, Jinrui Chen, and Liang Lin. Alphaagent: Llm-driven alpha mining with regularized exploration to counteract alpha decay. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 2813–2822, 2025
2025
-
[18]
arXiv preprint arXiv:2505.15155 (2025)
Yuante Li, Xu Yang, Xiao Yang, Minrui Xu, Xisen Wang, Weiqing Liu, and Jiang Bian. R&d- agent-quant: a multi-agent framework for data-centric factors and model joint optimization. arXiv preprint arXiv:2505.15155, 2025
-
[19]
R&d-agent: Automating data-driven ai solution building through llm-powered automated research, development, and evolution.arXiv e-prints, pages arXiv–2505, 2025
Xu Yang, Xiao Yang, Shikai Fang, Bowen Xian, Yuante Li, Jian Wang, Minrui Xu, Haoran Pan, Xinpeng Hong, Weiqing Liu, et al. R&d-agent: Automating data-driven ai solution building through llm-powered automated research, development, and evolution.arXiv e-prints, pages arXiv–2505, 2025
2025
-
[20]
Aide: Ai-driven exploration in the space of code.arXiv preprint arXiv:2502.13138, 2025
Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Ja- cenko, and Yuxiang Wu. Aide: Ai-driven exploration in the space of code.arXiv preprint arXiv:2502.13138, 2025
-
[21]
AlphaLab: Autonomous Multi-Agent Research Across Optimization Domains with Frontier LLMs
Brendan R Hogan, Xiwen Chen, James T Wilson, Kashif Rasul, Adel Boyarsky, Thomas Kamei, Anderson Schneider, and Yuriy Nevmyvaka. Alphalab: Autonomous multi-agent research across optimization domains with frontier llms.arXiv preprint arXiv:2604.08590, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[22]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746, 2025
work page internal anchor Pith review arXiv 2025
-
[24]
William F Shen, Xinchi Qiu, Chenxi Whitehouse, Lisa Alazraki, Shashwat Goel, Francesco Barbieri, Timon Willi, Akhil Mathur, and Ilias Leontiadis. Rethinking rubric generation for im- proving llm judge and reward modeling for open-ended tasks.arXiv preprint arXiv:2602.05125, 2026. 11
-
[25]
Reinforcement learning with rubric anchors.arXiv preprint arXiv:2508.12790,
Zenan Huang, Yihong Zhuang, Guoshan Lu, Zeyu Qin, Haokai Xu, Tianyu Zhao, Ru Peng, Jiaqi Hu, Zhanming Shen, Xiaomeng Hu, et al. Reinforcement learning with rubric anchors. arXiv preprint arXiv:2508.12790, 2025
-
[26]
Jiaxiang Chen, Mingxi Zou, Zhuo Wang, Qifan Wang, Dongning Sun, Chi Zhang, and Zenglin Xu. Finhear: Human expertise and adaptive risk-aware temporal reasoning for financial decision- making.arXiv preprint arXiv:2506.09080, 2025
-
[27]
Qwen2.5 technical report, 2025
Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...
2025
-
[28]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Direct estimation of equity market impact.Risk, 18(7):58–62, 2005
Robert Almgren, Chee Thum, Emmanuel Hauptmann, and Hong Li. Direct estimation of equity market impact.Risk, 18(7):58–62, 2005
2005
-
[30]
Giving content to investor sentiment: The role of media in the stock market
Paul C Tetlock. Giving content to investor sentiment: The role of media in the stock market. The Journal of Finance, 62(3):1139–1168, 2007
2007
-
[31]
News versus sentiment: Predicting stock returns from news stories
Steven L Heston and Nitish Ranjan Sinha. News versus sentiment: Predicting stock returns from news stories. Technical report, 2015
2015
-
[32]
Predicting returns with text data
Zheng Tracy Ke, Bryan T Kelly, and Dacheng Xiu. Predicting returns with text data. Technical report, National Bureau of Economic Research, 2019. Broader Impacts The strongest case for SHARP is not full automation, but a shorter loop betweenobservinga failure mode andeditingthe policy that caused it. Because adaptation is expressed as explicit rule edits, ...
2019
-
[33]
The LLM has access to all prices up to T close and all news up to T23:59 UTC
Decision time: day T close. The LLM has access to all prices up to T close and all news up to T23:59 UTC. 2.Execution: a market-on-open (MOO) order fires at dayT+1open (09:30 ET). 3.Return:r i =Open T+2 i /OpenT+1 i −1for each held positioni. This avoids the common pitfall of using close-to-close returns with same-day signals, which implicitly assumes exe...
2024
-
[34]
no real signal
Random L/S: 1,000 Monte Carlo trials. Each trial randomly selects 5 long and 5 short positions from the same 16-stock universe, rebalanced daily with the same 5 bps cost. Transaction costs 13 account for directional flips (e.g., a stock moving from the long to the short leg counts as two trades: closing the old position and opening the new one). Although ...
-
[35]
The lookback k is selected from {1,2,3,5,10,20} days by maximizing Sharpe on the combined train+validation databeforeevaluating on test
Momentum (tuned): rank tickers by k-day return; long top-5, short bottom-5. The lookback k is selected from {1,2,3,5,10,20} days by maximizing Sharpe on the combined train+validation databeforeevaluating on test. This respects the train/test boundary
-
[36]
Same lookback selection procedure as momentum
Mean Reversion (tuned): rank tickers bynegative k-day return (buy losers, short winners). Same lookback selection procedure as momentum
-
[37]
strong AI chip demand; positive trade news
Static rule (no evolution): the same LLM backbone receives the same news, price context, and macro data, and uses the initial rubric R(0) with no evolution applied. The LLM produces (ˆri, ci) signals guided only by the initial rules, without any sector-specific adaptation from P&L feedback. This is the key ablation: Static and Evo share the same initializ...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.