pith. sign in

arxiv: 2605.19337 · v1 · pith:UFP2QEATnew · submitted 2026-05-19 · 💻 cs.AI

Agentic Trading: When LLM Agents Meet Financial Markets

Pith reviewed 2026-05-20 05:59 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLM agentsfinancial tradingevaluation protocolsreproducibilityagentic systemsmarket simulationsurveyclosed-loop evaluation
0
0 comments X

The pith

Studies on LLM trading agents use evaluation protocols so inconsistent that direct comparison and reproduction are nearly impossible.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper audits 77 studies that embed large language models as agents in trading systems, where the models perceive market data, reason, output trades, and adapt to feedback. It isolates a primary group of 19 studies that at minimum output actions and close the evaluation loop with market data. The audit shows that only two of these 19 provide extractable time-consistent data splits, only one models transaction costs explicitly, only one handles universe or survivorship bias, and none reach the highest level of reproducibility. The paper therefore treats the rapid growth in agent architectures as background and instead contributes an evidence ledger, a reproducibility audit, and a reporting checklist.

Core claim

Within the primary subset of 19 studies, only 2/19 report extractable time-consistent split protocols, 1/19 reports an explicit transaction-cost model, 1/19 documents universe or survivorship handling, 11/19 report execution timing or semantics, 15/19 are coded as R0, and no study reaches R3 reproducibility. Architectural experimentation is expanding rapidly, while comparable evaluation protocols, execution semantics, and reproducible artifacts remain the field's immediate bottlenecks.

What carries the argument

The protocol-coded snapshot and reproducibility audit applied to the 77 studies, using Architecture-Capability-Adaptation as a working analytical lens.

If this is right

  • Without time-consistent splits, performance numbers across different agent designs cannot be compared fairly.
  • Absence of explicit transaction-cost models means reported returns do not reflect realistic net performance.
  • Lack of documented universe or survivorship handling introduces hidden selection bias into every result.
  • The proposed reporting checklist would allow future studies to reach higher reproducibility tiers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Standardized split and cost protocols would turn the current set of incomparable experiments into a cumulative body of evidence.
  • Adoption of the audit framework could serve as a template for evaluating agentic systems in other high-stakes domains such as portfolio allocation or risk management.

Load-bearing premise

The screening and coding protocol applied to the 77 studies accurately and without selection bias captures the evaluation practices of the broader literature on LLM trading agents.

What would settle it

A re-screening and re-coding of the same or updated literature that finds substantially more than 2/19 studies with extractable time-consistent splits or at least one study reaching R3 reproducibility.

Figures

Figures reproduced from arXiv: 2605.19337 by Fang Liu, Han Qi, Panpan You, Shengli Zhang, Taotao Wang, Xiaoxiao Wu, Yihan Xia.

Figure 1
Figure 1. Figure 1: The Agency Spectrum of Trading Systems. From left to right: Prediction Models (e.g., FinBERT, StockBERT) perceive market information but lack decision-making capabilities; Signal Generators (e.g., AlphaGen, FactorMiner) add decision layers but remain non-executing; Partial Agents (e.g., FinVis-GPT without execution) incorporate memory but lack action modules; Trading Agents (e.g., FinAgent, TradingAgents) … view at source ↗
Figure 2
Figure 2. Figure 2: Reasoning flow diagram. The agent receives input from perception and memory, applies reasoning mechanisms to generate candidate actions, evaluates these actions through forward planning or reflection, selects the best action, and executes it through the action module. Feedback loops enable learning and adaptation over time. This figure is schematic and is not used for evidence-mapping statistics or protoco… view at source ↗
Figure 3
Figure 3. Figure 3: Three perceptual modalities in agentic trading. Text-based perception processes linguistic information from news and reports. Time-series perception analyzes numerical patterns in price and volume data. Multimodal perception integrates heterogeneous data sources through cross-modal fusion mechanisms. This figure is schematic and is not used for evidence-mapping statistics or protocol comparison. 3.1. Text-… view at source ↗
Figure 4
Figure 4. Figure 4: Multimodal fusion flow in agentic trading. The diagram illustrates how text encoders and time-series encoders process their respective inputs before a cross-modal fusion layer integrates the representations into a unified market understanding. This figure is schematic and is not used for evidence-mapping statistics or protocol comparison [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Three memory systems in agentic trading. Working memory maintains immediate context (capacity: limited, duration: seconds-minutes). Episodic memory stores trading episodes (capacity: large, duration: long-term). Semantic memory encodes financial knowledge (capacity: very large, duration: permanent). This figure is schematic and is not used for evidence-mapping statistics or protocol comparison [PITH_FULL_… view at source ↗
Figure 6
Figure 6. Figure 6: Memory retrieval mechanism. The agent formulates a query based on current context, searches the episodic memory database using similarity-based retrieval, and selects relevant episodes to inform the current decision. This figure is schematic and is not used for evidence-mapping statistics or protocol comparison [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Three reasoning paradigms compared by time-scale. Reactive reasoning prioritizes speed over accuracy, executing in milliseconds. Reflective reasoning balances speed and accuracy through multi-step reasoning. Strategic reasoning prioritizes accuracy through extensive planning, executing over minutes to hours. This figure is schematic and is not used for evidence-mapping statistics or protocol comparison. in… view at source ↗
Figure 8
Figure 8. Figure 8: From decision to execution. Agent outputs map into order parameters and are executed with explicit cost and impact modeling under market microstructure constraints. This figure is schematic and is not used for evidence-mapping statistics or protocol comparison. 6.1. Decision-to-Order Mapping First Principle: Action Definition Risk. A critical failure mode in agent design is the ambiguity between target wei… view at source ↗
Figure 9
Figure 9. Figure 9: Three alpha discovery paradigms. Code-based discovery translates natural language hypotheses into executable code. Retrieval-based discovery finds and adapts existing factors from a library. Evolutionary discovery searches the factor space through mutation and selection. This figure is schematic and is not used for evidence-mapping statistics or protocol comparison. 7.1. Code-based Alpha Discovery Code-bas… view at source ↗
Figure 10
Figure 10. Figure 10: Chain-of-Alpha workflow. The generation chain (top) translates natural language into factor code. The validation chain (bottom) backtests the factor, computes performance metrics, and provides feedback for refinement. The two chains interact iteratively until a satisfactory factor is produced. This figure is schematic and is not used for evidence-mapping statistics or protocol comparison [PITH_FULL_IMAGE… view at source ↗
Figure 11
Figure 11. Figure 11: Risk control timeline across the trading lifecycle. Pre-trade risk control sets limits and checks before orders are submitted. Real-time risk control monitors positions and markets during trading, intervening when thresholds are breached. Post-trade risk analysis analyzes outcomes to improve future risk management. This figure is schematic and is not used for evidence-mapping statistics or protocol compar… view at source ↗
Figure 12
Figure 12. Figure 12: Schematic comparison card for three learning paradigms. The panels summarize what each loop updates, the main trade-offs emphasized in the design literature, and illustrative use cases. The figure is conceptual only and is not used for empirical comparison. to adapt its reasoning to similar situations without parameter updates. Technical Mechanisms. ICL systems are often organized around two stages: (i) e… view at source ↗
Figure 13
Figure 13. Figure 13: Three coordination patterns in multi-agent trading. Role-based collaboration assigns specialized roles (analyst, trader, risk manager) to different agents. Hierarchical organization layers agents from strategic (asset allocation) to tactical (position sizing) to execution (individual trades). Market ecology models agents interacting through markets with feedback and emergence. The structure/example/illust… view at source ↗
Figure 14
Figure 14. Figure 14: Market ecology interaction network. Agents interact both with one another and through the market environment, where prices, spreads, quotes, and fills feed back into subsequent behavior. The figure highlights reflexive feedback and competition channels that can support exploratory study of crowding, instability, and adaptation. This figure is schematic and is not used for evidence-mapping statistics or pr… view at source ↗
Figure 15
Figure 15. Figure 15: Conceptual governed update loop. Agents accumulate experience through trading, consolidate important patterns into memory, apply meta-learning-like update policies, and use reflection to critique reasoning under logging and rollback constraints. The figure is schematic, representing a protocol-constrained loop rather than an empirical proof of autonomous improvement, and is not used for evidence-mapping s… view at source ↗
read the original abstract

A growing body of work explores how Large Language Models (LLMs) can be embedded in trading systems as agents that perceive market information, retrieve context, reason about decisions, emit tradable actions, and adapt under market feedback. This paper reframes LLM-based trading agents as expert-system decision pipelines and presents an audit-oriented evidence map of 77 included studies in a protocol-coded snapshot screened through 2026-03-09. A primary empirical subset (n=19) satisfies the minimum boundary of Action Output plus Closed-Loop Evaluation; the remaining 58 included studies are retained as background and design context. The central empirical finding is protocol incomparability: within the primary subset, only 2/19 studies report extractable time-consistent split protocols, 1/19 reports an explicit transaction-cost model, 1/19 documents universe or survivorship handling, 11/19 report execution timing or semantics, 15/19 are coded as R0, and no study reaches R3 reproducibility. We therefore use Architecture-Capability-Adaptation as a working analytical lens rather than a validated taxonomy, and we foreground the evidence ledger, reproducibility audit, and reporting checklist as the main contributions. The resulting survey shows that architectural experimentation is expanding rapidly, while comparable evaluation protocols, execution semantics, and reproducible artifacts remain the field's immediate bottlenecks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript audits 77 studies on LLM agents in trading systems, screened through 2026-03-09. It isolates a primary subset of 19 studies meeting the criteria of Action Output plus Closed-Loop Evaluation and reports protocol incomparability: only 2/19 provide extractable time-consistent split protocols, 1/19 an explicit transaction-cost model, 1/19 documents universe/survivorship handling, 11/19 report execution timing/semantics, 15/19 are R0, and none reach R3 reproducibility. The paper positions Architecture-Capability-Adaptation as an analytical lens and contributes an evidence ledger, reproducibility audit, and reporting checklist.

Significance. If the screening and coding hold, the audit usefully documents evaluation gaps in an emerging area where architectural work is expanding quickly. The reproducibility audit and checklist offer concrete tools that could raise standards for comparability and artifact sharing, which are currently the field's main bottlenecks.

major comments (2)
  1. [Methods / Screening Protocol] Methods (screening and coding protocol): The exact operational definition of 'Action Output plus Closed-Loop Evaluation' used to select the n=19 primary subset from 77 studies, together with the full coding rubric for time-consistent splits, transaction-cost models, universe handling, and R-level reproducibility, must be supplied (including any inter-rater checks). These details are load-bearing for the headline fractions (2/19, 1/19, 0/19 for R3) that support the central claim of protocol incomparability.
  2. [Results / Primary Subset Analysis] Results (primary-subset counts): The term 'extractable' in the statement that only 2/19 studies report extractable time-consistent split protocols requires an explicit definition and at least one worked example from the coded studies. Without it, the 2/19 figure cannot be independently verified and the incomparability conclusion rests on an opaque threshold.
minor comments (2)
  1. [Abstract] Abstract: the screening date '2026-03-09' appears to be in the future; confirm whether this is a typographical error or a planned cutoff.
  2. [Figure 1] Figure 1 (publication timeline): add axis labels and a legend distinguishing the primary subset from the background studies for immediate readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their thorough review and valuable suggestions. We respond to each major comment in turn and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Methods / Screening Protocol] Methods (screening and coding protocol): The exact operational definition of 'Action Output plus Closed-Loop Evaluation' used to select the n=19 primary subset from 77 studies, together with the full coding rubric for time-consistent splits, transaction-cost models, universe handling, and R-level reproducibility, must be supplied (including any inter-rater checks). These details are load-bearing for the headline fractions (2/19, 1/19, 0/19 for R3) that support the central claim of protocol incomparability.

    Authors: We agree that these methodological details are essential for allowing readers to verify our audit process and conclusions. In the revised manuscript, we will expand the Methods section to provide the precise operational definition of 'Action Output plus Closed-Loop Evaluation', which we used to filter the primary subset of 19 studies from the 77 included ones. We will also include the complete coding rubric as an appendix, detailing the criteria applied for assessing time-consistent splits, transaction-cost models, universe and survivorship handling, execution timing, and the R-level reproducibility scale. For inter-rater reliability, the coding was performed primarily by the first author with review by co-authors for ambiguous cases, and we will document this process explicitly in the revision. These changes will make the headline statistics fully traceable. revision: yes

  2. Referee: [Results / Primary Subset Analysis] Results (primary-subset counts): The term 'extractable' in the statement that only 2/19 studies report extractable time-consistent split protocols requires an explicit definition and at least one worked example from the coded studies. Without it, the 2/19 figure cannot be independently verified and the incomparability conclusion rests on an opaque threshold.

    Authors: We concur that an explicit definition of 'extractable' is necessary to support independent verification. We define 'extractable time-consistent split protocols' as reporting practices that include enough specific information—such as exact date ranges, period lengths, or references to code that implements the split—to enable a reader to recreate the same temporal division without introducing lookahead bias or ambiguity. In the revised manuscript, we will add this definition to the Results section. Additionally, we will provide a worked example from one of the two studies that met this standard, illustrating how their reported split protocol satisfies the criteria. This will clarify the basis for the 2/19 count and reinforce the protocol incomparability finding. revision: yes

Circularity Check

0 steps flagged

Observational audit of external literature with no self-referential derivation

full rationale

The paper conducts a protocol-coded audit of 77 external studies on LLM trading agents, identifying a primary subset of 19 and reporting direct counts on their evaluation practices (e.g., 2/19 with time-consistent splits). No equations, parameter fits, predictions, or self-citations are invoked to derive these counts; findings are presented as empirical observations from screened literature. The methodology relies on external benchmarks (published papers) rather than reducing to the paper's own inputs by construction. This matches the default case of a self-contained observational study against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The audit relies on standard systematic-review methods for screening and coding; no free parameters, new entities, or ad-hoc axioms beyond domain assumptions about literature-review reliability are introduced.

axioms (1)
  • domain assumption A protocol-coded snapshot screened through a fixed date can reliably classify studies into primary empirical and background categories without material selection bias.
    The paper uses this classification to isolate the n=19 subset whose reporting practices are quantified.

pith-pipeline@v0.9.0 · 5782 in / 1233 out tokens · 47315 ms · 2026-05-20T05:59:44.499950+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 1 internal anchor

  1. [1]

    arXiv preprint arXiv:2512.02228 URL: https://arxiv.org/ab s/2512.02228, doi:10.48550/arXiv.2512.02228

    Stride: A systematic framework for selecting ai modalities – agentic ai, ai assistants, or llm calls. arXiv preprint arXiv:2512.02228 URL: https://arxiv.org/ab s/2512.02228, doi:10.48550/arXiv.2512.02228. Atkinson, R.C., Shiffrin, R.M., 1968. Human memory: A proposed system and its control processes, in: Spence, K.W., Spence, J.T. (Eds.), The Psychology o...

  2. [2]

    SSRN Electronic Journal URL: https://papers.ssrn.com/sol3/papers.cfm?abstra ct_id=2326253, doi:10.2139/ssrn.2326253

    Pseudo-mathematics and financial charlatanism: The effects of backtest overfitting. SSRN Electronic Journal URL: https://papers.ssrn.com/sol3/papers.cfm?abstra ct_id=2326253, doi:10.2139/ssrn.2326253. Baker, S.R., Bloom, N., Davis, S.J., 2016. Measuring economic policy uncertainty. The Quarterly Journal of Economics 131, 1593–1636. URL: https://doi.org/10...

  3. [3]

    Byun, W., 2023

    URL: https://doi.org/10.1093/qje/qjv027 , doi:10.1093/qje/qjv027. Byun, W., 2023. Practical application of deep reinforcement learning to optimal trade execution. FinTech 2, 414–429. URL: https://www.mdpi.com/2674- 1032/2/3/23 , doi:10.3390/fintech2030023. Caner, M., Fan, Q., 2024. Portfolio analysis in high dimensions with tracking error and weight const...

  4. [5]

    Cook, T., Osuagwu, R., Tsatiashvili, L., Vrynsia, V ., Ghosal, K., Masoud, M., Mattivi, R., 2025

    URL: https://doi.org/10.1086/261412 , doi:10.1086/261412. Cook, T., Osuagwu, R., Tsatiashvili, L., Vrynsia, V ., Ghosal, K., Masoud, M., Mattivi, R., 2025. Retrieval augmented generation (rag) for fintech: Agentic design and evaluation. arXiv preprint arXiv:2510.25518 URL: https://arxiv.or g/abs/2510.25518, doi:10.48550/arXiv.2510.25518. Coriat, B., Benha...

  5. [6]

    nl2spec: Interactively translating unstructured natural language to temporal logics with large language models, in: Computer Aided Verification (CA V 2023), pp. 383–

  6. [7]

    Davis, M.H.A., Norman, A.R., 1990

    URL: https://arxiv.org/abs/2303.04864 , doi:10.1007/978-3-031-37703-7_18. Davis, M.H.A., Norman, A.R., 1990. Portfolio selection with transaction costs. Mathematics of Operations Research 15, 676–713. URL: https://doi.org/10.1287/moor.15.4. 676, doi:10.1287/moor.15.4.676. Deng, K., 2025. Autoquant: An auditable expert-system frame- work for execution-cons...

  7. [9]

    Bandit Based Monte- Carlo Planning,

    URL: https://doi.org/10.1007/11871842_29 , doi:10.1007/11871842_29. Kou, Z., Yu, H., Luo, J., Peng, J., Li, X., Liu, C., Dai, J., Chen, L., Han, S., Guo, Y ., 2025. Automate strategy finding with llm in quant investment, in: Findings of the Association for Computational Linguistics: EMNLP 2025, Suzhou, China. URL: https://aclanthology.org/2025.findings-e ...

  8. [10]

    Cognitive Alpha Mining via LLM-Driven Code-Based Evolution

    URL: https://www.mdpi.com/2079-9292/11/2 2/3701, doi:10.3390/electronics11223701. 53 Liu, F., Yi, H., Luo, S., Wang, Y ., Yang, Y ., Li, X., Hu, Z., Feng, J., Liu, Q., 2025. Cognitive alpha mining via llm-driven code-based evolution. arXiv preprint arXiv:2511.18850 URL: https://arxiv.org/html/2511.18850v1. Liu, N.F., Lin, K., Hewitt, J., Paranjape, A., Be...

  9. [11]

    Shorten, and Jakub Mareˇcek

    Fintoolbench: Evaluating llm agents for real-world financial tool use. arXiv preprint arXiv:2603.08262v1 URL: https://arxiv.org/pdf/2603.08262v1. Lundberg, S.M., Lee, S.I., 2017. A unified approach to interpreting model predictions, in: Advances in Neural Information Processing Systems (NeurIPS). URL: https: //arxiv.org/abs/1705.07874. Luo, H., Zhang, Y ....

  10. [12]

    arXiv preprint arXiv:2602.20493v1 URL: https://arxiv.or g/pdf/2602.20493v1

    Awcp: A workspace delegation protocol for deep- engagement collaboration across remote agents. arXiv preprint arXiv:2602.20493v1 URL: https://arxiv.or g/pdf/2602.20493v1. Nitarach, N., Sirichotedumrong, W., Pitchayarthorn, P., Taveek- itworachai, P., Manakul, P., Pipatanakul, K., 2025. Fin- cot: Grounding chain-of-thought in expert financial reason- ing, ...

  11. [13]

    Papadakis, C., Filandrianos, G., Dimitriou, A., Lymperaiou, M., Thomas, K., Stamou, G., 2025

    URL: https://ieeexplore.ieee.org/document /5288526, doi:10.1109/TKDE.2009.191. Papadakis, C., Filandrianos, G., Dimitriou, A., Lymperaiou, M., Thomas, K., Stamou, G., 2025. Stocksim: A dual-mode order- level simulator for evaluating multi-agent llms in financial markets. URL: https://arxiv.org/abs/2507.09255 , doi:10.48550/arXiv.2507.09255,arXiv:2507.0925...

  12. [14]

    arXiv preprint arXiv:2602.00948 URL: https://arxiv.org/ab s/2602.00948, doi:10.48550/arXiv.2602.00948

    Finevo: From isolated backtests to ecological market games for multi-agent financial strategy evolution. arXiv preprint arXiv:2602.00948 URL: https://arxiv.org/ab s/2602.00948, doi:10.48550/arXiv.2602.00948. Zouaoui, H., Naas, M.N., 2025. Portfolio optimization based on mpt-lstm neural networks: A case study of cryptocurrency markets. Finance, Accounting ...