pith. sign in

arxiv: 2605.15412 · v1 · pith:KATRKKFLnew · submitted 2026-05-14 · 💻 cs.CE · cs.AI· cs.CL

From Feedback Loops to Policy Updates: Reinforcement Fine-Tuning for LLM-Based Alpha Factor Discovery

Pith reviewed 2026-05-19 14:39 UTC · model grok-4.3

classification 💻 cs.CE cs.AIcs.CL
keywords alpha factor discoveryreinforcement fine-tuninglarge language modelsquantitative tradingFactor DSLRegime BacktestDiversity-Complementarity Reward
0
0 comments X

The pith

Reinforcement fine-tuning converts quantitative evaluations into policy updates so an LLM internalizes alpha factor optimization experience instead of accumulating prompt feedback.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing LLM methods for alpha factor discovery rely on generation-evaluation-feedback loops that grow unwieldy, causing context explosion, higher costs, information dilution, and search stagnation. QuantEvolver replaces loop accumulation with reinforcement fine-tuning: it turns executable evaluation results into direct policy updates for a Miner LLM. The framework builds seed factors and diverse training tasks, generates Factor DSL expressions, scores them via Regime Backtest, and applies a Diversity-Complementarity Reward during optimization. Factors are stored in a growing Mined Factor Database. Experiments on three market benchmarks show consistent gains in primary metrics and more complementary factor pools compared with prior LLM baselines.

Core claim

QuantEvolver is a self-evolving framework that constructs high-quality seed factors, builds diverse seed-time-window training tasks, generates executable Factor DSL expressions, evaluates them through Regime Backtest, and optimizes the Miner LLM with Diversity-Complementarity Reward. High-quality factors are continuously accumulated in a Mined Factor Database that serves as the final discovered factor library. By converting quantitative evaluation results into reinforcement policy updates rather than appending feedback to prompts, the Miner LLM internalizes historical optimization experience through parameter learning.

What carries the argument

Reinforcement fine-tuning that converts executable quantitative evaluation results into policy updates for the Miner LLM.

If this is right

  • Consistently improves the primary evaluation metric of each task over existing LLM-based alpha factor discovery baselines.
  • Produces higher-quality and more complementary factor pools.
  • Avoids context explosion, increased inference cost, and feedback drift that arise from long prompt-level loops.
  • Enables continuous accumulation of usable factors in the Mined Factor Database during training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Smaller LLMs may become viable for factor discovery once they learn stable preferences through reinforcement updates rather than depending on the generation stability of very large models.
  • The same conversion of quantifiable feedback into policy updates could apply to other automated discovery problems where evaluation metrics exist.
  • Diverse regime-based training tasks may improve robustness when deployed on market conditions that differ from those seen in backtests.
  • The mined factor library could serve as a reusable asset for downstream portfolio construction or risk modeling.

Load-bearing premise

Converting executable quantitative evaluation results into reinforcement policy updates allows the Miner LLM to internalize historical optimization experience without introducing new biases or failing to generalize beyond the regime backtests used during training.

What would settle it

Evaluate the trained Miner LLM on out-of-sample market data from regimes absent from the seed-time-window training tasks and check whether alpha factor quality or complementarity falls below prompt-based baselines.

Figures

Figures reproduced from arXiv: 2605.15412 by Chiming Duan, Lingzhe Zhang, Minghua He, Philip S. Yu, Tong Jia, Ying Li, Yunpeng Zhai, Zixuan Xie.

Figure 1
Figure 1. Figure 1: LLM-Based Alpha Factor Discovery: From Feedback Loops to Policy [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall Framework of QUANTEVOLVER. Third, QUANTEVOLVER removes redundant candidates to avoid constructing a seed pool dominated by duplicated or near-identical expressions. In implementation, candidates are ranked by their empirical scores and greedily selected using canonical expression signatures, such as normalized AST hashes. A candidate is retained only if it passes the quality filter and its canonica… view at source ↗
Figure 4
Figure 4. Figure 4: provides a closer look at the mining dynamics of QUANTEVOLVER and its ablated variants on Dataset B. In [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: presents a profitability case study on Benchmark B by converting discovered cross-sectional factors into a simple long–short portfolio. At each rebalancing timestamp, assets are ranked by the factor signal, and the portfolio takes long positions in the top-ranked assets and short positions in the bottom-ranked assets. This experiment is not intended to be a fully optimized trading system; instead, it serve… view at source ↗
read the original abstract

Modern quantitative trading increasingly relies on systematic models to extract predictive signals from large-scale financial data, where alpha factor discovery plays a central role in transforming market observations into tradable signals. Recent LLM-based methods have shown promise in automating factor generation, but most of them still rely on prompt-level generation--evaluation--feedback loops for iterative optimization. As the loop becomes longer, repeatedly appended historical candidates and feedback can cause context explosion, increase inference cost, dilute useful information, and introduce feedback drift. Moreover, these methods often depend on very large LLMs whose stable generation preferences may lead to structurally similar expressions, redundant candidates, and search stagnation. To address these limitations, we propose \textsc{QuantEvolver}, a self-evolving alpha factor discovery framework based on reinforcement fine-tuning. Instead of accumulating feedback in the prompt, \textsc{QuantEvolver} converts executable quantitative evaluation into policy updates, enabling a Miner LLM to internalize historical optimization experience through parameter learning. Specifically, \textsc{QuantEvolver} constructs high-quality seed factors, builds diverse seed--time-window training tasks, generates executable Factor DSL expressions, evaluates them through Regime Backtest, and optimizes the Miner LLM with Diversity-Complementarity Reward. During training, high-quality factors are continuously accumulated in a Mined Factor Database, which serves as the final discovered factor library. Extensive experiments on three realistic market benchmarks demonstrate the effectiveness of \textsc{QuantEvolver}, which consistently improves the primary evaluation metric of each task over existing LLM-based alpha factor discovery baselines, produces higher-quality and more complementary factor pools.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes QuantEvolver, a self-evolving alpha factor discovery framework that replaces prompt-based feedback loops with reinforcement fine-tuning of a Miner LLM. Executable regime backtest results are converted into policy updates using a Diversity-Complementarity Reward on seed–time-window training tasks; high-quality factors are accumulated in a Mined Factor Database that serves as the final library. The central empirical claim is that this approach yields consistent improvements in the primary evaluation metric of each task over existing LLM-based baselines on three realistic market benchmarks while producing higher-quality and more complementary factor pools.

Significance. If the empirical results prove robust, the work could meaningfully advance automated quantitative factor discovery by mitigating context explosion, inference cost, and search stagnation that arise in long prompt-based loops. The shift from accumulating historical feedback in context to parameter-level internalization via RL is a conceptually clean idea that, if validated, would improve scalability and diversity in LLM-driven alpha generation.

major comments (2)
  1. [Method (training task construction and reward optimization)] The central claim that policy updates from regime backtests enable the Miner LLM to internalize transferable optimization experience (rather than memorizing historical patterns) is load-bearing yet lacks any described safeguard such as adversarial regime construction, causal regularization, or strict forward-chaining validation. When training tasks are built from specific seed–time-window pairs on historical data, overlap or statistical similarity with the three evaluation benchmarks could produce the reported metric gains through distribution matching rather than genuine discovery.
  2. [Experiments and results] The experimental claim of consistent primary-metric improvements and higher-quality complementary pools is unsupported by visible details on the exact metrics, chosen baselines, statistical significance tests, data-split protocols, or explicit overfitting controls. Without these, it is impossible to determine whether observed gains exceed what would be expected from database exploitation or regime-specific fitting.
minor comments (2)
  1. [Abstract] The abstract refers to “three realistic market benchmarks” and “the primary evaluation metric of each task” without naming either; adding these specifics would immediately improve readability and allow readers to assess relevance.
  2. [Method] Notation for the Factor DSL and the precise definition of the Diversity-Complementarity Reward would benefit from an explicit equation or pseudocode block to avoid ambiguity when readers attempt to reproduce the training objective.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below in a point-by-point manner and indicate the revisions we will make to improve clarity and address the raised concerns.

read point-by-point responses
  1. Referee: [Method (training task construction and reward optimization)] The central claim that policy updates from regime backtests enable the Miner LLM to internalize transferable optimization experience (rather than memorizing historical patterns) is load-bearing yet lacks any described safeguard such as adversarial regime construction, causal regularization, or strict forward-chaining validation. When training tasks are built from specific seed–time-window pairs on historical data, overlap or statistical similarity with the three evaluation benchmarks could produce the reported metric gains through distribution matching rather than genuine discovery.

    Authors: We appreciate the referee's emphasis on ensuring that observed improvements reflect genuine policy learning rather than data leakage. The manuscript constructs training tasks from diverse seed–time-window pairs explicitly chosen to span distinct market regimes, with the Regime Backtest evaluating executable expressions on forward periods. The Diversity-Complementarity Reward is designed to promote exploration of novel factor structures. However, we acknowledge that the current description of safeguards could be more explicit. In the revised manuscript, we will add a subsection in the Method section detailing the temporal partitioning protocol, including how seed–time-window pairs are selected to avoid overlap with evaluation benchmarks, along with forward-chaining validation steps and regime diversity metrics used during task construction. revision: yes

  2. Referee: [Experiments and results] The experimental claim of consistent primary-metric improvements and higher-quality complementary pools is unsupported by visible details on the exact metrics, chosen baselines, statistical significance tests, data-split protocols, or explicit overfitting controls. Without these, it is impossible to determine whether observed gains exceed what would be expected from database exploitation or regime-specific fitting.

    Authors: We agree that greater transparency on experimental protocols is essential for validating the claims. The primary metrics are the Information Coefficient (IC) and Sharpe ratio, with baselines consisting of prompt-based LLM methods (e.g., AlphaGen-style loops) and non-LLM approaches such as genetic programming. Statistical significance is evaluated using paired t-tests and bootstrap resampling across multiple random seeds. Data splits follow a strict temporal protocol with training tasks drawn from earlier periods and evaluation on later out-of-sample windows across the three benchmarks, and overfitting is mitigated via validation-set monitoring of the diversity reward and factor novelty. We will expand the Experiments section with these details, including explicit tables for p-values, ablation studies on the reward function, and descriptions of the data-split and control procedures. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on external benchmark experiments

full rationale

The paper's central claims are framed as empirical outcomes from experiments on three realistic market benchmarks, where QuantEvolver improves primary metrics over LLM-based baselines and yields higher-quality complementary factors. The abstract describes converting evaluation results into policy updates for the Miner LLM and using a Diversity-Complementarity Reward, but presents no equations, derivations, or self-referential definitions that reduce these improvements to fitted parameters or inputs by construction. Training tasks are built from seed-time-window pairs and factors are accumulated in a database, yet the reported gains are positioned as results of external comparative evaluation rather than tautological renaming or self-citation chains. This structure keeps the derivation self-contained against benchmarks, consistent with a non-circular finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on standard assumptions about RL fine-tuning effectiveness for code-like generation tasks and the representativeness of regime backtests for real market performance; no explicit free parameters or invented entities beyond the framework name are detailed in the abstract.

axioms (1)
  • domain assumption Regime backtests provide reliable signals for training an LLM to generate generalizable alpha factors.
    Invoked when the method evaluates generated factors and uses results for policy updates.
invented entities (1)
  • Miner LLM no independent evidence
    purpose: Generates executable Factor DSL expressions optimized via RL.
    Core component of the proposed framework.

pith-pipeline@v0.9.0 · 5842 in / 1216 out tokens · 64626 ms · 2026-05-19T14:39:02.202456+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Instead of accumulating feedback in the prompt, QUANTEVOLVER converts executable quantitative evaluation into policy updates, enabling a Miner LLM to internalize historical optimization experience through parameter learning... optimizes the Miner LLM with Diversity-Complementarity Reward.

  • IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    DiCo Reward... encourages the policy to generate factors that are not only predictive, but also structurally diverse, behaviorally distinct, and complementary to existing candidates.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · 10 internal anchors

  1. [1]

    AutoAlpha : an efficient hierarchical evolutionary algorithm for mining alpha factors in quantitative investment, 2020

    T. Zhang, Y . Li, Y . Jin, and J. Li, “Autoalpha: an efficient hierarchical evolutionary algorithm for mining alpha factors in quantitative invest- ment,” arXiv preprint arXiv:2002.08245, 2020

  2. [2]

    Alpha mining and enhancing via warm start genetic programming for quantitative investment,

    W. Ren, Y . Qin, and Y . Li, “Alpha mining and enhancing via warm start genetic programming for quantitative investment,” arXiv preprint arXiv:2412.00896, 2024

  3. [3]

    101 formulaic alphas,

    Z. Kakushadze, “101 formulaic alphas,” Wilmott, vol. 2016, no. 84, pp. 72–81, 2016

  4. [4]

    Multiple regression genetic programming,

    I. Arnaldo, K. Krawiec, and U.-M. O’Reilly, “Multiple regression genetic programming,” in Proceedings of the 2014 annual conference on genetic and evolutionary computation, 2014, pp. 879–886

  5. [5]

    Alpha discovery via grammar-guided learning and search,

    H. Yang, D. Hao, Z. Wang, Q. Shi, and X. Li, “Alpha discovery via grammar-guided learning and search,” arXiv preprint arXiv:2601.22119, 2026

  6. [6]

    Riskminer: Discovering formulaic alphas via risk seeking monte carlo tree search,

    T. Ren, R. Zhou, J. Jiang, J. Liang, Q. Wang, and Y . Peng, “Riskminer: Discovering formulaic alphas via risk seeking monte carlo tree search,” in Proceedings of the 5th ACM International Conference on AI in Finance, 2024, pp. 752–760

  7. [7]

    Generating synergistic formulaic alpha collections via reinforcement learning,

    S. Yu, H. Xue, X. Ao, F. Pan, J. He, D. Tu, and Q. He, “Generating synergistic formulaic alpha collections via reinforcement learning,” in Proceedings of the 29th ACM SIGKDD conference on knowledge discovery and data mining, 2023, pp. 5476–5486

  8. [8]

    \ text\ Alpha \ 2\ : Discovering Logical Formulaic Alphas using Deep Reinforcement Learning , June 2024

    F. Xu, Y . Yin, X. Zhang, T. Liu, S. Jiang, and Z. Zhang, “Alpha2: Discovering logical formulaic alphas using deep reinforcement learning,” arXiv preprint arXiv:2406.16505, 2024

  9. [9]

    Alphaqcm: Alpha discovery in finance with distribu- tional reinforcement learning,

    Z. Zhu and K. Zhu, “Alphaqcm: Alpha discovery in finance with distribu- tional reinforcement learning,” in Forty-second International Conference on Machine Learning, 2025

  10. [10]

    Alphaforge: A framework to mine and dynamically combine formulaic alpha factors,

    H. Shi, W. Song, X. Zhang, J. Shi, C. Luo, X. Ao, H. Arian, and L. A. Seco, “Alphaforge: A framework to mine and dynamically combine formulaic alpha factors,” in Proceedings of the AAAI conference on artificial intelligence, vol. 39, no. 12, 2025, pp. 12 524–12 532

  11. [11]

    Alphasage: Structure-aware alpha mining via gflownets for robust exploration,

    B. Chen, H. Ding, N. Shen, J. Huang, T. Guo, L. Liu, and M. Zhang, “Alphasage: Structure-aware alpha mining via gflownets for robust exploration,” arXiv preprint arXiv:2509.25055, 2025

  12. [12]

    A survey of aiops in the era of large language models,

    L. Zhang, T. Jia, M. Jia, Y . Wu, A. Liu, Y . Yang, Z. Wu, X. Hu, P. Yu, and Y . Li, “A survey of aiops in the era of large language models,”ACM Computing Surveys, 2025

  13. [13]

    E-log: Fine-grained elastic log-based anomaly detection and diagnosis for databases,

    L. Zhang, T. Jia, X. Tan, X. Huang, M. Jia, H. Liu, Z. Wu, and Y . Li, “E-log: Fine-grained elastic log-based anomaly detection and diagnosis for databases,” IEEE Transactions on Services Computing, 2025

  14. [14]

    Towards close-to-zero runtime collection overhead: Raft-based anomaly diagnosis on system faults for distributed storage system,

    L. Zhang, T. Jia, M. Jia, H. Liu, Y . Yang, Z. Wu, and Y . Li, “Towards close-to-zero runtime collection overhead: Raft-based anomaly diagnosis on system faults for distributed storage system,” IEEE Transactions on Services Computing, 2024

  15. [15]

    Multivariate log- based anomaly detection for distributed database,

    L. Zhang, T. Jia, M. Jia, Y . Li, Y . Yang, and Z. Wu, “Multivariate log- based anomaly detection for distributed database,” in Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2024, pp. 4256–4267

  16. [16]

    Reducing events to augment log-based anomaly detection models: An empirical study,

    L. Zhang, T. Jia, K. Wang, M. Jia, Y . Yang, and Y . Li, “Reducing events to augment log-based anomaly detection models: An empirical study,” in Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, 2024, pp. 538– 548

  17. [17]

    Scalalog: Scalable log-based failure diagnosis using llm,

    L. Zhang, T. Jia, M. Jia, Y . Wu, H. Liu, and Y . Li, “Scalalog: Scalable log-based failure diagnosis using llm,” in ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

  18. [18]

    Agentfm: Role-aware failure management for distributed databases with llm-driven multi-agents,

    L. Zhang, Y . Zhai, T. Jia, X. Huang, C. Duan, and Y . Li, “Agentfm: Role-aware failure management for distributed databases with llm-driven multi-agents,” arXiv preprint arXiv:2504.06614, 2025

  19. [19]

    Thinkfl: Self-refining failure localization for microservice sys- tems via reinforcement fine-tuning,

    L. Zhang, Y . Zhai, T. Jia, C. Duan, S. Yu, J. Gao, B. Ding, Z. Wu, and Y . Li, “Thinkfl: Self-refining failure localization for microservice sys- tems via reinforcement fine-tuning,” arXiv preprint arXiv:2504.18776, 2025

  20. [20]

    Agentic Memory Enhanced Recursive Reasoning for Root Cause Localization in Microservices,

    L. Zhang, T. Jia, Y . Zhai, L. Pan, C. Duan, M. He, M. Jia, and Y . Li, “Agentic memory enhanced recursive reasoning for root cause localization in microservices,” arXiv preprint arXiv:2601.02732, 2026

  21. [21]

    Logdb: Multivariate log-based failure diagnosis for distributed databases (extended from multilog),

    L. Zhang, T. Jia, M. Jia, and Y . Li, “Logdb: Multivariate log-based failure diagnosis for distributed databases (extended from multilog),” arXiv preprint arXiv:2505.01676, 2025

  22. [22]

    Xraglog: A resource- efficient and context-aware log-based anomaly detection method using retrieval-augmented generation,

    L. Zhang, T. Jia, M. Jia, Y . Wu, H. Liu, and Y . Li, “Xraglog: A resource- efficient and context-aware log-based anomaly detection method using retrieval-augmented generation,” inAAAI 2025 Workshop on Preventing and Detecting LLM Misinformation (PDLM), 2025

  23. [23]

    A survey on parallel text generation: From parallel decoding to diffusion language models.arXiv preprint arXiv:2508.08712, 2025

    L. Zhang, L. Fang, C. Duan, M. He, L. Pan, P. Xiao, S. Huang, Y . Zhai, X. Hu, P. S. Yu et al., “A survey on parallel text generation: From parallel decoding to diffusion language models,” arXiv preprint arXiv:2508.08712, 2025

  24. [24]

    Time-tired compaction: An elastic compaction scheme for lsm-tree based time-series database,

    L.-Z. Zhang, X.-D. Huang, Y .-K. Wang, J.-L. Qiao, S.-X. Song, and J.- M. Wang, “Time-tired compaction: An elastic compaction scheme for lsm-tree based time-series database,”Advanced Engineering Informatics, vol. 59, p. 102224, 2024

  25. [25]

    Separation or not: On handing out-of-order time-series data in leveled lsm-tree,

    Y . Kang, X. Huang, S. Song, L. Zhang, J. Qiao, C. Wang, J. Wang, and J. Feinauer, “Separation or not: On handing out-of-order time-series data in leveled lsm-tree,” in2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE, 2022, pp. 3340–3352

  26. [26]

    Adaptive root cause localization for microservice systems with multi-agent recursion-of-thought.arXiv preprint arXiv:2508.20370, 2025

    L. Zhang, T. Jia, K. Wang, W. Hong, C. Duan, M. He, and Y . Li, “Adaptive root cause localization for microservice systems with multi- agent recursion-of-thought,” arXiv preprint arXiv:2508.20370, 2025

  27. [27]

    Ora: Job runtime prediction for high-performance computing platforms using the online retrieval-augmented language model,

    H. Liu, Y . Ma, X. Huang, L. Zhang, T. Jia, and Y . Li, “Ora: Job runtime prediction for high-performance computing platforms using the online retrieval-augmented language model,” in Proceedings of the 39th ACM International Conference on Supercomputing, 2025, pp. 884–894

  28. [28]

    Microremed: Benchmarking llms in microservices remediation,

    L. Zhang, Y . Zhai, T. Jia, C. Duan, M. He, L. Pan, Z. Liu, B. Ding, and Y . Li, “Microremed: Benchmarking llms in microservices remediation,” arXiv preprint arXiv:2511.01166, 2025

  29. [29]

    arXiv preprint arXiv:2508.07173 , year=

    L. Pan, Z. Fu, Y . Zhai, S. Tao, S. Guan, S. Huang, L. Zhang, Z. Liu, B. Ding, F. Henry et al., “Omni-safetybench: A benchmark for safety evaluation of audio-visual large language models,” arXiv preprint arXiv:2508.07173, 2025

  30. [30]

    Walk the talk: Is your log-based software reliability maintenance system really reliable?

    M. He, T. Jia, C. Duan, P. Xiao, L. Zhang, K. Wang, Y . Wu, Y . Li, and G. Huang, “Walk the talk: Is your log-based software reliability maintenance system really reliable?” arXiv preprint arXiv:2509.24352, 2025

  31. [31]

    d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models

    L. Pan, S. Tao, Y . Zhai, Z. Fu, L. Fang, M. He, L. Zhang, Z. Liu, B. Ding, A. Liu et al., “d-treerpo: Towards more reliable policy optimization for diffusion language models,” arXiv preprint arXiv:2512.09675, 2025

  32. [32]

    Cslparser: A collaborative framework using small and large language models for log parsing,

    W. Hong, Y . Wu, L. Zhang, C. Duan, P. Xiao, M. He, X. Yang, and Y . Li, “Cslparser: A collaborative framework using small and large language models for log parsing,” in 2025 IEEE 36th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2025, pp. 61–72

  33. [33]

    United we stand: Towards end-to-end log- based fault diagnosis via interactive multi-task learning,

    M. He, C. Duan, P. Xiao, T. Jia, S. Yu, L. Zhang, W. Hong, J. Han, Y . Wu, Y . Li et al., “United we stand: Towards end-to-end log- based fault diagnosis via interactive multi-task learning,” arXiv preprint arXiv:2509.24364, 2025. 13

  34. [34]

    Hypothesize-then-verify: Speculative root cause analysis for microser- vices with pathwise parallelism,

    L. Zhang, T. Jia, Y . Zhai, L. Pan, C. Duan, M. He, P. Xiao, and Y . Li, “Hypothesize-then-verify: Speculative root cause analysis for microser- vices with pathwise parallelism,” arXiv preprint arXiv:2601.02736, 2026

  35. [35]

    Uda-rcl: Unsupervised domain adaptation for microservice root cause localization utilizing multimodal data,

    X. Huang, H. Liu, Y . Wu, L. Zhang, T. Jia, Y . Li, and Z. Wu, “Uda-rcl: Unsupervised domain adaptation for microservice root cause localization utilizing multimodal data,” IEEE Transactions on Services Computing, 2025

  36. [36]

    Aaad: Asynchronous inter-variable relationship-aware anomaly detection for multivariate time series,

    H. Liu, X. Huang, M. Jia, L. Zhang, T. Jia, Z. Wu, and Y . Li, “Aaad: Asynchronous inter-variable relationship-aware anomaly detection for multivariate time series,” in 2025 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2025, pp. 1–6

  37. [37]

    Logaction: Consistent cross-system anomaly detection through logs via active domain adaptation,

    C. Duan, M. He, P. Xiao, T. Jia, X. Zhang, Z. Zhong, X. Luo, Y . Niu, L. Zhang, S. Yu et al., “Logaction: Consistent cross-system anomaly detection through logs via active domain adaptation,” in 2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2025, pp. 700–712

  38. [38]

    Runtimeslicer: Towards generalizable unified runtime state representation for failure management,

    L. Zhang, T. Jia, W. Hong, M. Wang, C. Duan, M. He, R. Wang, X. Peng, M. Wang, G. Zhang et al., “Runtimeslicer: Towards generalizable unified runtime state representation for failure management,” arXiv preprint arXiv:2603.21495, 2026

  39. [39]

    Efficient failure management for multi-agent systems with reasoning trace representation,

    L. Zhang, T. Jia, M. Wang, W. Hong, C. Duan, M. He, R. Wang, X. Peng, M. Wang, G. Zhang et al., “Efficient failure management for multi-agent systems with reasoning trace representation,” arXiv preprint arXiv:2603.21522, 2026

  40. [40]

    E2E-REME: Towards End-to-End Microservices Auto-Remediation via Experience-Simulation Reinforcement Fine-Tuning

    L. Zhang, Y . Zhai, T. Jia, M. He, C. Duan, Z. Liu, B. Ding, and Y . Li, “E2e-reme: Towards end-to-end microservices auto-remediation via experience-simulation reinforcement fine-tuning,” arXiv preprint arXiv:2604.11094, 2026

  41. [41]

    Coorlog: Efficient-generalizable log anomaly detection via adaptive coordinator in software evolution,

    P. Xiao, C. Duan, M. He, T. Jia, Y . Wu, J. Xu, G. Gao, L. Zhang, W. Hong, Y . Li et al., “Coorlog: Efficient-generalizable log anomaly detection via adaptive coordinator in software evolution,” in 2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2025, pp. 1119–1131

  42. [42]

    Towards Robust LLM Post-Training: Automatic Failure Management for Reinforcement Fine-Tuning

    L. Zhang, T. Jia, Y . Zhai, L. Fang, K. Zheng, H. Liu, X. Huang, P. S. Yu, and Y . Li, “Towards robust llm post-training: Automatic failure manage- ment for reinforcement fine-tuning,” arXiv preprint arXiv:2605.04431, 2026

  43. [43]

    Alpha- gpt: Human-ai interactive alpha mining for quantitative investment,

    S. Wang, H. Yuan, L. Zhou, L. Ni, H. Y . Shum, and J. Guo, “Alpha- gpt: Human-ai interactive alpha mining for quantitative investment,” in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2025, pp. 196–206

  44. [44]

    Can large language models mine interpretable financial factors more effectively? a neural-symbolic factor mining agent model,

    Z. Li, R. Song, C. Sun, W. Xu, Z. Yu, and J.-R. Wen, “Can large language models mine interpretable financial factors more effectively? a neural-symbolic factor mining agent model,” in Findings of the Association for Computational Linguistics: ACL 2024, 2024, pp. 3891– 3902

  45. [45]

    Quantagent: Seeking holy grail in trading by self-improving large language model,

    S. Wang, H. Yuan, L. M. Ni, and J. Guo, “Quantagent: Seeking holy grail in trading by self-improving large language model,” arXiv preprint arXiv:2402.03755, 2024

  46. [46]

    Al- phabench: Benchmarking large language models in formulaic alpha factor mining,

    H. Luo, H. T. Ko, J. Chen, D. Sun, Y . Zhang, and C. Liu, “Al- phabench: Benchmarking large language models in formulaic alpha factor mining,” in The Fourteenth International Conference on Learning Representations

  47. [47]

    Alphaagent: Llm-driven alpha mining with regularized exploration to counteract alpha decay,

    Z. Tang, Z. Chen, J. Yang, J. Mai, Y . Zheng, K. Wang, J. Chen, and L. Lin, “Alphaagent: Llm-driven alpha mining with regularized exploration to counteract alpha decay,” in Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V .2, 2025, pp. 2813–2822

  48. [48]

    Navigating the alpha jungle: An llm-powered mcts framework for formulaic alpha factor mining,

    Y . Shi, Y . Duan, and J. Li, “Navigating the alpha jungle: An llm-powered mcts framework for formulaic alpha factor mining,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 2, 2026, pp. 997–1005

  49. [49]

    R&d-agent- quant: a multi-agent framework for data-centric factors and model joint optimization,

    Y . Li, X. Yang, X. Yang, X. Wang, W. Liu, and J. Bian, “R&d-agent- quant: a multi-agent framework for data-centric factors and model joint optimization,” Advances in Neural Information Processing Systems, vol. 38, 2026

  50. [50]

    QuantaAlpha: An Evolutionary Framework for LLM-Driven Alpha Mining

    J. Han, S. Zhang, W. Li, Z. Yang, Y . Dong, T. Hu, J. Yuan, X. Yu, Y . Zhu, F. Lou et al., “Quantaalpha: An evolutionary framework for llm-driven alpha mining,” arXiv preprint arXiv:2602.07085, 2026

  51. [51]

    Deep reinforcement learning from human preferences,

    P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” Advances in neural information processing systems, vol. 30, 2017

  52. [52]

    Fine-Tuning Language Models from Human Preferences

    D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irving, “Fine-tuning language models from human preferences,” arXiv preprint arXiv:1909.08593, 2019

  53. [53]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017

  54. [54]

    Direct preference optimization: Your language model is secretly a reward model,

    R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” Advances in Neural Information Processing Systems, vol. 36, pp. 53 728–53 741, 2023

  55. [55]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wuet al., “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,” arXiv preprint arXiv:2402.03300, 2024

  56. [56]

    Large language model agents in finance: A survey bridging research, practice, and real-world deployment,

    Y . Dong, F. Wu, K. Zhang, Y . Dai, S. Zhang, W. Ye, S. Chen, and Z.-Q. Cheng, “Large language model agents in finance: A survey bridging research, practice, and real-world deployment,” Findings of the Association for Computational Linguistics: EMNLP, vol. 2025, pp. 17 889–17 907, 2025

  57. [57]

    Ectsum: A new benchmark dataset for bullet point summarization of long earnings call transcripts,

    R. Mukherjee, A. Bohra, A. Banerjee, S. Sharma, M. Hegde, A. Shaikh, S. Shrivastava, K. Dasgupta, N. Ganguly, S. Ghosh et al., “Ectsum: A new benchmark dataset for bullet point summarization of long earnings call transcripts,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 10 893–10 906

  58. [58]

    Ab- stractive financial news summarization via transformer-bilstm encoder and graph attention-based decoder,

    H. Li, Q. Peng, X. Mou, Y . Wang, Z. Zeng, and M. F. Bashir, “Ab- stractive financial news summarization via transformer-bilstm encoder and graph attention-based decoder,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 3190–3205, 2023

  59. [59]

    Finred: A dataset for relation extraction in financial domain,

    S. Sharma, T. Nayak, A. Bose, A. K. Meena, K. Dasgupta, N. Ganguly, and P. Goyal, “Finred: A dataset for relation extraction in financial domain,” in Companion Proceedings of the Web Conference 2022, 2022, pp. 595–597

  60. [60]

    Finbert: A pre-trained financial language representation model for financial text mining,

    Z. Liu, D. Huang, K. Huang, Z. Li, and J. Zhao, “Finbert: A pre-trained financial language representation model for financial text mining,” in Proceedings of the twenty-ninth international conference on international joint conferences on artificial intelligence, 2021, pp. 4513–4519

  61. [61]

    BloombergGPT: A Large Language Model for Finance

    S. Wu, O. Irsoy, S. Lu, V . Dabravolski, M. Dredze, S. Gehrmann, P. Kambadur, D. Rosenberg, and G. Mann, “Bloomberggpt: A large language model for finance,” arXiv preprint arXiv:2303.17564, 2023

  62. [62]

    Fingpt: Open-source financial large language models,

    H. Yang, X.-Y . Liu, and C. D. Wang, “Fingpt: Open-source financial large language models,” arXiv preprint arXiv:2306.06031, 2023

  63. [63]

    Pixiu: a large language model, instruction data and evalua- tion benchmark for finance,

    Q. Xie, W. Han, X. Zhang, Y . Lai, M. Peng, A. Lopez-Lira, and J. Huang, “Pixiu: a large language model, instruction data and evalua- tion benchmark for finance,” in Proceedings of the 37th International Conference on Neural Information Processing Systems, 2023, pp. 33 469–33 484

  64. [64]

    Investlm: A large language model for investment using financial domain instruction tuning.arXiv preprint arXiv:2309.13064, 2023

    Y . Yang, Y . Tang, and K. Y . Tam, “Investlm: A large language model for investment using financial domain instruction tuning,” arXiv preprint arXiv:2309.13064, 2023

  65. [65]

    Fintral: A family of gpt-4 level multimodal financial large language models,

    G. Bhatia, H. Cavusoglu, M. Abdul-Mageed et al., “Fintral: A family of gpt-4 level multimodal financial large language models,” in Findings of the Association for Computational Linguistics: ACL 2024, 2024, pp. 13 064–13 087

  66. [66]

    No language is an island: Unifying chinese and english in financial large language models, instruction data, and benchmarks,

    G. Hu, K. Qin, C. Yuan, M. Peng, A. Lopez-Lira, B. Wang, S. Ana- niadou, J. Huang, and Q. Xie, “No language is an island: Unifying chinese and english in financial large language models, instruction data, and benchmarks,” arXiv preprint arXiv:2403.06249, 2024

  67. [67]

    Fednlp: an interpretable nlp system to decode federal reserve communications,

    J. Lee, H. L. Youn, N. Stevens, J. Poon, and S. C. Han, “Fednlp: an interpretable nlp system to decode federal reserve communications,” in Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval, 2021, pp. 2560–2564

  68. [68]

    Trillion dollar words: A new financial dataset, task & market analysis,

    A. Shah, S. Paturi, and S. Chava, “Trillion dollar words: A new financial dataset, task & market analysis,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume1: Long Papers), 2023, pp. 6664–6679

  69. [69]

    Impact of news on the commodity market: Dataset and results,

    A. Sinha and T. Khandait, “Impact of news on the commodity market: Dataset and results,” in Future of Information and Communication Conference. Springer, 2021, pp. 589–601

  70. [70]

    Harnessing llms for temporal data-a study on explainable financial time series forecasting,

    X. Yu, Z. Chen, and Y . Lu, “Harnessing llms for temporal data-a study on explainable financial time series forecasting,” in Proceedings of the 2023 conference on empirical methods in natural language processing: industry track, 2023, pp. 739–753

  71. [71]

    FinTSB: A Comprehensive and Practical Benchmark for Financial Time Series Forecasting

    Y . Hu, Y . Li, P. Liu, Y . Zhu, N. Li, T. Dai, S.-t. Xia, D. Cheng, and C. Jiang, “Fintsb: A comprehensive and practical benchmark for financial time series forecasting,” arXiv preprint arXiv:2502.18834, 2025

  72. [72]

    Gpt-investar: Enhancing stock investment strategies through annual report analysis with large language models,

    U. Gupta, “Gpt-investar: Enhancing stock investment strategies through annual report analysis with large language models,” arXiv preprint arXiv:2309.03079, 2023. 14

  73. [73]

    Finben: A holistic financial benchmark for large language models,

    Q. Xie, W. Han, Z. Chen, R. Xiang, X. Zhang, Y . He, M. Xiao, D. Li, Y . Dai, D. Fenget al., “Finben: A holistic financial benchmark for large language models,” Advances in Neural Information Processing Systems, vol. 37, pp. 95 716–95 743, 2024

  74. [74]

    Investorbench: A benchmark for financial decision-making tasks with llm-based agent,

    H. Li, Y . Cao, Y . Yu, S. R. Javaji, Z. Deng, Y . He, Y . Jiang, Z. Zhu, K. Subbalakshmi, J. Huang et al., “Investorbench: A benchmark for financial decision-making tasks with llm-based agent,” in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume1: Long Papers), 2025, pp. 2509–2525

  75. [75]

    Strux: An llm for decision- making with structured explanations,

    Y . Lu, Y . Hu, H. Foroosh, W. Jin, and F. Liu, “Strux: An llm for decision- making with structured explanations,” in Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 2: Short Papers), 2025, pp. 131–141

  76. [76]

    Finmem: A performance-enhanced llm trading agent with layered memory and character design,

    Y . Yu, H. Li, Z. Chen, Y . Jiang, Y . Li, J. W. Suchow, D. Zhang, and K. Khashanah, “Finmem: A performance-enhanced llm trading agent with layered memory and character design,” IEEE Transactions on Big Data, 2025

  77. [77]

    Cfgpt: Chinese financial assistant with large language model,

    J. Li, Y . Bian, G. Wang, Y . Lei, D. Cheng, Z. Ding, and C. Jiang, “Cfgpt: Chinese financial assistant with large language model,” arXiv preprint arXiv:2309.10654, 2023

  78. [78]

    When ai meets finance (stockagent): Large lan- guage model-based stock trading in simulated real-world environments,

    C. Zhang, X. Liu, Z. Zhang, M. Jin, L. Li, Z. Wang, W. Hua, D. Shu, S. Zhu, X. Jin et al., “When ai meets finance (stockagent): Large lan- guage model-based stock trading in simulated real-world environments,” arXiv preprint arXiv:2407.18957, 2024

  79. [79]

    Tradingagents: Multi-agents llm financial trading framework,

    Y . Xiao, E. Sun, D. Luo, and W. Wang, “Tradingagents: Multi-agents llm financial trading framework,” in The First MARW: Multi-Agent AI in the Real World Workshop at AAAI 2025

  80. [80]

    Convfinqa: Exploring the chain of numerical reasoning in conversational finance question answering,

    Z. Chen, S. Li, C. Smiley, Z. Ma, S. Shah, and W. Y . Wang, “Convfinqa: Exploring the chain of numerical reasoning in conversational finance question answering,” inProceedings of the 2022 conference on empirical methods in natural language processing, 2022, pp. 6279–6292

Showing first 80 references.