pith. machine review for the scientific record. sign in

arxiv: 2603.16365 · v2 · submitted 2026-03-17 · 💻 cs.AI

Recognition: no theorem link

FactorEngine: A Program-level Knowledge-Infused Factor Mining Framework for Quantitative Investment

Authors on Pith no claims yet

Pith reviewed 2026-05-15 10:26 UTC · model grok-4.3

classification 💻 cs.AI
keywords alpha factor miningquantitative investmentLLM-guided searchprogram-level factorsknowledge-infused bootstrappingpredictive stabilityportfolio performancebacktesting
0
0 comments X

The pith

FactorEngine mines executable alpha factors by representing them as Turing-complete code and bootstrapping from financial reports using LLM-guided search.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Alpha factor mining seeks predictive signals from noisy, shifting market data that can be run directly in trading systems. Symbolic methods often cannot express complex relationships, while neural approaches lose interpretability and break under new market conditions. FactorEngine treats each factor as a full program, splits logic changes from number tuning, lets language models steer the search and turn reports into code through a multi-agent pipeline, and stores past attempts to guide future tries. Backtests on real price and volume data show the resulting factors deliver higher information coefficients with better risk-adjusted portfolio returns than earlier techniques.

Core claim

FactorEngine casts factors as Turing-complete programs and improves discovery through three separations: logic revision from parameter optimization, LLM-guided directional search from Bayesian tuning, and LLM usage from local computation. It adds a knowledge-infused module that converts unstructured financial reports into executable code via a closed-loop multi-agent extraction-verification-generation pipeline, plus an experience base that supports refinement from past failures. Extensive backtests on real-world OHLCV data produce factors with higher IC, ICIR, Rank IC, Rank ICIR, annualized return, and Sharpe ratio than baseline methods.

What carries the argument

The knowledge-infused bootstrapping module that transforms unstructured financial reports into executable factor programs through a closed-loop multi-agent extraction-verification-code-generation pipeline, supported by an experience knowledge base for trajectory-aware refinement.

If this is right

  • Higher IC and ICIR values translate into more stable predictive signals for portfolio construction.
  • Improved annualized returns and Sharpe ratios follow directly from the stronger factors in the backtests.
  • Factors remain directly executable and auditable because they are stored as complete programs.
  • The separation of search steps makes large-scale discovery computationally feasible.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Maintaining the experience base across multiple market cycles could reduce the need for full restarts after regime changes.
  • The same report-to-program pipeline could be tested on other text-heavy prediction tasks outside finance.
  • Live updating of the knowledge base from new reports might allow factors to evolve without periodic full re-mining.

Load-bearing premise

Factors produced from historical reports and backtests will continue to predict and perform well when markets move into new regimes.

What would settle it

Apply the mined factors to market data from a later period than any used in the original backtests and check whether IC/ICIR and Sharpe remain higher than those of baseline factors.

Figures

Figures reproduced from arXiv: 2603.16365 by Binjie Fei, Jiaqi Liu, Linna Zhou, Qinhong Lin, Ruitao Feng, Yinglun Feng, Yukun Chen, Yu Li, Zhenxin Huang, Zhongliang Yang.

Figure 1
Figure 1. Figure 1: Overview of FactorEngine (FE). Left: Bootstrapping extracts factor ideas and converts pseudocode into executable Python to seed a knowledge-infused pool. Center: Evolution performs macro–micro co-evolution: LLM agents propose macro mutations guided by chains of experience, and Bayesian search conducts micro-level parameter tuning with fast validation and feedback updates. Right: Integration selects elite f… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Bootstrapping module. knowledge, yielding reliable inputs for downstream factor extraction. (2) Fac￾tor Extraction: Implements a two-step understanding-to-generation workflow with iterative reflection and verification to distill core financial ideas from re￾search reports into the structured JSON representations accompanied by LaTeX￾formatted pseudocode (3) Code Generation: Transforms verif… view at source ↗
Figure 3
Figure 3. Figure 3: (Left) Cumulative excess return comparison in the CSI300 market. (Middle) Cumulative excess return comparison in the CSI500 market. (Right) Visualization of factor correlation structure of three agent-based methods based on MDS. Metrics. We evaluated methods using a comprehensive set of metrics. For pre￾dictive performance, we reported the Information Coefficient (IC), Information Coefficient Information R… view at source ↗
Figure 4
Figure 4. Figure 4: Yearly IC and Rank IC comparisons in the CSI300 (Left) and CSI500 markets (Middle). Mean IC and Rank IC between the top 10% factors and future returns at T+N on the CSI300 market across three experimental settings (Right). independence and reduced redundancy. This observation is consistent with the Radius of Gyration (RoG) metric, where FE-alpha achieved the largest RoG, in￾dicating the highest overall dis… view at source ↗
Figure 5
Figure 5. Figure 5: Left:Effect of Bayesian micro-search. Bayesian parameter search (bay_avg) yields higher final performance and a faster improvement trajectory than that without Bayesian tuning (no_bay_avg). Right: Comparison of three methods evolved using the GPT-4o and Gemini-2.5-flash-lite models as backbone agents. evolution and calculation. In contrast, RD-AGENT’s generation of numerous DL-based factors resulted in red… view at source ↗
read the original abstract

We study alpha factor mining, the automated discovery of predictive signals from noisy, non-stationary market data-under a practical requirement that mined factors be directly executable and auditable, and that the discovery process remain computationally tractable at scale. Existing symbolic approaches are limited by bounded expressiveness, while neural forecasters often trade interpretability for performance and remain vulnerable to regime shifts and overfitting. We introduce FactorEngine (FE), a program-level factor discovery framework that casts factors as Turing-complete code and improves both effectiveness and efficiency via three separations: (i) logic revision vs. parameter optimization, (ii) LLM-guided directional search vs. Bayesian hyperparameter search, and (iii) LLM usage vs. local computation. FE further incorporates a knowledge-infused bootstrapping module that transforms unstructured financial reports into executable factor programs through a closed-loop multi-agent extraction-verification-code-generation pipeline, and an experience knowledge base that supports trajectory-aware refinement (including learning from failures). Across extensive backtests on real-world OHLCV data, FE produces factors with substantially stronger predictive stability and portfolio impact-for example, higher IC/ICIR (and Rank IC/ICIR) and improved AR/Sharpe, than baseline methods, achieving state-of-the-art predictive and portfolio performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces FactorEngine (FE), a program-level knowledge-infused factor mining framework for quantitative investment. Factors are represented as Turing-complete executable code. The approach uses three separations (logic revision vs. parameter optimization, LLM-guided directional search vs. Bayesian hyperparameter search, and LLM usage vs. local computation), a closed-loop multi-agent pipeline to bootstrap executable factors from unstructured financial reports, and an experience knowledge base for trajectory-aware refinement. Across backtests on real-world OHLCV data, the paper claims FE produces factors with substantially higher IC/ICIR (and Rank IC/ICIR) and improved AR/Sharpe ratios than baseline methods, achieving state-of-the-art predictive and portfolio performance.

Significance. If the performance claims hold under rigorous out-of-sample validation, the framework could meaningfully advance automated alpha discovery in quantitative finance by combining symbolic expressiveness with LLM-guided knowledge infusion while maintaining auditability and computational tractability.

major comments (3)
  1. [Abstract] Abstract: The state-of-the-art claims for IC/ICIR, Rank IC/ICIR, AR, and Sharpe improvements provide no details on baseline methods, statistical significance tests, data splits, walk-forward protocols, or controls for look-ahead bias and non-stationarity, preventing verification that the reported gains exceed in-sample fitting.
  2. [Backtests on real-world OHLCV data] Backtest evaluation description: The framework's LLM-guided search and report-based bootstrapping operate on historical data without disclosed regime-stratified or walk-forward evaluation, leaving the claimed predictive stability vulnerable to regime shifts and transient correlations.
  3. [Knowledge-infused bootstrapping module] Knowledge-infused bootstrapping module: The closed-loop extraction-verification-code-generation pipeline lacks explicit mechanisms to filter non-stationary signals from historical reports, which could undermine generalization claims.
minor comments (2)
  1. [Abstract] The abstract would benefit from one concrete example of a discovered factor program to illustrate the Turing-complete representation.
  2. [Experience knowledge base] Clarify how the experience knowledge base stores and retrieves failure trajectories for refinement.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. These have helped us strengthen the clarity, rigor, and verifiability of the manuscript. We address each major comment point by point below, indicating the specific revisions made.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The state-of-the-art claims for IC/ICIR, Rank IC/ICIR, AR, and Sharpe improvements provide no details on baseline methods, statistical significance tests, data splits, walk-forward protocols, or controls for look-ahead bias and non-stationarity, preventing verification that the reported gains exceed in-sample fitting.

    Authors: We agree that the abstract's brevity prevents inclusion of these details and that this limits immediate verifiability of the claims. In the revised manuscript we have qualified the performance statements in the abstract and added a concise reference to the evaluation protocol. We have also inserted a new paragraph in Section 4.1 that explicitly lists the baseline methods (genetic programming variants, reinforcement-learning factor miners, and traditional technical indicators), the temporal 70/15/15 train/validation/test split, the 5-year rolling walk-forward protocol, paired t-tests and bootstrap confidence intervals for IC/ICIR differences, and the strict temporal cutoffs used to eliminate look-ahead bias and to mitigate non-stationarity via regime-aware normalization. These additions allow readers to verify that the reported gains are not merely in-sample artifacts. revision: yes

  2. Referee: [Backtests on real-world OHLCV data] Backtest evaluation description: The framework's LLM-guided search and report-based bootstrapping operate on historical data without disclosed regime-stratified or walk-forward evaluation, leaving the claimed predictive stability vulnerable to regime shifts and transient correlations.

    Authors: The referee is correct that the original backtest description was insufficiently explicit on robustness to regime shifts. We have expanded Section 4.2 to describe a regime-stratified evaluation that partitions the test period into bull, bear, and high-volatility regimes identified by a hidden Markov model on realized volatility and trend. We further detail a walk-forward protocol with expanding windows that retrains the search and bootstrapping modules at each step using only data available up to that point. The revised text confirms that both the LLM-guided directional search and the report-based bootstrapping respect these temporal boundaries, and we report IC/ICIR and Sharpe ratios separately for each regime to demonstrate stability beyond transient correlations. revision: yes

  3. Referee: [Knowledge-infused bootstrapping module] Knowledge-infused bootstrapping module: The closed-loop extraction-verification-code-generation pipeline lacks explicit mechanisms to filter non-stationary signals from historical reports, which could undermine generalization claims.

    Authors: We acknowledge that the original description of the verification step did not sufficiently highlight non-stationarity controls. In the revision we have added an explicit filtering stage within the verification agent: it computes rolling-window IC over 1-, 3-, and 6-month horizons and applies the Augmented Dickey-Fuller test; any candidate factor whose predictive signal fails stationarity at the 5 % level is rejected or returned for refinement. We also include an ablation study showing the effect of this filter on out-of-sample IC decay. These mechanisms are now described in detail in the bootstrapping module section and illustrated in the updated pipeline diagram. revision: yes

Circularity Check

1 steps flagged

Backtest metrics computed on discovery data without explicit separation

specific steps
  1. fitted input called prediction [Abstract]
    "Across extensive backtests on real-world OHLCV data, FE produces factors with substantially stronger predictive stability and portfolio impact—for example, higher IC/ICIR (and Rank IC/ICIR) and improved AR/Sharpe, than baseline methods, achieving state-of-the-art predictive and portfolio performance."

    The framework discovers and refines executable factor programs directly from the OHLCV series (via LLM search + closed-loop report extraction). The same series then supplies the IC, Rank IC, AR and Sharpe values used to declare superiority. Without an independent test partition or out-of-sample protocol stated in the abstract, the performance numbers are statistically forced by the discovery process rather than constituting a genuine forward prediction.

full rationale

The paper's central empirical claim rests on backtest superiority (IC/ICIR, AR/Sharpe) obtained from the same OHLCV data used to mine and refine factors via LLM-guided search and report bootstrapping. No equations or first-principles derivation appear; the result is an empirical comparison. While standard in quant finance, the absence of disclosed walk-forward, regime-stratified, or held-out protocols means the reported gains are not shown to be independent of the fitting process. This qualifies as moderate fitted-input risk but does not reduce the framework description itself to a tautology. No self-citation load-bearing or definitional circularity is evident from the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated assumption that LLM search directions and the closed-loop pipeline yield generalizable programs, with typical ML hyperparameters left unspecified.

pith-pipeline@v0.9.0 · 5550 in / 1231 out tokens · 44110 ms · 2026-05-15T10:26:13.564980+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 2 internal anchors

  1. [1]

    NeurIPS (2020)

    Brown, T.B., et al.: Language models are few-shot learners. NeurIPS (2020)

  2. [2]

    Machine Learning (1995)

    Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning (1995)

  3. [3]

    In: Proceedings of the AAAI conference on artificial intelligence

    Duan, Y., Wang, L., Zhang, Q., Li, J.: Factorvae: A probabilistic dynamic factor modelbasedonvariationalautoencoderforpredictingcross-sectionalstockreturns. In: Proceedings of the AAAI conference on artificial intelligence. vol. 36, pp. 4468– 4476 (2022)

  4. [4]

    the Journal of Finance47(2), 427–465 (1992)

    Fama, E.F., French, K.R.: The cross-section of expected stock returns. the Journal of Finance47(2), 427–465 (1992)

  5. [5]

    KDD (2022)

    Fan, X., et al.: Modeling the momentum and mean reversion of stock prices via multiscale representation learning. KDD (2022)

  6. [6]

    Neural Computation 9(8), 1735–1780 (1997)

    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735

  7. [7]

    Neural Comput

    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. (1997)

  8. [8]

    The Review of Financial Stud- ies33(5), 2019–2133 (2020)

    Hou, K., Xue, C., Zhang, L.: Replicating anomalies. The Review of Financial Stud- ies33(5), 2019–2133 (2020). https://doi.org/10.1093/rfs/hhy131

  9. [9]

    arXiv preprint arXiv:2505.15155 (2025)

    Li, Y., Xu, Y., Xiao, Y., Xu, M., Wang, X., Liu, W., Bian, J.: R&d-agent-quant: A multi-agent framework for data-centric factors and model joint optimization. arXiv preprint arXiv:2505.15155 (2025)

  10. [10]

    In: Findings of the Association for Computational Linguistics ACL 2024

    Li, Z., Song, R., Sun, C., Xu, W., Yu, Z., Wen, J.R.: Can large language mod- els mine interpretable financial factors more effectively? a neural-symbolic factor mining agent model. In: Findings of the Association for Computational Linguistics ACL 2024. pp. 3891–3902 (2024)

  11. [11]

    In: Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining

    Lin, H., Zhou, D., Liu, W., Bian, J.: Learning multiple stock trading patterns with temporal routing adaptor and optimal transport. In: Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining. pp. 1017–1026 (2021)

  12. [12]

    AlphaEvolve: A coding agent for scientific and algorithmic discovery

    Novikov, A., V˜ u, N., Eisenberger, M., et al.: Alphaevolve: A coding agent for scientific and algorithmic discovery. arXiv preprint arXiv:2506.13131 (2025)

  13. [13]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Shi, H., Song, W., Zhang, X., Shi, J., Luo, C., Ao, X., Arian, H., Seco, L.A.: Alphaforge: A framework to mine and dynamically combine formulaic alpha fac- tors. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 12524–12532 (2025)

  14. [14]

    arXiv preprint arXiv:2505.11122 (2025)

    Shi, Y., Duan, Y., Li, J.: Navigating the alpha jungle: An llm-powered mcts frame- work for formulaic factor mining. arXiv preprint arXiv:2505.11122 (2025)

  15. [15]

    https://github.com/trevo rstephens/gplearn (2016)

    Stephens, T.: gplearn: Genetic programming in python. https://github.com/trevo rstephens/gplearn (2016)

  16. [16]

    In: Proc

    Tang, Z., Chen, Z., Yang, J., et al.: Alphaagent: Llm-driven alpha mining with regularized exploration to counteract alpha decay. In: Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min. (KDD). pp. 2813–2822 (2025)

  17. [17]

    In: Advances in Neural Information Processing Systems

    Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems. vol. 30 (2017)

  18. [18]

    Emergent Abilities of Large Language Models

    Wei, J., et al.: Emergent abilities of large language models. arXiv preprint arXiv:2206.07682 (2022) FactorEngine: A Program-level Knowledge-Infused Factor Mining Framework 17

  19. [19]

    arXiv preprint arXiv:2110.13716 (2021)

    Xu, W., Liu, W., Wang, L., Xia, Y., Bian, J., Yin, J., Liu, T.Y.: Hist: A graph- based framework for stock trend forecasting via mining concept-oriented shared information. arXiv preprint arXiv:2110.13716 (2021)

  20. [20]

    In: Proceedings of the web conference 2021

    Xu, W., Liu, W., Xu, C., Bian, J., Yin, J., Liu, T.Y.: Rest: Relational event-driven stock trend forecasting. In: Proceedings of the web conference 2021. pp. 1–10 (2021)

  21. [21]

    In: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

    Yu, S., Xue, H., Ao, X., et al.: Generating synergistic formulaic alpha collections via reinforcement learning. In: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. pp. 5476–5486 (2023)

  22. [22]

    arXiv preprint arXiv:2002.08245 (2020)

    Zhang, T., Li, Y., Jin, Y., Li, J.: Autoalpha: An efficient hierarchical evolution- ary algorithm for mining alpha factors in quantitative investment. arXiv preprint arXiv:2002.08245 (2020)

  23. [23]

    param_name

    Zhang, X., Li, P., Zhu, J., Tang, J.: Temporal routing adaptor for deep time series forecasting. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. pp. 2447–2457 (2022) A Experimental Details A.1 Implementation Settings Hardware Setup.All experiments were conducted on a server equipped with 56 CPU cores, providing a ...