pith. sign in

arxiv: 2604.09601 · v2 · submitted 2026-03-09 · 💻 cs.AI · cs.CE

Hubble: An LLM-Driven Agentic Framework for Safe, Diverse, and Reproducible Alpha Factor Discovery

Pith reviewed 2026-05-15 15:21 UTC · model grok-4.3

classification 💻 cs.AI cs.CE
keywords alpha factor discoveryLLM agentsquantitative financefactor miningretrieval augmented generationequity alpha factorsautomated research workflow
0
0 comments X

The pith

LLM agent with operator trees and feedback discovers range and volatility alpha factors that stay positive out of sample on U.S. equities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Hubble as an agentic system that directs large language models to generate alpha factors while locking them inside an abstract syntax tree sandbox and a domain-specific operator language. Positive and negative retrieval feedback plus family-aware penalties steer the search away from repetitive volume motifs toward range, volatility, and trend families. On roughly 500 U.S. stocks the system produced 104 valid candidates across three rounds with no runtime failures. When the top five factors were frozen and tested on a 2025-2026 holdout, the range and volatility members remained positive with several reaching HAC-significant Pearson IC and long-short evidence. This shows that LLM-driven search can be turned into a controlled, reproducible workflow that jointly enforces validity, diversity, and measurable generalization.

Core claim

Hubble restricts LLM generation to interpretable operator trees, runs every candidate through a deterministic cross-sectional evaluation pipeline, and returns both top formulas and structured family-level diagnostics to the next round via dual-channel positive/negative RAG and similarity penalties. On a U.S. equity universe of roughly 500 stocks the main run evaluated 104 valid candidates across three rounds with zero crashes and produced a top set dominated by range, volatility, and trend families. When these top-5 factors were fixed and tested on the 2025-06-01 to 2026-03-13 holdout, the two range and two volatility factors stayed positive while several reached HAC-significant Pearson IC;

What carries the argument

The iterative agentic loop that restricts generation to AST-executable operator trees, scores them with standardized multi-metric RankIC and Pearson IC, and feeds back family diagnostics plus positive/negative RAG examples to steer subsequent proposals.

If this is right

  • Range and volatility family factors remain positive in the held-out period while trend factors decay.
  • Several discovered factors reach HAC-significant Pearson IC and long-short evidence on the 2025-2026 window.
  • The search systematically avoids crowded volume-only motifs in favor of range, volatility, and trend families.
  • Persistent diagnostics artifacts allow post-hoc inspection of why each factor was retained or rejected.
  • The same constrained pipeline can be rerun with different seeds or universes while preserving reproducibility.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The operator-language plus sandbox approach could be ported to other asset classes once the primitive set is extended.
  • Repeated runs with fresh holdouts would clarify whether the observed OOS stability is stable across regimes.
  • The family-penalty mechanism offers a concrete way to measure and control diversity that could be adopted in non-LLM factor searches.
  • If the diagnostics artifacts are made public they would let independent researchers audit the selection path for hidden overfitting.

Load-bearing premise

That positive performance on the chosen 2025-2026 holdout reflects genuine generalization rather than residual data snooping or period-specific market regimes.

What would settle it

Re-testing the same top-5 factors on any later out-of-sample window after March 2026 and finding that their Pearson IC or long-short returns turn negative or lose statistical significance would falsify the generalization claim.

Figures

Figures reproduced from arXiv: 2604.09601 by Chengxi Lv, Runze Shi, Shengyu Yan, Yuecheng Cai.

Figure 1
Figure 1. Figure 1: Hubble agentic search loop. Positive and negative RAG references guide the generator; [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Hubble infrastructure and safety stack, integrating market data ingestion, AST-based [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Throughput and best-score progression across rounds for the main run. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: In-sample statistical and trading diagnostics for the top- [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Out-of-sample statistical and trading diagnostics for the top- [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: In-sample bucket-return profile of the best factor. [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Out-of-sample bucket-return profile of the best factor. [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: HAC significance of the top-5 factors in-sample. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: HAC significance of the top-5 factors out-of-sample [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Cumulative IC for the top-3 factors in-sample [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Cumulative IC for the top-3 factors out-of-sample. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Cumulative long-short spread of the top- [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Cumulative long-short spread of the top- [PITH_FULL_IMAGE:figures/full_fig_p013_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Family composition of the main run: candidate pool versus final top- [PITH_FULL_IMAGE:figures/full_fig_p013_14.png] view at source ↗
read the original abstract

Automated alpha discovery is difficult because the search space of formulaic factors is combinatorial, the signal-to-noise ratio in daily equity data is low, and unconstrained program generation is operationally unsafe. We present Hubble, an agentic factor mining framework that combines large language models (LLMs) with a domain-specific operator language, an abstract syntax tree (AST) execution sandbox, a dual-channel retrieval-augmented generation (RAG) module, and a family-aware selection mechanism. Instead of treating the LLM as an unconstrained code generator, Hubble restricts generation to interpretable operator trees, evaluates every candidate through a deterministic cross-sectional pipeline, and feeds back both top formulas and structured family-level diagnostics to subsequent rounds. The current system additionally introduces positive/negative RAG, formula-similarity penalties, standardized multi-metric scoring, dual reporting of RankIC and Pearson IC, and persistent diagnostics artifacts for post-hoc research analysis. On a U.S. equity universe of roughly 500 stocks, our main run evaluates 104 valid candidates across three rounds with zero runtime crashes and discovers a top set dominated by range, volatility, and trend families rather than crowded volume-only motifs. We then fix the resulting top-5 factors and validate them on a held-out period from 2025-06-01 to 2026-03-13. In this out-of-sample window, the two range factors and two volatility factors remain positive and several achieve HAC-significant Pearson IC and long-short evidence, whereas the weakest in-sample trend factor decays materially. These results suggest that safe LLM-guided search can be upgraded from a syntax-compliant generator into a reproducible alpha-research workflow that jointly optimizes validity, diversity, interpretability, and family-level generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents Hubble, an LLM-driven agentic framework for alpha factor discovery that restricts generation to interpretable operator trees executed in an AST sandbox, incorporates dual-channel positive/negative RAG, formula-similarity penalties, family-aware selection, and multi-metric scoring. On a U.S. equity universe of roughly 500 stocks, the system evaluates 104 valid candidates across three rounds with zero runtime crashes, identifies a top set dominated by range, volatility, and trend families, and validates the top-5 factors on a held-out period (2025-06-01 to 2026-03-13), where four remain positive with some achieving HAC-significant Pearson IC and long-short evidence while the trend factor decays.

Significance. If the central claims hold under stronger validation, Hubble would advance automated alpha discovery by demonstrating a reproducible, safe workflow that jointly optimizes validity, diversity, and family-level generalization rather than unconstrained code generation. The deterministic cross-sectional pipeline, persistent diagnostics artifacts, and dual RankIC/Pearson IC reporting address practical operational issues in quant research. The reported zero-crash execution of 104 candidates and shift away from volume-only motifs toward range/volatility families are concrete strengths, though the single short holdout limits immediate impact.

major comments (3)
  1. Out-of-sample validation: The held-out evaluation uses only a single ~9-month interval (2025-06-01 to 2026-03-13). No rolling-window tests, additional disjoint periods, or regime-stratified breakdowns are described, so the persistence of the two range and two volatility factors (and decay of the trend factor) could be driven by period-specific dynamics rather than robust family properties.
  2. Framework ablations: No ablation results are provided for the dual-channel RAG module or the family-aware penalties and similarity penalties. These components are central to the claims of safe, diverse discovery, yet their incremental contribution to the reported top-set composition and OOS performance cannot be assessed.
  3. Results reporting and reproducibility: The abstract and results state positive OOS IC for four of five factors but supply no error bars, exact universe construction details (liquidity filters, delisting rules, or stock selection criteria), or full multi-metric scoring weights. This weakens the reproducibility claims and leaves selection-bias concerns from post-hoc family diagnostics unaddressed.
minor comments (2)
  1. The operator language and AST execution details would benefit from one or two concrete formula examples to illustrate valid trees and sandbox constraints.
  2. Clarify in the methods how HAC standard errors are computed for the Pearson IC values, as this is referenced in the abstract but not fully detailed.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each of the major comments below and will incorporate revisions to strengthen the manuscript's validation, ablation analysis, and reproducibility.

read point-by-point responses
  1. Referee: Out-of-sample validation: The held-out evaluation uses only a single ~9-month interval (2025-06-01 to 2026-03-13). No rolling-window tests, additional disjoint periods, or regime-stratified breakdowns are described, so the persistence of the two range and two volatility factors (and decay of the trend factor) could be driven by period-specific dynamics rather than robust family properties.

    Authors: We agree that relying on a single hold-out period limits the strength of the robustness claims. In the revised version, we will add rolling-window out-of-sample tests using multiple disjoint periods from the available data and include regime-stratified breakdowns (e.g., by market volatility or trend regimes) to better assess whether the range and volatility families exhibit consistent performance across conditions. This will be presented in an expanded results section. revision: yes

  2. Referee: Framework ablations: No ablation results are provided for the dual-channel RAG module or the family-aware penalties and similarity penalties. These components are central to the claims of safe, diverse discovery, yet their incremental contribution to the reported top-set composition and OOS performance cannot be assessed.

    Authors: We concur that ablation experiments are necessary to quantify the impact of these key components. We will add a dedicated ablation study in the revised manuscript, comparing the full framework against variants without dual-channel RAG and without the family-aware and similarity penalties. Metrics will include diversity (e.g., formula similarity), validity rate, and OOS IC performance to demonstrate their contributions. revision: yes

  3. Referee: Results reporting and reproducibility: The abstract and results state positive OOS IC for four of five factors but supply no error bars, exact universe construction details (liquidity filters, delisting rules, or stock selection criteria), or full multi-metric scoring weights. This weakens the reproducibility claims and leaves selection-bias concerns from post-hoc family diagnostics unaddressed.

    Authors: We will revise the manuscript to include precise details on the equity universe construction, such as liquidity filters, delisting handling, and stock selection criteria. Error bars or confidence intervals will be added to the reported IC values using appropriate statistical methods (e.g., HAC standard errors). The full multi-metric scoring weights will be explicitly stated. To address potential selection bias, we will clarify the selection process and add checks showing that the top factors are not overly sensitive to the post-hoc family analysis. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on held-out empirical evaluation

full rationale

The paper describes an empirical workflow: LLM-guided generation of operator trees, deterministic cross-sectional evaluation of 104 candidates, family-aware selection, and validation of fixed top-5 factors on a disjoint 2025-06-01 to 2026-03-13 holdout. No equations, definitions, or self-citations reduce the reported IC or long-short metrics to in-sample fits by construction. Performance numbers are computed directly on the held-out window rather than being tautological with the search inputs or fitted parameters. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework assumes equity data exhibits low signal-to-noise and that LLM outputs can be reliably constrained to valid operator trees; no new physical entities are postulated.

free parameters (1)
  • multi-metric scoring weights
    Standardized multi-metric scoring and formula-similarity penalties require chosen thresholds or weights that are not derived from first principles.
axioms (1)
  • domain assumption Daily equity returns have low signal-to-noise ratio and combinatorial formula space is unsafe without constraints
    Explicitly stated as core difficulties motivating the framework.

pith-pipeline@v0.9.0 · 5623 in / 1284 out tokens · 38243 ms · 2026-05-15T15:21:24.967865+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 1 internal anchor

  1. [1]

    McGraw-Hill, 2nd edition, 2000

    Richard C Grinold and Ronald N Kahn.Active Portfolio Management: A Quantitative Approach for Producing Superior Returns and Controlling Risk. McGraw-Hill, 2nd edition, 2000

  2. [2]

    Chapman and Hall/CRC, 2007

    Edward E Qian, Ronald H Hua, and Eric H Sorensen.Quantitative Equity Portfolio Management: Modern Techniques and Applications. Chapman and Hall/CRC, 2007

  3. [3]

    MIT Press, 1992

    John R Koza.Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, 1992

  4. [4]

    101 formulaic alphas.Wilmott, 2016(84):72–81, 2016

    Zura Kakushadze. 101 formulaic alphas.Wilmott, 2016(84):72–81, 2016

  5. [5]

    org/abs/2308.00016

    Saizhuo Wang, Hang Yuan, Leon Zhou, Lionel M. Ni, Heung-Yeung Shum, and Jian Guo. Alpha-GPT: Human-AI interactive alpha mining for quantitative investment.arXiv preprint arXiv:2308.00016, 2023

  6. [6]

    Alphaagent: Llm-driven alpha mining with regularized exploration to counteract alpha decay.arXiv preprint arXiv:2502.16789, 2025

    Ziyi Tang, Zechuan Chen, Jiarui Yang, Jiayao Mai, Yongsen Zheng, Keze Wang, Jinrui Chen, and Liang Lin. AlphaAgent: LLM-driven alpha mining with regularized exploration to counteract alpha decay.arXiv preprint arXiv:2502.16789, 2025

  7. [7]

    FactorMiner: A self-evolving agent with skills and experience memory for financial alpha discovery.arXiv preprint arXiv:2602.14670, 2026

    Yanlong Wang, Jian Xu, Hongkang Zhang, Shao-Lun Huang, Danny Dongning Sun, and Xiao- Ping Zhang. FactorMiner: A self-evolving agent with skills and experience memory for financial alpha discovery.arXiv preprint arXiv:2602.14670, 2026

  8. [8]

    Qlib: An ai-oriented quantitative investment platform, 2020

    Xiao Yang, Weiqing Liu, Dong Zhou, Jiang Bian, and Tie-Yan Liu. Qlib: An ai-oriented quantitative investment platform.arXiv preprint arXiv:2009.11189, 2020

  9. [9]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  10. [10]

    Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022. 16 Hubble Celestial Quant Lab

  11. [11]

    Tree of thoughts: Deliberate problem solving with large language models

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In Advances in Neural Information Processing Systems, volume 36, 2023

  12. [12]

    Mathematical discoveries from program search with large language models

    Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M Pawan Kumar, Emilien Dupont, Francisco J R Ruiz, Jordan S Ellenberg, Pengming Wang, Omar Fawzi, et al. Mathematical discoveries from program search with large language models. Nature, 625:468–475, 2024

  13. [13]

    Retrieval-augmented generation for knowledge-intensive NLP tasks.Advances in Neural Information Processing Systems, 33, 2020

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks.Advances in Neural Information Processing Systems, 33, 2020

  14. [14]

    Newey and Kenneth D

    Whitney K. Newey and Kenneth D. West. A simple, positive semi-definite, heteroskedasticity and autocorrelation consistent covariance matrix.Econometrica, 55(3):703–708, 1987

  15. [15]

    David McLean and Jeffrey Pontiff

    R. David McLean and Jeffrey Pontiff. Does academic research destroy stock return predictability? The Journal of Finance, 71(1):5–32, 2016

  16. [16]

    Replicating anomalies.The Review of Financial Studies, 33(5):2019–2133, 2020

    Kewei Hou, Chen Xue, and Lu Zhang. Replicating anomalies.The Review of Financial Studies, 33(5):2019–2133, 2020

  17. [17]

    Campbell R Harvey, Yan Liu, and Heqing Zhu. ... and the cross-section of expected returns. The Review of Financial Studies, 29(1):5–68, 2016. 17