Hubble: An LLM-Driven Agentic Framework for Safe, Diverse, and Reproducible Alpha Factor Discovery
Pith reviewed 2026-05-15 15:21 UTC · model grok-4.3
The pith
LLM agent with operator trees and feedback discovers range and volatility alpha factors that stay positive out of sample on U.S. equities.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Hubble restricts LLM generation to interpretable operator trees, runs every candidate through a deterministic cross-sectional evaluation pipeline, and returns both top formulas and structured family-level diagnostics to the next round via dual-channel positive/negative RAG and similarity penalties. On a U.S. equity universe of roughly 500 stocks the main run evaluated 104 valid candidates across three rounds with zero crashes and produced a top set dominated by range, volatility, and trend families. When these top-5 factors were fixed and tested on the 2025-06-01 to 2026-03-13 holdout, the two range and two volatility factors stayed positive while several reached HAC-significant Pearson IC;
What carries the argument
The iterative agentic loop that restricts generation to AST-executable operator trees, scores them with standardized multi-metric RankIC and Pearson IC, and feeds back family diagnostics plus positive/negative RAG examples to steer subsequent proposals.
If this is right
- Range and volatility family factors remain positive in the held-out period while trend factors decay.
- Several discovered factors reach HAC-significant Pearson IC and long-short evidence on the 2025-2026 window.
- The search systematically avoids crowded volume-only motifs in favor of range, volatility, and trend families.
- Persistent diagnostics artifacts allow post-hoc inspection of why each factor was retained or rejected.
- The same constrained pipeline can be rerun with different seeds or universes while preserving reproducibility.
Where Pith is reading between the lines
- The operator-language plus sandbox approach could be ported to other asset classes once the primitive set is extended.
- Repeated runs with fresh holdouts would clarify whether the observed OOS stability is stable across regimes.
- The family-penalty mechanism offers a concrete way to measure and control diversity that could be adopted in non-LLM factor searches.
- If the diagnostics artifacts are made public they would let independent researchers audit the selection path for hidden overfitting.
Load-bearing premise
That positive performance on the chosen 2025-2026 holdout reflects genuine generalization rather than residual data snooping or period-specific market regimes.
What would settle it
Re-testing the same top-5 factors on any later out-of-sample window after March 2026 and finding that their Pearson IC or long-short returns turn negative or lose statistical significance would falsify the generalization claim.
Figures
read the original abstract
Automated alpha discovery is difficult because the search space of formulaic factors is combinatorial, the signal-to-noise ratio in daily equity data is low, and unconstrained program generation is operationally unsafe. We present Hubble, an agentic factor mining framework that combines large language models (LLMs) with a domain-specific operator language, an abstract syntax tree (AST) execution sandbox, a dual-channel retrieval-augmented generation (RAG) module, and a family-aware selection mechanism. Instead of treating the LLM as an unconstrained code generator, Hubble restricts generation to interpretable operator trees, evaluates every candidate through a deterministic cross-sectional pipeline, and feeds back both top formulas and structured family-level diagnostics to subsequent rounds. The current system additionally introduces positive/negative RAG, formula-similarity penalties, standardized multi-metric scoring, dual reporting of RankIC and Pearson IC, and persistent diagnostics artifacts for post-hoc research analysis. On a U.S. equity universe of roughly 500 stocks, our main run evaluates 104 valid candidates across three rounds with zero runtime crashes and discovers a top set dominated by range, volatility, and trend families rather than crowded volume-only motifs. We then fix the resulting top-5 factors and validate them on a held-out period from 2025-06-01 to 2026-03-13. In this out-of-sample window, the two range factors and two volatility factors remain positive and several achieve HAC-significant Pearson IC and long-short evidence, whereas the weakest in-sample trend factor decays materially. These results suggest that safe LLM-guided search can be upgraded from a syntax-compliant generator into a reproducible alpha-research workflow that jointly optimizes validity, diversity, interpretability, and family-level generalization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Hubble, an LLM-driven agentic framework for alpha factor discovery that restricts generation to interpretable operator trees executed in an AST sandbox, incorporates dual-channel positive/negative RAG, formula-similarity penalties, family-aware selection, and multi-metric scoring. On a U.S. equity universe of roughly 500 stocks, the system evaluates 104 valid candidates across three rounds with zero runtime crashes, identifies a top set dominated by range, volatility, and trend families, and validates the top-5 factors on a held-out period (2025-06-01 to 2026-03-13), where four remain positive with some achieving HAC-significant Pearson IC and long-short evidence while the trend factor decays.
Significance. If the central claims hold under stronger validation, Hubble would advance automated alpha discovery by demonstrating a reproducible, safe workflow that jointly optimizes validity, diversity, and family-level generalization rather than unconstrained code generation. The deterministic cross-sectional pipeline, persistent diagnostics artifacts, and dual RankIC/Pearson IC reporting address practical operational issues in quant research. The reported zero-crash execution of 104 candidates and shift away from volume-only motifs toward range/volatility families are concrete strengths, though the single short holdout limits immediate impact.
major comments (3)
- Out-of-sample validation: The held-out evaluation uses only a single ~9-month interval (2025-06-01 to 2026-03-13). No rolling-window tests, additional disjoint periods, or regime-stratified breakdowns are described, so the persistence of the two range and two volatility factors (and decay of the trend factor) could be driven by period-specific dynamics rather than robust family properties.
- Framework ablations: No ablation results are provided for the dual-channel RAG module or the family-aware penalties and similarity penalties. These components are central to the claims of safe, diverse discovery, yet their incremental contribution to the reported top-set composition and OOS performance cannot be assessed.
- Results reporting and reproducibility: The abstract and results state positive OOS IC for four of five factors but supply no error bars, exact universe construction details (liquidity filters, delisting rules, or stock selection criteria), or full multi-metric scoring weights. This weakens the reproducibility claims and leaves selection-bias concerns from post-hoc family diagnostics unaddressed.
minor comments (2)
- The operator language and AST execution details would benefit from one or two concrete formula examples to illustrate valid trees and sandbox constraints.
- Clarify in the methods how HAC standard errors are computed for the Pearson IC values, as this is referenced in the abstract but not fully detailed.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We address each of the major comments below and will incorporate revisions to strengthen the manuscript's validation, ablation analysis, and reproducibility.
read point-by-point responses
-
Referee: Out-of-sample validation: The held-out evaluation uses only a single ~9-month interval (2025-06-01 to 2026-03-13). No rolling-window tests, additional disjoint periods, or regime-stratified breakdowns are described, so the persistence of the two range and two volatility factors (and decay of the trend factor) could be driven by period-specific dynamics rather than robust family properties.
Authors: We agree that relying on a single hold-out period limits the strength of the robustness claims. In the revised version, we will add rolling-window out-of-sample tests using multiple disjoint periods from the available data and include regime-stratified breakdowns (e.g., by market volatility or trend regimes) to better assess whether the range and volatility families exhibit consistent performance across conditions. This will be presented in an expanded results section. revision: yes
-
Referee: Framework ablations: No ablation results are provided for the dual-channel RAG module or the family-aware penalties and similarity penalties. These components are central to the claims of safe, diverse discovery, yet their incremental contribution to the reported top-set composition and OOS performance cannot be assessed.
Authors: We concur that ablation experiments are necessary to quantify the impact of these key components. We will add a dedicated ablation study in the revised manuscript, comparing the full framework against variants without dual-channel RAG and without the family-aware and similarity penalties. Metrics will include diversity (e.g., formula similarity), validity rate, and OOS IC performance to demonstrate their contributions. revision: yes
-
Referee: Results reporting and reproducibility: The abstract and results state positive OOS IC for four of five factors but supply no error bars, exact universe construction details (liquidity filters, delisting rules, or stock selection criteria), or full multi-metric scoring weights. This weakens the reproducibility claims and leaves selection-bias concerns from post-hoc family diagnostics unaddressed.
Authors: We will revise the manuscript to include precise details on the equity universe construction, such as liquidity filters, delisting handling, and stock selection criteria. Error bars or confidence intervals will be added to the reported IC values using appropriate statistical methods (e.g., HAC standard errors). The full multi-metric scoring weights will be explicitly stated. To address potential selection bias, we will clarify the selection process and add checks showing that the top factors are not overly sensitive to the post-hoc family analysis. revision: yes
Circularity Check
No circularity; claims rest on held-out empirical evaluation
full rationale
The paper describes an empirical workflow: LLM-guided generation of operator trees, deterministic cross-sectional evaluation of 104 candidates, family-aware selection, and validation of fixed top-5 factors on a disjoint 2025-06-01 to 2026-03-13 holdout. No equations, definitions, or self-citations reduce the reported IC or long-short metrics to in-sample fits by construction. Performance numbers are computed directly on the held-out window rather than being tautological with the search inputs or fitted parameters. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- multi-metric scoring weights
axioms (1)
- domain assumption Daily equity returns have low signal-to-noise ratio and combinatorial formula space is unsafe without constraints
Reference graph
Works this paper leans on
-
[1]
McGraw-Hill, 2nd edition, 2000
Richard C Grinold and Ronald N Kahn.Active Portfolio Management: A Quantitative Approach for Producing Superior Returns and Controlling Risk. McGraw-Hill, 2nd edition, 2000
work page 2000
-
[2]
Edward E Qian, Ronald H Hua, and Eric H Sorensen.Quantitative Equity Portfolio Management: Modern Techniques and Applications. Chapman and Hall/CRC, 2007
work page 2007
-
[3]
John R Koza.Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, 1992
work page 1992
-
[4]
101 formulaic alphas.Wilmott, 2016(84):72–81, 2016
Zura Kakushadze. 101 formulaic alphas.Wilmott, 2016(84):72–81, 2016
work page 2016
-
[5]
Saizhuo Wang, Hang Yuan, Leon Zhou, Lionel M. Ni, Heung-Yeung Shum, and Jian Guo. Alpha-GPT: Human-AI interactive alpha mining for quantitative investment.arXiv preprint arXiv:2308.00016, 2023
-
[6]
Ziyi Tang, Zechuan Chen, Jiarui Yang, Jiayao Mai, Yongsen Zheng, Keze Wang, Jinrui Chen, and Liang Lin. AlphaAgent: LLM-driven alpha mining with regularized exploration to counteract alpha decay.arXiv preprint arXiv:2502.16789, 2025
-
[7]
Yanlong Wang, Jian Xu, Hongkang Zhang, Shao-Lun Huang, Danny Dongning Sun, and Xiao- Ping Zhang. FactorMiner: A self-evolving agent with skills and experience memory for financial alpha discovery.arXiv preprint arXiv:2602.14670, 2026
-
[8]
Qlib: An ai-oriented quantitative investment platform, 2020
Xiao Yang, Weiqing Liu, Dong Zhou, Jiang Bian, and Tie-Yan Liu. Qlib: An ai-oriented quantitative investment platform.arXiv preprint arXiv:2009.11189, 2020
-
[9]
OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022. 16 Hubble Celestial Quant Lab
work page 2022
-
[11]
Tree of thoughts: Deliberate problem solving with large language models
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In Advances in Neural Information Processing Systems, volume 36, 2023
work page 2023
-
[12]
Mathematical discoveries from program search with large language models
Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M Pawan Kumar, Emilien Dupont, Francisco J R Ruiz, Jordan S Ellenberg, Pengming Wang, Omar Fawzi, et al. Mathematical discoveries from program search with large language models. Nature, 625:468–475, 2024
work page 2024
-
[13]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks.Advances in Neural Information Processing Systems, 33, 2020
work page 2020
-
[14]
Whitney K. Newey and Kenneth D. West. A simple, positive semi-definite, heteroskedasticity and autocorrelation consistent covariance matrix.Econometrica, 55(3):703–708, 1987
work page 1987
-
[15]
David McLean and Jeffrey Pontiff
R. David McLean and Jeffrey Pontiff. Does academic research destroy stock return predictability? The Journal of Finance, 71(1):5–32, 2016
work page 2016
-
[16]
Replicating anomalies.The Review of Financial Studies, 33(5):2019–2133, 2020
Kewei Hou, Chen Xue, and Lu Zhang. Replicating anomalies.The Review of Financial Studies, 33(5):2019–2133, 2020
work page 2019
-
[17]
Campbell R Harvey, Yan Liu, and Heqing Zhu. ... and the cross-section of expected returns. The Review of Financial Studies, 29(1):5–68, 2016. 17
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.