AlphaCrafter: A Full-Stack Multi-Agent Framework for Cross-Sectional Quantitative Trading
Pith reviewed 2026-05-08 11:54 UTC · model grok-4.3
The pith
A three-agent system automates factor discovery, regime detection, and risk-constrained trading to produce consistent outperformance in quantitative strategies.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AlphaCrafter introduces a full-stack multi-agent framework consisting of a Miner agent that expands the factor pool through LLM-guided search, a Screener agent that evaluates current market regimes to select appropriate factor ensembles, and a Trader agent that implements these into strategies with explicit risk constraints. This closed-loop system allows the entire pipeline to adapt holistically to evolving market dynamics, resulting in superior risk-adjusted returns and minimal variance across multiple trials on CSI 300 and S&P 500 datasets compared to existing baselines.
What carries the argument
The closed-loop coordination among the Miner, Screener, and Trader agents that unifies factor discovery with regime-adaptive selection and risk-managed execution.
If this is right
- Quantitative strategies can be generated and updated continuously without manual intervention as market regimes shift.
- Performance exhibits lower variance across different runs, indicating greater reliability.
- The integrated design yields better risk-adjusted returns than methods that treat components in isolation.
Where Pith is reading between the lines
- Similar multi-agent structures might apply to other dynamic optimization problems like supply chain management or energy trading.
- The reduction in behavioral noise from role-playing could generalize to other decision-making AI systems.
- Testing on additional asset classes or higher frequency data could reveal the limits of the current agent specialization.
Load-bearing premise
The assumption that LLM-guided factor searches combined with agent-based regime detection will generate profitable strategies that generalize beyond the specific market conditions observed in the CSI 300 and S&P 500 experiments.
What would settle it
A decline in risk-adjusted returns or an increase in performance variance when the framework is applied to data from a subsequent time period not used in the original experiments.
Figures
read the original abstract
Financial markets are inherently non-stationary, driven by complex interactions among macroeconomic regimes, microstructural frictions, and behavioral dynamics. Building quantitative strategies that remain profitable demands the continuous coupling of factor discovery, regime-adaptive selection, and risk-constrained execution. Prevailing approaches, however, optimize these components under static or isolated assumptions. Factor mining frameworks typically treat alpha discovery as a one-time search process, implicitly assuming that factor efficacy persists across market regimes. Execution-oriented systems often adopt role-playing agent architectures that simulate anthropomorphic trading committees, introducing behavioral noise rather than systematic rationality. Consequently, a fully automated, rationality-driven framework unifying a coherent quantitative pipeline remains absent. We introduce AlphaCrafter, a full-stack multi-agent framework that closes this gap through a continuously adaptive factor-to-execution pipeline, designed to track and respond to evolving market conditions without manual intervention. AlphaCrafter operates via three specialized agents: a Miner that continuously expands the factor pool via LLM-guided search, a Screener that assesses prevailing market conditions to construct regime-conditioned factor ensembles, and a Trader that translates these ensembles into quantitative strategies under explicit risk constraints. Together, these three agents form a closed-loop cross-sectional trading system that adapts holistically to evolving market dynamics. Extensive experiments on CSI 300 and S&P 500 demonstrate that AlphaCrafter consistently outperforms state-of-the-art baselines in risk-adjusted returns while exhibiting the lowest cross-trial variance, confirming that integrated and adaptive factor-to-execution design yields robust trading performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. AlphaCrafter is introduced as a full-stack multi-agent framework for cross-sectional quantitative trading, comprising a Miner agent that uses LLM-guided search to expand the factor pool, a Screener agent that assesses market conditions to build regime-conditioned factor ensembles, and a Trader agent that executes strategies under risk constraints. The framework forms a closed-loop system claimed to adapt to evolving market dynamics. Extensive experiments on CSI 300 and S&P 500 are reported to show consistent outperformance over state-of-the-art baselines in risk-adjusted returns with the lowest cross-trial variance.
Significance. If the results are substantiated with proper controls for data leakage and statistical validation, the work could significantly contribute to the development of adaptive, automated trading systems by demonstrating the value of integrated multi-agent designs in handling non-stationary financial markets. The emphasis on a rationality-driven rather than anthropomorphic agent architecture is a positive distinction from prior work. However, the current presentation leaves the central empirical claims difficult to assess independently.
major comments (3)
- §5 (Experiments): The reported outperformance and low variance lack supporting details such as exact performance metrics (e.g., Sharpe ratios, maximum drawdown), number of trials, data split methodology (train/validation/test periods), and any hypothesis testing. This makes it impossible to evaluate whether the gains are statistically significant or merely due to favorable period selection, directly impacting the validity of the main claim.
- §3.2 (Screener): The mechanism for regime identification and ensemble construction is described at a high level without specifying the temporal scope of data used for regime detection. To support the claim of genuine adaptation to evolving conditions, it must be demonstrated that no future information is used; otherwise, the results on CSI 300 and S&P 500 may reflect overfitting rather than robust performance.
- §3.1 (Miner): The LLM-guided factor search process does not detail how factors are evaluated and selected in an online manner. If factor scores or selections incorporate information from the full evaluation period, the closed-loop adaptation would be circular, contradicting the assertion of continuous response to market dynamics without manual intervention.
minor comments (2)
- Abstract: Consider adding one sentence on the specific risk constraints employed by the Trader agent to better contextualize the framework's safety features.
- The manuscript would benefit from a table summarizing the agent roles, inputs, and outputs for quick reference.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments identify important areas where greater specificity will strengthen the manuscript's clarity and allow independent verification of the claims. We address each major comment below and will incorporate the necessary revisions.
read point-by-point responses
-
Referee: §5 (Experiments): The reported outperformance and low variance lack supporting details such as exact performance metrics (e.g., Sharpe ratios, maximum drawdown), number of trials, data split methodology (train/validation/test periods), and any hypothesis testing. This makes it impossible to evaluate whether the gains are statistically significant or merely due to favorable period selection, directly impacting the validity of the main claim.
Authors: We agree that the current presentation of results in §5 is insufficiently detailed for rigorous evaluation. In the revised manuscript we will add the exact numerical values for Sharpe ratios, maximum drawdowns, and other metrics for AlphaCrafter and all baselines on both CSI 300 and S&P 500. We will state that all figures are means over 10 independent trials with distinct random seeds, describe the chronological data splits (training through 2018, walk-forward validation 2019-2020, test 2021 onward), and report p-values from paired t-tests together with Diebold-Mariano tests to establish statistical significance. These additions will directly substantiate the claims of outperformance and low variance. revision: yes
-
Referee: §3.2 (Screener): The mechanism for regime identification and ensemble construction is described at a high level without specifying the temporal scope of data used for regime detection. To support the claim of genuine adaptation to evolving conditions, it must be demonstrated that no future information is used; otherwise, the results on CSI 300 and S&P 500 may reflect overfitting rather than robust performance.
Authors: We acknowledge that the temporal constraints on regime detection must be made explicit. We will expand §3.2 to specify that regime identification operates on a rolling historical window (e.g., the preceding 252 trading days) using only data available at the rebalancing date. Regime labels and ensemble weights are computed causally with no access to future returns or market conditions. We will also add pseudocode and a short appendix verifying that the reported CSI 300 and S&P 500 results respect this forward-only information flow. revision: yes
-
Referee: §3.1 (Miner): The LLM-guided factor search process does not detail how factors are evaluated and selected in an online manner. If factor scores or selections incorporate information from the full evaluation period, the closed-loop adaptation would be circular, contradicting the assertion of continuous response to market dynamics without manual intervention.
Authors: We agree that the online character of the Miner requires explicit description to preclude any appearance of circularity. The revised §3.1 will detail that candidate factors are scored exclusively on rolling historical windows up to the current decision point (e.g., information coefficient computed on the prior 60-252 days) and that selections are updated incrementally at each rebalancing. We will include a temporal information-flow diagram and confirm that all experimental results adhere to strict no-future-data protocols. revision: yes
Circularity Check
No derivation chain present; empirical framework claims lack mathematical reductions
full rationale
The paper describes a multi-agent system (Miner, Screener, Trader) for factor discovery and regime-adaptive trading but advances no equations, derivations, or first-principles results. Claims of outperformance rest solely on reported experiments on CSI 300 and S&P 500 rather than any closed-form prediction or fitted quantity renamed as a result. No self-citations, ansatzes, or uniqueness theorems are invoked to justify core components, and no step reduces an output to its inputs by construction. The analysis therefore detects no circularity.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
URLhttp://www.jstor.org/stable/2171879
ISSN 00129682, 14680262. URLhttp://www.jstor.org/stable/2171879. William A. Brock and Cars H. Hommes. Heterogeneous beliefs and routes to chaos in a simple asset pricing model.Journal of Economic Dynamics and Control, 22(8):1235–1274, 1998. ISSN 0165-1889. doi: https://doi.org/10.1016/S0165-1889(98)00011-6. URL https://www.sciencedirect.com/science/ artic...
-
[2]
URLhttps://arxiv.org/abs/2502.16789. William F. Sharpe. The sharpe ratio.The Journal of Portfolio Management, 21(1):49–58, 1994. doi: 10.3905/ jpm.1994.409501. Richard C. Grinold and Ronald N. Kahn.Active Portfolio Management: A Quantitative Approach for Producing Superior Returns and Controlling Risk. McGraw-Hill, New York, 2 edition, 1999. Gerald Appel....
-
[3]
Accessed: 2026-04-27. Google DeepMind. Gemini 3.1 Pro model card. https://deepmind.google/models/model-cards/ gemini-3-1-pro/, February 2026. Accessed: 2026-04-27. Sergey Isaenko. Transaction costs, frequent trading, and stock prices.Journal of Financial Markets, 64:100775, 2023. ISSN 1386-4181. doi: https://doi.org/10.1016/j.finmar.2022.100775. URL https...
-
[4]
URLhttps://arxiv.org/abs/2402.18679. Yuan Liang, Ruobin Zhong, Haoming Xu, Chen Jiang, Yi Zhong, Runnan Fang, Jia-Chen Gu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, Xin Xu, Tongtong Wu, Kun Wang, Yang Liu, Zhen Bi, Jungang Lou, Yuchen Eleanor Jiang, Hangcheng Zhu, Gang Yu, Haiwen Hong, Longtao Huang, Hui Xue, Chenxi Wang, Yijun Wang, Zifei Shan,...
-
[5]
sortUbyϕ i,t descending 5.I long ←firstN long assets,I short ←lastN short assets 6.for eachi∈ {i|h i,t−1 ̸=0}do// liquidate positions no longer in the list 7.ifi/∈ I long ∪ Ishort then
-
[6]
submit_order(i,−h i,t−1 ) 9.end if 10.end for 11.V long ←β·NAV t ·(1+γ)/2 12.V short ←β·NAV t ·(1−γ)/2 13.for eachi∈ I long do 14.h target i,t ←V long/(Nlong ·P i,t) 15.end for 16.for eachi∈ I short do 17.h target i,t ← −Vshort/(Nshort ·P i,t) 18.end for 19.for eachi∈ I long ∪ Ishort do
-
[7]
submit_order(i,h target i,t −h i,t−1 ) 21.end for B Experimental Details B.1 Dataset Details B.1.1 Data Sources The daily OHLCV data for CSI 300 constituents is collected fromBaostockBaoStock (2026). For the S&P 500 constituents, daily price-volume data is obtained fromYahoo FinanceAroussi (2026). Fundamental indicators (PE, PS, PB, DYR), financial statem...
work page 2026
-
[8]
Once you receive a factor ensemble, you should write your strategy in the strategy.py file
If no factor ensemble is received from Screener Agent in the current cycle, you should skip this round with a skipping message (i.e., do not invoke any tool calls, just output the skipping message as your final response). Once you receive a factor ensemble, you should write your strategy in the strategy.py file. Never write a strategy that is too complex
-
[9]
Overfitting to backtest results will lead to poor live performance
You should always use backtesting tool for validation, but do not rely on backtest results. Overfitting to backtest results will lead to poor live performance. But for badly performing strategy in backtesting, you should update the strategy immediately
-
[10]
Do not call it multiple times within the same cycle
Call the step tool only once per trading cycle. Do not call it multiple times within the same cycle
-
[11]
After each relaxation step, re-run the backtest to verify that trades are now being executed
If no orders are executed during backtesting or live trading, you must systematically relax the strategy’s constraints until trades are generated. After each relaxation step, re-run the backtest to verify that trades are now being executed
-
[12]
When encountering bugs (e.g., version issues, nonexistent methods), attempt to use alternative equivalent approaches rather than stubbornly persisting with the problematic method. 26
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.