AlphaCrafter: A Full-Stack Multi-Agent Framework for Cross-Sectional Quantitative Trading

Jiaheng Liu; Jiaqi Wang; Jiayi Sheng; Sirui Zeng; Yishuo Yuan

arxiv: 2605.05580 · v1 · submitted 2026-05-07 · 💻 cs.AI

AlphaCrafter: A Full-Stack Multi-Agent Framework for Cross-Sectional Quantitative Trading

Yishuo Yuan , Jiayi Sheng , Sirui Zeng , Jiaqi Wang , Jiaheng Liu This is my paper

Pith reviewed 2026-05-08 11:54 UTC · model grok-4.3

classification 💻 cs.AI

keywords multi-agent frameworkquantitative tradingfactor discoverymarket regimesrisk constraintsadaptive strategiescross-sectional tradingLLM-guided search

0 comments

The pith

A three-agent system automates factor discovery, regime detection, and risk-constrained trading to produce consistent outperformance in quantitative strategies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that integrating factor discovery, regime assessment, and execution into one continuously running multi-agent system produces more reliable trading performance than approaches that handle these tasks separately. This matters because financial markets shift over time due to economic changes and other factors, causing static strategies to degrade. The authors argue that their design avoids the pitfalls of one-time factor mining or anthropomorphic agent setups by using rational, specialized roles that respond to current conditions. A sympathetic reader would see this as a step toward fully automated quantitative trading that does not require ongoing human adjustments.

Core claim

AlphaCrafter introduces a full-stack multi-agent framework consisting of a Miner agent that expands the factor pool through LLM-guided search, a Screener agent that evaluates current market regimes to select appropriate factor ensembles, and a Trader agent that implements these into strategies with explicit risk constraints. This closed-loop system allows the entire pipeline to adapt holistically to evolving market dynamics, resulting in superior risk-adjusted returns and minimal variance across multiple trials on CSI 300 and S&P 500 datasets compared to existing baselines.

What carries the argument

The closed-loop coordination among the Miner, Screener, and Trader agents that unifies factor discovery with regime-adaptive selection and risk-managed execution.

If this is right

Quantitative strategies can be generated and updated continuously without manual intervention as market regimes shift.
Performance exhibits lower variance across different runs, indicating greater reliability.
The integrated design yields better risk-adjusted returns than methods that treat components in isolation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar multi-agent structures might apply to other dynamic optimization problems like supply chain management or energy trading.
The reduction in behavioral noise from role-playing could generalize to other decision-making AI systems.
Testing on additional asset classes or higher frequency data could reveal the limits of the current agent specialization.

Load-bearing premise

The assumption that LLM-guided factor searches combined with agent-based regime detection will generate profitable strategies that generalize beyond the specific market conditions observed in the CSI 300 and S&P 500 experiments.

What would settle it

A decline in risk-adjusted returns or an increase in performance variance when the framework is applied to data from a subsequent time period not used in the original experiments.

Figures

Figures reproduced from arXiv: 2605.05580 by Jiaheng Liu, Jiaqi Wang, Jiayi Sheng, Sirui Zeng, Yishuo Yuan.

**Figure 1.** Figure 1: The architecture of AlphaCrafter: The Miner expands alpha diversity, the Screener enforces view at source ↗

**Figure 2.** Figure 2: Performance distributions of agent methods across independent trials on backtesting. view at source ↗

**Figure 3.** Figure 3: Backtesting performance comparison of AlphaCrafter instantiated with different backbone view at source ↗

**Figure 4.** Figure 4: IC comparison of different methods across time periods on CSI 300 and S&P 500 markets. view at source ↗

**Figure 5.** Figure 5: Semantic diversity and novelty metrics across three LLM backbones. view at source ↗

**Figure 6.** Figure 6: Regime coherence heatmaps for CSI 300 market. view at source ↗

**Figure 7.** Figure 7: Regime coherence heatmaps for S&P 500 market. view at source ↗

**Figure 8.** Figure 8: Relationship between market volatility and net position exposure for a representative Claude view at source ↗

read the original abstract

Financial markets are inherently non-stationary, driven by complex interactions among macroeconomic regimes, microstructural frictions, and behavioral dynamics. Building quantitative strategies that remain profitable demands the continuous coupling of factor discovery, regime-adaptive selection, and risk-constrained execution. Prevailing approaches, however, optimize these components under static or isolated assumptions. Factor mining frameworks typically treat alpha discovery as a one-time search process, implicitly assuming that factor efficacy persists across market regimes. Execution-oriented systems often adopt role-playing agent architectures that simulate anthropomorphic trading committees, introducing behavioral noise rather than systematic rationality. Consequently, a fully automated, rationality-driven framework unifying a coherent quantitative pipeline remains absent. We introduce AlphaCrafter, a full-stack multi-agent framework that closes this gap through a continuously adaptive factor-to-execution pipeline, designed to track and respond to evolving market conditions without manual intervention. AlphaCrafter operates via three specialized agents: a Miner that continuously expands the factor pool via LLM-guided search, a Screener that assesses prevailing market conditions to construct regime-conditioned factor ensembles, and a Trader that translates these ensembles into quantitative strategies under explicit risk constraints. Together, these three agents form a closed-loop cross-sectional trading system that adapts holistically to evolving market dynamics. Extensive experiments on CSI 300 and S&P 500 demonstrate that AlphaCrafter consistently outperforms state-of-the-art baselines in risk-adjusted returns while exhibiting the lowest cross-trial variance, confirming that integrated and adaptive factor-to-execution design yields robust trading performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AlphaCrafter puts forward a Miner-Screener-Trader loop for continuous factor expansion and regime adaptation in quant trading, but the outperformance claims rest on unevidenced assertions that invite leakage concerns.

read the letter

The main thing to know is that this paper describes a three-agent system meant to run factor discovery, regime detection, and execution as one closed loop that keeps updating without manual resets. It reports stronger risk-adjusted returns and lower variance than baselines on CSI 300 and S&P 500 data, framing the setup as a fix for non-stationary markets where static factor pools or role-playing agents fall short. The specific Miner-Screener-Trader split with LLM-guided factor growth and regime-conditioned ensembles is the clearest new element here, as it tries to make the whole pipeline adaptive rather than treating pieces in isolation. The paper does a straightforward job spelling out why one-time factor mining or anthropomorphic agent committees add noise instead of systematic response to regime shifts. That framing is useful for anyone thinking about automation in trading systems. The soft spots sit in the results and validation. The abstract states the gains and low cross-trial variance, yet supplies no equations, explicit data splits, or statistical tests. The stress-test concern about regime identification or factor scoring possibly drawing on non-causal or full-sample information is not ruled out in the high-level description, so the adaptation could still reflect in-sample fitting rather than genuine online response. Without those details the central empirical claim stays hard to assess. This is for readers already working on LLM or agent tools inside quantitative finance who want a concrete architecture to build from or critique. It deserves a serious referee because the framework is specific enough to evaluate and the domain rewards practical checks on robustness. I would send it for peer review, with the main questions focused on temporal causality, parameter fixing, and reproducibility of the reported variance reduction.

Referee Report

3 major / 2 minor

Summary. AlphaCrafter is introduced as a full-stack multi-agent framework for cross-sectional quantitative trading, comprising a Miner agent that uses LLM-guided search to expand the factor pool, a Screener agent that assesses market conditions to build regime-conditioned factor ensembles, and a Trader agent that executes strategies under risk constraints. The framework forms a closed-loop system claimed to adapt to evolving market dynamics. Extensive experiments on CSI 300 and S&P 500 are reported to show consistent outperformance over state-of-the-art baselines in risk-adjusted returns with the lowest cross-trial variance.

Significance. If the results are substantiated with proper controls for data leakage and statistical validation, the work could significantly contribute to the development of adaptive, automated trading systems by demonstrating the value of integrated multi-agent designs in handling non-stationary financial markets. The emphasis on a rationality-driven rather than anthropomorphic agent architecture is a positive distinction from prior work. However, the current presentation leaves the central empirical claims difficult to assess independently.

major comments (3)

§5 (Experiments): The reported outperformance and low variance lack supporting details such as exact performance metrics (e.g., Sharpe ratios, maximum drawdown), number of trials, data split methodology (train/validation/test periods), and any hypothesis testing. This makes it impossible to evaluate whether the gains are statistically significant or merely due to favorable period selection, directly impacting the validity of the main claim.
§3.2 (Screener): The mechanism for regime identification and ensemble construction is described at a high level without specifying the temporal scope of data used for regime detection. To support the claim of genuine adaptation to evolving conditions, it must be demonstrated that no future information is used; otherwise, the results on CSI 300 and S&P 500 may reflect overfitting rather than robust performance.
§3.1 (Miner): The LLM-guided factor search process does not detail how factors are evaluated and selected in an online manner. If factor scores or selections incorporate information from the full evaluation period, the closed-loop adaptation would be circular, contradicting the assertion of continuous response to market dynamics without manual intervention.

minor comments (2)

Abstract: Consider adding one sentence on the specific risk constraints employed by the Trader agent to better contextualize the framework's safety features.
The manuscript would benefit from a table summarizing the agent roles, inputs, and outputs for quick reference.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments identify important areas where greater specificity will strengthen the manuscript's clarity and allow independent verification of the claims. We address each major comment below and will incorporate the necessary revisions.

read point-by-point responses

Referee: §5 (Experiments): The reported outperformance and low variance lack supporting details such as exact performance metrics (e.g., Sharpe ratios, maximum drawdown), number of trials, data split methodology (train/validation/test periods), and any hypothesis testing. This makes it impossible to evaluate whether the gains are statistically significant or merely due to favorable period selection, directly impacting the validity of the main claim.

Authors: We agree that the current presentation of results in §5 is insufficiently detailed for rigorous evaluation. In the revised manuscript we will add the exact numerical values for Sharpe ratios, maximum drawdowns, and other metrics for AlphaCrafter and all baselines on both CSI 300 and S&P 500. We will state that all figures are means over 10 independent trials with distinct random seeds, describe the chronological data splits (training through 2018, walk-forward validation 2019-2020, test 2021 onward), and report p-values from paired t-tests together with Diebold-Mariano tests to establish statistical significance. These additions will directly substantiate the claims of outperformance and low variance. revision: yes
Referee: §3.2 (Screener): The mechanism for regime identification and ensemble construction is described at a high level without specifying the temporal scope of data used for regime detection. To support the claim of genuine adaptation to evolving conditions, it must be demonstrated that no future information is used; otherwise, the results on CSI 300 and S&P 500 may reflect overfitting rather than robust performance.

Authors: We acknowledge that the temporal constraints on regime detection must be made explicit. We will expand §3.2 to specify that regime identification operates on a rolling historical window (e.g., the preceding 252 trading days) using only data available at the rebalancing date. Regime labels and ensemble weights are computed causally with no access to future returns or market conditions. We will also add pseudocode and a short appendix verifying that the reported CSI 300 and S&P 500 results respect this forward-only information flow. revision: yes
Referee: §3.1 (Miner): The LLM-guided factor search process does not detail how factors are evaluated and selected in an online manner. If factor scores or selections incorporate information from the full evaluation period, the closed-loop adaptation would be circular, contradicting the assertion of continuous response to market dynamics without manual intervention.

Authors: We agree that the online character of the Miner requires explicit description to preclude any appearance of circularity. The revised §3.1 will detail that candidate factors are scored exclusively on rolling historical windows up to the current decision point (e.g., information coefficient computed on the prior 60-252 days) and that selections are updated incrementally at each rebalancing. We will include a temporal information-flow diagram and confirm that all experimental results adhere to strict no-future-data protocols. revision: yes

Circularity Check

0 steps flagged

No derivation chain present; empirical framework claims lack mathematical reductions

full rationale

The paper describes a multi-agent system (Miner, Screener, Trader) for factor discovery and regime-adaptive trading but advances no equations, derivations, or first-principles results. Claims of outperformance rest solely on reported experiments on CSI 300 and S&P 500 rather than any closed-form prediction or fitted quantity renamed as a result. No self-citations, ansatzes, or uniqueness theorems are invoked to justify core components, and no step reduces an output to its inputs by construction. The analysis therefore detects no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the framework implicitly assumes LLM search produces useful factors and that agent coordination remains stable without external validation.

pith-pipeline@v0.9.0 · 5583 in / 1191 out tokens · 59732 ms · 2026-05-08T11:54:00.853944+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

[1]

URLhttp://www.jstor.org/stable/2171879

ISSN 00129682, 14680262. URLhttp://www.jstor.org/stable/2171879. William A. Brock and Cars H. Hommes. Heterogeneous beliefs and routes to chaos in a simple asset pricing model.Journal of Economic Dynamics and Control, 22(8):1235–1274, 1998. ISSN 0165-1889. doi: https://doi.org/10.1016/S0165-1889(98)00011-6. URL https://www.sciencedirect.com/science/ artic...

work page doi:10.1016/s0165-1889(98)00011-6 1998
[2]

William F

URLhttps://arxiv.org/abs/2502.16789. William F. Sharpe. The sharpe ratio.The Journal of Portfolio Management, 21(1):49–58, 1994. doi: 10.3905/ jpm.1994.409501. Richard C. Grinold and Ronald N. Kahn.Active Portfolio Management: A Quantitative Approach for Producing Superior Returns and Controlling Risk. McGraw-Hill, New York, 2 edition, 1999. Gerald Appel....

work page doi:10.1145/3447548.3467358 1994
[3]

Google DeepMind

Accessed: 2026-04-27. Google DeepMind. Gemini 3.1 Pro model card. https://deepmind.google/models/model-cards/ gemini-3-1-pro/, February 2026. Accessed: 2026-04-27. Sergey Isaenko. Transaction costs, frequent trading, and stock prices.Journal of Financial Markets, 64:100775, 2023. ISSN 1386-4181. doi: https://doi.org/10.1016/j.finmar.2022.100775. URL https...

work page doi:10.1016/j.finmar.2022.100775 2026
[4]

URLhttps://arxiv.org/abs/2402.18679. Yuan Liang, Ruobin Zhong, Haoming Xu, Chen Jiang, Yi Zhong, Runnan Fang, Jia-Chen Gu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, Xin Xu, Tongtong Wu, Kun Wang, Yang Liu, Zhen Bi, Jungang Lou, Yuchen Eleanor Jiang, Hangcheng Zhu, Gang Yu, Haiwen Hong, Longtao Huang, Hui Xue, Chenxi Wang, Yijun Wang, Zifei Shan,...

work page doi:10.1145/3688399 2026
[5]

sortUbyϕ i,t descending 5.I long ←firstN long assets,I short ←lastN short assets 6.for eachi∈ {i|h i,t−1 ̸=0}do// liquidate positions no longer in the list 7.ifi/∈ I long ∪ Ishort then

work page
[6]

submit_order(i,−h i,t−1 ) 9.end if 10.end for 11.V long ←β·NAV t ·(1+γ)/2 12.V short ←β·NAV t ·(1−γ)/2 13.for eachi∈ I long do 14.h target i,t ←V long/(Nlong ·P i,t) 15.end for 16.for eachi∈ I short do 17.h target i,t ← −Vshort/(Nshort ·P i,t) 18.end for 19.for eachi∈ I long ∪ Ishort do

work page
[7]

strong downtrend

submit_order(i,h target i,t −h i,t−1 ) 21.end for B Experimental Details B.1 Dataset Details B.1.1 Data Sources The daily OHLCV data for CSI 300 constituents is collected fromBaostockBaoStock (2026). For the S&P 500 constituents, daily price-volume data is obtained fromYahoo FinanceAroussi (2026). Fundamental indicators (PE, PS, PB, DYR), financial statem...

work page 2026
[8]

Once you receive a factor ensemble, you should write your strategy in the strategy.py file

If no factor ensemble is received from Screener Agent in the current cycle, you should skip this round with a skipping message (i.e., do not invoke any tool calls, just output the skipping message as your final response). Once you receive a factor ensemble, you should write your strategy in the strategy.py file. Never write a strategy that is too complex

work page
[9]

Overfitting to backtest results will lead to poor live performance

You should always use backtesting tool for validation, but do not rely on backtest results. Overfitting to backtest results will lead to poor live performance. But for badly performing strategy in backtesting, you should update the strategy immediately

work page
[10]

Do not call it multiple times within the same cycle

Call the step tool only once per trading cycle. Do not call it multiple times within the same cycle

work page
[11]

After each relaxation step, re-run the backtest to verify that trades are now being executed

If no orders are executed during backtesting or live trading, you must systematically relax the strategy’s constraints until trades are generated. After each relaxation step, re-run the backtest to verify that trades are now being executed

work page
[12]

When encountering bugs (e.g., version issues, nonexistent methods), attempt to use alternative equivalent approaches rather than stubbornly persisting with the problematic method. 26

work page

[1] [1]

URLhttp://www.jstor.org/stable/2171879

ISSN 00129682, 14680262. URLhttp://www.jstor.org/stable/2171879. William A. Brock and Cars H. Hommes. Heterogeneous beliefs and routes to chaos in a simple asset pricing model.Journal of Economic Dynamics and Control, 22(8):1235–1274, 1998. ISSN 0165-1889. doi: https://doi.org/10.1016/S0165-1889(98)00011-6. URL https://www.sciencedirect.com/science/ artic...

work page doi:10.1016/s0165-1889(98)00011-6 1998

[2] [2]

William F

URLhttps://arxiv.org/abs/2502.16789. William F. Sharpe. The sharpe ratio.The Journal of Portfolio Management, 21(1):49–58, 1994. doi: 10.3905/ jpm.1994.409501. Richard C. Grinold and Ronald N. Kahn.Active Portfolio Management: A Quantitative Approach for Producing Superior Returns and Controlling Risk. McGraw-Hill, New York, 2 edition, 1999. Gerald Appel....

work page doi:10.1145/3447548.3467358 1994

[3] [3]

Google DeepMind

Accessed: 2026-04-27. Google DeepMind. Gemini 3.1 Pro model card. https://deepmind.google/models/model-cards/ gemini-3-1-pro/, February 2026. Accessed: 2026-04-27. Sergey Isaenko. Transaction costs, frequent trading, and stock prices.Journal of Financial Markets, 64:100775, 2023. ISSN 1386-4181. doi: https://doi.org/10.1016/j.finmar.2022.100775. URL https...

work page doi:10.1016/j.finmar.2022.100775 2026

[4] [4]

URLhttps://arxiv.org/abs/2402.18679. Yuan Liang, Ruobin Zhong, Haoming Xu, Chen Jiang, Yi Zhong, Runnan Fang, Jia-Chen Gu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, Xin Xu, Tongtong Wu, Kun Wang, Yang Liu, Zhen Bi, Jungang Lou, Yuchen Eleanor Jiang, Hangcheng Zhu, Gang Yu, Haiwen Hong, Longtao Huang, Hui Xue, Chenxi Wang, Yijun Wang, Zifei Shan,...

work page doi:10.1145/3688399 2026

[5] [5]

sortUbyϕ i,t descending 5.I long ←firstN long assets,I short ←lastN short assets 6.for eachi∈ {i|h i,t−1 ̸=0}do// liquidate positions no longer in the list 7.ifi/∈ I long ∪ Ishort then

work page

[6] [6]

submit_order(i,−h i,t−1 ) 9.end if 10.end for 11.V long ←β·NAV t ·(1+γ)/2 12.V short ←β·NAV t ·(1−γ)/2 13.for eachi∈ I long do 14.h target i,t ←V long/(Nlong ·P i,t) 15.end for 16.for eachi∈ I short do 17.h target i,t ← −Vshort/(Nshort ·P i,t) 18.end for 19.for eachi∈ I long ∪ Ishort do

work page

[7] [7]

strong downtrend

submit_order(i,h target i,t −h i,t−1 ) 21.end for B Experimental Details B.1 Dataset Details B.1.1 Data Sources The daily OHLCV data for CSI 300 constituents is collected fromBaostockBaoStock (2026). For the S&P 500 constituents, daily price-volume data is obtained fromYahoo FinanceAroussi (2026). Fundamental indicators (PE, PS, PB, DYR), financial statem...

work page 2026

[8] [8]

Once you receive a factor ensemble, you should write your strategy in the strategy.py file

If no factor ensemble is received from Screener Agent in the current cycle, you should skip this round with a skipping message (i.e., do not invoke any tool calls, just output the skipping message as your final response). Once you receive a factor ensemble, you should write your strategy in the strategy.py file. Never write a strategy that is too complex

work page

[9] [9]

Overfitting to backtest results will lead to poor live performance

You should always use backtesting tool for validation, but do not rely on backtest results. Overfitting to backtest results will lead to poor live performance. But for badly performing strategy in backtesting, you should update the strategy immediately

work page

[10] [10]

Do not call it multiple times within the same cycle

Call the step tool only once per trading cycle. Do not call it multiple times within the same cycle

work page

[11] [11]

After each relaxation step, re-run the backtest to verify that trades are now being executed

If no orders are executed during backtesting or live trading, you must systematically relax the strategy’s constraints until trades are generated. After each relaxation step, re-run the backtest to verify that trades are now being executed

work page

[12] [12]

When encountering bugs (e.g., version issues, nonexistent methods), attempt to use alternative equivalent approaches rather than stubbornly persisting with the problematic method. 26

work page