pith. machine review for the scientific record. sign in

arxiv: 2604.17327 · v1 · submitted 2026-04-19 · 💱 q-fin.PM · cs.AI· q-fin.ST

Recognition: unknown

Signal or Noise in Multi-Agent LLM-based Stock Recommendations?

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:13 UTC · model grok-4.3

classification 💱 q-fin.PM cs.AIq-fin.ST
keywords multi-agent LLMstock recommendationsportfolio alphamarket regime adaptationLLM validationS&P 500 performanceagent synthesisinformation coefficient
0
0 comments X

The pith

Multi-agent LLM equity recommendations generate significant outperformance on S&P 500 stocks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper validates a multi-agent LLM system by generating live stock recommendations and testing their portfolio-level performance against benchmarks and random selections. Over 19 months on S&P 500 stocks, the strong-buy equal-weight portfolio returned 2.18 percent per month compared to 1.15 percent for a passive equal-weight benchmark, achieving a 25.2 percent compound excess and ranking at the 99.7th percentile among 10,000 Monte Carlo simulations. The internal analysis shows how four specialist agents contribute differently depending on market conditions, with their integration adapting in ways that align with sector choices and economic events. A sympathetic reader would care because this provides evidence that such systems can deliver alpha not captured by traditional models, potentially offering a new way to filter investment universes.

Core claim

The MarketSenseAI system issues monthly equity theses and recommendations by routing inputs from four specialist agents—News, Fundamentals, Dynamics, and Macro—through a synthesis agent. On the S&P 500 cohort, the equal-weighted portfolio of strong-buy recommendations achieves +2.18%/month returns against +1.15% for the passive equal-weight benchmark, with +25.2% compound excess and p=0.003 against random portfolios. Agent contributions rotate with regimes, as shown by embedding projections, and the recommendation Information Coefficient is +0.489 with p=0.024.

What carries the argument

The adaptive-integration mechanism, where non-negative least-squares projection of thesis embeddings onto agent embeddings reveals rotating dominance among the four specialist agents in response to market conditions.

If this is right

  • The buy signal serves as an effective universe-filter that can precede any portfolio construction process.
  • Agent contributions adapt to market regimes, with Fundamentals leading on S&P 500 and Macro on S&P 100.
  • Dynamics agent acts as an episodic momentum signal.
  • The system identifies sources of alpha beyond classical factor models.
  • Performance on S&P 100 shows consistent direction but lacks formal significance due to small average selection size.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the regime-adaptive integration generalizes, multi-agent LLM systems could be deployed across varied market environments with dynamic weighting.
  • These findings imply that LLM-based recommendations might complement rather than replace traditional quantitative strategies by providing a high-level filter.
  • Extending the live validation to longer periods or additional indices would test the durability of the observed outperformance.
  • Correlating the agent rotation with specific macro events suggests opportunities for incorporating calendar-based adjustments in similar systems.

Load-bearing premise

That all signals are generated live at each observation date with no look-ahead bias and that the Monte Carlo random portfolios accurately represent the null distribution of no skill under the same selection constraints and universe.

What would settle it

A follow-up live period in which the strong-buy equal-weight portfolio fails to outperform the passive benchmark or the observed agent contribution rotation no longer aligns with market regimes and macro events.

Figures

Figures reproduced from arXiv: 2604.17327 by George Fatouros, Kostas Metaxas.

Figure 1
Figure 1. Figure 1: MarketSenseAI pipeline. Stage 1 (Generation): four specialist agents independently analyse market data and produce focused text analyses; the synthesis agent reads all four summaries and gener￾ates a free-text equity thesis together with an ordinal recommendation—its explicit sentiment assessment of the stock. Stage 2 (Attribution): the thesis text is encoded with text-embedding-3-small into a vector ti,d∈… view at source ↗
Figure 2
Figure 2. Figure 2: Cosine similarity heatmap (S&P 500 cohort): thesis–agent cosines [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Sector composition of the equal-weight strong-buy basket per month (S&P 500 cohort, [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Monte Carlo null distributions of mean monthly equal-weight portfolio returns (10,000 simu [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Compound growth of the strong-buy equal-weight portfolio (red solid line) versus the equal [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Empirical CDF comparison of one-month forward returns for strong-buy (blue) and hold (grey) [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Date-level cross-sectional IC for the ordinal score computed on the [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Best-agent timeline (S&P 500 cohort, buy + strong-buy universe only): which agent achieves [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Sector mean NNLS agent weights over time (S&P 500 cohort). Each panel shows one sector; [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Market beta robustness checks (S&P 500 cohort). Neither the below-unity portfolio beta [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
read the original abstract

We present the first portfolio-level validation of MarketSenseAI, a deployed multi-agent LLM equity system. All signals are generated live at each observation date, eliminating look-ahead bias. The system routes four specialist agents (News, Fundamentals, Dynamics, and Macro) through a synthesis agent that issues a monthly equity thesis and recommendation for each stock in its coverage universe, and we ask two questions: do its buy recommendations add value over both passive benchmarks and random selection, and what does the internal agent structure reveal about the source of the edge? On the S&P 500 cohort (19 months) the strong-buy equal-weight portfolio earns +2.18%/month against a passive equal-weight benchmark of +1.15% (approximating RSP), a +25.2% compound excess, and ranks at the 99.7th percentile of 10,000 Monte Carlo portfolios (p=0.003). The S&P 100 cohort (35 months) delivers a +30.5% compound excess over EQWL with consistent direction but formal significance not reached, limited by the small average selection of ~10 stocks per month. Non-negative least-squares projection of thesis embeddings onto agent embeddings reveals an adaptive-integration mechanism. Agent contributions rotate with market regime (Fundamentals leads on S&P 500, Macro on S&P 100, Dynamics acts as an episodic momentum signal) and this agent rotation moves in lockstep with both the sector composition of strong-buy selections and identifiable macro-calendar events, three independent views of the same underlying adaptation. The recommendation's cross-sectional Information Coefficient is statistically significant on S&P 500 (ICIR=+0.489, p=0.024). These results suggest that multi-agent LLM equity systems can identify sources of alpha beyond what classical factor models capture, and that the buy signal functions as an effective universe-filter that can sit upstream of any portfolio-construction process.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper presents the first portfolio-level validation of MarketSenseAI, a deployed multi-agent LLM equity recommendation system. All signals are generated live without look-ahead bias. On the S&P 500 cohort over 19 months, the strong-buy equal-weighted portfolio returns +2.18%/month versus +1.15% for a passive equal-weight benchmark, delivering +25.2% compound excess and ranking at the 99.7th percentile of 10,000 Monte Carlo portfolios (p=0.003). The S&P 100 cohort (35 months) shows +30.5% compound excess but does not reach formal significance due to small selection size. Non-negative least-squares analysis of thesis embeddings onto agent embeddings reveals adaptive integration, with agent contributions rotating by market regime and aligning with sector composition and macro events. The cross-sectional ICIR is significant on S&P 500 (ICIR=+0.489, p=0.024). The authors conclude that multi-agent LLM systems can identify alpha beyond classical factors and serve as an effective universe filter.

Significance. If the performance and significance claims hold after verification of the Monte Carlo construction, this would constitute a notable contribution as the first live, portfolio-level test of a multi-agent LLM system in equity markets. Strengths include the explicit use of live signal generation to eliminate look-ahead bias, the multi-cohort design, and the triangulation of results via ICIR, agent-rotation analysis, and sector/macro alignment. These elements provide falsifiable, reproducible evidence that could inform both academic understanding of LLM adaptation and practical deployment of AI-driven signals upstream of portfolio construction.

major comments (1)
  1. [Monte Carlo simulation and results] The p=0.003 claim for the S&P 500 cohort rests on the strong-buy portfolio ranking at the 99.7th percentile of 10,000 Monte Carlo portfolios. The manuscript must specify (in the Monte Carlo methods subsection) whether each simulation month draws exactly the same number of stocks as the LLM actually selected that month, from the precise live universe at that date (including any liquidity or coverage constraints), and applies identical equal-weighted rebalancing. Any deviation in cardinality, static universe, or independent draws would produce an incorrect null distribution whose tails do not reflect the actual selection process, rendering the percentile and p-value uninterpretable.
minor comments (1)
  1. [Abstract] The abstract reports performance numbers and p-values but omits the exact average selection size for the S&P 500 cohort and any mention of transaction costs or turnover handling; adding these details would improve interpretability without altering the central claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review. The single major comment concerns the documentation of the Monte Carlo null distribution; we address it directly below, confirm the construction used, and commit to expanding the methods section for full transparency and reproducibility.

read point-by-point responses
  1. Referee: [Monte Carlo simulation and results] The p=0.003 claim for the S&P 500 cohort rests on the strong-buy portfolio ranking at the 99.7th percentile of 10,000 Monte Carlo portfolios. The manuscript must specify (in the Monte Carlo methods subsection) whether each simulation month draws exactly the same number of stocks as the LLM actually selected that month, from the precise live universe at that date (including any liquidity or coverage constraints), and applies identical equal-weighted rebalancing. Any deviation in cardinality, static universe, or independent draws would produce an incorrect null distribution whose tails do not reflect the actual selection process, rendering the percentile and p-value uninterpretable.

    Authors: We agree that the precise construction of the Monte Carlo null must be documented to make the reported percentile and p-value interpretable. The simulations were executed exactly as the referee requires: for each month, we drew exactly the same number of stocks as the live LLM strong-buy selection for that month, sampling without replacement from the precise live universe available on the observation date (including all liquidity filters, coverage constraints, and data-availability restrictions present in the real-time feed). Each simulated portfolio was then equal-weighted and rebalanced on the identical schedule used for the actual strong-buy portfolio. No static universe or independent monthly draws were employed. We will revise the Monte Carlo methods subsection to include an explicit, step-by-step description of this procedure together with pseudocode, thereby eliminating any ambiguity and directly satisfying the referee's request. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on live signals and external benchmarks

full rationale

The paper's central results are empirical portfolio performance metrics (strong-buy returns vs. RSP/EQWL benchmarks) and statistical tests (Monte Carlo percentile ranking, ICIR) computed from live-generated signals at each date. These do not reduce by construction to any fitted parameter inside the paper's equations, nor to self-citations. The non-negative least-squares projection onto agent embeddings and the reported agent-rotation observations are independent computations on the same signals but do not tautologically reproduce the performance claim. No uniqueness theorem, ansatz smuggling, or renaming of known results is invoked as load-bearing. The Monte Carlo null is presented as an external randomization procedure; even if its exact cardinality/universe matching is debatable on validity grounds, that is a methodological concern rather than a circular reduction of the reported p-value to the input data by definition. The derivation chain is therefore self-contained against external benchmarks and standard statistical procedures.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central performance and adaptation claims rest on the unverified assumption that live monthly signal generation fully eliminates look-ahead bias and that the Monte Carlo simulation matches the exact selection constraints of the deployed system.

axioms (2)
  • domain assumption Live generation at each observation date eliminates look-ahead bias
    Stated in the abstract as the basis for validity but no implementation details provided.
  • domain assumption Monte Carlo random portfolios form a valid null distribution under identical universe and selection-size constraints
    Used to compute the 99.7th percentile ranking and p=0.003.

pith-pipeline@v0.9.0 · 5651 in / 1611 out tokens · 41000 ms · 2026-05-10T06:13:28.643433+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CyberAId: AI-Driven Cybersecurity for Financial Service Providers

    cs.AI 2026-05 unverdicted novelty 4.0

    CyberAId is a proposed on-premise multi-agent system that coordinates LLM subagents with classical security tools to improve threat response and regulatory alignment in financial services.

Reference graph

Works this paper leans on

26 extracted references · 17 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    Measuring what llms think they do: Shap faithfulness and deployability on financial tabular classification.arXiv preprint arXiv:2512.00163, 2025

    Saeed AlMarri, Mathieu Ravaut, Kristof Juhasz, Gautier Marti, Hamdan Al Ahbabi, and Ibrahim Elfadel. Measuring what llms think they do: Shap faithfulness and deployability on financial tabular classification.arXiv preprint arXiv:2512.00163, 2025

  2. [2]

    Quantum error thresholds for gauge-redundant digitiza- tions of lattice field theories

    Usha Bhalla, Alex Oesterling, Suraj Srinivas, Fl´ avio P. Calmon, and Himabindu Lakkaraju. Interpreting CLIP with sparse linear concept embeddings (SpLiCE). InAdvances in Neural Information Processing Systems (NeurIPS), volume 37, 2024. doi: 10.48550/arXiv.2402. 10376. URLhttps://arxiv.org/abs/2402.10376. arXiv:2402.10376

  3. [3]

    Fama and Kenneth R

    Eugene F. Fama and Kenneth R. French. Common risk factors in the returns on stocks and bonds.Journal of Financial Economics, 33(1):3–56, 1993. ISSN 0304-405X. doi: https://doi.org/10.1016/0304-405X(93)90023-5. URLhttps://www.sciencedirect.com/ science/article/pii/0304405X93900235

  4. [4]

    Marketsenseai 2.0: Enhancing stock analysis through llm agents

    George Fatouros, Kostas Metaxas, John Soldatos, and Manos Karathanassis. Marketsenseai 2.0: Enhancing stock analysis through llm agents. In2025 IEEE International Conference on Data Mining Workshops (ICDMW), pages 883–892, 2025. doi: 10.1109/ICDMW69685. 2025.00105

  5. [5]

    Can large language models beat wall street? evaluating gpt-4’s impact on financial decision-making with marketsenseai.Neural Computing and Applications, 37(30):24893–24918, 2025

    George Fatouros, Kostas Metaxas, John Soldatos, and Dimosthenis Kyriazis. Can large language models beat wall street? evaluating gpt-4’s impact on financial decision-making with marketsenseai.Neural Computing and Applications, 37(30):24893–24918, 2025

  6. [6]

    Transforming sentiment analysis in the financial domain with chatgpt

    Georgios Fatouros, John Soldatos, Kalliopi Kouroumali, Georgios Makridis, and Dimos- thenis Kyriazis. Transforming sentiment analysis in the financial domain with chatgpt. Machine Learning with Applications, 14:100508, 2023

  7. [7]

    Could large language models work as post-hoc explainability tools in credit risk models?arXiv preprint arXiv:2602.18895, 2026

    Wenxi Geng, Dingyuan Liu, Liya Li, and Yiqing Wang. Could large language models work as post-hoc explainability tools in credit risk models?arXiv preprint arXiv:2602.18895, 2026

  8. [8]

    Grinold and Ronald N

    Richard C. Grinold and Ronald N. Kahn.Active Portfolio Management: A Quantitative Approach for Producing Superior Returns and Controlling Risk. McGraw-Hill, New York, NY, 2nd edition, 2000

  9. [9]

    Enhancing investment analysis: Optimizing AI-agent collaboration in financial re- search

    Xuewen Han, Neng Wang, Shangkun Che, Hongyang Yang, Kunpeng Zhang, and Sean Xin Xu. Enhancing investment analysis: Optimizing AI-agent collaboration in financial re- search. InProceedings of the ACM International Conference on AI in Finance (ICAIF), 19

  10. [10]

    URLhttps://arxiv.org/abs/2411.04788

    doi: 10.48550/arXiv.2411.04788. URLhttps://arxiv.org/abs/2411.04788. arXiv:2411.04788

  11. [11]

    Review of prompt engineering techniques in fi- nance: An evaluation of chain-of-thought, tree-of-thought, and graph-of-thought ap- proaches.SSRN Working Paper 5339795, 2025

    Artur Kulpa and Grzegorz Wojarnik. Review of prompt engineering techniques in fi- nance: An evaluation of chain-of-thought, tree-of-thought, and graph-of-thought ap- proaches.SSRN Working Paper 5339795, 2025. doi: 10.2139/ssrn.5339795. URL https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5339795

  12. [12]

    Profit mirage: Revisiting information leakage in LLM-based financial agents.arXiv preprint arXiv:2510.07920, 2025

    Xiangyu Li, Yawen Zeng, Xiaofen Xing, and Jin Xu. Profit mirage: Revisiting information leakage in LLM-based financial agents.arXiv preprint arXiv:2510.07920, 2025. doi: 10. 48550/arXiv.2510.07920. URLhttps://arxiv.org/abs/2510.07920

  13. [13]

    Lopez-Lira and Y

    Alejandro Lopez-Lira and Yuehua Tang. Can ChatGPT forecast stock price movements? Return predictability and large language models.arXiv preprint arXiv:2304.07619, 2023. doi: 10.48550/arXiv.2304.07619. URLhttps://arxiv.org/abs/2304.07619

  14. [14]

    Orr, and Jun Wang

    Jose Menchero, D.J. Orr, and Jun Wang. The Barra US equity model (USE4): Methodology notes. Technical report, MSCI Inc., August 2011. URLhttps://www.top1000funds.com/ wp-content/uploads/2011/09/USE4_Methodology_Notes_August_2011.pdf

  15. [15]

    Toward expert investment teams: A multi-agent LLM system with fine-grained trading tasks.arXiv preprint arXiv:2602.23330, 2026

    Kunihiro Miyazaki, Takanobu Kawahara, Stephen Roberts, and Stefan Zohren. Toward expert investment teams: A multi-agent LLM system with fine-grained trading tasks.arXiv preprint arXiv:2602.23330, 2026. doi: 10.48550/arXiv.2602.23330. URLhttps://arxiv. org/abs/2602.23330

  16. [16]

    Analysis of cross-sectional equity models

    Northfield Information Services. Analysis of cross-sectional equity models. Technical re- port, Northfield Information Services, Inc., 2003. URLhttps://www.northinfo.com/ documents/151.pdf

  17. [17]

    ATLAS: Adaptive Trading with LLM AgentS Through Dynamic Prompt Optimization and Multi-Agent Coordination

    Charidimos Papadakis, Angeliki Dimitriou, Giorgos Filandrianos, Maria Lymperaiou, Kon- stantinos Thomas, and Giorgos Stamou. ATLAS: Adaptive trading with LLM AgentS through dynamic prompt optimization and multi-agent coordination.arXiv preprint arXiv:2510.15949, 2025. doi: 10.48550/arXiv.2510.15949. URLhttps://arxiv.org/abs/ 2510.15949. National Technical...

  18. [18]

    Ai in investment analysis: Llms for equity stock ratings

    Kassiani Papasotiriou, Srijan Sood, Shayleen Reynolds, and Tucker Balch. Ai in investment analysis: Llms for equity stock ratings. InProceedings of the 5th ACM International Conference on AI in Finance, pages 419–427, 2024

  19. [19]

    From earnings calls to investment reports: Evaluating role-based multi-agent llm systems

    Ranjan Satapathy, Raphael Liew, Joyjit Chattorj, Erik Cambria, and Rick Goh. From earnings calls to investment reports: Evaluating role-based multi-agent llm systems. In Proceedings of The 10th Workshop on Financial Technology and Natural Language Pro- cessing, pages 258–267, 2025

  20. [20]

    The stability trap: Evaluating the reliability of llm-based in- struction adherence auditing.arXiv preprint arXiv:2601.11783, 2026

    Murtuza N Shergadwala. The stability trap: Evaluating the reliability of llm-based in- struction adherence auditing.arXiv preprint arXiv:2601.11783, 2026

  21. [21]

    Beyond the black box: Interpretability of LLMs in finance.arXiv preprint arXiv:2505.24650, 2025

    Harish Tatsat and Ahmed Shater. Beyond the black box: Interpretability of LLMs in finance.arXiv preprint arXiv:2505.24650, 2025. doi: 10.48550/arXiv.2505.24650. URL https://arxiv.org/abs/2505.24650. Barclays Quantitative Analytics. Also available as SSRN Working Paper 5263803

  22. [22]

    Adaptive market intelligence: A mixture of experts framework for volatility-sensitive stock forecasting.arXiv preprint arXiv:2508.02686, 2025

    Diego Vallarino. Adaptive market intelligence: A mixture of experts framework for volatility-sensitive stock forecasting.arXiv preprint arXiv:2508.02686, 2025. doi: 10. 48550/arXiv.2508.02686. URLhttps://arxiv.org/abs/2508.02686. 20

  23. [23]

    Prompt engineering in consistency and reliability with the evidence-based guideline for llms.NPJ digital medicine, 7(1):41, 2024

    Li Wang, Xi Chen, XiangWen Deng, Hao Wen, MingKe You, WeiZhi Liu, Qi Li, and Jian Li. Prompt engineering in consistency and reliability with the evidence-based guideline for llms.NPJ digital medicine, 7(1):41, 2024

  24. [26]

    2601.22579

    Hongyang Yang, Xiao-Yang Liu, and Christina Dan Wang. FinGPT: Open-source financial large language models. InFinLLM Workshop at IJCAI 2023, 2023. doi: 10.48550/arXiv. 2306.06031. URLhttps://arxiv.org/abs/2306.06031. arXiv:2306.06031

  25. [27]

    RegimeFolio: A regime aware ML system for sectoral portfolio optimization in dynamic markets.arXiv preprint arXiv:2510.14986, 2025

    Yiyao Zhang, Diksha Goel, Hussain Ahmad, and Claudia Szabo. RegimeFolio: A regime aware ML system for sectoral portfolio optimization in dynamic markets.arXiv preprint arXiv:2510.14986, 2025. doi: 10.48550/arXiv.2510.14986. URLhttps://arxiv.org/abs/ 2510.14986

  26. [28]

    AlphaAgents: Large language model based multi-agents for equity portfolio constructions,

    Tianjiao Zhao, Jingrao Lyu, Stokes Jones, Harrison Garber, Stefano Pasquali, and Dha- gash Mehta. AlphaAgents: Large language model based multi-agents for equity portfolio constructions.arXiv preprint arXiv:2508.11152, 2025. doi: 10.48550/arXiv.2508.11152. URLhttps://arxiv.org/abs/2508.11152. BlackRock, Inc. 21 A Monte Carlo Results by Date Table 8 report...