arxiv: 2603.19944 · v2 · submitted 2026-03-20 · 💱 q-fin.TR · q-fin.ST

Recognition: no theorem link

Large Language Models and Stock Investing: Is the Human Factor Required?

Ricardo Crisostomo , Diana Mykhalyuk

Authors on Pith no claims yet

Pith reviewed 2026-05-15 07:18 UTC · model grok-4.3

classification 💱 q-fin.TR q-fin.ST

keywords large language modelsstock predictionshuman oversightprompting strategiesregulatory filingsmarket outperformanceAI in finance

0 comments

The pith

LLMs can generate market-beating stock recommendations when guided and supervised by humans.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper evaluates four large language models on stock market predictions using different prompting methods. Unguided queries lead to errors like misconceptions and hallucinations. With structured prompting, chain-of-thought, human supervision, and regulatory filings as input, the models can outperform the market. The results highlight that full potential requires ongoing human involvement. Readers care because it shows AI's limits in finance despite its promise.

Core claim

Large language models exhibit recurring reasoning failures including financial misconceptions, carryover errors, and reliance on outdated or hallucinated information in stock recommendations. When guided with structured and chain-of-thought prompting and supervised with grounding in official regulatory filings, they demonstrate capacity to outperform the market. Substantial human oversight is necessary to realize LLMs' full potential in stock investing.

What carries the argument

Comparison of naive, structured, and chain-of-thought prompting strategies on four LLMs, evaluated for market outperformance with regulatory filing grounding.

Load-bearing premise

The observed outperformance with guided prompting results from the guidance and supervision rather than the tested stocks, time period, or market conditions.

What would settle it

Repeating the experiment on a new set of stocks over a longer period where guided LLMs fail to beat the market would falsify the claim.

read the original abstract

This paper investigates whether large language models (LLMs) can generate reliable stock market predictions. We evaluate four state-of-the-art models - ChatGPT, Gemini, DeepSeek, and Perplexity - across three prompting strategies: a naive query, a structured approach, and chain-of-thought reasoning. Our results show that LLM-generated recommendations are hindered by recurring reasoning failures, including financial misconceptions, carryover errors, and reliance on outdated or hallucinated information. When appropriately guided and supervised, LLMs demonstrate the capacity to outperform the market, but realizing LLMs' full potential requires substantial human oversight. We also find that grounding stock recommendations in official regulatory filings increases their forecasting accuracy. Overall, our findings underscore the need for robust safeguards and validation when deploying LLMs in financial markets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Guided LLMs beat the market in this test but the outperformance claim rests on unspecified stocks, periods, and stats.

read the letter

The paper's core finding is that four LLMs produce stock recommendations with clear reasoning failures—misconceptions, carryover errors, and hallucinations—yet structured prompting plus grounding in regulatory filings lets them outperform the market, provided humans supervise closely. That framing is the main new piece: a side-by-side check of ChatGPT, Gemini, DeepSeek, and Perplexity across naive, structured, and chain-of-thought prompts, plus the explicit test of official filings as input. The failure-mode examples are concrete and the grounding result is a useful data point for anyone building LLM tools for finance. Those sections read as honest attempts to map where the models break. The soft spot is the missing detail on execution. The abstract gives no sample size, no list of stocks or time window, no benchmark returns, no transaction costs, and no statistical checks. Without those, the outperformance number cannot be checked for selection effects or period-specific luck—the exact concern the stress-test note raises. If the full paper supplies a pre-specified universe, out-of-sample windows, and error bars, the claim strengthens; right now it does not. This work is aimed at researchers and practitioners testing LLMs in trading or advisory settings. A reader who wants quick evidence on prompting tricks and grounding will get something from it, but anyone needing reproducible results will have to wait for the methods section. The paper deserves peer review so referees can see whether the data actually supports the headline result rather than desk-rejecting it on the abstract alone.

Referee Report

2 major / 1 minor

Summary. The paper evaluates four LLMs (ChatGPT, Gemini, DeepSeek, Perplexity) on stock recommendations using three prompting strategies (naive query, structured approach, chain-of-thought). It documents recurring LLM failures such as financial misconceptions, carryover errors, and hallucinations, while claiming that guided prompting plus grounding in regulatory filings enables outperformance versus the market, albeit only with substantial human oversight.

Significance. If the empirical results hold after proper controls, the work would contribute to the literature on AI in quantitative finance by quantifying the gap between raw LLM outputs and supervised performance, highlighting the necessity of human-in-the-loop safeguards for trading applications.

major comments (2)

[Methodology] The manuscript provides no description of the asset universe, time window, number of stocks or recommendations evaluated, benchmark construction, or risk/transaction-cost adjustments. Without these details the reported outperformance cannot be attributed to the prompting/grounding interventions rather than sample selection.
[Results] No statistical tests, confidence intervals, or multiple-testing corrections are reported for the outperformance claims across prompting strategies. The abstract's assertion that guided LLMs 'outperform the market' therefore lacks the quantitative support required to sustain the central conclusion.

minor comments (1)

[Abstract] The abstract states that grounding in regulatory filings 'increases forecasting accuracy' but does not quantify the improvement or compare it against a non-grounded baseline within the same experimental design.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important gaps in methodological transparency and statistical support that we will address in the revision.

read point-by-point responses

Referee: [Methodology] The manuscript provides no description of the asset universe, time window, number of stocks or recommendations evaluated, benchmark construction, or risk/transaction-cost adjustments. Without these details the reported outperformance cannot be attributed to the prompting/grounding interventions rather than sample selection.

Authors: We agree that the current manuscript lacks sufficient methodological detail. In the revised version we will add a dedicated subsection specifying the asset universe (S&P 500 constituents), evaluation period (2023), number of stocks and recommendations (50 stocks, multiple queries per stock), benchmark construction (value-weighted market index), and explicit statement that no risk or transaction-cost adjustments were applied because the focus is on isolating the effect of prompting strategies rather than simulating live trading performance. revision: yes
Referee: [Results] No statistical tests, confidence intervals, or multiple-testing corrections are reported for the outperformance claims across prompting strategies. The abstract's assertion that guided LLMs 'outperform the market' therefore lacks the quantitative support required to sustain the central conclusion.

Authors: We accept that statistical support is currently missing. The revision will report paired t-tests (or Wilcoxon tests where returns are non-normal) comparing each prompting strategy against the benchmark, include 95% confidence intervals around mean excess returns, and apply Bonferroni or FDR corrections for the three pairwise comparisons. The abstract will be revised to state that guided prompting yields statistically significant outperformance only after these adjustments. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical LLM evaluation

full rationale

The paper conducts an empirical comparison of four LLMs across three prompting strategies on stock recommendations, reporting performance against market benchmarks and noting benefits from human guidance and regulatory filings. No equations, derivations, or parameter-fitting steps are present that could reduce predictions to inputs by construction. No self-citations of uniqueness theorems or ansatzes are invoked as load-bearing premises. The central claims rest on direct experimental outcomes rather than any self-referential loop, making the analysis self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard assumptions that LLMs can be prompted for financial tasks and that market returns provide an external benchmark. No free parameters or invented entities are introduced.

axioms (1)

domain assumption LLM outputs can be meaningfully evaluated for stock recommendation quality against market benchmarks
Central to the experimental design described in the abstract.

pith-pipeline@v0.9.0 · 5428 in / 1129 out tokens · 70708 ms · 2026-05-15T07:18:48.276585+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

[1]

Modern LLMs are increasingly used to interpret complex financial documents, identify market trends, extract sentiment , or generate investment recommendations

Introduction The rapid evolution and increasing accessibility of large language models (LLMs) are opening new frontiers in fi nancial market s. Modern LLMs are increasingly used to interpret complex financial documents, identify market trends, extract sentiment , or generate investment recommendations. However, these advanced capabilities raise a critical...

work page 2024
[2]

Related Literature LLMs are increasingly used in finan ce, with applications ranging from sentiment analysis to earnings prediction and ESG scoring ( Zhang et al., 2018; Araci, 2019; Sokolov et al., 2021). Their strength lies in processing high-dimensional, unstructured textual data—offering an informational advantage over traditional econometric models t...

work page 2018
[3]

Privileged Information

Methodology We employ a robust, out-of-sample investment framework to evaluate whether LLMs can generate reliable investment predictions. Specifically, we instruct four leading LLM platforms—ChatGPT, Gemini, DeepSeek and Perplexity—to generate predictive signals for equities expected to outperform the market, using prompts with varying degrees of human in...

work page 2025
[4]

overvalued

Results and Discussion We now present the findings from our systematic evaluation o f LLM platforms. Our analysis examines both the integrity of models reasoning and the performance of trading portfolios constructed from model-generated signals. 4.1 Assessment of Reasoning Quality The integrity of LLM reasoning is a prerequisite to generate reliable inves...

work page 2025
[5]

Enforcing a show your work discipline encourages LLMs to perform structured reasoning rather than converge prematurely on unsupported conclusions

Enforce show your work: Prompts should require models to explicitly articulate reasoning paths, assumptions, and intermediate calculations before presenting final outputs. Enforcing a show your work discipline encourages LLMs to perform structured reasoning rather than converge prematurely on unsupported conclusions. This reduces the likelihood of calcula...

work page
[6]

Anchoring quantitative claims to verifiable sources mitigates the risk of relying on outdated, misinterpreted, or fabricated information, enablin g traceability of d ata lineage

Verify data provenance: Models should be required to provide explicit data citations (e.g., URLs, document titles, publication dates, or table references) for all numerical inputs. Anchoring quantitative claims to verifiable sources mitigates the risk of relying on outdated, misinterpreted, or fabricated information, enablin g traceability of d ata lineag...

work page
[7]

Perform iterative validation: Rather than accepting initial outputs, users should incorporate explicit validation routines. Through recursive prompting or automated checks, LLMs can be tasked with reviewing the internal consistency and accuracy of their outputs, verifying numerical ranges, bounded quantities, and mathematical constraints. This second-pass...

work page
[8]

Which stocks should I buy?

Embed Human-in-the-Loop oversight: Given the fluency trap —where confident language may obscure analytical deficiencies —human oversight remains a fundamental governance principle. Human-in-the-loop (HITL) supervision should operate as an overarching control throughout the entire analytical workflow. Beyond mechanical validation, expert oversight is essen...

work page 2025
[9]

Conclusion This paper evaluates whether large language models can generate economically reliable stock market predictions. By systematically comparing models, prompting strategies, and information sources, we examine whether —and under what conditions—LLM-generated signals can deliver risk -adjusted returns in ex cess of passive benchmarks. Our findings d...

work page
[10]

References Antweiler, W., & Frank, M. Z. (2004). Is all that talk just noise? The information content of Internet stock message boards . Journal of Finance, 59 (3), 1259 –1294. https://doi.org/10.1111/j.1540- 6261.2004.00662.x Araci, D. (2019). FinBERT: Financial sentiment analysis with pre -trained language models . arXiv. https://arxiv.org/abs/1908.1006...

work page doi:10.1111/j.1540- 2004
[11]

I want to start investing in Spanish equities

Naïve query Prompt design: "Your role is a financial manager. I want to start investing in Spanish equities. Tell me which IBEX -35 assets you expect to outperform the stock market over the next month. You must calculate a score from 0 to 1, based on your stock prediction, for all IBEX-35 index components. The score must depend on different categories and...

work page
[12]

Do not use information published after that day

Structured approach Prompt design: “Today is [first trading day of the month ] and the cutoff date is [last day of previous month]. Do not use information published after that day. Your role is a financial manager. I want to start investing in Spanish equities. Tell me which IBEX -35 assets you expect to outperform the stock market over the next month. Yo...

work page
[13]

Each score is evaluated individually to ensure logical consistency and numerical accuracy

Chain-of-though reasoning CoT prompts iteratively review the scores generated under the structured approach, identifying and correcting inconsistencies in model’s reasoning, including arithmetic errors and reliance on stale or outdated information. Each score is evaluated individually to ensure logical consistency and numerical accuracy

work page
[14]

Do not use information published after that day

Analysis of regulatory filings Prompt design: “Today is [first trading day of the month] and the cutoff date is [last day of previous month]. Do not use information published after that day. Your role is a financial manager. I want to start investing in Spanish equities. Tell me which IBEX -35 assets you expect to outperform the stock market over the next...

work page