arxiv: 2603.20965 · v2 · submitted 2026-03-21 · 💱 q-fin.TR · cs.AI· cs.MA· q-fin.CP· q-fin.ST

Recognition: no theorem link

Learning to Aggregate Zero-Shot LLM Agents for Corporate Disclosure Classification

Kemal Kirtac

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:33 UTC · model grok-4.3

classification 💱 q-fin.TR cs.AIcs.MAq-fin.CPq-fin.ST

keywords zero-shot large language modelscorporate disclosuresstock return predictionmeta-classifiersentiment classificationfinancial text analysisaggregator modelnext-day returns

0 comments

The pith

A lightweight supervised aggregator improves next-day stock return predictions by combining outputs from multiple zero-shot LLMs on corporate disclosures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates if a simple logistic model can turn varying outputs from zero-shot large language models into more accurate forecasts of stock movements based on corporate disclosures. Three zero-shot classifiers each analyze the same text from different financial viewpoints and provide a label, confidence, and rationale. A meta-classifier then learns to combine these signals. Tested on nearly ten thousand recent disclosures to sidestep training data overlap, the aggregator achieves higher balanced accuracy than any single model or voting scheme, especially when the base models conflict. This indicates that zero-shot LLM predictions hold distinct pieces of financial information that benefit from learned combination.

Core claim

A logistic meta-classifier aggregates sentiment labels, confidence scores, and rationales produced by three fixed zero-shot LLM classifiers prompted from distinct financial perspectives. When evaluated on 9,860 U.S. corporate disclosures issued after the base models' release dates, this aggregator delivers a balanced accuracy of 0.606 for next-day stock return direction, exceeding the 0.566 of the strongest single classifier as well as majority vote, confidence-weighted vote, a zero-shot judge, and FinBERT. The improvement is greatest on disclosures eliciting disagreement among the base classifiers.

What carries the argument

Logistic regression meta-classifier trained on the combined outputs (labels, confidences, and rationales) of three zero-shot LLM agents each using a different prompt perspective.

If this is right

The aggregator outperforms all compared methods including voting and baselines.
The largest accuracy gains occur in mixed-signal disclosures where the zero-shot classifiers disagree.
Zero-shot LLM outputs contain complementary financial signals that supervised aggregation can exploit.
Supervised combination yields stronger results than zero-shot voting alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This aggregation strategy might extend to other financial prediction tasks that rely on textual data without requiring fine-tuning of the underlying LLMs.
Future work could test whether adding more prompt perspectives or different model families further boosts performance on the same task.
If the complementary signals persist across languages or smaller firms, the method could support broader market monitoring applications.

Load-bearing premise

The three prompt perspectives must produce genuinely complementary signals rather than redundant ones, and the post-release evaluation sample must fully eliminate any contamination from the base LLMs' pretraining data.

What would settle it

Evaluating the aggregator on a new set of disclosures and observing no improvement in balanced accuracy over the best individual zero-shot classifier would indicate that the aggregation benefit does not hold.

read the original abstract

This paper studies whether a lightweight supervised aggregator can combine diverse zero-shot large language model outputs into a stronger downstream signal for corporate disclosure classification. Zero-shot LLMs can read disclosures without task-specific fine-tuning, but their predictions often vary across prompt perspectives, model families, and confidence levels. I examine this problem with a multi-prompt framework in which three fixed zero-shot LLM classifiers read each disclosure from different financial perspectives and output a sentiment label, a confidence score, and a short rationale. A logistic meta-classifier then aggregates these outputs to predict next-day stock return direction. To reduce pretrained-model contamination, I restrict evaluation to a post-release sample of 9{,}860 U.S.\ corporate disclosures issued by large publicly traded firms between January 2025 and March 2026, after the release of the frozen base LLMs used in the experiment. Results show that the trained aggregator outperforms single classifiers, majority vote, confidence-weighted voting, a zero-shot LLM judge, and a FinBERT baseline. Balanced accuracy rises from 0.566 for the best single classifier to 0.606 for the trained aggregator. The gain is largest in mixed-signal disclosures where classifiers disagree. The results suggest that zero-shot LLM outputs contain complementary financial signals, while also showing that the strongest gains come from supervised aggregation rather than from zero-shot voting alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Modest accuracy gain from logistic aggregator on three zero-shot LLM perspectives for disclosure classification, but complementarity evidence is thin.

read the letter

The key takeaway is a modest lift in balanced accuracy from 0.566 (best single zero-shot classifier) to 0.606 using a trained logistic meta-classifier that combines outputs from three fixed zero-shot LLM prompts on corporate disclosures. The setup predicts next-day stock return direction and evaluates on a post-2025 sample of 9,860 U.S. disclosures to avoid pretrained-model contamination. This is the main result worth noting. The paper does a solid job with the experimental controls and baselines. Restricting to disclosures issued after the base LLMs were released is a clean way to reduce contamination risk. It compares the aggregator against single classifiers, majority vote, confidence-weighted voting, a zero-shot LLM judge, and FinBERT, showing consistent outperformance. The multi-perspective prompting (different financial angles, each yielding label, confidence, and rationale) is a practical idea for zero-shot work, and the note that gains are biggest on mixed-signal cases hints at potential value. The soft spots are around proving the signals are actually complementary. The abstract claims larger gains where classifiers disagree but gives no numbers on disagreement rates, pairwise correlations, or how often the three perspectives differ. Without that data, the small 0.04 accuracy bump could easily come from the meta-model fitting minor calibration differences or noise rather than distinct information. No details appear on training splits, statistical significance, cross-validation, or robustness to prompt changes. The full paper would need those to make the central claim convincing. This is aimed at applied researchers in financial NLP or quant finance who want lightweight ways to improve zero-shot LLM performance without fine-tuning. It could be useful for practitioners building similar ensembles. I recommend sending it to peer review—the design is straightforward and the contamination control is thoughtful, so a referee can push on the complementarity evidence and add the missing stats.

Referee Report

2 major / 1 minor

Summary. The paper proposes a multi-prompt framework in which three fixed zero-shot LLM classifiers analyze each corporate disclosure from distinct financial perspectives and output a sentiment label, confidence score, and rationale. These outputs are aggregated by a logistic meta-classifier to predict next-day stock-return direction. Evaluation is restricted to a post-release sample of 9,860 U.S. disclosures issued January 2025–March 2026. The trained aggregator is reported to outperform single classifiers, majority vote, confidence-weighted voting, a zero-shot LLM judge, and a FinBERT baseline, raising balanced accuracy from 0.566 to 0.606, with the largest gains on mixed-signal cases.

Significance. If verified, the result would show that supervised aggregation can usefully combine zero-shot LLM outputs that contain modestly complementary financial signals, offering a lightweight alternative to fine-tuning for disclosure classification tasks. The post-release sample restriction is a clear strength that reduces contamination risk. The modest absolute gain, however, requires statistical confirmation and explicit evidence that the three perspectives are non-redundant before the practical significance for trading or disclosure analysis can be assessed.

major comments (2)

[Abstract] Abstract: the headline improvement (balanced accuracy 0.566 → 0.606) is presented without any information on training/validation splits, cross-validation procedure, statistical significance of the difference, or robustness to random seeds and hyper-parameters, so the central empirical claim cannot be evaluated from the manuscript.
[Results] Results section: the statement that gains are largest on mixed-signal disclosures is unsupported by any quantitative measure of classifier disagreement (pairwise agreement rates, output correlations, or disagreement frequency), leaving open the possibility that the meta-classifier improvement arises from fitting minor calibration differences rather than genuinely complementary signals.

minor comments (1)

[Abstract] Abstract: the sample is described as 9,860 disclosures but no class-balance statistics or firm-size breakdown is supplied, which would help interpret the balanced-accuracy numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We agree that additional methodological details and quantitative support are needed to strengthen the central claims. We will revise the manuscript accordingly and address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the headline improvement (balanced accuracy 0.566 → 0.606) is presented without any information on training/validation splits, cross-validation procedure, statistical significance of the difference, or robustness to random seeds and hyper-parameters, so the central empirical claim cannot be evaluated from the manuscript.

Authors: We agree that the abstract and main text must provide these details for reproducibility and evaluation. In the revision we will explicitly state the train/validation/test split ratios (with temporal blocking to respect the post-release sample), describe the 5-fold cross-validation procedure used for hyperparameter tuning of the logistic aggregator, report bootstrap confidence intervals and a paired t-test p-value for the 0.040 balanced-accuracy difference, and include a robustness table showing results across five random seeds and two alternative regularization strengths. These additions will allow readers to assess the stability of the reported improvement. revision: yes
Referee: [Results] Results section: the statement that gains are largest on mixed-signal disclosures is unsupported by any quantitative measure of classifier disagreement (pairwise agreement rates, output correlations, or disagreement frequency), leaving open the possibility that the meta-classifier improvement arises from fitting minor calibration differences rather than genuinely complementary signals.

Authors: We accept that the current draft lacks explicit disagreement metrics. We will add a new results subsection and accompanying table that reports (i) pairwise label agreement rates and Krippendorff’s alpha among the three zero-shot classifiers, (ii) Pearson correlations of their confidence scores, and (iii) the fraction of disclosures on which at least two classifiers disagree. We will further stratify balanced accuracy by disagreement level (all three agree vs. exactly two agree vs. all disagree) and show that the meta-classifier’s incremental gain is monotonically increasing with disagreement frequency, which supports complementarity rather than simple recalibration. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical accuracy gain on held-out data

full rationale

The paper's core chain trains a logistic meta-classifier on the three zero-shot LLM outputs (sentiment, confidence, rationale) to predict next-day return direction, then evaluates balanced accuracy on a post-release held-out sample of 9,860 disclosures. This is standard supervised aggregation with independent test performance; the reported lift from 0.566 to 0.606 is measured on data not used in fitting. No equations, self-citations, or ansatzes reduce the result to a definitional identity or fitted parameter by construction. The restriction to post-release disclosures addresses contamination but does not create circularity in the reported metric.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard supervised-learning assumptions plus the domain claim that LLM disagreement encodes complementary financial information. No new entities are postulated.

free parameters (1)

logistic regression coefficients
Fitted on the training portion of the 9,860 disclosures to map the three LLM outputs to return direction.

axioms (2)

domain assumption Zero-shot LLM outputs from different prompts contain complementary rather than redundant financial signals
Invoked to explain why aggregation improves performance over voting.
domain assumption Post-release sample construction fully removes pretrained-model contamination
Used to justify the evaluation design.

pith-pipeline@v0.9.0 · 5542 in / 1291 out tokens · 24347 ms · 2026-05-15T06:33:11.980179+00:00 · methodology