Recognition: no theorem link
Learning to Aggregate Zero-Shot LLM Agents for Corporate Disclosure Classification
Pith reviewed 2026-05-15 06:33 UTC · model grok-4.3
The pith
A lightweight supervised aggregator improves next-day stock return predictions by combining outputs from multiple zero-shot LLMs on corporate disclosures.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A logistic meta-classifier aggregates sentiment labels, confidence scores, and rationales produced by three fixed zero-shot LLM classifiers prompted from distinct financial perspectives. When evaluated on 9,860 U.S. corporate disclosures issued after the base models' release dates, this aggregator delivers a balanced accuracy of 0.606 for next-day stock return direction, exceeding the 0.566 of the strongest single classifier as well as majority vote, confidence-weighted vote, a zero-shot judge, and FinBERT. The improvement is greatest on disclosures eliciting disagreement among the base classifiers.
What carries the argument
Logistic regression meta-classifier trained on the combined outputs (labels, confidences, and rationales) of three zero-shot LLM agents each using a different prompt perspective.
If this is right
- The aggregator outperforms all compared methods including voting and baselines.
- The largest accuracy gains occur in mixed-signal disclosures where the zero-shot classifiers disagree.
- Zero-shot LLM outputs contain complementary financial signals that supervised aggregation can exploit.
- Supervised combination yields stronger results than zero-shot voting alone.
Where Pith is reading between the lines
- This aggregation strategy might extend to other financial prediction tasks that rely on textual data without requiring fine-tuning of the underlying LLMs.
- Future work could test whether adding more prompt perspectives or different model families further boosts performance on the same task.
- If the complementary signals persist across languages or smaller firms, the method could support broader market monitoring applications.
Load-bearing premise
The three prompt perspectives must produce genuinely complementary signals rather than redundant ones, and the post-release evaluation sample must fully eliminate any contamination from the base LLMs' pretraining data.
What would settle it
Evaluating the aggregator on a new set of disclosures and observing no improvement in balanced accuracy over the best individual zero-shot classifier would indicate that the aggregation benefit does not hold.
read the original abstract
This paper studies whether a lightweight supervised aggregator can combine diverse zero-shot large language model outputs into a stronger downstream signal for corporate disclosure classification. Zero-shot LLMs can read disclosures without task-specific fine-tuning, but their predictions often vary across prompt perspectives, model families, and confidence levels. I examine this problem with a multi-prompt framework in which three fixed zero-shot LLM classifiers read each disclosure from different financial perspectives and output a sentiment label, a confidence score, and a short rationale. A logistic meta-classifier then aggregates these outputs to predict next-day stock return direction. To reduce pretrained-model contamination, I restrict evaluation to a post-release sample of 9{,}860 U.S.\ corporate disclosures issued by large publicly traded firms between January 2025 and March 2026, after the release of the frozen base LLMs used in the experiment. Results show that the trained aggregator outperforms single classifiers, majority vote, confidence-weighted voting, a zero-shot LLM judge, and a FinBERT baseline. Balanced accuracy rises from 0.566 for the best single classifier to 0.606 for the trained aggregator. The gain is largest in mixed-signal disclosures where classifiers disagree. The results suggest that zero-shot LLM outputs contain complementary financial signals, while also showing that the strongest gains come from supervised aggregation rather than from zero-shot voting alone.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a multi-prompt framework in which three fixed zero-shot LLM classifiers analyze each corporate disclosure from distinct financial perspectives and output a sentiment label, confidence score, and rationale. These outputs are aggregated by a logistic meta-classifier to predict next-day stock-return direction. Evaluation is restricted to a post-release sample of 9,860 U.S. disclosures issued January 2025–March 2026. The trained aggregator is reported to outperform single classifiers, majority vote, confidence-weighted voting, a zero-shot LLM judge, and a FinBERT baseline, raising balanced accuracy from 0.566 to 0.606, with the largest gains on mixed-signal cases.
Significance. If verified, the result would show that supervised aggregation can usefully combine zero-shot LLM outputs that contain modestly complementary financial signals, offering a lightweight alternative to fine-tuning for disclosure classification tasks. The post-release sample restriction is a clear strength that reduces contamination risk. The modest absolute gain, however, requires statistical confirmation and explicit evidence that the three perspectives are non-redundant before the practical significance for trading or disclosure analysis can be assessed.
major comments (2)
- [Abstract] Abstract: the headline improvement (balanced accuracy 0.566 → 0.606) is presented without any information on training/validation splits, cross-validation procedure, statistical significance of the difference, or robustness to random seeds and hyper-parameters, so the central empirical claim cannot be evaluated from the manuscript.
- [Results] Results section: the statement that gains are largest on mixed-signal disclosures is unsupported by any quantitative measure of classifier disagreement (pairwise agreement rates, output correlations, or disagreement frequency), leaving open the possibility that the meta-classifier improvement arises from fitting minor calibration differences rather than genuinely complementary signals.
minor comments (1)
- [Abstract] Abstract: the sample is described as 9,860 disclosures but no class-balance statistics or firm-size breakdown is supplied, which would help interpret the balanced-accuracy numbers.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. We agree that additional methodological details and quantitative support are needed to strengthen the central claims. We will revise the manuscript accordingly and address each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline improvement (balanced accuracy 0.566 → 0.606) is presented without any information on training/validation splits, cross-validation procedure, statistical significance of the difference, or robustness to random seeds and hyper-parameters, so the central empirical claim cannot be evaluated from the manuscript.
Authors: We agree that the abstract and main text must provide these details for reproducibility and evaluation. In the revision we will explicitly state the train/validation/test split ratios (with temporal blocking to respect the post-release sample), describe the 5-fold cross-validation procedure used for hyperparameter tuning of the logistic aggregator, report bootstrap confidence intervals and a paired t-test p-value for the 0.040 balanced-accuracy difference, and include a robustness table showing results across five random seeds and two alternative regularization strengths. These additions will allow readers to assess the stability of the reported improvement. revision: yes
-
Referee: [Results] Results section: the statement that gains are largest on mixed-signal disclosures is unsupported by any quantitative measure of classifier disagreement (pairwise agreement rates, output correlations, or disagreement frequency), leaving open the possibility that the meta-classifier improvement arises from fitting minor calibration differences rather than genuinely complementary signals.
Authors: We accept that the current draft lacks explicit disagreement metrics. We will add a new results subsection and accompanying table that reports (i) pairwise label agreement rates and Krippendorff’s alpha among the three zero-shot classifiers, (ii) Pearson correlations of their confidence scores, and (iii) the fraction of disclosures on which at least two classifiers disagree. We will further stratify balanced accuracy by disagreement level (all three agree vs. exactly two agree vs. all disagree) and show that the meta-classifier’s incremental gain is monotonically increasing with disagreement frequency, which supports complementarity rather than simple recalibration. revision: yes
Circularity Check
No significant circularity; empirical accuracy gain on held-out data
full rationale
The paper's core chain trains a logistic meta-classifier on the three zero-shot LLM outputs (sentiment, confidence, rationale) to predict next-day return direction, then evaluates balanced accuracy on a post-release held-out sample of 9,860 disclosures. This is standard supervised aggregation with independent test performance; the reported lift from 0.566 to 0.606 is measured on data not used in fitting. No equations, self-citations, or ansatzes reduce the result to a definitional identity or fitted parameter by construction. The restriction to post-release disclosures addresses contamination but does not create circularity in the reported metric.
Axiom & Free-Parameter Ledger
free parameters (1)
- logistic regression coefficients
axioms (2)
- domain assumption Zero-shot LLM outputs from different prompts contain complementary rather than redundant financial signals
- domain assumption Post-release sample construction fully removes pretrained-model contamination
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.