Researchers waste 80% of LLM annotation costs by classifying one text at a time
Pith reviewed 2026-05-13 17:15 UTC · model grok-4.3
The pith
Batching texts and stacking variables cuts LLM classification costs by over 80% with little accuracy loss.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Batching 25 items and stacking variables reduces API calls from 400,000 to 4,000 for 100,000 texts on four variables. Six models maintain accuracy within 2 percentage points of the single-item baseline at batch size 100, and stacking up to 10 dimensions matches single-variable results, with error smaller than inter-coder disagreement.
What carries the argument
The batch-and-stack prompting technique that processes multiple items and multiple coding dimensions in one prompt to reduce API calls.
If this is right
- Annotation of large text corpora becomes much cheaper without major quality trade-offs for most models.
- Social science researchers can scale up content analysis projects that were previously too expensive.
- The safe range of batch size 100 and up to 10 stacked variables allows reliable use in practice.
- Measurement error from this method is smaller than typical human disagreement, supporting its adoption.
- Future prompts should prioritize reducing task complexity over minimizing length.
Where Pith is reading between the lines
- This approach may extend to other generative tasks like summarization if adapted carefully.
- Models from different providers show varying tolerance to batching, suggesting model-specific testing.
- Applying this in non-English or specialized domains could require smaller batches to stay accurate.
- Overall, it lowers the barrier for reproducible large-scale text analysis in research.
Load-bearing premise
Performance on these four tweet tasks with expert coding will hold for other domains, languages, and types of variables.
What would settle it
A study on news articles or legal texts showing accuracy drops exceeding 2 points at batch size 100 would show the safe range does not generalize.
Figures
read the original abstract
Large language models (LLMs) are increasingly being used for text classification across the social sciences, yet researchers overwhelmingly classify one text per variable per prompt. Coding 100,000 texts on four variables requires 400,000 API calls. Batching 25 items and stacking all variables into a single prompt reduces this to 4,000 calls, cutting token costs by over 80%. Whether this degrades coding quality is unknown. We tested eight production LLMs from four providers on 3,962 expert-coded tweets across four tasks, varying batch size from 1 to 1,000 items and stacking up to 25 coding dimensions per prompt. Six of eight models maintained accuracy within 2 pp of the single-item baseline through batch sizes of 100. Variable stacking with up to 10 dimensions produced results comparable to single-variable coding, with degradation driven by task complexity rather than prompt length. Within this safe operating range, the measurement error from batching and stacking is smaller than typical inter-coder disagreement in the ground-truth data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that batching up to 100 texts and stacking up to 10 variables into single LLM prompts reduces annotation API calls (and thus token costs) by over 80% while keeping accuracy within 2 percentage points of the single-item baseline for six of eight tested models, based on experiments with 3,962 expert-coded tweets across four classification tasks; degradation is attributed to task complexity rather than prompt length, and batching/stacking error is smaller than typical inter-coder disagreement.
Significance. If the results hold, the work has clear significance for computational social science by supplying concrete, empirically grounded guidelines for cost-efficient LLM annotation at scale. The sizable expert-coded dataset, multi-model testing, and direct comparison to inter-coder reliability provide a practical benchmark that could immediately inform research design decisions.
major comments (1)
- [Results] Results section: the central claim that batching/stacking error is 'smaller than typical inter-coder disagreement' requires explicit reporting of the inter-coder agreement statistics (e.g., Cohen's kappa or percent agreement) from the ground-truth data so readers can verify the comparison; without these numbers the claim remains qualitative.
minor comments (2)
- [Abstract] Abstract: replace the summary phrase 'within 2 pp' with the actual per-model accuracy deltas, confidence intervals, and the four specific tasks so the abstract is self-contained and reproducible.
- [Methods] Methods: specify the exact prompt templates used for batching and stacking (including any formatting or separator tokens) and report the token counts per condition to allow direct replication of the cost calculations.
Simulated Author's Rebuttal
We thank the referee for the constructive comment on strengthening the comparison to inter-coder reliability. We address the point below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Results] Results section: the central claim that batching/stacking error is 'smaller than typical inter-coder disagreement' requires explicit reporting of the inter-coder agreement statistics (e.g., Cohen's kappa or percent agreement) from the ground-truth data so readers can verify the comparison; without these numbers the claim remains qualitative.
Authors: We agree that explicit reporting of the inter-coder agreement statistics is needed to make the comparison verifiable rather than qualitative. The ground-truth dataset was produced by multiple expert coders, and we will add the percent agreement and Cohen's kappa values (computed on the overlapping annotations) to the Results section in the revised manuscript, allowing readers to directly compare these figures against the observed batching/stacking error rates. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper reports direct empirical accuracy measurements from running eight LLMs on 3,962 expert-coded tweets across four classification tasks, comparing single-item prompts against batch sizes up to 1,000 and variable stacking up to 25 dimensions. All reported figures are computed against an external ground-truth benchmark; no equations, fitted parameters, or self-citations are used to derive the accuracy numbers from the experimental inputs themselves. The central claims remain observational and externally falsifiable within the tested regime.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Accuracy can be meaningfully compared across prompting regimes using percentage-point differences against expert-coded ground truth.
Reference graph
Works this paper leans on
-
[1]
D., Korobeynikova, M., & Gilardi, F
Alizadeh, M., Kubli, M., Samei, Z., Dehghani, S., Bermeo, J. D., Korobeynikova, M., & Gilardi, F. (2025). Open-source LLMs for text annotation: A practical guide for model setting and fine-tuning.Journal of Computational Social Science,8(1), Article
work page 2025
-
[2]
Bail, C. A. (2024). Can generative AI improve social science?Proceedings of the National Academy of Sciences,121(21), e2314021121. Baumann, J., Röttger, P., Urman, A., Wendsjö, A., Plaza-del-Arco, F. M., Gruber, J. B., & Hovy, D. (2025). Large language model hacking: Quantifying the hidden risks of using LLMs for text annotation.arXiv preprint arXiv:2509....
-
[3]
6 Gilardi, F., Alizadeh, M., & Kubli, M. (2023). ChatGPT outperforms crowd workers for text- annotation tasks.Proceedings of the National Academy of Sciences,120(30), e2305016120. Hartman, V., & Törnberg, P. (2025). Who attacks, and why? using LLMs to identify negative campaigning in 18M tweets across 19 countries.arXiv preprint arXiv:2507.17636. Heseltin...
-
[4]
The data were re-used by Alizadeh et al. (2025) to compare multiple LLMs on the same classification tasks. Our evaluation dataset contains 3,962 tweets with human-coded ground-truth labels across four classification variables. Two trained research assistants independently coded each variable. Disagreements were adjudi- cated to produce the final ground-tr...
work page 2025
-
[5]
Google: Gemini 3 Flash ($0.50/$3.00), Gemini 3.1 Flash-Lite ($0.25/$1.50), Gemini 2.5 Flash- Lite ($0.10/$0.40). Alibaba: Qwen 3.5 Plus ($0.40/$2.40), Qwen 3.5 Flash ($0.10/$0.40). All models (except GPT-5-mini and GPT-5-nano) were run at temperature 0 to maximize deter- minismwithreasoningturnedoff. GPT-5-miniandGPT-5-nanowererunatthedefaulttemperature 1...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.