Researchers waste 80% of LLM annotation costs by classifying one text at a time

Christian Pipal; Eva-Maria Vogel; Frank Esser; Morgan Wack

arxiv: 2604.03684 · v1 · submitted 2026-04-04 · 💻 cs.CL

Researchers waste 80% of LLM annotation costs by classifying one text at a time

Christian Pipal , Eva-Maria Vogel , Morgan Wack , Frank Esser This is my paper

Pith reviewed 2026-05-13 17:15 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM annotationtext classificationbatchingprompt stackingcost efficiencyaccuracy maintenancesocial science researchinter-coder agreement

0 comments

The pith

Batching texts and stacking variables cuts LLM classification costs by over 80% with little accuracy loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Researchers typically classify one text per prompt per variable when using LLMs for annotation, leading to high costs. This paper tests whether batching multiple texts and stacking multiple variables into fewer prompts degrades performance. On four tweet classification tasks with expert ground truth, six of eight models kept accuracy nearly the same up to batches of 100 items and stacks of 10 variables. The resulting cost savings exceed 80%, and added error stays below usual human coder disagreement levels. Task complexity, not prompt size, drives any observed drops.

Core claim

Batching 25 items and stacking variables reduces API calls from 400,000 to 4,000 for 100,000 texts on four variables. Six models maintain accuracy within 2 percentage points of the single-item baseline at batch size 100, and stacking up to 10 dimensions matches single-variable results, with error smaller than inter-coder disagreement.

What carries the argument

The batch-and-stack prompting technique that processes multiple items and multiple coding dimensions in one prompt to reduce API calls.

If this is right

Annotation of large text corpora becomes much cheaper without major quality trade-offs for most models.
Social science researchers can scale up content analysis projects that were previously too expensive.
The safe range of batch size 100 and up to 10 stacked variables allows reliable use in practice.
Measurement error from this method is smaller than typical human disagreement, supporting its adoption.
Future prompts should prioritize reducing task complexity over minimizing length.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach may extend to other generative tasks like summarization if adapted carefully.
Models from different providers show varying tolerance to batching, suggesting model-specific testing.
Applying this in non-English or specialized domains could require smaller batches to stay accurate.
Overall, it lowers the barrier for reproducible large-scale text analysis in research.

Load-bearing premise

Performance on these four tweet tasks with expert coding will hold for other domains, languages, and types of variables.

What would settle it

A study on news articles or legal texts showing accuracy drops exceeding 2 points at batch size 100 would show the safe range does not generalize.

Figures

Figures reproduced from arXiv: 2604.03684 by Christian Pipal, Eva-Maria Vogel, Frank Esser, Morgan Wack.

**Figure 2.** Figure 2: Accuracy by model, variable, and batch size (Study 1). Each panel shows one of the four [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

Large language models (LLMs) are increasingly being used for text classification across the social sciences, yet researchers overwhelmingly classify one text per variable per prompt. Coding 100,000 texts on four variables requires 400,000 API calls. Batching 25 items and stacking all variables into a single prompt reduces this to 4,000 calls, cutting token costs by over 80%. Whether this degrades coding quality is unknown. We tested eight production LLMs from four providers on 3,962 expert-coded tweets across four tasks, varying batch size from 1 to 1,000 items and stacking up to 25 coding dimensions per prompt. Six of eight models maintained accuracy within 2 pp of the single-item baseline through batch sizes of 100. Variable stacking with up to 10 dimensions produced results comparable to single-variable coding, with degradation driven by task complexity rather than prompt length. Within this safe operating range, the measurement error from batching and stacking is smaller than typical inter-coder disagreement in the ground-truth data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Batching up to 100 items and stacking up to 10 variables keeps accuracy within 2 pp of single-item prompts on this tweet data, with batching error below inter-coder disagreement.

read the letter

The main takeaway is straightforward: on these four tweet classification tasks with 3,962 expert-coded items, batching texts and stacking variables in the same prompt cuts API calls by roughly 80% while holding accuracy close to the single-item baseline for six of the eight models tested. The added measurement error stays smaller than typical inter-coder disagreement in the ground-truth labels, which is the practical threshold that matters for most users. That comparison is the part that feels most useful. The experiment runs eight production models across batch sizes from 1 to 1,000 and stacking up to 25 dimensions, and it reports that degradation tracks task complexity more than prompt length. This is a clean empirical check against real expert labels rather than synthetic data or self-reported consistency. The numbers line up internally and give concrete safe ranges (batch 100, stack 10) that researchers can try immediately. The soft spots are the usual ones for this kind of work. Everything is on short English tweets and four specific tasks, so generalization to longer documents, other languages, or different variable types is untested. The abstract also skips exact accuracy figures, confidence intervals, and statistical tests, which makes it harder to judge how stable the 2 pp claim really is. If the full paper has those details and a clear limitations section, the contribution holds up. This is aimed at social scientists who already use LLMs for large-scale coding and want to reduce costs without reinventing their pipeline. Anyone running annotation on tens of thousands of items will get direct value from the reported operating ranges. I would send it to peer review. The empirical test is large enough and grounded enough to deserve referee attention, even if the scope needs tightening.

Referee Report

1 major / 2 minor

Summary. The paper claims that batching up to 100 texts and stacking up to 10 variables into single LLM prompts reduces annotation API calls (and thus token costs) by over 80% while keeping accuracy within 2 percentage points of the single-item baseline for six of eight tested models, based on experiments with 3,962 expert-coded tweets across four classification tasks; degradation is attributed to task complexity rather than prompt length, and batching/stacking error is smaller than typical inter-coder disagreement.

Significance. If the results hold, the work has clear significance for computational social science by supplying concrete, empirically grounded guidelines for cost-efficient LLM annotation at scale. The sizable expert-coded dataset, multi-model testing, and direct comparison to inter-coder reliability provide a practical benchmark that could immediately inform research design decisions.

major comments (1)

[Results] Results section: the central claim that batching/stacking error is 'smaller than typical inter-coder disagreement' requires explicit reporting of the inter-coder agreement statistics (e.g., Cohen's kappa or percent agreement) from the ground-truth data so readers can verify the comparison; without these numbers the claim remains qualitative.

minor comments (2)

[Abstract] Abstract: replace the summary phrase 'within 2 pp' with the actual per-model accuracy deltas, confidence intervals, and the four specific tasks so the abstract is self-contained and reproducible.
[Methods] Methods: specify the exact prompt templates used for batching and stacking (including any formatting or separator tokens) and report the token counts per condition to allow direct replication of the cost calculations.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on strengthening the comparison to inter-coder reliability. We address the point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Results] Results section: the central claim that batching/stacking error is 'smaller than typical inter-coder disagreement' requires explicit reporting of the inter-coder agreement statistics (e.g., Cohen's kappa or percent agreement) from the ground-truth data so readers can verify the comparison; without these numbers the claim remains qualitative.

Authors: We agree that explicit reporting of the inter-coder agreement statistics is needed to make the comparison verifiable rather than qualitative. The ground-truth dataset was produced by multiple expert coders, and we will add the percent agreement and Cohen's kappa values (computed on the overlapping annotations) to the Results section in the revised manuscript, allowing readers to directly compare these figures against the observed batching/stacking error rates. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper reports direct empirical accuracy measurements from running eight LLMs on 3,962 expert-coded tweets across four classification tasks, comparing single-item prompts against batch sizes up to 1,000 and variable stacking up to 25 dimensions. All reported figures are computed against an external ground-truth benchmark; no equations, fitted parameters, or self-citations are used to derive the accuracy numbers from the experimental inputs themselves. The central claims remain observational and externally falsifiable within the tested regime.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical evaluation that relies on standard statistical comparison of accuracies and inter-coder agreement; it introduces no new free parameters, axioms beyond basic measurement assumptions, or invented entities.

axioms (1)

domain assumption Accuracy can be meaningfully compared across prompting regimes using percentage-point differences against expert-coded ground truth.
Invoked when claiming that 2pp degradation is acceptable.

pith-pipeline@v0.9.0 · 5486 in / 1427 out tokens · 58517 ms · 2026-05-13T17:15:33.301371+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages

[1]

D., Korobeynikova, M., & Gilardi, F

Alizadeh, M., Kubli, M., Samei, Z., Dehghani, S., Bermeo, J. D., Korobeynikova, M., & Gilardi, F. (2025). Open-source LLMs for text annotation: A practical guide for model setting and fine-tuning.Journal of Computational Social Science,8(1), Article

work page 2025
[2]

Bail, C. A. (2024). Can generative AI improve social science?Proceedings of the National Academy of Sciences,121(21), e2314021121. Baumann, J., Röttger, P., Urman, A., Wendsjö, A., Plaza-del-Arco, F. M., Gruber, J. B., & Hovy, D. (2025). Large language model hacking: Quantifying the hidden risks of using LLMs for text annotation.arXiv preprint arXiv:2509....

work page arXiv 2024
[3]

6 Gilardi, F., Alizadeh, M., & Kubli, M. (2023). ChatGPT outperforms crowd workers for text- annotation tasks.Proceedings of the National Academy of Sciences,120(30), e2305016120. Hartman, V., & Törnberg, P. (2025). Who attacks, and why? using LLMs to identify negative campaigning in 18M tweets across 19 countries.arXiv preprint arXiv:2507.17636. Heseltin...

work page arXiv 2023
[4]

in favor

The data were re-used by Alizadeh et al. (2025) to compare multiple LLMs on the same classification tasks. Our evaluation dataset contains 3,962 tweets with human-coded ground-truth labels across four classification variables. Two trained research assistants independently coded each variable. Disagreements were adjudi- cated to produce the final ground-tr...

work page 2025
[5]

beginning

Google: Gemini 3 Flash ($0.50/$3.00), Gemini 3.1 Flash-Lite ($0.25/$1.50), Gemini 2.5 Flash- Lite ($0.10/$0.40). Alibaba: Qwen 3.5 Plus ($0.40/$2.40), Qwen 3.5 Flash ($0.10/$0.40). All models (except GPT-5-mini and GPT-5-nano) were run at temperature 0 to maximize deter- minismwithreasoningturnedoff. GPT-5-miniandGPT-5-nanowererunatthedefaulttemperature 1...

work page 2023

[1] [1]

D., Korobeynikova, M., & Gilardi, F

Alizadeh, M., Kubli, M., Samei, Z., Dehghani, S., Bermeo, J. D., Korobeynikova, M., & Gilardi, F. (2025). Open-source LLMs for text annotation: A practical guide for model setting and fine-tuning.Journal of Computational Social Science,8(1), Article

work page 2025

[2] [2]

Bail, C. A. (2024). Can generative AI improve social science?Proceedings of the National Academy of Sciences,121(21), e2314021121. Baumann, J., Röttger, P., Urman, A., Wendsjö, A., Plaza-del-Arco, F. M., Gruber, J. B., & Hovy, D. (2025). Large language model hacking: Quantifying the hidden risks of using LLMs for text annotation.arXiv preprint arXiv:2509....

work page arXiv 2024

[3] [3]

6 Gilardi, F., Alizadeh, M., & Kubli, M. (2023). ChatGPT outperforms crowd workers for text- annotation tasks.Proceedings of the National Academy of Sciences,120(30), e2305016120. Hartman, V., & Törnberg, P. (2025). Who attacks, and why? using LLMs to identify negative campaigning in 18M tweets across 19 countries.arXiv preprint arXiv:2507.17636. Heseltin...

work page arXiv 2023

[4] [4]

in favor

The data were re-used by Alizadeh et al. (2025) to compare multiple LLMs on the same classification tasks. Our evaluation dataset contains 3,962 tweets with human-coded ground-truth labels across four classification variables. Two trained research assistants independently coded each variable. Disagreements were adjudi- cated to produce the final ground-tr...

work page 2025

[5] [5]

beginning

Google: Gemini 3 Flash ($0.50/$3.00), Gemini 3.1 Flash-Lite ($0.25/$1.50), Gemini 2.5 Flash- Lite ($0.10/$0.40). Alibaba: Qwen 3.5 Plus ($0.40/$2.40), Qwen 3.5 Flash ($0.10/$0.40). All models (except GPT-5-mini and GPT-5-nano) were run at temperature 0 to maximize deter- minismwithreasoningturnedoff. GPT-5-miniandGPT-5-nanowererunatthedefaulttemperature 1...

work page 2023