pith. sign in

arxiv: 2604.03684 · v1 · submitted 2026-04-04 · 💻 cs.CL

Researchers waste 80% of LLM annotation costs by classifying one text at a time

Pith reviewed 2026-05-13 17:15 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM annotationtext classificationbatchingprompt stackingcost efficiencyaccuracy maintenancesocial science researchinter-coder agreement
0
0 comments X

The pith

Batching texts and stacking variables cuts LLM classification costs by over 80% with little accuracy loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Researchers typically classify one text per prompt per variable when using LLMs for annotation, leading to high costs. This paper tests whether batching multiple texts and stacking multiple variables into fewer prompts degrades performance. On four tweet classification tasks with expert ground truth, six of eight models kept accuracy nearly the same up to batches of 100 items and stacks of 10 variables. The resulting cost savings exceed 80%, and added error stays below usual human coder disagreement levels. Task complexity, not prompt size, drives any observed drops.

Core claim

Batching 25 items and stacking variables reduces API calls from 400,000 to 4,000 for 100,000 texts on four variables. Six models maintain accuracy within 2 percentage points of the single-item baseline at batch size 100, and stacking up to 10 dimensions matches single-variable results, with error smaller than inter-coder disagreement.

What carries the argument

The batch-and-stack prompting technique that processes multiple items and multiple coding dimensions in one prompt to reduce API calls.

If this is right

  • Annotation of large text corpora becomes much cheaper without major quality trade-offs for most models.
  • Social science researchers can scale up content analysis projects that were previously too expensive.
  • The safe range of batch size 100 and up to 10 stacked variables allows reliable use in practice.
  • Measurement error from this method is smaller than typical human disagreement, supporting its adoption.
  • Future prompts should prioritize reducing task complexity over minimizing length.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach may extend to other generative tasks like summarization if adapted carefully.
  • Models from different providers show varying tolerance to batching, suggesting model-specific testing.
  • Applying this in non-English or specialized domains could require smaller batches to stay accurate.
  • Overall, it lowers the barrier for reproducible large-scale text analysis in research.

Load-bearing premise

Performance on these four tweet tasks with expert coding will hold for other domains, languages, and types of variables.

What would settle it

A study on news articles or legal texts showing accuracy drops exceeding 2 points at batch size 100 would show the safe range does not generalize.

Figures

Figures reproduced from arXiv: 2604.03684 by Christian Pipal, Eva-Maria Vogel, Frank Esser, Morgan Wack.

Figure 1
Figure 1. Figure 1: Batching and stacking are safe within bounds. ( [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy by model, variable, and batch size (Study 1). Each panel shows one of the four [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

Large language models (LLMs) are increasingly being used for text classification across the social sciences, yet researchers overwhelmingly classify one text per variable per prompt. Coding 100,000 texts on four variables requires 400,000 API calls. Batching 25 items and stacking all variables into a single prompt reduces this to 4,000 calls, cutting token costs by over 80%. Whether this degrades coding quality is unknown. We tested eight production LLMs from four providers on 3,962 expert-coded tweets across four tasks, varying batch size from 1 to 1,000 items and stacking up to 25 coding dimensions per prompt. Six of eight models maintained accuracy within 2 pp of the single-item baseline through batch sizes of 100. Variable stacking with up to 10 dimensions produced results comparable to single-variable coding, with degradation driven by task complexity rather than prompt length. Within this safe operating range, the measurement error from batching and stacking is smaller than typical inter-coder disagreement in the ground-truth data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that batching up to 100 texts and stacking up to 10 variables into single LLM prompts reduces annotation API calls (and thus token costs) by over 80% while keeping accuracy within 2 percentage points of the single-item baseline for six of eight tested models, based on experiments with 3,962 expert-coded tweets across four classification tasks; degradation is attributed to task complexity rather than prompt length, and batching/stacking error is smaller than typical inter-coder disagreement.

Significance. If the results hold, the work has clear significance for computational social science by supplying concrete, empirically grounded guidelines for cost-efficient LLM annotation at scale. The sizable expert-coded dataset, multi-model testing, and direct comparison to inter-coder reliability provide a practical benchmark that could immediately inform research design decisions.

major comments (1)
  1. [Results] Results section: the central claim that batching/stacking error is 'smaller than typical inter-coder disagreement' requires explicit reporting of the inter-coder agreement statistics (e.g., Cohen's kappa or percent agreement) from the ground-truth data so readers can verify the comparison; without these numbers the claim remains qualitative.
minor comments (2)
  1. [Abstract] Abstract: replace the summary phrase 'within 2 pp' with the actual per-model accuracy deltas, confidence intervals, and the four specific tasks so the abstract is self-contained and reproducible.
  2. [Methods] Methods: specify the exact prompt templates used for batching and stacking (including any formatting or separator tokens) and report the token counts per condition to allow direct replication of the cost calculations.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on strengthening the comparison to inter-coder reliability. We address the point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Results] Results section: the central claim that batching/stacking error is 'smaller than typical inter-coder disagreement' requires explicit reporting of the inter-coder agreement statistics (e.g., Cohen's kappa or percent agreement) from the ground-truth data so readers can verify the comparison; without these numbers the claim remains qualitative.

    Authors: We agree that explicit reporting of the inter-coder agreement statistics is needed to make the comparison verifiable rather than qualitative. The ground-truth dataset was produced by multiple expert coders, and we will add the percent agreement and Cohen's kappa values (computed on the overlapping annotations) to the Results section in the revised manuscript, allowing readers to directly compare these figures against the observed batching/stacking error rates. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper reports direct empirical accuracy measurements from running eight LLMs on 3,962 expert-coded tweets across four classification tasks, comparing single-item prompts against batch sizes up to 1,000 and variable stacking up to 25 dimensions. All reported figures are computed against an external ground-truth benchmark; no equations, fitted parameters, or self-citations are used to derive the accuracy numbers from the experimental inputs themselves. The central claims remain observational and externally falsifiable within the tested regime.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical evaluation that relies on standard statistical comparison of accuracies and inter-coder agreement; it introduces no new free parameters, axioms beyond basic measurement assumptions, or invented entities.

axioms (1)
  • domain assumption Accuracy can be meaningfully compared across prompting regimes using percentage-point differences against expert-coded ground truth.
    Invoked when claiming that 2pp degradation is acceptable.

pith-pipeline@v0.9.0 · 5486 in / 1427 out tokens · 58517 ms · 2026-05-13T17:15:33.301371+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages

  1. [1]

    D., Korobeynikova, M., & Gilardi, F

    Alizadeh, M., Kubli, M., Samei, Z., Dehghani, S., Bermeo, J. D., Korobeynikova, M., & Gilardi, F. (2025). Open-source LLMs for text annotation: A practical guide for model setting and fine-tuning.Journal of Computational Social Science,8(1), Article

  2. [2]

    Bail, C. A. (2024). Can generative AI improve social science?Proceedings of the National Academy of Sciences,121(21), e2314021121. Baumann, J., Röttger, P., Urman, A., Wendsjö, A., Plaza-del-Arco, F. M., Gruber, J. B., & Hovy, D. (2025). Large language model hacking: Quantifying the hidden risks of using LLMs for text annotation.arXiv preprint arXiv:2509....

  3. [3]

    6 Gilardi, F., Alizadeh, M., & Kubli, M. (2023). ChatGPT outperforms crowd workers for text- annotation tasks.Proceedings of the National Academy of Sciences,120(30), e2305016120. Hartman, V., & Törnberg, P. (2025). Who attacks, and why? using LLMs to identify negative campaigning in 18M tweets across 19 countries.arXiv preprint arXiv:2507.17636. Heseltin...

  4. [4]

    in favor

    The data were re-used by Alizadeh et al. (2025) to compare multiple LLMs on the same classification tasks. Our evaluation dataset contains 3,962 tweets with human-coded ground-truth labels across four classification variables. Two trained research assistants independently coded each variable. Disagreements were adjudi- cated to produce the final ground-tr...

  5. [5]

    beginning

    Google: Gemini 3 Flash ($0.50/$3.00), Gemini 3.1 Flash-Lite ($0.25/$1.50), Gemini 2.5 Flash- Lite ($0.10/$0.40). Alibaba: Qwen 3.5 Plus ($0.40/$2.40), Qwen 3.5 Flash ($0.10/$0.40). All models (except GPT-5-mini and GPT-5-nano) were run at temperature 0 to maximize deter- minismwithreasoningturnedoff. GPT-5-miniandGPT-5-nanowererunatthedefaulttemperature 1...