What Is Actually Being Annotated? Inter-Prompt Reliability as a Measurement Problem in LLM-Based Social Science Labeling
Pith reviewed 2026-05-13 21:38 UTC · model grok-4.3
The pith
LLM annotations for social science show high stochastic variation across equivalent prompts in interpretive tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLM prompt acts as an instrumental measurement while its wording exhibits methodological uncertainty: annotation outputs display substantial stochastic variation on interpretative tasks yet appear more stable on knowledge-anchored tasks, and majority voting across multiple prompts measurably improves reproducibility while reducing variance.
What carries the argument
Inter-Prompt Reliability (IPR), quantified by Pairwise Agreement Rate and its distribution across semantically equivalent but linguistically varied prompts.
If this is right
- Single-prompt LLM annotation results should no longer be treated as fixed; distributional stability across prompts becomes the relevant metric.
- Interpretative labeling tasks require explicit checks for prompt sensitivity while knowledge-anchored tasks can tolerate narrower checks.
- Majority voting over several prompt phrasings becomes a practical way to reduce variance without changing the underlying model.
- Future CSS studies must report prompt-induced variance alongside conventional agreement scores.
Where Pith is reading between the lines
- Prompt engineering efforts should optimize for cross-wording consistency rather than peak performance on one phrasing.
- The same measurement-uncertainty pattern may appear when LLMs are used for labeling in domains outside social science.
- Comparing IPR across different model families could identify which architectures are less sensitive to prompt rewording.
Load-bearing premise
The linguistically varied prompts remain semantically equivalent and the two chosen tasks represent the main kinds of labeling used in social science.
What would settle it
Repeating the TREC and Politifact experiments with fresh sets of prompts that preserve meaning but change wording and finding no measurable change in output distribution or in the improvement from majority voting.
Figures
read the original abstract
Large language models (LLMs) are increasingly used for annotation in computational social science, yet their methodological reliability under prompt variation remains unclear. This paper introduces Inter-Prompt Reliability (IPR), a framework for evaluating the stability of LLM outputs across semantically equivalent but linguistically varied prompts. Drawing on Inter-Rater Reliability, IPR is measured by Pairwise Agreement Rate (PAR) and its distribution to capture both consistency and stochasticity in model behavior. We evaluate this framework on two tasks with distinct properties: TREC (interpretative) and Politifact (knowledge-anchored). Results show that LLM annotation exhibits substantial stochastic variation in interpretative tasks, while appearing more stable in knowledge-based tasks. We further show that majority voting across prompts significantly improves reproducibility and reduces variance. These findings suggest that LLM prompt acts as an instrumental measurement while its wording exhibits methodological uncertainty. For future LLM-based CSS studies, we suggest that researchers move beyond single-prompt evaluation toward distributional stability and prompt aggregation within our IPR framework.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that LLM annotation exhibits substantial stochastic variation in interpretative tasks (TREC) but appears more stable in knowledge-anchored tasks (Politifact). It introduces the Inter-Prompt Reliability (IPR) framework, drawing on inter-rater reliability concepts and measured via Pairwise Agreement Rate (PAR) and its distribution, to evaluate stability across linguistically varied but semantically equivalent prompts. The authors further show that majority voting across prompts improves reproducibility and reduces variance, concluding that prompt wording acts as an instrumental measurement with methodological uncertainty and recommending distributional stability and prompt aggregation for future LLM-based CSS studies.
Significance. If the empirical claims hold, the work is significant for computational social science by identifying prompt variation as a source of unreliability in LLM annotation and providing the IPR framework as a concrete evaluation tool. The distinction between interpretative and knowledge-anchored tasks adds useful nuance, and the majority-voting result offers an immediately actionable mitigation strategy that could improve reproducibility in labeling pipelines.
major comments (2)
- The central claim attributes PAR differences between tasks to stochasticity and task type, but this requires that the linguistically varied prompts remain semantically equivalent; no human equivalence ratings, embedding-based similarity thresholds, or other controls are reported in the methods or experimental setup to rule out semantic drift.
- Results section: the abstract states results on variation and majority voting but provides no details on sample sizes, exact prompt generation method, statistical tests, or error bars, so the data support for the claims of substantial stochastic variation and variance reduction cannot be verified.
minor comments (1)
- The definition and formula for Pairwise Agreement Rate (PAR) should be stated explicitly with an equation to support reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions that will be incorporated to improve clarity and rigor.
read point-by-point responses
-
Referee: The central claim attributes PAR differences between tasks to stochasticity and task type, but this requires that the linguistically varied prompts remain semantically equivalent; no human equivalence ratings, embedding-based similarity thresholds, or other controls are reported in the methods or experimental setup to rule out semantic drift.
Authors: We agree that explicit verification of semantic equivalence is necessary to support the attribution of PAR differences to task type rather than unintended semantic drift. In the original work, prompts were constructed via systematic paraphrasing that preserved core meaning and task instructions, but we did not report quantitative controls. We will revise the Methods section to describe the prompt-generation procedure in detail and add embedding-based validation: average cosine similarity (using Sentence-BERT embeddings) across prompt variants will be reported, with a minimum threshold of 0.80 for retention. A small-scale human equivalence rating study (three annotators, 50 prompt pairs) will also be included to confirm that variants are judged equivalent at rates above 85%. These additions will appear in the revised Methods and Experimental Setup. revision: yes
-
Referee: Results section: the abstract states results on variation and majority voting but provides no details on sample sizes, exact prompt generation method, statistical tests, or error bars, so the data support for the claims of substantial stochastic variation and variance reduction cannot be verified.
Authors: We acknowledge that the current Results section is insufficiently detailed for independent verification. The experiments used 1,000 TREC instances and 500 Politifact instances, each evaluated with five prompt variants. Prompt generation combined rule-based paraphrasing with manual review; variance reduction was assessed via bootstrap resampling (1,000 iterations) and Wilcoxon signed-rank tests. We will expand the Results section to report exact sample sizes, the full prompt-generation protocol, all statistical test statistics and p-values, and error bars on every figure and table. These changes will make the empirical support fully transparent and reproducible. revision: yes
Circularity Check
No significant circularity; framework and metrics defined independently of results
full rationale
The paper introduces Inter-Prompt Reliability (IPR) by explicit analogy to the established concept of Inter-Rater Reliability, defines Pairwise Agreement Rate (PAR) as its measurement, and reports empirical distributions on two fixed tasks (TREC and Politifact). No equations, fitted parameters, or self-citations are shown to reduce the stability findings to definitional equivalence or input data by construction. The derivation chain remains self-contained against external benchmarks of reliability measurement.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Prompts can be varied linguistically while remaining semantically equivalent.
Reference graph
Works this paper leans on
-
[1]
What Is Actually Being Annotated? Inter-Prompt Reliability as a Measurement Problem in LLM-Based Social Science Labeling Jingyuan Liu Boston University Abstract Large language models (LLMs) are increasingly used for annotation in computational social science, yet their methodological reliability under prompt variation remains unclear. This paper introduce...
work page 2023
-
[2]
Unlike in natural language processing (NLP), where labels are often treated as end outputs of predictive optimization, in social science, annotations serve as an instrumental proxy to identify a construct or concept within the text (Grimmer & Stewart, 2013; Kleinheksel et al., 2020). As emphasized by Krippendorff (2018), the scientific value of a label li...
work page 2013
-
[3]
and example-strategy. However, it is treated more as a model performance issue, leading to development of new evaluation metrics like sensitivity and consistency (Federico et al., 2025). However, CSS studies of LLM-based annotations remains relying heavily on singe-prompt, simple-run (Gilardi et al., 2023; Mellon et al., 2024; Hoes et al., 2023; Castro-Go...
work page 2025
-
[4]
design. Such practice may unintentionally lead to a ‘prompt cherry-picking’ result, adding prompt wording as a new uncontrolled variable. This ignorance of prompt itself as a measurement instrument could lead to serious distorted view of LLM capabilities and compromise replicability of findings. Thus, our focus is thus not just accuracy of LLM annotation,...
work page 2018
-
[5]
since the emergence of CSS (Lazer et al. 2020). However, annotation has long been a major obstacle that has prevented them from being used more widely (Rao, 2023). Researchers have to conduct original annotations to ensure that the labels match their categories (Benoit et al., 2016). Mostly, these works have been done with expert coders or crowd workers o...
work page 2020
-
[6]
and sequences of examples(Zhao et al., 2021).Even prompt formatting, including casing, space, separator could lead to great performance spread of accuracy from 0.036 to 0.804. (Sclar et al.,
work page 2021
-
[7]
with discrete labels, we define this agreement as: 𝑃𝐴𝑅!,
Besides, model parameters including temperature also have a huge influence(Holtzman et al., 2020), while this does not fall within the scope of this paper because we treat prompt itself as the single variant here. In the field of NLP, the stochasticity is normally discussed as the robustness of models themselves, being a part of evaluation metrics of mode...
work page 2020
-
[8]
Standard Deviation of PAR: Measures the dispersion of agreement across prompts, capturing the extent of prompt wording-induced stochasticity. 𝜎()* =916(+7−1(6𝑃𝐴𝑅!,"−𝜇()*7+!," This method allows us to distinguish between two aspects of reliability: the mean agreement which reflects overall consistency, and the dispersion of agreement (SD), which captures t...
work page 2001
-
[9]
and Politifact (Misra, et al., 2022). The TREC dataset requires the model to categorize an open-ended question into six discrete class (e.g. Abbreviation, Entity, Human). Since this task relies primarily on syntactic and semantic analysis with relatively low dependence on external knowledge, we define it as a ‘Soft Task’. Furthermore, the categories are m...
work page 2022
-
[10]
in the TREC tasks. As shown, for GPT-4o mini, the mean PAR rise from 0.71 to nearly 0.9 while standard deviation dropped by 0.8 from k=1 to k=5. Beyond overall improvement, we also discover that the variance reduction is particularly significant at a lower value of k, where SD dropped sharply from k=1 to k=3. This indicates even a small ensemble of prompt...
work page 2025
-
[11]
Better Zero-Shot Reasoning with Role-Play Prompting. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long Papers), pages 4099–4113, Mexico City, Mexico. Association for Computational Linguistics. Argyle LP, Busby EC, Fulda N, Gubler JR, Rytting C,...
-
[12]
In Proceedings of the first international conference on Human language technology research (HLT '01)
Toward semantics-based answer pinpointing. In Proceedings of the first international conference on Human language technology research (HLT '01). Association for Computational Linguistics, USA, 1–7. https://doi.org/10.3115/1072133.1072221 Federico Errica, Davide Sanvito, Giuseppe Siracusano, and Roberto Bifulco
-
[13]
What Did I Do Wrong? Quantifying LLMs’ Sensitivity and Consistency to Prompt Engineering. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long Papers), pages 1543–1558, Albuquerque, New Mexico. Association for Computational Linguistics. F...
-
[14]
In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI '25)
Large Language Models in Qualitative Research: Uses, Tensions, and Intentions. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI '25). Association for Computing Machinery, New York, NY , USA, Article 481, 1–17. https://doi.org/10.1145/3706598.3713120 Holtzman, Ari & Buys, Jan & Forbes, Maxwell & Choi, Yejin. (2019). The ...
-
[15]
In Proceedings of the 30th Conference on Pattern Languages of Programs (PLoP '23)
A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT. In Proceedings of the 30th Conference on Pattern Languages of Programs (PLoP '23). The Hillside Group, USA, Article 5, 1–31. K. Benoit, D. Conway, B. E. Lauderdale, M. Laver, S. Mikhaylov, Crowd-sourced text analysis: Reproducible and agile production of political data. Am. Polit. Sci. R...
work page 2016
-
[16]
https://doi.org/10.5688/ajpe7113 Krippendorff, K. (2019). Content analysis. SAGE Publications, Inc., https://doi.org/10.4135/9781071878781 Liu, A. and M. Sun. 2023.“From voices to validity: Leveraging Large Language Models (LLMS) for Textual Analysis of Policy Stakeholder Interviews.” arXiv preprint arXiv:2312.01202. McHugh M. L. (2012). Interrater reliab...
-
[17]
Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS '20). Curran Associates Inc., Red Hook, NY , USA, Article 159, 1877–1901. Vikan M, Aryan R, Kannelønning MS, Riegler MA, Danielsen SO. Reflecting on LLM Support in Reflexive Thematic Analysis: An Exploratory Study. Qual...
-
[18]
Learning question classifiers. In Proceedings of the 19th international conference on Computational linguistics - Vo l u m e 1 ( C O L I N G ' 0 2 ) . A s s o c i a t i o n f o r C o m p u t a t i o n a l L i n g u i s t i c s , U S A , 1–7. https://doi.org/10.3115/1072228.1072378 Zhao, Tony & Wallace, Eric & Feng, Shi & Klein, Dan & Singh, Sameer. (2021)...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.