Prompt Stability Scoring for Text Annotation with Large Language Models
Pith reviewed 2026-05-23 23:13 UTC · model grok-4.3
The pith
A metric called the Prompt Stability Score adapts traditional coder reliability methods to assess how consistent large language model text annotations remain across prompt variations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By adapting traditional approaches to intra- and inter-coder reliability scoring, the Prompt Stability Score (PSS) provides a general framework for diagnosing prompt stability in LLM text annotation tasks, with a Python package available for its estimation.
What carries the argument
The Prompt Stability Score (PSS) that quantifies stability by treating different prompts as different 'coders' and applying reliability metrics to their outputs on the same data.
If this is right
- Prompt stability can be diagnosed before scaling up annotation tasks.
- The package enables computation on large-scale datasets with millions of rows.
- Best practice recommendations help applied researchers improve prompt robustness.
- Stability issues can be identified across multiple datasets and outcomes.
Where Pith is reading between the lines
- If the PSS is widely adopted, papers using LLM annotations may start reporting stability scores as standard practice.
- This framework could be extended to measure stability in other LLM applications like generation or reasoning tasks.
- Low PSS values might prompt the development of more robust prompt engineering techniques.
Load-bearing premise
Traditional intra- and inter-coder reliability scoring methods can be directly and meaningfully adapted to quantify the stability of LLM outputs across prompt variations.
What would settle it
Observing that the PSS indicates high stability for a set of prompts, yet human experts find that the annotations differ substantively in ways that affect downstream analysis, would challenge the metric's validity.
read the original abstract
Researchers are increasingly using language models (LMs) for text annotation. These approaches rely only on a prompt telling the model to return a given output according to a set of instructions. The reproducibility of LM outputs may nonetheless be vulnerable to small changes in the prompt design. This calls into question the replicability of classification routines. To tackle this problem, researchers have typically tested a variety of semantically similar prompts to determine what we call ``prompt stability." These approaches remain ad-hoc and task specific. In this article, we propose a general framework for diagnosing prompt stability by adapting traditional approaches to intra- and inter-coder reliability scoring. We call the resulting metric the Prompt Stability Score (PSS) and provide a Python package \texttt{promptstability} for its estimation. Using six different datasets and twelve outcomes, we classify $\sim$3.1m rows of data and $\sim$300m input tokens to: a) diagnose when prompt stability is low; and b) demonstrate the functionality of the package. We conclude by providing best practice recommendations for applied researchers.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a general framework for diagnosing prompt stability in LLM-based text annotation by adapting traditional intra- and inter-coder reliability scoring methods (e.g., Cohen's kappa, Krippendorff's alpha), resulting in the Prompt Stability Score (PSS). It provides a Python package promptstability for estimation and demonstrates the approach via large-scale experiments on six datasets and twelve outcomes, classifying ~3.1m rows and ~300m tokens to diagnose low stability and showcase functionality, concluding with best-practice recommendations.
Significance. If the adaptation is valid, PSS provides a standardized, generalizable diagnostic for prompt sensitivity, moving beyond ad-hoc testing and improving replicability of LLM annotation pipelines. The open-source package and scale of the empirical demonstration (~3.1m rows) are explicit strengths supporting reproducibility and adoption.
major comments (2)
- [Abstract] Abstract, paragraph 2: the central claim that traditional intra- and inter-coder reliability methods transfer directly when prompts replace coders is presented without derivation, error analysis, or validation of the adaptation; this assumption is load-bearing for all diagnostic claims but remains untested in the reported design.
- [Empirical section] Empirical design (six datasets, twelve outcomes): the ~3.1m-row test does not include any adjustment or diagnostic for correlated errors arising from shared token embeddings and attention patterns across semantically similar prompts, which standard reliability formulas do not correct; this risks misrepresenting stability.
Simulated Author's Rebuttal
Thank you for the detailed review. We appreciate the opportunity to clarify the theoretical basis and empirical design of the Prompt Stability Score (PSS). We address the major comments below and commit to revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract] Abstract, paragraph 2: the central claim that traditional intra- and inter-coder reliability methods transfer directly when prompts replace coders is presented without derivation, error analysis, or validation of the adaptation; this assumption is load-bearing for all diagnostic claims but remains untested in the reported design.
Authors: We acknowledge that the manuscript presents the adaptation as a direct mapping without an explicit derivation or formal error analysis in the abstract. However, the full paper grounds the PSS in the reliability literature by treating distinct prompts as analogous to distinct coders, with stability measured across prompt variations. The large-scale experiments across six datasets serve as empirical validation by demonstrating the score's ability to identify low-stability cases consistently. To strengthen this, we will add a dedicated subsection in the methods section providing a step-by-step derivation of the adaptation, including discussion of assumptions and potential biases. This will include a brief error analysis comparing PSS to traditional metrics under simulated conditions. revision: yes
-
Referee: [Empirical section] Empirical design (six datasets, twelve outcomes): the ~3.1m-row test does not include any adjustment or diagnostic for correlated errors arising from shared token embeddings and attention patterns across semantically similar prompts, which standard reliability formulas do not correct; this risks misrepresenting stability.
Authors: The referee correctly identifies that standard reliability measures like Krippendorff's alpha assume independent coders, and our adaptation inherits this. In our design, semantically similar prompts may induce correlated errors due to shared model internals. However, the PSS is intended to measure observed stability as it would be experienced in practice, where such correlations are part of the prompt sensitivity. We did not explicitly model or correct for these correlations, which could indeed affect interpretation in some cases. We will revise the discussion section to explicitly note this limitation, provide guidance on when it may be relevant, and suggest future extensions such as using diverse prompt sets or embedding-based diagnostics to mitigate it. No changes to the core experiments are needed as the current scale already highlights practical stability issues. revision: partial
Circularity Check
No significant circularity in derivation of Prompt Stability Score
full rationale
The paper defines the Prompt Stability Score (PSS) explicitly as an adaptation of established intra- and inter-coder reliability metrics (e.g., Cohen's kappa, Krippendorff's alpha) to LLM prompt variations, as stated in the abstract. No equations, fitted parameters, or self-citations are shown that would reduce the PSS to its inputs by construction, rename a fit as a prediction, or rely on load-bearing self-referential steps. The framework is presented as a direct methodological extension applied to six datasets, making the derivation self-contained without any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Traditional intra- and inter-coder reliability metrics can be directly adapted to measure consistency of LLM outputs across prompt variations
invented entities (1)
-
Prompt Stability Score (PSS)
no independent evidence
Forward citations
Cited by 1 Pith paper
-
LLMs as annotators of credibility assessment in Danish asylum decisions: evaluating classification performance and errors beyond aggregated metrics
LLMs can provide cost-effective annotation of credibility in Danish asylum texts but produce inconsistent errors that vary by model and prompt, requiring checks beyond single-model accuracy.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.