Prompt Stability Scoring for Text Annotation with Large Language Models

Christopher Barrie; Elli Palaiologou; Petter T\"ornberg

arxiv: 2407.02039 · v3 · pith:VHKYNHJDnew · submitted 2024-07-02 · 💻 cs.CL

Prompt Stability Scoring for Text Annotation with Large Language Models

Christopher Barrie , Elli Palaiologou , Petter T\"ornberg This is my paper

Pith reviewed 2026-05-23 23:13 UTC · model grok-4.3

classification 💻 cs.CL

keywords prompt stabilitylarge language modelstext annotationreliability scoringreproducibilityLLM annotationintra-coder reliability

0 comments

The pith

A metric called the Prompt Stability Score adapts traditional coder reliability methods to assess how consistent large language model text annotations remain across prompt variations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to address the vulnerability of LLM-based text annotation to small changes in prompt design, which can undermine the replicability of results. It proposes adapting established intra- and inter-coder reliability scoring techniques to create a general framework for diagnosing prompt stability. This results in the Prompt Stability Score (PSS), accompanied by a Python package for its calculation. The method is demonstrated on six datasets involving twelve outcomes by processing millions of data rows. If effective, this would allow researchers to systematically evaluate and improve the reliability of their LLM annotation pipelines.

Core claim

By adapting traditional approaches to intra- and inter-coder reliability scoring, the Prompt Stability Score (PSS) provides a general framework for diagnosing prompt stability in LLM text annotation tasks, with a Python package available for its estimation.

What carries the argument

The Prompt Stability Score (PSS) that quantifies stability by treating different prompts as different 'coders' and applying reliability metrics to their outputs on the same data.

If this is right

Prompt stability can be diagnosed before scaling up annotation tasks.
The package enables computation on large-scale datasets with millions of rows.
Best practice recommendations help applied researchers improve prompt robustness.
Stability issues can be identified across multiple datasets and outcomes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the PSS is widely adopted, papers using LLM annotations may start reporting stability scores as standard practice.
This framework could be extended to measure stability in other LLM applications like generation or reasoning tasks.
Low PSS values might prompt the development of more robust prompt engineering techniques.

Load-bearing premise

Traditional intra- and inter-coder reliability scoring methods can be directly and meaningfully adapted to quantify the stability of LLM outputs across prompt variations.

What would settle it

Observing that the PSS indicates high stability for a set of prompts, yet human experts find that the annotations differ substantively in ways that affect downstream analysis, would challenge the metric's validity.

read the original abstract

Researchers are increasingly using language models (LMs) for text annotation. These approaches rely only on a prompt telling the model to return a given output according to a set of instructions. The reproducibility of LM outputs may nonetheless be vulnerable to small changes in the prompt design. This calls into question the replicability of classification routines. To tackle this problem, researchers have typically tested a variety of semantically similar prompts to determine what we call ``prompt stability." These approaches remain ad-hoc and task specific. In this article, we propose a general framework for diagnosing prompt stability by adapting traditional approaches to intra- and inter-coder reliability scoring. We call the resulting metric the Prompt Stability Score (PSS) and provide a Python package \texttt{promptstability} for its estimation. Using six different datasets and twelve outcomes, we classify $\sim$3.1m rows of data and $\sim$300m input tokens to: a) diagnose when prompt stability is low; and b) demonstrate the functionality of the package. We conclude by providing best practice recommendations for applied researchers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes a general framework for diagnosing prompt stability in LLM-based text annotation by adapting traditional intra- and inter-coder reliability scoring methods (e.g., Cohen's kappa, Krippendorff's alpha), resulting in the Prompt Stability Score (PSS). It provides a Python package promptstability for estimation and demonstrates the approach via large-scale experiments on six datasets and twelve outcomes, classifying ~3.1m rows and ~300m tokens to diagnose low stability and showcase functionality, concluding with best-practice recommendations.

Significance. If the adaptation is valid, PSS provides a standardized, generalizable diagnostic for prompt sensitivity, moving beyond ad-hoc testing and improving replicability of LLM annotation pipelines. The open-source package and scale of the empirical demonstration (~3.1m rows) are explicit strengths supporting reproducibility and adoption.

major comments (2)

[Abstract] Abstract, paragraph 2: the central claim that traditional intra- and inter-coder reliability methods transfer directly when prompts replace coders is presented without derivation, error analysis, or validation of the adaptation; this assumption is load-bearing for all diagnostic claims but remains untested in the reported design.
[Empirical section] Empirical design (six datasets, twelve outcomes): the ~3.1m-row test does not include any adjustment or diagnostic for correlated errors arising from shared token embeddings and attention patterns across semantically similar prompts, which standard reliability formulas do not correct; this risks misrepresenting stability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed review. We appreciate the opportunity to clarify the theoretical basis and empirical design of the Prompt Stability Score (PSS). We address the major comments below and commit to revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract, paragraph 2: the central claim that traditional intra- and inter-coder reliability methods transfer directly when prompts replace coders is presented without derivation, error analysis, or validation of the adaptation; this assumption is load-bearing for all diagnostic claims but remains untested in the reported design.

Authors: We acknowledge that the manuscript presents the adaptation as a direct mapping without an explicit derivation or formal error analysis in the abstract. However, the full paper grounds the PSS in the reliability literature by treating distinct prompts as analogous to distinct coders, with stability measured across prompt variations. The large-scale experiments across six datasets serve as empirical validation by demonstrating the score's ability to identify low-stability cases consistently. To strengthen this, we will add a dedicated subsection in the methods section providing a step-by-step derivation of the adaptation, including discussion of assumptions and potential biases. This will include a brief error analysis comparing PSS to traditional metrics under simulated conditions. revision: yes
Referee: [Empirical section] Empirical design (six datasets, twelve outcomes): the ~3.1m-row test does not include any adjustment or diagnostic for correlated errors arising from shared token embeddings and attention patterns across semantically similar prompts, which standard reliability formulas do not correct; this risks misrepresenting stability.

Authors: The referee correctly identifies that standard reliability measures like Krippendorff's alpha assume independent coders, and our adaptation inherits this. In our design, semantically similar prompts may induce correlated errors due to shared model internals. However, the PSS is intended to measure observed stability as it would be experienced in practice, where such correlations are part of the prompt sensitivity. We did not explicitly model or correct for these correlations, which could indeed affect interpretation in some cases. We will revise the discussion section to explicitly note this limitation, provide guidance on when it may be relevant, and suggest future extensions such as using diverse prompt sets or embedding-based diagnostics to mitigate it. No changes to the core experiments are needed as the current scale already highlights practical stability issues. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation of Prompt Stability Score

full rationale

The paper defines the Prompt Stability Score (PSS) explicitly as an adaptation of established intra- and inter-coder reliability metrics (e.g., Cohen's kappa, Krippendorff's alpha) to LLM prompt variations, as stated in the abstract. No equations, fitted parameters, or self-citations are shown that would reduce the PSS to its inputs by construction, rename a fit as a prediction, or rely on load-bearing self-referential steps. The framework is presented as a direct methodological extension applied to six datasets, making the derivation self-contained without any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central contribution rests on the untested premise that human-coder reliability metrics transfer to LLM prompt variation; no free parameters or invented physical entities are described.

axioms (1)

domain assumption Traditional intra- and inter-coder reliability metrics can be directly adapted to measure consistency of LLM outputs across prompt variations
Core framework depends on this transfer being valid and meaningful.

invented entities (1)

Prompt Stability Score (PSS) no independent evidence
purpose: To quantify prompt stability for LLM text annotation
Newly defined metric whose validity rests on the adaptation assumption rather than independent external evidence.

pith-pipeline@v0.9.0 · 5716 in / 1290 out tokens · 27520 ms · 2026-05-23T23:13:45.142966+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LLMs as annotators of credibility assessment in Danish asylum decisions: evaluating classification performance and errors beyond aggregated metrics
cs.CL 2026-05 accept novelty 7.0

LLMs can provide cost-effective annotation of credibility in Danish asylum texts but produce inconsistent errors that vary by model and prompt, requiring checks beyond single-model accuracy.