pith. sign in

arxiv: 2605.28802 · v1 · pith:WLJFF332new · submitted 2026-05-27 · 💻 cs.CL

Human Label Variation as Stable Signal: Learning Annotator-Specific Explanation Behavior via Cross-Annotator Preference Optimization

Pith reviewed 2026-06-29 13:02 UTC · model grok-4.3

classification 💻 cs.CL
keywords human label variationannotator-specific behaviorexplanation generationpreference optimizationnatural language inferenceparaphrase judgmentlarge language models
0
0 comments X

The pith

Large language models can learn stable annotator-specific explanation patterns from human label variation when trained with cross-annotator preference optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can reproduce the distinct reasoning styles shown in individual annotators' free-text explanations. It finds that these styles are difficult to detect in single annotations because input content dominates, yet they become clear once content effects are reduced and responses are aggregated by annotator. The authors introduce cross-annotator preference optimization, a method that trains by contrasting a target annotator's response against other valid but less distinctive annotations for the same input. Experiments on natural language inference and paraphrase judgment tasks show that this approach improves over prompting and standard fine-tuning in matching specific behaviors while preserving distinct reasoning under human checks.

Core claim

Annotators show stable individual patterns in label-explanation behavior that remain weak at the single-annotation level due to strong input-content effects but become detectable after input-content reduction and annotator-level aggregation; these patterns can be learned by LLMs through cross-annotator preference optimization, which contrasts a target annotator's response with other valid but less target-specific annotations for the same input, yielding better aggregation-aware imitation and judge-based attribution while preserving target-specific reasoning patterns under human validation.

What carries the argument

Cross-annotator preference optimization (CAPO), a training method that contrasts a target annotator's response with other valid but less target-specific annotations for the same input to capture annotator-specific label-explanation behavior.

If this is right

  • Prompting methods remain limited and unstable for reproducing annotator-specific behavior.
  • Supervised fine-tuning captures annotator-specific behavior more effectively than prompting.
  • CAPO improves aggregation-aware imitation and judge-based attribution over supervised fine-tuning.
  • Target-specific reasoning patterns stay preserved under human validation after CAPO training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Annotation pipelines could shift from majority labels toward histories of individual explanations for greater consistency.
  • The same contrastive approach might extend to other subjective tasks where personal reasoning style matters.
  • Models trained this way could serve as proxies for specific annotators when scaling explanation collection.

Load-bearing premise

Annotator-specific patterns in explanations exist and become detectable and learnable only after input-content reduction and annotator-level aggregation.

What would settle it

If human judges cannot attribute CAPO-generated explanations to the intended target annotator more often than those from standard supervised fine-tuning, or if per-annotator aggregation fails to reveal consistent patterns beyond input effects.

Figures

Figures reproduced from arXiv: 2605.28802 by Anna Korhonen, Barbara Plank, Beiduo Chen, Benjamin Roth, Pingjun Hong, Ziyun Zhang.

Figure 1
Figure 1. Figure 1: VariErr label variation across annotators. An [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: VariErr explanation-style features averaged by [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: VariErr group-averaged E4 for increasing m. Each visualization samples 80 groups per annotator. Averaging annotations by annotator suppresses item-specific content and reveals stable annotator-specific structure. is heavily mixed across annotators. We hypothesize it is because E1 contains the input, making expla￾nations for the same or similar inputs semantically similar regardless of annotator-specific ha… view at source ↗
Figure 5
Figure 5. Figure 5: VariErr annotator-classifier accuracy by rep [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: CAPO policy and checkpoint effects. Panels [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: VariErr individual-annotation UMAPs for content-reduced explanation representations. [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: VariErr group-averaged E1 embeddings. Raw context-containing embeddings become more separable with aggregation, but remain weaker than content-reduced representations. For each representation Ek, we fit a ridge regres￾sion probe on the training items: Eb0 = W Ek + b, (19) and evaluate on held-out items using the coefficient of determination: R 2 k = 1 − ∥E0 − Eb0∥ 2 ∥E0 − E¯ 0∥ 2 . (20) Higher R2 indicates… view at source ↗
Figure 9
Figure 9. Figure 9: VariErr group-averaged E2 embeddings. Removing the input text from the embedded string improves group-level separability. Representation Acc. Confusion matrix Features 50.2    104 47 38 41 38 102 53 40 35 29 103 57 23 29 32 157    E1 41.9    115 35 48 32 69 54 79 31 32 30 116 46 27 36 74 104    E2 52.9    117 47 47 19 56 89 59 29 28 29 130 37 16 19 51 155    E3 52.5    116 51 37 26 59 … view at source ↗
Figure 10
Figure 10. Figure 10: VariErr group-averaged E3 embeddings. Directly subtracting the input embedding gives a stronger group-level signal than raw E1. Representation Acc. Confusion matrix Features 76.7    177 28 20 25 8 177 6 59 22 3 204 21 18 14 9 209    E1 76.6    182 13 9 46 30 203 7 10 27 14 200 9 53 6 10 181    E2 86.7    201 3 10 36 7 237 1 5 23 2 222 3 38 3 2 207    E3 85.3    188 3 9 50 9 236 1 4 18… view at source ↗
Figure 11
Figure 11. Figure 11: Train- and test-time effects of aggregation on annotator-classifier accuracy. We compare classifiers [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: R2 score variation across annotators. R2 [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 14
Figure 14. Figure 14: R2 E1 UMAP. Compared with VariErr, an￾notator structure is more visible, but aggregation still provides a cleaner signal. for learning fine-grained annotator-specific label￾explanation behavior. Since the shared adapter receives gradients from all annotators in the same parameter space, the model may learn an averaged annotation policy and use the ID only weakly, espe￾cially when the differences among ann… view at source ↗
Figure 13
Figure 13. Figure 13: R2 explanation-style features averaged by [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
Figure 15
Figure 15. Figure 15: R2 group-averaged E4 embeddings. Annotator clusters become nearly perfectly separable after modest aggregation. 1 3 5 10 15 20 30 50 Group size m 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy r2 Group Classifier Accuracy E1 group E1 single E2 group E2 single E3 group E3 single E4 group E4 single random [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: R2 annotator-classifier accuracy by represen [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: R2 individual-annotation UMAPs for content-reduced explanation representations. [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: R2 group-averaged E1 embeddings. Even the raw context-containing representation becomes highly separable after aggregation. single-instance judgments. E CAPO Ablations This appendix provides additional details for the CAPO ablations summarized in [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: R2 group-averaged E2 embeddings. Label-score and explanation embeddings separate annotators quickly under aggregation. Model Dataset SFT CAPO ∆ Llama3.2 R2 0.971 [0.947, 0.991] 0.995 [0.984, 1.010] +0.024 Qwen3 R2 0.934 [0.902, 0.964] 0.973 [0.949, 0.992] +0.039 Llama3.2 VariErr 0.951 [0.887, 1.011] 0.969 [0.910, 1.028] +0.018 Qwen3 VariErr 0.864 [0.795, 0.936] 0.893 [0.823, 0.969] +0.029 [PITH_FULL_IMAG… view at source ↗
Figure 20
Figure 20. Figure 20: R2 group-averaged E3 embeddings. The residual representation also saturates to near-perfect group-level separability. Model Policy Acc ↑ R-L ↑ Feature KL ↓ GC Conf ↑ ImiScore ↑ Judge Acc ↑ Qwen3 CAPO-strict 0.627 0.281 0.081 0.867 0.888 0.328 CAPO-near-label 0.580 0.272 0.068 0.907 0.940 0.302 CAPO-no-restriction 0.575 0.269 0.064 0.918 0.955 0.343 Llama3.2 CAPO-strict 0.512 0.262 0.121 0.924 0.964 0.297 … view at source ↗
read the original abstract

Free-text explanations extend human label variation (HLV) beyond label disagreement by revealing the reasoning and preferences behind annotators' decisions. We study whether large language models (LLMs) can learn and reproduce such annotator-specific label-explanation behavior. Using two sentence-pair tasks with four annotators each -- natural language inference and paraphrase judgment -- we first analyze whether annotators exhibit stable individual patterns. We find that such patterns are weak at the single-annotation level due to strong input-content effects, but become detectable after input-content reduction and annotator-level aggregation. We then compare prompting and supervised fine-tuning (SFT) baselines and propose cross-annotator preference optimization (CAPO), which contrasts a target annotator's response with other valid but less target-specific annotations for the same input. Experiments show that prompting is limited and unstable, SFT better captures annotator-specific behavior, and CAPO further improves aggregation-aware imitation and judge-based attribution while preserving target-specific reasoning patterns under human validation. Overall, our results show that HLV can be learned as annotator-specific label-explanation behavior, suggesting a path toward scalable explanation-based annotation grounded in annotator histories rather than labels alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that human label variation (HLV) in free-text explanations can be learned as stable annotator-specific behaviors by LLMs. On NLI and paraphrase judgment tasks with four annotators each, single-annotation patterns are weak due to input effects but become detectable after input-content reduction and annotator-level aggregation. Prompting is limited, SFT captures some behavior, and the proposed cross-annotator preference optimization (CAPO) further improves aggregation-aware imitation and judge-based attribution while preserving target-specific patterns per human validation, suggesting scalable explanation-based annotation from annotator histories.

Significance. If the empirical results hold, the work demonstrates a concrete method for modeling individual annotator reasoning preferences beyond aggregate labels, with potential to improve fidelity in explanation generation and annotation systems that leverage per-annotator histories.

major comments (2)
  1. [Experiments] The central empirical claim that CAPO outperforms SFT on aggregation-aware metrics rests on experiments whose details (error bars, statistical tests, data exclusion rules) are not reported in the abstract or summary; this load-bearing support for outperformance is therefore moderate and requires explicit quantification in the results section.
  2. [Analysis of annotator patterns] The stability analysis treats input-content reduction and annotator-level aggregation as the precondition that makes patterns detectable; the specific reduction procedure and controls for whether it introduces artifacts rather than revealing true annotator signals need to be shown to be robust, as this underpins the claim that HLV is learnable as annotator-specific behavior.
minor comments (1)
  1. [Abstract] The abstract mentions 'human validation' but does not specify the protocol or inter-annotator agreement for the judge-based attribution; adding this would improve clarity without altering the central claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate the corresponding revisions.

read point-by-point responses
  1. Referee: [Experiments] The central empirical claim that CAPO outperforms SFT on aggregation-aware metrics rests on experiments whose details (error bars, statistical tests, data exclusion rules) are not reported in the abstract or summary; this load-bearing support for outperformance is therefore moderate and requires explicit quantification in the results section.

    Authors: We agree that the abstract does not contain these details. The full results section reports multiple runs with error bars and basic comparisons, but we will revise it to explicitly include statistical significance tests (e.g., paired t-tests), precise data exclusion criteria, and quantified error bars to strengthen the empirical support for CAPO outperforming SFT. revision: yes

  2. Referee: [Analysis of annotator patterns] The stability analysis treats input-content reduction and annotator-level aggregation as the precondition that makes patterns detectable; the specific reduction procedure and controls for whether it introduces artifacts rather than revealing true annotator signals need to be shown to be robust, as this underpins the claim that HLV is learnable as annotator-specific behavior.

    Authors: Section 3.2 describes the input-content reduction procedure (removal of lexical overlap while retaining annotator-specific reasoning). To demonstrate robustness, we will add controls comparing multiple reduction variants and artifact checks (e.g., shuffled annotator baselines) in a revised analysis subsection. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method comparison is self-contained

full rationale

The paper's core argument proceeds from empirical observation (stability of annotator patterns only after aggregation) to method comparison (prompting vs SFT vs CAPO) on two tasks with human validation. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the provided text. The aggregation step is treated as an observed precondition rather than a derived necessity, and CAPO is introduced as a contrastive training procedure whose performance is measured externally. This is a standard empirical ML study whose claims rest on experimental outcomes rather than internal definitional reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim depends on standard machine-learning assumptions about supervised training and preference optimization plus the domain assumption that annotator patterns stabilize under aggregation; no new physical entities or ad-hoc constants are introduced.

free parameters (1)
  • CAPO contrast and optimization hyperparameters
    The preference optimization procedure requires choices of learning rates, contrast weights, or batch sizes that are not specified in the abstract and must be tuned to achieve the reported gains.
axioms (2)
  • domain assumption Annotator patterns become detectable and stable after input-content reduction and annotator-level aggregation
    Invoked when the abstract states that single-annotation patterns are weak due to input effects but detectable after reduction and aggregation.
  • standard math Standard LLM fine-tuning and preference optimization procedures can be applied to explanation data without additional unstated constraints
    Background assumption required for comparing prompting, SFT, and CAPO.

pith-pipeline@v0.9.1-grok · 5758 in / 1359 out tokens · 48599 ms · 2026-06-29T13:02:11.010694+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 5 canonical work pages · 2 internal anchors

  1. [1]

    seeing the big through the small

    Refusal in language models is mediated by a single direction. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neu- ral Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024. Shlomo Argamon, Moshe Koppel, James W. Pen- nebaker, and Jonathan Schler. 2009. Automatically profiling the aut...

  2. [2]

    Mario Giulianelli, Joris Baan, Wilker Aziz, Raquel Fernández, and Barbara Plank

    Chatgpt outperforms crowd-workers for text- annotation tasks.CoRR, abs/2303.15056. Mario Giulianelli, Joris Baan, Wilker Aziz, Raquel Fernández, and Barbara Plank. 2023. What comes next? evaluating uncertainty in neural text generators against human production variability. InProceed- ings of the 2023 Conference on Empirical Methods in Natural Language Pro...

  3. [3]

    Annotation artifacts in natural language infer- ence data. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, Volume 2 (Short Papers), pages 107–112, New Orleans, Louisiana. Association for Computa- tional Linguistics. Pingjun Hong, Beiduo Chen, Siyao Peng, Ma...

  4. [4]

    Computational methods in authorship attribu- tion.J. Assoc. Inf. Sci. Technol., 60(1):9–26. Elisa Leonardelli, Silvia Casola, Siyao Peng, Giu- lia Rizzi, Valerio Basile, Elisabetta Fersini, Diego Frassinelli, Hyewon Jang, Maja Pavlovic, Barbara Plank, and Massimo Poesio. 2025. LeWiDi-2025 at NLPerspectives: Third edition of the learning with disagreements...

  5. [5]

    choice":

    Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Sys- tems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. Sara Rajaee, Yadollah Yaghoobzadeh, and Moham- mad Taher Pilehvar. 2022. Looking at the overlo...

  6. [6]

    Contradiction

    Be cautious when labeling as "Contradiction" or "Entailment", as these labels require explicit evidence and logical connections

  7. [7]

    the context only mentions

    Use phrases like "the context only mentions" or "it is unclear if" to indicate uncertainty

  8. [8]

    Provide concise explanations that summarize the reasoning behind the label

  9. [9]

    Contradiction

    Be strict in your labeling, especially when it comes to "Contradiction" and "Entailment"

  10. [10]

    context,

    Recognize implicit evidence, such as tone or context, when possible. 33 Annotator Behavioral interpretation Empirical cues Representative examples VariErr NLI 0 Evidence-checking and context- grounded: treats the task as verifying whether the hypothesis is explicitly licensed by the context. Longer explanations; frequent meta- level NLI terms such as “con...

  11. [11]

    the statement implies

    Use phrases like "the statement implies" or "the context suggests" to indicate logical connections. VariErr Llama3.2 Annotator 1 value profile **Annotation Style Profile: Annotator_1 ** **Label/Score Bias: ** * Tends to avoid making definitive judgments, often labeling as "Neutral" when the context suggests a stronger stance. * May not always provide a cl...

  12. [12]

    Use a formal, objective tone and avoid emotive language or personal opinions

  13. [13]

    Focus on literal meaning and context, avoiding value judgments or assumptions

  14. [14]

    Provide a clear and concise explanation to support the label, but avoid elaborating on the reasoning unless necessary

  15. [15]

    Use straightforward language and avoid jargon or technical terms unless necessary

  16. [16]

    Be cautious in labeling, avoiding strong or absolute statements

  17. [17]

    it’s not clear

    Indicate uncertainty or ambiguity in the context using phrases like "it’s not clear" or "it’s not clear whether"

  18. [18]

    Contradiction

    Avoid making value judgments or assumptions, focusing on literal meaning and context instead. VariErr Llama3.2 Annotator 2 value profile **Annotator_2’s Annotation Behavior Profile ** **Label/Score Bias: ** * Tends to label as "Contradiction" when the premise and hypothesis are mutually exclusive, but not necessarily contradictory in a strict sense. * Lab...

  19. [19]

    Evaluate the strength of the premise and hypothesis based on logical coherence and implications

  20. [20]

    Consider the context and tone of the text when making judgments

  21. [21]

    Use explicit statements and logical implications to support judgments

  22. [22]

    usually,

    Be concise and clear in explanations, using phrases like "usually," "typically," or "in context" to clarify reasoning

  23. [23]

    Be lenient in cases where the premise and hypothesis are ambiguous or lack clear implications

  24. [24]

    Indicate uncertainty or ambiguity in judgments when the premise and hypothesis are unclear or lack clear implications

  25. [25]

    Entailment

    Avoid over-interpreting or over-analyzing text, and be cautious in cases where the premise and hypothesis are ambiguous or lack clear implications. VariErr Llama3.2 Annotator 3 value profile Based on the provided examples, here is a compact profile of Annotator_3’s annotation behavior: **Label/Score Bias: ** 35 * Tends to label as "Entailment" when the co...

  26. [26]

    Annotator_3 tends to emphasize specific keywords or phrases in the questions, which can lead to different answers

  27. [27]

    They do not provide explicit explanations for their answers, relying on the literal meaning of the questions

  28. [28]

    Annotator_3’s explanations are often brief, providing only a concise indication of whether the questions are paraphrases or not

  29. [29]

    **Profile:** To imitate Annotator_3, follow these guidelines:

    They do not acknowledge the context or nuances of the questions, focusing solely on the literal meaning. **Profile:** To imitate Annotator_3, follow these guidelines:

  30. [30]

    Focus on the literal meaning of the questions

  31. [31]

    Emphasize specific keywords or phrases in the questions

  32. [32]

    Provide concise explanations, avoiding complex sentences or jargon

  33. [33]

    Classify questions as paraphrases or not based on their similarity in meaning

  34. [34]

    Avoid expressing value judgments or providing explicit evidence

  35. [35]

    Be strict in your classification of paraphrases, assigning low scores (-5) 40 when the questions are not similar in meaning

  36. [36]

    Do not express uncertainty in your answers

  37. [37]

    some" vs

    Use simple and straightforward language in your explanations. H.2.2 Qwen3 profiles These profiles correspond to profile model family: qwen3, sample size: 50, seed: 42. R2 Qwen3 Annotator 0 value profile **Annotator_0 Profile (Imitation Guide) ** - **Label/Score Bias **: Strongly penalizes minor differences (e.g., word order, quantifiers like "some" vs. "a...

  38. [38]

    different conditions

    as "different conditions" rather than topic differences. **Imitation Rules for New Items **:

  39. [39]

    unrelated

    Compare core topic, object, and intent. 2. R2 Qwen3 Annotator 2 value profile **Annotator_2 Profile (Imitation Guide) ** - **Label/Score Bias **: Tends to assign **high scores (4-5) ** when questions share core domain, key phrase, and intent, even with minor differences (e.g., word order, redundancy, or minor qualifiers). **Low scores (-3 to -5) ** only w...