Agree, Disagree, Explain: Decomposing Human Label Variation in NLI through the Lens of Explanations

Barbara Plank; Beiduo Chen; Benjamin Roth; Marie-Catherine de Marneffe; Pingjun Hong; Siyao Peng

arxiv: 2510.16458 · v2 · submitted 2025-10-18 · 💻 cs.CL

Agree, Disagree, Explain: Decomposing Human Label Variation in NLI through the Lens of Explanations

Pingjun Hong , Beiduo Chen , Siyao Peng , Marie-Catherine de Marneffe , Benjamin Roth , Barbara Plank This is my paper

Pith reviewed 2026-05-18 06:19 UTC · model grok-4.3

classification 💻 cs.CL

keywords natural language inferencehuman label variationexplanationstaxonomyannotator agreementreasoning categoriesNLI datasets

0 comments

The pith

Annotators often disagree on NLI labels while sharing similar reasoning in their explanations

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper broadens the study of human label variation in natural language inference by applying the LiTEx taxonomy to free-text explanations from two datasets. It aligns three measures of annotator behavior: whether they pick the same NLI label, whether their explanations are semantically similar, and whether those explanations fall into the same reasoning category according to the taxonomy, while also tracking selection bias. The analysis identifies cases where label disagreement co-occurs with both high explanation similarity and matching reasoning categories. It further documents consistent individual preferences in how annotators choose both labels and explanation strategies. The central observation is that agreement on reasoning categories tracks explanation similarity more closely than agreement on the final label does.

Core claim

When annotators select different NLI labels for the same premise-hypothesis pair, their explanations can still belong to the same LiTEx reasoning category and exhibit high semantic similarity, indicating that label disagreement can conceal underlying agreement in how the text is interpreted.

What carries the argument

The LiTEx taxonomy, which assigns free-text explanations to discrete reasoning categories, serves as the lens for measuring agreement beyond surface labels while accounting for annotator selection bias.

If this is right

NLI datasets can be enriched with explanation data to distinguish superficial label conflict from genuine interpretive difference.
Evaluation of NLI systems may shift from single-label accuracy toward consistency with observed reasoning categories.
Annotation protocols could prioritize alignment on reasoning steps rather than final labels to reduce apparent variation.
Individual annotator profiles become visible through stable preferences in both label choice and explanation category.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar explanation-based decomposition could be tested on annotation variation in other tasks such as question answering or summarization.
Models might be trained or evaluated on multiple valid reasoning paths per instance instead of a single gold label.
Collecting explanations at scale could support new reliability metrics that treat reasoning agreement as a signal of shared understanding.

Load-bearing premise

The LiTEx taxonomy supplies a stable, unbiased categorization of explanations that remains consistent across different annotators and datasets even in cases of label disagreement.

What would settle it

Re-categorizing the same explanations with the LiTEx taxonomy by a new set of independent annotators yields low agreement on reasoning categories or fails to correlate with measured semantic similarity of the explanations.

read the original abstract

Natural Language Inference (NLI) datasets often exhibit human label variation. To better understand these variations, explanation-based approaches analyze the underlying reasoning behind annotators' decisions. One such approach is the LiTEx taxonomy, which categorizes free-text explanations in English into reasoning categories. However, previous work applying LiTEx has focused on within-label variation: cases where annotators agree on the NLI label but provide different explanations. This paper broadens the scope by examining how annotators may diverge not only in the reasoning category but also in the labeling. We use explanations as a lens to analyze variation in NLI annotations and to examine individual differences in reasoning. We apply LiTEx to two NLI datasets and align annotation variation from multiple aspects: NLI label agreement, explanation similarity, and taxonomy agreement, with an additional compounding factor of annotators' selection bias. We observe instances where annotators disagree on the label but provide similar explanations, suggesting that surface-level disagreement may mask underlying agreement in interpretation. Moreover, our analysis reveals individual preferences in explanation strategies and label choices. These findings highlight that agreement in reasoning categories better reflects the semantic similarity of explanations than label agreement alone. Our findings underscore the richness of reasoning-based explanations and the need for caution in treating labels as ground truth.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Extends LiTEx analysis to label-disagreement cases in NLI and surfaces cases where similar explanations cross labels, but the comparative claim on taxonomy agreement rests on thin validation.

read the letter

This paper applies the existing LiTEx taxonomy to NLI explanations and includes cases where annotators pick different labels, not just different explanations within one label. They report instances of similar explanations across disagreeing labels and note that taxonomy category agreement lines up with explanation similarity more closely than label agreement does. They also flag individual annotator preferences in reasoning style and label choice, plus the role of selection bias in the data collection.

Referee Report

2 major / 2 minor

Summary. The paper applies the LiTEx taxonomy to free-text explanations from two existing NLI datasets to decompose human label variation. It aligns instances across NLI label agreement, explanation semantic similarity, and taxonomy category agreement while noting annotator selection bias. The central observation is that annotators sometimes disagree on labels yet provide similar explanations, and that agreement on reasoning categories tracks explanation similarity more closely than label agreement alone, revealing individual differences in reasoning strategies and cautioning against treating labels as ground truth.

Significance. If the findings hold, the work is significant for computational linguistics because it extends prior within-label explanation analyses to cross-label disagreement cases and demonstrates that reasoning categories can surface underlying interpretive agreements masked by surface label differences. The observational use of existing datasets without new parameters or derivations is a strength, as is the explicit acknowledgment of selection bias; the result could inform more nuanced dataset curation and evaluation practices that incorporate explanations rather than labels alone.

major comments (2)

[§3 and §4] §3 (Taxonomy Application) and §4 (Results): The claim that taxonomy agreement better reflects explanation similarity than label agreement is load-bearing for the central contribution, yet the manuscript provides no inter-annotator reliability metrics or agreement scores for LiTEx category assignment specifically on the label-disagreement subsets. Without these controls, it remains possible that category assignments correlate with label choice or dataset phrasing patterns, rendering the comparative advantage over labels potentially tautological rather than independently informative.
[§4.2] §4.2 (Alignment Analysis): The reported instances of label disagreement with similar explanations are presented observationally; the paper does not include explicit statistical tests (e.g., correlation coefficients or permutation baselines) that isolate the effect of taxonomy agreement from the acknowledged annotator selection bias. This weakens the support for the stronger claim that reasoning categories are a superior lens.

minor comments (2)

[§2] The related-work section would benefit from a short explicit contrast with prior NLI disagreement studies that also use explanations, to clarify the incremental contribution.
[Figures] Figure captions should explicitly state whether similarity is measured by embedding cosine or human judgment to avoid ambiguity in interpreting the alignment plots.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We appreciate the emphasis on strengthening the quantitative support for our claims regarding taxonomy agreement versus label agreement. We address each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: [§3 and §4] §3 (Taxonomy Application) and §4 (Results): The claim that taxonomy agreement better reflects explanation similarity than label agreement is load-bearing for the central contribution, yet the manuscript provides no inter-annotator reliability metrics or agreement scores for LiTEx category assignment specifically on the label-disagreement subsets. Without these controls, it remains possible that category assignments correlate with label choice or dataset phrasing patterns, rendering the comparative advantage over labels potentially tautological rather than independently informative.

Authors: We agree that explicit inter-annotator reliability metrics for LiTEx category assignment on the label-disagreement subsets would strengthen the manuscript and help rule out potential confounds with label choice or phrasing. While the original LiTEx work reported high agreement during taxonomy development, we did not recompute these metrics on our specific disagreement subsets. In the revised version, we will add a new analysis section reporting agreement scores (e.g., Cohen's kappa) obtained by having an independent annotator re-label a representative sample of the label-disagreement instances. This will provide the requested controls and demonstrate that category assignments are reliable and informative beyond label patterns. revision: yes
Referee: [§4.2] §4.2 (Alignment Analysis): The reported instances of label disagreement with similar explanations are presented observationally; the paper does not include explicit statistical tests (e.g., correlation coefficients or permutation baselines) that isolate the effect of taxonomy agreement from the acknowledged annotator selection bias. This weakens the support for the stronger claim that reasoning categories are a superior lens.

Authors: We acknowledge that the alignment analysis in §4.2 is observational in nature, consistent with our use of existing datasets and the explicit discussion of annotator selection bias in the manuscript. To provide stronger quantitative backing, we will incorporate additional statistical measures in the revision, including Pearson or Spearman correlations between taxonomy agreement and explanation semantic similarity (measured via sentence embeddings), alongside comparisons to label agreement. We will also add a permutation baseline to assess whether observed patterns exceed chance levels. While fully isolating the taxonomy effect from selection bias may remain partially limited by dataset constraints, these tests will better support the claim that reasoning categories offer a superior lens. revision: partial

Circularity Check

0 steps flagged

No significant circularity in observational taxonomy application

full rationale

The paper applies the pre-existing LiTEx taxonomy to two established NLI datasets to align label agreement, explanation similarity, and taxonomy categories while noting annotator selection bias. No equations, derivations, fitted parameters presented as predictions, or self-referential definitions appear in the described methodology or findings. The central claim that taxonomy agreement better tracks explanation similarity rests on direct empirical observation rather than any reduction to inputs by construction or load-bearing self-citation chains. This is a standard observational study self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis depends on the validity of the pre-existing LiTEx taxonomy and standard NLI datasets; no new free parameters, invented entities, or ad-hoc axioms are introduced.

axioms (1)

domain assumption The LiTEx taxonomy accurately and consistently categorizes free-text explanations into reasoning types across different annotators and NLI datasets.
All alignment of variation and conclusions about reasoning agreement rest on this categorization being reliable.

pith-pipeline@v0.9.0 · 5782 in / 1265 out tokens · 32803 ms · 2026-05-18T06:19:35.933465+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Quantifying and Predicting Disagreement in Graded Human Ratings
cs.CL 2026-05 unverdicted novelty 5.0

Annotation disagreement on toxic language can be moderately predicted from textual features, with high-opposition items proving harder for models to estimate accurately.