Agree, Disagree, Explain: Decomposing Human Label Variation in NLI through the Lens of Explanations
Pith reviewed 2026-05-18 06:19 UTC · model grok-4.3
The pith
Annotators often disagree on NLI labels while sharing similar reasoning in their explanations
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When annotators select different NLI labels for the same premise-hypothesis pair, their explanations can still belong to the same LiTEx reasoning category and exhibit high semantic similarity, indicating that label disagreement can conceal underlying agreement in how the text is interpreted.
What carries the argument
The LiTEx taxonomy, which assigns free-text explanations to discrete reasoning categories, serves as the lens for measuring agreement beyond surface labels while accounting for annotator selection bias.
If this is right
- NLI datasets can be enriched with explanation data to distinguish superficial label conflict from genuine interpretive difference.
- Evaluation of NLI systems may shift from single-label accuracy toward consistency with observed reasoning categories.
- Annotation protocols could prioritize alignment on reasoning steps rather than final labels to reduce apparent variation.
- Individual annotator profiles become visible through stable preferences in both label choice and explanation category.
Where Pith is reading between the lines
- Similar explanation-based decomposition could be tested on annotation variation in other tasks such as question answering or summarization.
- Models might be trained or evaluated on multiple valid reasoning paths per instance instead of a single gold label.
- Collecting explanations at scale could support new reliability metrics that treat reasoning agreement as a signal of shared understanding.
Load-bearing premise
The LiTEx taxonomy supplies a stable, unbiased categorization of explanations that remains consistent across different annotators and datasets even in cases of label disagreement.
What would settle it
Re-categorizing the same explanations with the LiTEx taxonomy by a new set of independent annotators yields low agreement on reasoning categories or fails to correlate with measured semantic similarity of the explanations.
read the original abstract
Natural Language Inference (NLI) datasets often exhibit human label variation. To better understand these variations, explanation-based approaches analyze the underlying reasoning behind annotators' decisions. One such approach is the LiTEx taxonomy, which categorizes free-text explanations in English into reasoning categories. However, previous work applying LiTEx has focused on within-label variation: cases where annotators agree on the NLI label but provide different explanations. This paper broadens the scope by examining how annotators may diverge not only in the reasoning category but also in the labeling. We use explanations as a lens to analyze variation in NLI annotations and to examine individual differences in reasoning. We apply LiTEx to two NLI datasets and align annotation variation from multiple aspects: NLI label agreement, explanation similarity, and taxonomy agreement, with an additional compounding factor of annotators' selection bias. We observe instances where annotators disagree on the label but provide similar explanations, suggesting that surface-level disagreement may mask underlying agreement in interpretation. Moreover, our analysis reveals individual preferences in explanation strategies and label choices. These findings highlight that agreement in reasoning categories better reflects the semantic similarity of explanations than label agreement alone. Our findings underscore the richness of reasoning-based explanations and the need for caution in treating labels as ground truth.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper applies the LiTEx taxonomy to free-text explanations from two existing NLI datasets to decompose human label variation. It aligns instances across NLI label agreement, explanation semantic similarity, and taxonomy category agreement while noting annotator selection bias. The central observation is that annotators sometimes disagree on labels yet provide similar explanations, and that agreement on reasoning categories tracks explanation similarity more closely than label agreement alone, revealing individual differences in reasoning strategies and cautioning against treating labels as ground truth.
Significance. If the findings hold, the work is significant for computational linguistics because it extends prior within-label explanation analyses to cross-label disagreement cases and demonstrates that reasoning categories can surface underlying interpretive agreements masked by surface label differences. The observational use of existing datasets without new parameters or derivations is a strength, as is the explicit acknowledgment of selection bias; the result could inform more nuanced dataset curation and evaluation practices that incorporate explanations rather than labels alone.
major comments (2)
- [§3 and §4] §3 (Taxonomy Application) and §4 (Results): The claim that taxonomy agreement better reflects explanation similarity than label agreement is load-bearing for the central contribution, yet the manuscript provides no inter-annotator reliability metrics or agreement scores for LiTEx category assignment specifically on the label-disagreement subsets. Without these controls, it remains possible that category assignments correlate with label choice or dataset phrasing patterns, rendering the comparative advantage over labels potentially tautological rather than independently informative.
- [§4.2] §4.2 (Alignment Analysis): The reported instances of label disagreement with similar explanations are presented observationally; the paper does not include explicit statistical tests (e.g., correlation coefficients or permutation baselines) that isolate the effect of taxonomy agreement from the acknowledged annotator selection bias. This weakens the support for the stronger claim that reasoning categories are a superior lens.
minor comments (2)
- [§2] The related-work section would benefit from a short explicit contrast with prior NLI disagreement studies that also use explanations, to clarify the incremental contribution.
- [Figures] Figure captions should explicitly state whether similarity is measured by embedding cosine or human judgment to avoid ambiguity in interpreting the alignment plots.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We appreciate the emphasis on strengthening the quantitative support for our claims regarding taxonomy agreement versus label agreement. We address each major comment below and indicate the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [§3 and §4] §3 (Taxonomy Application) and §4 (Results): The claim that taxonomy agreement better reflects explanation similarity than label agreement is load-bearing for the central contribution, yet the manuscript provides no inter-annotator reliability metrics or agreement scores for LiTEx category assignment specifically on the label-disagreement subsets. Without these controls, it remains possible that category assignments correlate with label choice or dataset phrasing patterns, rendering the comparative advantage over labels potentially tautological rather than independently informative.
Authors: We agree that explicit inter-annotator reliability metrics for LiTEx category assignment on the label-disagreement subsets would strengthen the manuscript and help rule out potential confounds with label choice or phrasing. While the original LiTEx work reported high agreement during taxonomy development, we did not recompute these metrics on our specific disagreement subsets. In the revised version, we will add a new analysis section reporting agreement scores (e.g., Cohen's kappa) obtained by having an independent annotator re-label a representative sample of the label-disagreement instances. This will provide the requested controls and demonstrate that category assignments are reliable and informative beyond label patterns. revision: yes
-
Referee: [§4.2] §4.2 (Alignment Analysis): The reported instances of label disagreement with similar explanations are presented observationally; the paper does not include explicit statistical tests (e.g., correlation coefficients or permutation baselines) that isolate the effect of taxonomy agreement from the acknowledged annotator selection bias. This weakens the support for the stronger claim that reasoning categories are a superior lens.
Authors: We acknowledge that the alignment analysis in §4.2 is observational in nature, consistent with our use of existing datasets and the explicit discussion of annotator selection bias in the manuscript. To provide stronger quantitative backing, we will incorporate additional statistical measures in the revision, including Pearson or Spearman correlations between taxonomy agreement and explanation semantic similarity (measured via sentence embeddings), alongside comparisons to label agreement. We will also add a permutation baseline to assess whether observed patterns exceed chance levels. While fully isolating the taxonomy effect from selection bias may remain partially limited by dataset constraints, these tests will better support the claim that reasoning categories offer a superior lens. revision: partial
Circularity Check
No significant circularity in observational taxonomy application
full rationale
The paper applies the pre-existing LiTEx taxonomy to two established NLI datasets to align label agreement, explanation similarity, and taxonomy categories while noting annotator selection bias. No equations, derivations, fitted parameters presented as predictions, or self-referential definitions appear in the described methodology or findings. The central claim that taxonomy agreement better tracks explanation similarity rests on direct empirical observation rather than any reduction to inputs by construction or load-bearing self-citation chains. This is a standard observational study self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The LiTEx taxonomy accurately and consistently categorizes free-text explanations into reasoning types across different annotators and NLI datasets.
Forward citations
Cited by 1 Pith paper
-
Quantifying and Predicting Disagreement in Graded Human Ratings
Annotation disagreement on toxic language can be moderately predicted from textual features, with high-opposition items proving harder for models to estimate accurately.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.