Investigating Biases in Textual Entailment Datasets

Aaron Courville; Chin-Wei Huang; Shawn Tan; Yikang Shen

arxiv: 1906.09635 · v1 · pith:SQSNADOLnew · submitted 2019-06-23 · 💻 cs.CL · cs.LG

Investigating Biases in Textual Entailment Datasets

Shawn Tan , Yikang Shen , Chin-Wei Huang , Aaron Courville This is my paper

Pith reviewed 2026-05-25 17:29 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords textual entailmentdataset biasSNLIMultiNLIhypothesis-only classificationcrowdsourcing artifactsstatistical patterns

0 comments

The pith

Hypothesis-only classification reaches 64% accuracy on SNLI, revealing dataset biases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that training a classifier on only the hypothesis sentences from the SNLI dataset produces 64% accuracy for the three-class entailment task. This result indicates that the hypotheses contain statistical patterns that align with the labels, patterns likely introduced during crowdsourced collection. The authors measure similar patterns in MultiNLI and outline a simple approach to lessen their effect. If the patterns are artifacts rather than signals of entailment, then standard benchmarks may credit models for shortcut learning instead of genuine reasoning.

Core claim

Performing classification on just the hypotheses on the SNLI dataset yields an accuracy of 64%. The paper analyzes the bias extent in the SNLI and the MultiNLI dataset, discusses its implication, and proposes a simple method to reduce the biases in the datasets.

What carries the argument

Hypothesis-only classification, a baseline that predicts entailment labels from the second sentence alone.

If this is right

Both SNLI and MultiNLI contain measurable hypothesis-only biases.
These biases affect how model performance on entailment should be interpreted.
A simple debiasing procedure can lower the amount of exploitable patterns in the datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

New entailment datasets would benefit from collection methods designed to break hypothesis-label correlations.
Reporting hypothesis-only baseline accuracy should become standard when releasing entailment benchmarks.
Models that continue to perform well after debiasing may be closer to learning actual logical relations.

Load-bearing premise

The high accuracy achieved by hypothesis-only models comes from unintended statistical artifacts created during crowdsourced data collection.

What would settle it

Collecting a new entailment dataset with a non-crowdsourced protocol and finding that a hypothesis-only classifier then scores near 33% accuracy would falsify the claim that the bias is an artifact of the original collection process.

Figures

Figures reproduced from arXiv: 1906.09635 by Aaron Courville, Chin-Wei Huang, Shawn Tan, Yikang Shen.

**Figure 1.** Figure 1: The top most informative bigrams in the SNLI dataset. Red represents proportion of contradiction labels, Blue for neutral, and Green for entailment. Numbers on the bars represent the proportion of the bigram in the dataset (A bar labeled with 0.5 means that portion of the bigram constitutes half of that partition of the dataset). The experiment attempts to test sentenceembedding models for their reliance… view at source ↗

**Figure 3.** Figure 3: The top most informative bigrams in the Pruned SNLI dataset. without taking this shift into account, a new set of instances would become the most informative. To deal with this, a classifier should be retrained for every iteration of the pruning. The reason Naive Bayes was used for pruning was because it was easy to retrain to optimality given the original dataset by simply subtracting the counts. Using th… view at source ↗

read the original abstract

The ability to understand logical relationships between sentences is an important task in language understanding. To aid in progress for this task, researchers have collected datasets for machine learning and evaluation of current systems. However, like in the crowdsourced Visual Question Answering (VQA) task, some biases in the data inevitably occur. In our experiments, we find that performing classification on just the hypotheses on the SNLI dataset yields an accuracy of 64%. We analyze the bias extent in the SNLI and the MultiNLI dataset, discuss its implication, and propose a simple method to reduce the biases in the datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A hypothesis-only classifier hits 64% on SNLI, confirming dataset artifacts from crowdsourcing, with a basic mitigation proposed.

read the letter

The key takeaway is that a model seeing only the hypothesis reaches 64% accuracy on SNLI. That number is higher than random and shows the dataset contains surface patterns that do not require the premise, which matches what we already suspected from VQA-style biases but now has a concrete measurement here. The paper extends the check to MultiNLI as well and sketches a simple correction step, probably something like rebalancing or filtering based on hypothesis statistics. That part is useful because it gives practitioners a quick diagnostic they can run on new data collections. The empirical observation itself looks solid and rests on public data rather than any fitted trick. The main limitation is that the mitigation is presented as straightforward without much detail on how much it preserves the original task difficulty or whether it changes downstream model rankings in a meaningful way. The abstract also skips model architecture and training specifics, though the central claim does not seem to depend on those choices. This is the kind of short, targeted note that helps the field tighten its benchmarks. It is worth sending to review because the measurement is falsifiable and directly actionable for dataset work, even if the fix is incremental rather than a new framework.

Referee Report

1 major / 1 minor

Summary. The paper claims that crowdsourced textual entailment datasets such as SNLI contain unintended statistical biases, demonstrated by the empirical result that a classifier trained solely on hypotheses achieves 64% accuracy; the work extends the analysis to MultiNLI, discusses implications for NLI modeling, and proposes a simple mitigation method.

Significance. If the 64% hypothesis-only result holds under scrutiny, the finding provides a concrete, easily replicable diagnostic for annotation artifacts in NLI data collection. This could influence future dataset construction practices and encourage routine hypothesis-only baselines in the field.

major comments (1)

[Abstract] The central empirical claim (64% accuracy) is presented without any description of the classifier architecture, training procedure, or statistical testing in the provided abstract; if these details are absent from the full manuscript as well, the result cannot be independently verified and the bias interpretation rests on an unreproducible measurement.

minor comments (1)

[Abstract] The abstract mentions a proposed mitigation method but gives no indication of its effectiveness or how it was evaluated; a brief quantitative result would strengthen the contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment. The details supporting the 64% result are present in the full manuscript; we address the concern below and note that a minor clarification to the abstract can be added if requested.

read point-by-point responses

Referee: [Abstract] The central empirical claim (64% accuracy) is presented without any description of the classifier architecture, training procedure, or statistical testing in the provided abstract; if these details are absent from the full manuscript as well, the result cannot be independently verified and the bias interpretation rests on an unreproducible measurement.

Authors: The abstract is intentionally concise, but Section 3 of the full manuscript specifies the hypothesis-only classifier (a linear model over unigram and bigram features), the exact training procedure (including hyperparameter selection and data splits), and the evaluation protocol with accuracy reported alongside a majority-class baseline. These elements enable replication. We are happy to append a one-sentence summary of the model to the abstract in the revised version. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central result is an empirical measurement: a hypothesis-only classifier achieves 64% accuracy on the public SNLI dataset. This is obtained by standard supervised training and evaluation on fixed data splits, with no equations, fitted parameters, or derivations that reduce to the target quantity by construction. No self-citation chains, ansatzes, or uniqueness theorems are invoked to justify the measurement itself. The interpretation of the result as evidence of unintended bias is a modeling choice external to the reported number and does not create circularity in the derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical observation that hypothesis-only classification is possible; no free parameters, axioms, or invented entities are introduced or required beyond standard supervised classification assumptions.

pith-pipeline@v0.9.0 · 5626 in / 1047 out tokens · 19564 ms · 2026-05-25T17:29:21.814786+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

performing classification on just the hypotheses on the SNLI dataset yields an accuracy of 64%

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 5 internal anchors

[1]

Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. 2017. Don't just assume; look and answer: Overcoming priors for visual question answering. arXiv preprint arXiv:1712.00377

work page internal anchor Pith review Pith/arXiv arXiv 2017
[2]

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pages 2425--2433

work page 2015
[3]

Bowman, Gabor Angeli, Christopher Potts, and Christopher D

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics

work page 2015
[4]

Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017. Enhanced lstm for natural language inference. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1657--1668

work page 2017
[5]

Yichen Gong, Heng Luo, and Jian Zhang. 2017. Natural language inference over interaction space. arXiv preprint arXiv:1709.04348

work page internal anchor Pith review Pith/arXiv arXiv 2017
[6]

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, volume 1, page 9

work page 2017
[7]

Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R Bowman, and Noah A Smith. 2018. Annotation artifacts in natural language inference data. arXiv preprint arXiv:1803.02324

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

Bill MacCartney and Christopher D Manning. 2009. An extended model of natural logic. In Proceedings of the eighth international conference on computational semantics, pages 140--156. Association for Computational Linguistics

work page 2009
[9]

Marco Marelli, Stefano Menini, Marco Baroni, Luisa Bentivogli, Raffaella Bernardi, Roberto Zamparelli, et al. 2014. A sick cure for the evaluation of compositional distributional semantic models. In LREC, pages 216--223

work page 2014
[10]

Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, and Benjamin Van Durme. 2018. Hypothesis only baselines in natural language inference. arXiv preprint arXiv:1805.01042

work page internal anchor Pith review Pith/arXiv arXiv 2018
[11]

Adina Williams, Nikita Nangia, and Samuel R Bowman. 2017. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426

work page internal anchor Pith review Pith/arXiv arXiv 2017
[12]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67--78

work page 2014
[13]

Peng Zhang, Yash Goyal, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2016. Yin and yang: Balancing and answering binary visual questions. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pages 5014--5022. IEEE

work page 2016
[14]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

work page
[15]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[1] [1]

Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. 2017. Don't just assume; look and answer: Overcoming priors for visual question answering. arXiv preprint arXiv:1712.00377

work page internal anchor Pith review Pith/arXiv arXiv 2017

[2] [2]

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pages 2425--2433

work page 2015

[3] [3]

Bowman, Gabor Angeli, Christopher Potts, and Christopher D

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics

work page 2015

[4] [4]

Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017. Enhanced lstm for natural language inference. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1657--1668

work page 2017

[5] [5]

Yichen Gong, Heng Luo, and Jian Zhang. 2017. Natural language inference over interaction space. arXiv preprint arXiv:1709.04348

work page internal anchor Pith review Pith/arXiv arXiv 2017

[6] [6]

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, volume 1, page 9

work page 2017

[7] [7]

Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R Bowman, and Noah A Smith. 2018. Annotation artifacts in natural language inference data. arXiv preprint arXiv:1803.02324

work page internal anchor Pith review Pith/arXiv arXiv 2018

[8] [8]

Bill MacCartney and Christopher D Manning. 2009. An extended model of natural logic. In Proceedings of the eighth international conference on computational semantics, pages 140--156. Association for Computational Linguistics

work page 2009

[9] [9]

Marco Marelli, Stefano Menini, Marco Baroni, Luisa Bentivogli, Raffaella Bernardi, Roberto Zamparelli, et al. 2014. A sick cure for the evaluation of compositional distributional semantic models. In LREC, pages 216--223

work page 2014

[10] [10]

Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, and Benjamin Van Durme. 2018. Hypothesis only baselines in natural language inference. arXiv preprint arXiv:1805.01042

work page internal anchor Pith review Pith/arXiv arXiv 2018

[11] [11]

Adina Williams, Nikita Nangia, and Samuel R Bowman. 2017. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426

work page internal anchor Pith review Pith/arXiv arXiv 2017

[12] [12]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67--78

work page 2014

[13] [13]

Peng Zhang, Yash Goyal, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2016. Yin and yang: Balancing and answering binary visual questions. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pages 5014--5022. IEEE

work page 2016

[14] [14]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

work page

[15] [15]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page