Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference

Ellie Pavlick; R. Thomas McCoy; Tal Linzen

arxiv: 1902.01007 · v4 · pith:OM4WY75Onew · submitted 2019-02-04 · 💻 cs.CL

Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference

R. Thomas McCoy , Ellie Pavlick , Tal Linzen This is my paper

classification 💻 cs.CL

keywords heuristicsheuristichansmodelsadoptedinferencelanguagenatural

0 comments

read the original abstract

A machine learning system can score well on a given test set by relying on heuristics that are effective for frequent example types but break down in more challenging cases. We study this issue within natural language inference (NLI), the task of determining whether one sentence entails another. We hypothesize that statistical NLI models may adopt three fallible syntactic heuristics: the lexical overlap heuristic, the subsequence heuristic, and the constituent heuristic. To determine whether models have adopted these heuristics, we introduce a controlled evaluation set called HANS (Heuristic Analysis for NLI Systems), which contains many examples where the heuristics fail. We find that models trained on MNLI, including BERT, a state-of-the-art model, perform very poorly on HANS, suggesting that they have indeed adopted these heuristics. We conclude that there is substantial room for improvement in NLI systems, and that the HANS dataset can motivate and measure progress in this area

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Language Models are Few-Shot Learners
cs.CL 2020-05 accept novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
Neural Collapse in Test-Time Adaptation
cs.CV 2025-12 unverdicted novelty 7.0

Sample-wise neural collapse reveals that feature-classifier misalignment drives TTA degradation under shifts, which NCTTA corrects via hybrid geometric-predictive targets.
Test-Time Distillation for Continual Model Adaptation
cs.CV 2025-06 conditional novelty 7.0

CoDiRe blends VLM and target model predictions via MSP-based weighting and Optimal Transport rectification to enable stable continual test-time adaptation, outperforming CoTTA by 10.55% on ImageNet-C at 48% of the com...
Multitask Prompted Training Enables Zero-Shot Task Generalization
cs.LG 2021-10 conditional novelty 7.0

Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
cs.CL 2019-05 accept novelty 7.0

BoolQ introduces naturally occurring yes/no questions as a challenging benchmark where BERT fine-tuned on MultiNLI reaches 80.4% accuracy against 90% human performance.
Decaf: Improving Neural Decompilation with Automatic Feedback and Search
cs.SE 2026-05 unverdicted novelty 6.0

Decaf uses compiler feedback and search to improve neural decompilation, boosting semantic success rate from 26.0% to 83.9% on ExeBench Real -O2 split.
Deciphering Shortcut Learning from an Evolutionary Game Theory Perspective
cs.AI 2026-05 unverdicted novelty 6.0

Evolutionary game theory shows gradient descent and stochastic gradient descent drive neural networks to distinct stable states favoring shortcut or core subnetworks, with data and optimization noise shaping shortcut ...
TriagerX: Dual Transformers for Bug Triaging Tasks with Content and Interaction Based Rankings
cs.SE 2025-08 conditional novelty 6.0

TriagerX combines dual-transformer content rankings with developer interaction history to improve top-k accuracy for developer and component recommendations in bug triaging across five datasets.
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
cs.CL 2019-05 accept novelty 6.0

SuperGLUE is a new benchmark with more difficult language understanding tasks, a toolkit, and leaderboard to drive further progress beyond GLUE.
Interactive Evaluation Requires a Design Science
cs.AI 2026-05 unverdicted novelty 5.0

Interactive evaluation of AI must be reframed as a distinct paradigm that maps interaction trajectories to judgments on process, recoverability, coordination, robustness, and system performance, supported by a two-axi...