Don't Take the Premise for Granted: Mitigating Artifacts in Natural Language Inference

Adam Poliak; Alexander M. Rush; Benjamin Van Durme; Stuart M. Shieber; Yonatan Belinkov

arxiv: 1907.04380 · v1 · pith:6YR54WLZnew · submitted 2019-07-09 · 💻 cs.CL

Don't Take the Premise for Granted: Mitigating Artifacts in Natural Language Inference

Yonatan Belinkov , Adam Poliak , Stuart M. Shieber , Benjamin Van Durme , Alexander M. Rush This is my paper

Pith reviewed 2026-05-25 00:11 UTC · model grok-4.3

classification 💻 cs.CL

keywords natural language inferencehypothesis-only biasdataset artifactsmodel robustnesscross-dataset transferprobabilistic methodsentailment

0 comments

The pith

Predicting the probability of the premise given the hypothesis and label makes NLI models less reliant on hypothesis-only biases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes two probabilistic methods that reverse the usual NLI modeling direction by estimating the probability of the premise given the hypothesis and the inference label. This change is meant to discourage models from ignoring the premise and instead exploiting artifacts that allow high performance from the hypothesis alone. A sympathetic reader would care because many NLI datasets contain such biases, so models that avoid them should generalize better when moved to new datasets that lack or change those biases. The evaluation trains on biased data and tests on datasets with different or no biases, finding improved transfer over a baseline in 9 of 12 cases. The work also analyzes how the methods interact with known dataset biases and the effects of fine-tuning.

Core claim

In contrast to standard NLI approaches, the two proposed methods predict the probability of a premise given a hypothesis and NLI label. This discourages models from ignoring the premise. When trained on datasets with biases and tested on datasets with no or different hypothesis-only biases, the methods produce models that transfer better than a baseline architecture in 9 out of 12 NLI datasets.

What carries the argument

Probabilistic reversal that computes the likelihood of the premise conditioned on the hypothesis and label.

If this is right

NLI models become more robust to dataset-specific artifacts when trained with the reversed prediction.
Transfer performance improves on datasets that contain different or absent hypothesis-only biases.
The methods interact with known biases in NLI datasets in ways that can be analyzed directly.
Combining the methods with fine-tuning on target datasets produces measurable effects on final performance.
Encouraging models to ignore biases through this reversal has observable consequences on cross-dataset behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The reversal strategy could extend to other NLP tasks where one input side contains shortcuts that the model should not rely on.
Models built this way might capture entailment relations more faithfully when premises vary independently of hypotheses.
The approach highlights that the modeling direction itself can serve as a lever against shortcut learning without changing the architecture.

Load-bearing premise

That training to predict the premise probability given the hypothesis and label will force the model to use premise information rather than still exploiting hypothesis-only patterns through other means.

What would settle it

A test showing that a model trained with these methods still achieves high accuracy on a hypothesis-biased dataset after the premises are modified in ways that should alter the label but without the model detecting the change.

read the original abstract

Natural Language Inference (NLI) datasets often contain hypothesis-only biases---artifacts that allow models to achieve non-trivial performance without learning whether a premise entails a hypothesis. We propose two probabilistic methods to build models that are more robust to such biases and better transfer across datasets. In contrast to standard approaches to NLI, our methods predict the probability of a premise given a hypothesis and NLI label, discouraging models from ignoring the premise. We evaluate our methods on synthetic and existing NLI datasets by training on datasets containing biases and testing on datasets containing no (or different) hypothesis-only biases. Our results indicate that these methods can make NLI models more robust to dataset-specific artifacts, transferring better than a baseline architecture in 9 out of 12 NLI datasets. Additionally, we provide an extensive analysis of the interplay of our methods with known biases in NLI datasets, as well as the effects of encouraging models to ignore biases and fine-tuning on target datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The premise-prediction methods improve cross-dataset transfer in 9/12 cases but the abstract leaves the core mechanism unverified.

read the letter

The paper's main idea is to train NLI models on P(premise | hypothesis, label) rather than the usual P(label | premise, hypothesis) so that the model cannot ignore the premise. They report better transfer than a baseline when training on biased data and testing on datasets with different or no hypothesis-only biases, with gains on 9 out of 12 datasets. They also include analysis of how the methods interact with known biases and what happens with fine-tuning on target data. The multi-dataset transfer setup is a clear strength and gives the results some weight. The probabilistic framing is a straightforward way to push the model toward using premise information. The soft spot is exactly the one in the stress-test note. Because the generative model only ever sees the hypothesis and label, it can still learn to emit the right premise text by picking up on whatever statistical links exist between hypotheses and labels in the training data. Nothing in the abstract shows a diagnostic that rules this out, such as a hypothesis-only probe after training or a check that premise semantics are actually required. Without that, the transfer numbers could have other explanations. The abstract also gives no numbers on statistical significance or controls for other factors, which makes the evidence harder to assess. This work is aimed at people already working on NLI bias mitigation. It has a concrete method and enough results to deserve a serious referee, even though the central claim would need stronger verification in revision.

Referee Report

2 major / 1 minor

Summary. The paper proposes two probabilistic methods for NLI that train models to predict P(premise | hypothesis, label) rather than the standard P(label | premise, hypothesis), with the goal of reducing reliance on hypothesis-only biases. It evaluates transfer performance by training on biased datasets and testing on datasets with different or no such biases, reporting better results than a baseline in 9 out of 12 cases, plus analysis of bias interactions and fine-tuning effects.

Significance. If the central claim holds after verification, the work provides a concrete training objective for building NLI models with improved cross-dataset robustness, directly targeting a well-documented artifact problem. The reported extensive analysis of bias interplay adds value beyond the transfer numbers.

major comments (2)

[Results] Results section: the claim of superior transfer in 9/12 datasets is presented without reported statistical significance tests, confidence intervals, or controls for implementation details and confounding factors, leaving the quantitative support for the robustness claim difficult to assess.
[Methods] Methods and analysis sections: the generative objective is motivated as discouraging premise-ignoring behavior, yet no post-training hypothesis-only probe accuracies or equivalent diagnostics are described to confirm that premise semantics are actually required rather than hypothesis-label correlations still sufficing; this is load-bearing for the claim that the method forces premise use.

minor comments (1)

[Methods] Notation for the two proposed methods could be clarified with explicit equations showing how P(premise | hypothesis, label) is converted to an NLI prediction at inference time.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address the major concerns point by point below and outline the revisions we plan to make.

read point-by-point responses

Referee: [Results] Results section: the claim of superior transfer in 9/12 datasets is presented without reported statistical significance tests, confidence intervals, or controls for implementation details and confounding factors, leaving the quantitative support for the robustness claim difficult to assess.

Authors: We agree that statistical significance tests and confidence intervals would strengthen the quantitative support for the transfer results. In the revised manuscript, we will add bootstrap confidence intervals for the reported accuracies and paired significance tests against the baseline. We will also expand the description of implementation details and controls to address potential confounding factors. revision: yes
Referee: [Methods] Methods and analysis sections: the generative objective is motivated as discouraging premise-ignoring behavior, yet no post-training hypothesis-only probe accuracies or equivalent diagnostics are described to confirm that premise semantics are actually required rather than hypothesis-label correlations still sufficing; this is load-bearing for the claim that the method forces premise use.

Authors: We acknowledge that explicit post-training diagnostics such as hypothesis-only probes would provide stronger empirical confirmation. While the generative objective P(premise | hypothesis, label) is theoretically motivated to require premise semantics for accurate generation (in contrast to standard discriminative models), we will add hypothesis-only probe experiments in the revised version to directly measure whether premise information is utilized. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in the derivation or evaluation chain.

full rationale

The paper proposes two generative training objectives (predicting P(premise | hypothesis, label)) as an alternative to standard discriminative NLI training and evaluates them via cross-dataset transfer experiments on synthetic and existing NLI corpora. The central claim—that these methods yield more robust models that transfer better than a baseline in 9/12 cases—is an empirical result obtained by training on biased data and testing on data with different or absent biases. No step reduces by construction to its own inputs (no self-definitional relations, no fitted parameters renamed as predictions, no load-bearing self-citations, and no uniqueness theorems imported from prior author work). The evaluation is externally falsifiable via the reported transfer metrics and does not rely on any ansatz or renaming that collapses the result to the training data by definition. The derivation is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the methods are described at a high level without equations or implementation details.

pith-pipeline@v0.9.0 · 5711 in / 965 out tokens · 22390 ms · 2026-05-25T00:11:46.062263+00:00 · methodology

Don't Take the Premise for Granted: Mitigating Artifacts in Natural Language Inference

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)