A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

Adina Williams; Nikita Nangia; Samuel R. Bowman

arxiv: 1704.05426 · v4 · pith:SS53DYPJnew · submitted 2017-04-18 · 💻 cs.CL

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

Adina Williams , Nikita Nangia , Samuel R. Bowman This is my paper

classification 💻 cs.CL

keywords corpusavailableevaluationinferenceofferssentenceunderstandingadaptation

0 comments

read the original abstract

This paper introduces the Multi-Genre Natural Language Inference (MultiNLI) corpus, a dataset designed for use in the development and evaluation of machine learning models for sentence understanding. In addition to being one of the largest corpora available for the task of NLI, at 433k examples, this corpus improves upon available resources in its coverage: it offers data from ten distinct genres of written and spoken English--making it possible to evaluate systems on nearly the full complexity of the language--and it offers an explicit setting for the evaluation of cross-genre domain adaptation.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 43 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SwissGov-RSD: A Human-annotated, Cross-lingual Benchmark for Token-level Recognition of Semantic Differences Between Related Documents
cs.CL 2025-12 accept novelty 8.0

SwissGov-RSD is the first naturalistic cross-lingual document-level benchmark with human token-level semantic difference annotations, on which both LLMs and encoders show a large performance gap relative to simpler settings.
Don't Label Twice: Quantity Beats Quality when Comparing Binary Classifiers on a Budget
cs.LG 2024-02 unverdicted novelty 8.0

For comparing two binary classifiers using a budget of noisy labels, collecting one label per sample across more samples outperforms aggregating multiple labels per sample.
RoFormer: Enhanced Transformer with Rotary Position Embedding
cs.CL 2021-04 accept novelty 8.0

RoFormer introduces rotary position embeddings that encode absolute positions via rotation matrices and relative dependencies in attention, outperforming prior position methods on long text classification tasks.
SimCSE: Simple Contrastive Learning of Sentence Embeddings
cs.CL 2021-04 conditional novelty 8.0

SimCSE achieves 76.3% unsupervised and 81.6% supervised Spearman's correlation on STS tasks with BERT-base, improving prior best results by 4.2% and 2.2% via simple contrastive learning.
Probabilistic Attribution For Large Language Models
cs.CL 2026-05 unverdicted novelty 7.0

Develops a model-agnostic attribution score as the log-ratio of conditional response probabilities with and without a marginalized prompt token, derived via Bayes inversion of next-token distributions, and relates it ...
Cumulative Meta-Learning from Active Learning Queries for Robustness to Spurious Correlations
cs.LG 2026-05 unverdicted novelty 7.0

CAML meta-learns a progressively refined inductive bias from active-learning queries to improve robustness to spurious correlations, reporting accuracy gains on minority groups across several benchmarks.
A Multi-View Media Profiling Suite: Resources, Evaluation, and Analysis
cs.CL 2026-05 unverdicted novelty 7.0

Presents MBFC-2025 dataset and multi-view embeddings with fusion methods for media bias and factuality, reporting SOTA results on ACL-2020 and new benchmarks on MBFC-2025.
Learning Posterior Predictive Distributions for Node Classification from Synthetic Graph Priors
cs.LG 2026-04 unverdicted novelty 7.0

NodePFN pre-trains on synthetic graphs with controllable homophily and causal feature-label models to achieve 71.27 average accuracy on 23 node classification benchmarks without graph-specific training.
Wiring the 'Why': A Unified Taxonomy and Survey of Abductive Reasoning in LLMs
cs.AI 2026-04 accept novelty 7.0

The paper delivers the first survey of abductive reasoning in LLMs, a unified two-stage taxonomy, a compact benchmark, and an analysis of gaps relative to deductive and inductive reasoning.
Norm Anchors Make Model Edits Last
cs.LG 2026-01 conditional novelty 7.0

Norm-Anchor Scaling breaks the norm-feedback loop in sequential LLM editing by anchoring value vectors to original norms, improving long-run performance by 72.2% and extending the editing horizon over 4x.
Task Prompt Vectors: Effective Initialization through Multi-Task Soft-Prompt Transfer
cs.CL 2024-08 unverdicted novelty 7.0

Task prompt vectors, formed by subtracting random initialization from tuned soft prompts, support low-resource initialization and arithmetic combination across tasks on 12 NLU datasets while remaining independent of i...
C-Pack: Packed Resources For General Chinese Embeddings
cs.CL 2023-09 accept novelty 7.0

C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.
LoRA: Low-Rank Adaptation of Large Language Models
cs.CL 2021-06 accept novelty 7.0

Adapting large language models by training only a low-rank decomposition BA added to frozen weight matrices matches full fine-tuning while cutting trainable parameters by orders of magnitude and adding no inference latency.
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
cs.CL 2019-10 accept novelty 7.0

BART introduces a denoising pretraining method for seq2seq models that matches RoBERTa on GLUE and SQuAD while setting new state-of-the-art results on abstractive summarization, dialogue, and QA with up to 6 ROUGE gains.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
cs.LG 2019-10 unverdicted novelty 7.0

T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
cs.CL 2019-09 accept novelty 7.0

ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
cs.CL 2019-05 accept novelty 7.0

BoolQ introduces naturally occurring yes/no questions as a challenging benchmark where BERT fine-tuned on MultiNLI reaches 80.4% accuracy against 90% human performance.
To MRL or not to MRL: Text Embeddings are Robust to Truncation Without Matryoshka Learning, Except In Heavy Truncation Scenarios
cs.LG 2026-05 conditional novelty 6.0

Text embeddings are robust to truncation without MRL except when reducing size by at least 80%.
On the Burden of Achieving Fairness in Conformal Prediction
stat.ML 2026-05 unverdicted novelty 6.0

Pooled conformal calibration incurs irreducible group-wise coverage distortion set by cross-group quantile heterogeneity, and Equalized Coverage and Equalized Set Size are in fundamental tension.
On the Burden of Achieving Fairness in Conformal Prediction
stat.ML 2026-05 unverdicted novelty 6.0

Pooled conformal calibration incurs irreducible group-wise coverage distortion scaled by cross-group quantile heterogeneity, with Equalized Coverage and Equalized Set Size in fundamental tension.
When Reviews Disagree: Fine-Grained Contradiction Analysis in Scientific Peer Reviews
cs.CL 2026-05 unverdicted novelty 6.0

Introduces RevCI benchmark and IMPACT multi-agent framework for evidence-level contradiction detection and graded intensity scoring in peer reviews, distilled into efficient TIDE model.
From Articles to Premises: Building PrimeFacts, an Extraction Methodology and Resource for Fact-Checking Evidence
cs.CL 2026-05 unverdicted novelty 6.0

PrimeFacts extracts decontextualized premises from fact-check articles, raising evidence retrieval MRR by up to 30% and verdict prediction Macro-F1 by 10-20 points over baselines.
On the (In-)Security of the Shuffling Defense in the Transformer Secure Inference
cs.CR 2026-05 conditional novelty 6.0

An attack aligns differently shuffled intermediate activations from secure Transformer inference queries to recover model weights with low error using roughly one dollar of queries.
Is Textual Similarity Invariant under Machine Translation? Evidence Based on the Political Manifesto Corpus
cs.CL 2026-05 unverdicted novelty 6.0

Machine translation preserves embedding similarity structure for ten languages but distorts it for four in the Manifesto Corpus, via a new non-inferiority testing framework.
Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation
cs.CV 2026-04 unverdicted novelty 6.0

MDPD mutually distills knowledge between a frozen backbone and a learnable side network during fine-tuning, then discards the side network at inference to accelerate speed by at least 25% while preserving accuracy.
MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning
cs.LG 2026-04 unverdicted novelty 6.0

MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter an...
PEFT-Bench: A Parameter-Efficient Fine-Tuning Methods Benchmark
cs.CL 2025-11 unverdicted novelty 6.0

PEFT-Bench is a standardized end-to-end benchmark for 7 PEFT methods across 27 NLP datasets on autoregressive LLMs, accompanied by the PSCP metric that penalizes based on trainable parameters, inference speed, and tra...
Should We Still Pretrain Encoders with Masked Language Modeling?
cs.CL 2025-07 accept novelty 6.0

Controlled ablations of 38 models find MLM superior to CLM on representation benchmarks while CLM offers better data efficiency and stability; a biphasic CLM-then-MLM schedule is optimal under fixed compute and improv...
Expressive yet Efficient Feature Expansion with Adaptive Cross-Hadamard Products
cs.CV 2025-05 unverdicted novelty 6.0

Proposes ACH module with differentiable sampling and softsign normalization for efficient feature expansion, integrated via NAS into Hadaptive-Net to claim SOTA accuracy/speed trade-offs on image classification.
SyMerge: From Non-Interference to Synergistic Merging via Single-Layer Adaptation
cs.LG 2024-12 unverdicted novelty 6.0

SyMerge merges models via single-layer adaptation and expert-guided self-labeling to achieve task synergy, reporting SOTA results on vision, dense prediction, and NLP tasks.
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations
cs.CL 2023-05 conditional novelty 6.0

UltraChat supplies 1.5 million high-quality multi-turn dialogues that, when used to fine-tune LLaMA, produce UltraLLaMA, which outperforms prior open-source chat models including Vicuna.
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
cs.CL 2023-03 unverdicted novelty 6.0

SelfCheckGPT detects hallucinations by checking consistency across multiple sampled responses from black-box LLMs on WikiBio biography generation tasks.
Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation
cs.CL 2023-02 unverdicted novelty 6.0

Semantic entropy improves uncertainty estimation in natural language generation by incorporating semantic equivalences, outperforming standard entropy baselines on predicting model accuracy for question answering.
REPLUG: Retrieval-Augmented Black-Box Language Models
cs.CL 2023-01 conditional novelty 6.0

REPLUG improves frozen black-box LMs by prepending LM-supervised retrieved documents, delivering 6.3% better language modeling on GPT-3 and 5.1% better five-shot MMLU on Codex.
Convex Dataset Valuation for Post-Training
cs.LG 2026-05 unverdicted novelty 5.0

A convex KMM-based valuation method that accounts for both target-task alignment and inter-dataset redundancy in gradient space outperforms standard gradient-alignment baselines for LLM post-training data selection.
Commonsense Knowledge with Negation: A Resource to Enhance Negation Understanding
cs.CL 2026-04 unverdicted novelty 5.0

Augmenting commonsense knowledge corpora with negation produces over 2M new triples that benefit LLM negation understanding when used for pre-training.
PEFT-Factory: Unified Parameter-Efficient Fine-Tuning of Autoregressive Large Language Models
cs.CL 2025-12 unverdicted novelty 5.0

PEFT-Factory supplies a ready-to-use, extensible codebase that unifies 19 PEFT methods and evaluation pipelines for fine-tuning large autoregressive language models.
Vanishing Contributions: A Unified Framework for Smooth and Iterative Model Compression
cs.LG 2025-10 unverdicted novelty 5.0

VCON is a unified framework for smooth iterative DNN compression that uses parallel execution and an affine combination to progressively replace the original model with its compressed form during fine-tuning.
Investigating Biases in Textual Entailment Datasets
cs.CL 2019-06 unverdicted novelty 5.0

Hypothesis-only classification reaches 64% accuracy on SNLI, revealing dataset biases in SNLI and MultiNLI that the authors quantify and propose a simple mitigation for.
Learning Compressed Sentence Representations for On-Device Text Processing
cs.CL 2019-06 unverdicted novelty 5.0

Four binarization strategies turn continuous sentence embeddings into binary form, cutting storage by over 98% with only about 2% performance drop on downstream tasks.
Is Large Language Model Performance on Reasoning Tasks Impacted by Different Ways Questions Are Asked?
cs.CL 2025-07 unverdicted novelty 4.0

LLM accuracy on reasoning tasks differs significantly by question type, with step-by-step reasoning accuracy often uncorrelated to final answer selection.
To Tune or Not To Tune? How About the Best of Both Worlds?
cs.CL 2019-07 unverdicted novelty 3.0

A sequential fine-tuning strategy for pre-trained language models reports modest accuracy gains of 4.7%, 0.99%, and 0.72% on semantic similarity, sequence labeling, and text classification tasks.
Multilingual and Multimodal LLMs in the Wild: Building for Low-Resource Languages
cs.CL 2026-05 unverdicted novelty 2.0

A tutorial synthesizing foundations, recent models such as PALO and Maya, and low-cost methods for tri-modal multilingual AI in resource-constrained settings.