arxiv: 1907.10641 · v2 · submitted 2019-07-24 · 💻 cs.CL

Recognition: 3 theorem links

· Lean Theorem

WinoGrande: An Adversarial Winograd Schema Challenge at Scale

Keisuke Sakaguchi , Ronan Le Bras , Chandra Bhagavatula , Yejin Choi

Authors on Pith no claims yet

Pith reviewed 2026-05-14 17:12 UTC · model grok-4.3

classification 💻 cs.CL

keywords WinoGrandeWinograd Schema Challengecommonsense reasoningadversarial datasetbias reductionAfLitepronoun resolution

0 comments

The pith

WinoGrande creates a 44k-scale adversarial Winograd dataset where models score 15-35 points below humans.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds WinoGrande as a large set of 44,000 pronoun resolution problems modeled on the original Winograd Schema Challenge but designed to be harder for statistical models. It employs a crowdsourcing process followed by the AfLite algorithm to strip out spurious word associations that models might exploit. State-of-the-art neural language models reach only 59.4 to 79.1 percent accuracy on this dataset, well below the 94 percent human level. This performance gap suggests that earlier high scores on similar tasks reflected dataset biases rather than true commonsense understanding. The work also shows that training on WinoGrande transfers to improve results on other related commonsense benchmarks.

Core claim

WinoGrande is a new 44k-problem dataset for commonsense reasoning that, after bias reduction via AfLite, exposes a significant gap between model performance (59.4-79.1%) and human performance (94.0%) on pronoun resolution tasks.

What carries the argument

AfLite algorithm that extends human-detectable word associations to machine-detectable embedding associations for systematic bias reduction in the dataset construction.

If this is right

Transfer learning with WinoGrande yields new state-of-the-art results on WSC (90.1%), DPR (93.1%), COPA (90.6%), KnowRef (85.6%), and Winogender (97.1%).
High performance on prior benchmarks likely overestimates the true commonsense capabilities of current models.
Algorithmic bias reduction should be applied to existing and new benchmarks to prevent overestimation of model abilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Applying similar bias reduction methods to other NLP benchmarks could reveal comparable overestimations in model capabilities.
Closing the gap on WinoGrande may require architectural changes that go beyond scaling up language models.

Load-bearing premise

That the AfLite algorithm successfully removes only spurious statistical associations without eliminating genuine commonsense signals needed for the task.

What would settle it

A neural model reaching 90 percent or higher accuracy on WinoGrande while using only the bias-reduced training data would challenge the claim that the dataset requires genuine commonsense reasoning beyond pattern matching.

read the original abstract

The Winograd Schema Challenge (WSC) (Levesque, Davis, and Morgenstern 2011), a benchmark for commonsense reasoning, is a set of 273 expert-crafted pronoun resolution problems originally designed to be unsolvable for statistical models that rely on selectional preferences or word associations. However, recent advances in neural language models have already reached around 90% accuracy on variants of WSC. This raises an important question whether these models have truly acquired robust commonsense capabilities or whether they rely on spurious biases in the datasets that lead to an overestimation of the true capabilities of machine commonsense. To investigate this question, we introduce WinoGrande, a large-scale dataset of 44k problems, inspired by the original WSC design, but adjusted to improve both the scale and the hardness of the dataset. The key steps of the dataset construction consist of (1) a carefully designed crowdsourcing procedure, followed by (2) systematic bias reduction using a novel AfLite algorithm that generalizes human-detectable word associations to machine-detectable embedding associations. The best state-of-the-art methods on WinoGrande achieve 59.4-79.1%, which are 15-35% below human performance of 94.0%, depending on the amount of the training data allowed. Furthermore, we establish new state-of-the-art results on five related benchmarks - WSC (90.1%), DPR (93.1%), COPA (90.6%), KnowRef (85.6%), and Winogender (97.1%). These results have dual implications: on one hand, they demonstrate the effectiveness of WinoGrande when used as a resource for transfer learning. On the other hand, they raise a concern that we are likely to be overestimating the true capabilities of machine commonsense across all these benchmarks. We emphasize the importance of algorithmic bias reduction in existing and future benchmarks to mitigate such overestimation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WinoGrande scales the Winograd task to 44k examples with an embedding-based filter that cuts easy statistical shortcuts and leaves models 15-35 points below human performance.

read the letter

The main thing to know is that this paper delivers a bigger, harder Winograd-style benchmark by combining crowdsourcing with a new AfLite step that removes examples models can solve from embeddings alone. Top models reach only 59-79% depending on training data size, against 94% for humans, and training on the new set lifts results on the original WSC, COPA, DPR, KnowRef, and Winogender. That transfer is useful evidence the retained examples carry real signals rather than just noise. The construction is described clearly enough that others can replicate the pipeline, and the experiments cover multiple model families and data regimes without obvious internal contradictions. The central gap to human performance holds up across those controls. The softer spot is the limited ablation on exactly what AfLite keeps versus discards; the authors claim it targets only spurious associations, and the transfer gains support that, but a fuller breakdown of filtered items would make the claim tighter. No load-bearing flaw appears in the reported numbers or setup. This is for groups that build or evaluate commonsense models and want a tougher test set than the original 273 examples. Anyone running pronoun-resolution or related reasoning experiments will get concrete value from the dataset and the bias-reduction idea. It has enough new data, a reproducible method, and consistent results to deserve serious referee time rather than a desk reject.

Referee Report

2 major / 3 minor

Summary. The paper introduces WinoGrande, a large-scale dataset of 44k Winograd Schema-style pronoun resolution problems constructed via crowdsourcing followed by the AfLite algorithm for systematic reduction of machine-detectable statistical biases. It reports that state-of-the-art models achieve 59.4-79.1% accuracy on WinoGrande (depending on training data regime) versus 94% human performance, while training on WinoGrande yields new state-of-the-art results on five related benchmarks: WSC (90.1%), DPR (93.1%), COPA (90.6%), KnowRef (85.6%), and Winogender (97.1%). The central claim is that prior WSC variants overestimate machine commonsense due to spurious associations and that WinoGrande provides a harder, more reliable diagnostic.

Significance. If the results hold, the work is significant for exposing overestimation in existing commonsense benchmarks and for supplying both a diagnostic resource and a transfer-learning corpus that improves performance on multiple related tasks. The multi-model-family evaluation and cross-benchmark transfer gains constitute concrete, reproducible evidence that genuine reasoning signals can be retained after bias filtering; this directly supports the paper's emphasis on algorithmic bias reduction as a general methodological contribution.

major comments (2)

[§3.2] §3.2 (AfLite description and evaluation): the claim that AfLite removes only spurious associations while preserving all required commonsense signals rests on transfer results rather than a direct ablation; an exhaustive comparison of model performance and human-validated reasoning quality on AfLite-filtered versus unfiltered splits is needed to confirm that the performance gap (59.4-79.1% vs. 94%) is not partly attributable to removal of genuine signals.
[Table 3] Table 3 (or equivalent results table): the reported accuracies for different training-data regimes lack error bars or statistical significance tests; without them it is difficult to assess whether the 15-35% gap to human performance is robust across random seeds and model initializations.

minor comments (3)

[§2.1] §2.1: the crowdsourcing prompt templates and exact quality-control criteria (e.g., agreement thresholds, adversarial example generation rules) are summarized at a high level; including the full templates as an appendix would improve reproducibility.
[Figure 1] Figure 1: the AfLite pipeline diagram would benefit from explicit annotation of the embedding-association step and the filtering threshold parameters.
References: verify that all prior WSC variants cited in the abstract (including any recent neural-model results reaching ~90%) appear with full bibliographic details.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [§3.2] §3.2 (AfLite description and evaluation): the claim that AfLite removes only spurious associations while preserving all required commonsense signals rests on transfer results rather than a direct ablation; an exhaustive comparison of model performance and human-validated reasoning quality on AfLite-filtered versus unfiltered splits is needed to confirm that the performance gap (59.4-79.1% vs. 94%) is not partly attributable to removal of genuine signals.

Authors: We agree that a direct ablation would strengthen the evidence. The current manuscript supports preservation of commonsense signals via the observed transfer gains on five related benchmarks after AfLite filtering. However, we will add an exhaustive comparison of model performance on AfLite-filtered versus unfiltered splits, together with human validation of reasoning quality on sampled instances from each split, to rule out unintended removal of genuine signals. revision: yes
Referee: [Table 3] Table 3 (or equivalent results table): the reported accuracies for different training-data regimes lack error bars or statistical significance tests; without them it is difficult to assess whether the 15-35% gap to human performance is robust across random seeds and model initializations.

Authors: We acknowledge this limitation in the current results. In the revised manuscript we will report mean accuracies and standard deviations over multiple random seeds for all training regimes in Table 3 (and equivalent tables), and we will include statistical significance tests to confirm robustness of the gaps relative to human performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper constructs WinoGrande via a new crowdsourcing procedure and the AfLite bias-reduction algorithm, then reports empirical model accuracies (59.4-79.1%) against human performance (94.0%) and transfer results on five external benchmarks. These outcomes rest on held-out data splits and standard training/evaluation pipelines rather than any equation or parameter that is defined in terms of the target performance numbers. AfLite is introduced and specified within the paper itself; no load-bearing step reduces to a self-citation, fitted input renamed as prediction, or ansatz smuggled from prior author work. The derivation chain is therefore self-contained through explicit data-generation steps and controlled experiments.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that crowdsourced sentences plus embedding-based filtering produce items that require genuine commonsense rather than residual statistical cues; no new physical entities or free parameters beyond standard ML hyperparameters are introduced.

free parameters (1)

AfLite filtering thresholds
Hyperparameters that control which embedding associations are removed; chosen to balance scale and hardness.

axioms (1)

domain assumption Crowdsourced pronoun resolutions reflect genuine commonsense reasoning when statistical cues are removed
Invoked in the dataset construction and bias-reduction sections.

pith-pipeline@v0.9.0 · 5666 in / 1278 out tokens · 69818 ms · 2026-05-14T17:12:15.646232+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The best state-of-the-art methods on WinoGrande achieve 59.4-79.1%, which are 15-35% below human performance of 94.0%
IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

systematic bias reduction using a novel AfLite algorithm that generalizes human-detectable word associations to machine-detectable embedding associations
IndisputableMonolith.Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

establish new state-of-the-art results on five related benchmarks - WSC (90.1%), DPR (93.1%), COPA (90.6%), KnowRef (85.6%), and Winogender (97.1%)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
cs.CL 2026-05 unverdicted novelty 7.0

TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.
Projection-Free Transformers via Gaussian Kernel Attention
cs.LG 2026-05 unverdicted novelty 7.0

Gaussian Kernel Attention replaces learned QKV projections with a Gaussian RBF kernel on per-head token features, using 0.42x parameters and 0.49x FLOPs while showing competitive language modeling performance at depth 20.
Rethinking Layer Redundancy in Large Language Models: Calibration Objectives and Search for Depth Pruning
cs.LG 2026-04 unverdicted novelty 7.0

Calibration objectives influence redundant layer identification in LLM depth pruning more than search algorithms do, with different objectives producing different layer rankings.
A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network
cs.AR 2026-03 unverdicted novelty 7.0

SCIN uses an in-switch accelerator for direct memory access and 8-bit in-network quantization during All-Reduce, delivering up to 8.7x faster small-message reduction and 1.74x TTFT speedup on LLaMA-2 models.
Scaling Latent Reasoning via Looped Language Models
cs.CL 2025-10 unverdicted novelty 7.0

Looped language models with latent iterative computation and entropy-regularized depth allocation achieve performance matching up to 12B standard LLMs through superior knowledge manipulation.
Multitask Prompted Training Enables Zero-Shot Task Generalization
cs.LG 2021-10 conditional novelty 7.0

Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.
SocialIQA: Commonsense Reasoning about Social Interactions
cs.CL 2019-04 unverdicted novelty 7.0

SocialIQA is the first large-scale benchmark with 38k crowdsourced questions testing commonsense about social interactions, where pretrained language models trail humans by over 20% but transfer to improve performance...
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
cs.CL 2026-05 unverdicted novelty 6.0

TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.
Not How Many, But Which: Parameter Placement in Low-Rank Adaptation
cs.LG 2026-05 unverdicted novelty 6.0

Gradient-informed placement of LoRA parameters recovers full performance under GRPO while random placement does not, due to differences in gradient rank and stability across training regimes.
Measuring Five-Nines Reliability: Sample-Efficient LLM Evaluation in Saturated Benchmarks
cs.LG 2026-05 unverdicted novelty 6.0

Cross-entropy method sampling reduces inferences needed to estimate five-nines LLM reliability by up to 156x on parameterized GSM8K templates, revealing reliability differences hidden by saturated accuracy scores.
Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs
cs.CL 2026-05 unverdicted novelty 6.0

Extremely quantized LLMs degrade in smoothness, sparsifying the decoding tree and hurting generation quality; a smoothness-preserving principle delivers gains beyond numerical fitting.
Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting
cs.LG 2026-05 unverdicted novelty 6.0

Sharpness-aware pretraining and related flat-minima interventions reduce catastrophic forgetting by up to 80% after post-training across 20M-150M models and by 31-40% at 1B scale.
Diversity in Large Language Models under Supervised Fine-Tuning
cs.LG 2026-04 unverdicted novelty 6.0

TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
Rethinking Layer Redundancy in Large Language Models: Calibration Objectives and Search for Depth Pruning
cs.LG 2026-04 unverdicted novelty 6.0

Different calibration objectives produce distinct layer pruning patterns in LLMs, while search algorithms converge to similar solutions under a fixed objective.
Kimi Linear: An Expressive, Efficient Attention Architecture
cs.CL 2025-10 unverdicted novelty 6.0

Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
cs.CL 2024-04 accept novelty 6.0

Phi-3-mini (3.8B params, 3.3T tokens) reaches 69% MMLU and 8.38 MT-bench, matching larger models, with scaled-up 7B/14B variants and phi-3.5 extensions for multilingual, MoE, and vision capabilities.
Efficient Streaming Language Models with Attention Sinks
cs.CL 2023-09 accept novelty 6.0

StreamingLLM lets finite-window LLMs generalize to infinite-length sequences by retaining initial-token KV states as attention sinks, enabling stable streaming inference up to 4M tokens.
Textbooks Are All You Need II: phi-1.5 technical report
cs.CL 2023-09 unverdicted novelty 6.0

phi-1.5 is a 1.3B parameter model trained on synthetic textbook data that matches the reasoning performance of models five times larger on natural language, math, and basic coding tasks.
PaLM: Scaling Language Modeling with Pathways
cs.CL 2022-04 accept novelty 6.0

PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
Diversity in Large Language Models under Supervised Fine-Tuning
cs.LG 2026-04 unverdicted novelty 5.0

Supervised fine-tuning narrows LLM generative diversity through neglect of low-frequency patterns and knowledge forgetting, but the TOFU loss mitigates this effect across models and benchmarks.
SpikingBrain2.0: Brain-Inspired Foundation Models for Efficient Long-Context and Cross-Platform Inference
cs.LG 2026-04 unverdicted novelty 5.0

SpikingBrain2.0 is a 5B hybrid spiking-Transformer that recovers most base model performance while delivering 10x TTFT speedup at 4M context and supporting over 10M tokens on limited GPUs via dual sparse attention and...
Hyperloop Transformers
cs.LG 2026-04 unverdicted novelty 5.0

Hyperloop Transformers outperform standard and mHC Transformers with roughly 50% fewer parameters by looping a middle block of layers and applying hyper-connections only after each loop.
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
cs.CL 2025-03 unverdicted novelty 5.0

Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.
Qwen2.5-Coder Technical Report
cs.CL 2024-09 unverdicted novelty 4.0

Qwen2.5-Coder models claim state-of-the-art results on over 10 code benchmarks, outperforming larger models of similar size.
Gemma: Open Models Based on Gemini Research and Technology
cs.CL 2024-03 accept novelty 4.0

Gemma introduces open 2B and 7B LLMs derived from Gemini technology that beat comparable open models on 11 of 18 text tasks and come with safety assessments.
Gemma 2: Improving Open Language Models at a Practical Size
cs.CL 2024-07 conditional novelty 3.0

Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.