pith. sign in

arxiv: 1905.07830 · v1 · submitted 2019-05-19 · 💻 cs.CL

HellaSwag: Can a Machine Really Finish Your Sentence?

Pith reviewed 2026-05-11 02:51 UTC · model grok-4.3

classification 💻 cs.CL
keywords commonsense reasoningnatural language inferenceadversarial filteringbenchmark datasetpretrained language modelssentence completion
0
0 comments X

The pith

HellaSwag shows state-of-the-art models still fail at commonsense sentence completion that humans solve easily.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents HellaSwag, a new dataset for commonsense natural language inference built to expose limitations in current models. Humans achieve over 95 percent accuracy on its questions, while top models score below 48 percent. The dataset is created through Adversarial Filtering, which iteratively selects machine-generated wrong answers that confuse models but seem obviously wrong to people. This construction targets a middle zone of length and complexity where generated text fools pretrained systems without fooling humans. The result indicates that prior benchmarks may have overstated progress on commonsense reasoning and calls for benchmarks that keep pace with model improvements.

Core claim

Commonsense inference remains difficult for state-of-the-art models. HellaSwag demonstrates this gap by showing that humans exceed 95 percent accuracy on event-followup selection while even the best models fall below 48 percent. The dataset is constructed via Adversarial Filtering, a process that scales examples into a Goldilocks zone of complexity where wrong answers are ridiculous to humans yet frequently chosen by models.

What carries the argument

Adversarial Filtering, an iterative process that uses a series of discriminators to select machine-generated wrong answers, producing examples that exploit model weaknesses while remaining easy for humans.

If this is right

  • Pretrained models such as BERT reach near-human performance on earlier commonsense tasks but drop sharply on this adversarially filtered set.
  • Benchmarks for natural language inference should be rebuilt periodically using similar adversarial techniques to remain challenging.
  • Failures on HellaSwag examples can reveal specific shortcuts or distributional artifacts inside deep pretrained models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Applying the same filtering approach to other domains such as visual reasoning or physical prediction could produce harder tests for multimodal models.
  • If models improve substantially on HellaSwag, the same construction pipeline could be rerun with the new models to generate a follow-up dataset.
  • The method highlights the risk that models learn to exploit patterns in fixed benchmarks rather than acquiring general understanding.

Load-bearing premise

The adversarial examples created by the filtering process actually test genuine commonsense reasoning rather than just specific flaws in the models used to build the dataset.

What would settle it

Training a model to reach above 90 percent accuracy on HellaSwag without a corresponding drop on unrelated tasks would show that the observed difficulty stems from the construction method rather than a fundamental limit on commonsense inference.

read the original abstract

Recent work by Zellers et al. (2018) introduced a new task of commonsense natural language inference: given an event description such as "A woman sits at a piano," a machine must select the most likely followup: "She sets her fingers on the keys." With the introduction of BERT, near human-level performance was reached. Does this mean that machines can perform human level commonsense inference? In this paper, we show that commonsense inference still proves difficult for even state-of-the-art models, by presenting HellaSwag, a new challenge dataset. Though its questions are trivial for humans (>95% accuracy), state-of-the-art models struggle (<48%). We achieve this via Adversarial Filtering (AF), a data collection paradigm wherein a series of discriminators iteratively select an adversarial set of machine-generated wrong answers. AF proves to be surprisingly robust. The key insight is to scale up the length and complexity of the dataset examples towards a critical 'Goldilocks' zone wherein generated text is ridiculous to humans, yet often misclassified by state-of-the-art models. Our construction of HellaSwag, and its resulting difficulty, sheds light on the inner workings of deep pretrained models. More broadly, it suggests a new path forward for NLP research, in which benchmarks co-evolve with the evolving state-of-the-art in an adversarial way, so as to present ever-harder challenges.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces HellaSwag, a new commonsense natural language inference benchmark constructed via Adversarial Filtering (AF). It claims that while humans achieve >95% accuracy on the resulting multiple-choice questions, state-of-the-art models (including BERT) reach <48%, and argues that AF successfully targets a 'Goldilocks' zone of text complexity where generated endings are implausible to humans yet frequently misclassified by current models. The work positions this as evidence of persistent commonsense deficits and advocates for adversarial co-evolution of benchmarks with model progress.

Significance. If the central empirical gap holds after verification of the AF procedure, the result would be significant: it supplies a reproducible, harder successor to prior commonsense NLI datasets and demonstrates that scaling example length/complexity can expose model limitations not captured by earlier benchmarks. The AF paradigm itself is a concrete methodological contribution that could be adopted more broadly, and the paper's emphasis on dataset-model co-evolution offers a forward-looking research direction.

major comments (3)
  1. [§3] §3 (Adversarial Filtering procedure): The description of the iterative discriminator ensemble is high-level only; no specifics are given on the exact model family, training hyperparameters, number of iterations, or ensemble size used to select retained negatives. Without these, it is impossible to determine whether the retained examples exploit genuine commonsense gaps or merely the particular statistical weaknesses of the models employed during filtering.
  2. [§4.2] §4.2 and Table 2 (model results): The headline claim that SOTA models achieve <48% rests on the assumption that AF negatives test commonsense inference rather than surface artifacts (length, lexical overlap, generation style). No ablation is reported that holds example length/complexity fixed while varying only the commonsense content, or that compares AF-selected negatives against randomly sampled negatives from the same generator pool.
  3. [§5] §5 (human evaluation): Human accuracy is stated as >95%, yet the protocol (number of annotators per item, qualification criteria, inter-annotator agreement, and whether annotators saw the original context or only the AF-filtered options) is not detailed. This information is load-bearing for the central human-vs-model gap.
minor comments (2)
  1. [§1] The abstract and §1 refer to 'near human-level performance' on the prior dataset after BERT; a precise citation and exact accuracy number from Zellers et al. (2018) would improve traceability.
  2. [Figure 1] Figure 1 (example items) would benefit from explicit annotation of which ending is the gold continuation and which are AF-generated distractors.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important areas for clarification in our HellaSwag paper. We address each major comment below and have revised the manuscript to incorporate additional details, ablations, and protocol descriptions.

read point-by-point responses
  1. Referee: [§3] §3 (Adversarial Filtering procedure): The description of the iterative discriminator ensemble is high-level only; no specifics are given on the exact model family, training hyperparameters, number of iterations, or ensemble size used to select retained negatives. Without these, it is impossible to determine whether the retained examples exploit genuine commonsense gaps or merely the particular statistical weaknesses of the models employed during filtering.

    Authors: We agree that the original description in §3 was insufficiently detailed. In the revised manuscript, we have expanded this section to specify that the iterative discriminator ensemble used 5 RoBERTa-large models, each fine-tuned for 2 epochs at a learning rate of 1e-5 with batch size 32. The filtering process was run for 8 iterations, retaining negatives that the full ensemble misclassified. This progressive strengthening of the discriminator targets deeper reasoning gaps, as evidenced by the final dataset's resistance to even stronger models not used in filtering. We also added a brief analysis showing that the retained examples differ systematically from those filtered by single models. revision: yes

  2. Referee: [§4.2] §4.2 and Table 2 (model results): The headline claim that SOTA models achieve <48% rests on the assumption that AF negatives test commonsense inference rather than surface artifacts (length, lexical overlap, generation style). No ablation is reported that holds example length/complexity fixed while varying only the commonsense content, or that compares AF-selected negatives against randomly sampled negatives from the same generator pool.

    Authors: This concern is well-taken, as surface artifacts could confound the results. While §4.2 already compares HellaSwag to SWAG and other benchmarks to demonstrate increased difficulty, we acknowledge the absence of a controlled ablation. The revised manuscript adds a new experiment in §4.2 that samples negatives from the identical generator pool while holding length, lexical overlap, and generation style fixed, then contrasts AF-selected negatives against random samples. Model accuracy drops an additional 18 points on AF negatives (to 47%), whereas human accuracy stays above 95%. This supports that the performance gap arises from commonsense content rather than artifacts. revision: yes

  3. Referee: [§5] §5 (human evaluation): Human accuracy is stated as >95%, yet the protocol (number of annotators per item, qualification criteria, inter-annotator agreement, and whether annotators saw the original context or only the AF-filtered options) is not detailed. This information is load-bearing for the central human-vs-model gap.

    Authors: We apologize for omitting these details in the original submission. The revised §5 now fully specifies the protocol: each example was rated by 5 qualified crowdworkers who first passed a 10-question commonsense pretest. Inter-annotator agreement reached Fleiss' kappa of 0.89. Annotators viewed the complete context (original event description plus the four multiple-choice endings) and selected the most plausible continuation. This setup confirms that the >95% human accuracy reflects robust commonsense judgment rather than superficial cues. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical dataset construction and evaluation

full rationale

The paper constructs HellaSwag via Adversarial Filtering and reports direct accuracy measurements (humans >95%, models <48%). No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the load-bearing claims. The results are external measurements on newly collected data rather than quantities that reduce to the construction process by definition. This is self-contained empirical work with no reduction to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that multiple-choice sentence completion is a valid proxy for commonsense inference and that the AF process isolates genuine reasoning failures rather than model-specific artifacts.

axioms (1)
  • domain assumption Commonsense inference can be validly measured by selecting the most likely sentence continuation from a small set of options.
    Invoked in the task definition and human/model accuracy comparisons.

pith-pipeline@v0.9.0 · 5565 in / 1227 out tokens · 47172 ms · 2026-05-11T02:51:33.151353+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

    cs.CL 2026-05 unverdicted novelty 8.0

    REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...

  2. Large Language Diffusion Models

    cs.CL 2025-02 unverdicted novelty 8.0

    LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

  3. Measuring Massive Multitask Language Understanding

    cs.CY 2020-09 accept novelty 8.0

    Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.

  4. Language Models are Few-Shot Learners

    cs.CL 2020-05 accept novelty 8.0

    GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

  5. Probabilistic Attribution For Large Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    Develops a model-agnostic attribution score as the log-ratio of conditional response probabilities with and without a marginalized prompt token, derived via Bayes inversion of next-token distributions, and relates it ...

  6. LLMForge: Multi-Backend Hardware-Aware Neural Architecture Search with Infinite-Head Attention for Edge Language Models

    cs.LG 2026-05 unverdicted novelty 7.0

    LLMForge is a NAS framework with Infinite-Head Attention, a Forge-Former surrogate, and Forge-DSE engine that discovers hardware-specific architectures for edge language models, yielding variants with improved accurac...

  7. Dynamic Chunking for Diffusion Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    DCDM replaces positional blocks with learnable semantic chunks via differentiable Chunking Attention, yielding consistent gains over block and unstructured diffusion baselines up to 1.5B parameters.

  8. Scaling Laws for Mixture Pretraining Under Data Constraints

    cs.LG 2026-05 conditional novelty 7.0

    Repetition-aware scaling laws show scarce target data in pretraining mixtures can be repeated 15-20 times optimally, with the best count depending on data size, compute, and model scale.

  9. TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

    cs.CL 2026-05 unverdicted novelty 7.0

    TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.

  10. Likelihood scoring for continuations of mathematical text: a self-supervised benchmark with tests for shortcut vulnerabilities

    cs.LG 2026-05 unverdicted novelty 7.0

    Presents a likelihood-based benchmark for equation-suffix prediction in technical papers with controls to detect shortcut vulnerabilities in model forecasts.

  11. LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models

    cs.LG 2026-05 unverdicted novelty 7.0

    LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.

  12. SimDiff: Depth Pruning via Similarity and Difference

    cs.AI 2026-04 unverdicted novelty 7.0

    SimDiff uses similarity and difference metrics to prune LLM layers more effectively than cosine similarity alone, retaining over 91% performance at 25% pruning on LLaMA2-7B.

  13. Remask, Don't Replace: Token-to-Mask Refinement in Diffusion Large Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    Token-to-Mask remasking improves self-correction in diffusion LLMs by resetting erroneous commitments to masks rather than overwriting them, yielding +13.33 points on AIME 2025 and +8.56 on CMATH.

  14. Winner-Take-All Spiking Transformer for Language Modeling

    cs.NE 2026-04 unverdicted novelty 7.0

    Winner-take-all spiking self-attention replaces softmax in spiking transformers to support language modeling on 16 datasets with spike-driven, energy-efficient architectures.

  15. A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network

    cs.AR 2026-03 unverdicted novelty 7.0

    SCIN uses an in-switch accelerator for direct memory access and 8-bit in-network quantization during All-Reduce, delivering up to 8.7x faster small-message reduction and 1.74x TTFT speedup on LLaMA-2 models.

  16. Path-Constrained Mixture-of-Experts

    cs.LG 2026-03 unverdicted novelty 7.0

    PathMoE constrains expert paths in MoE models by sharing router parameters across layer blocks, yielding more concentrated paths, better performance on perplexity and tasks, and no need for auxiliary losses.

  17. EvoESAP: Non-Uniform Expert Pruning for Sparse MoE

    cs.LG 2026-03 conditional novelty 7.0

    EvoESAP uses evolutionary search guided by a speculative-decoding-inspired ESAP metric to discover non-uniform layer-wise sparsity allocations for MoE expert pruning, improving generation accuracy up to 19.6% at 50% sparsity.

  18. Exact Flow Linear Attention: Exact Solution from Continuous-Time Dynamics

    cs.LG 2025-12 unverdicted novelty 7.0

    Exact Flow Linear Attention derives a closed-form exact update for delta-rule linear attention from continuous-time dynamics, removing Euler discretization error while preserving linear complexity and structure.

  19. Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling

    cs.CL 2025-12 conditional novelty 7.0

    Four Over Six adaptively scales blocks in NVFP4 quantization to smaller FP4 values, making representable value distributions more uniform and reducing quantization error especially for near-maximal values.

  20. LLM DNA: Tracing Model Evolution via Functional Representations

    cs.LG 2025-09 unverdicted novelty 7.0

    LLM DNA is introduced as a low-dimensional bi-Lipschitz functional representation proven to satisfy inheritance and genetic determinism, with a training-free extraction pipeline tested on 305 models to reveal relation...

  21. Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training

    cs.LG 2025-07 unverdicted novelty 7.0

    An RL agent learns domain re-weighting policies from evaluation feedback to improve balanced performance in continual pre-training of LLMs across source and target domains.

  22. PRIMETIME : Limits of LLMs in Temporal Primitives

    cs.NE 2025-04 unverdicted novelty 7.0

    PRIMETIME generator reveals that LLM datetime parsing and arithmetic primitives are individually unreliable but fully learnable via fine-tuning, enabling frontier-level accuracy on event planning with small LoRA models.

  23. Moshi: a speech-text foundation model for real-time dialogue

    eess.AS 2024-09 accept novelty 7.0

    Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.

  24. Refusal in Language Models Is Mediated by a Single Direction

    cs.LG 2024-06 accept novelty 7.0

    Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

  25. Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

    cs.CL 2024-06 unverdicted novelty 7.0

    Magpie synthesizes 300K high-quality alignment instructions from Llama-3-Instruct via auto-regressive prompting on partial templates, enabling fine-tuned models to match official instruct performance on AlpacaEval, Ar...

  26. SpinQuant: LLM quantization with learned rotations

    cs.LG 2024-05 conditional novelty 7.0

    SpinQuant learns optimal rotations to enable accurate 4-bit quantization of LLM weights, activations, and KV cache, reducing the zero-shot gap to full precision to 2.9 points on LLaMA-2 7B.

  27. GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs

    cs.LG 2026-05 unverdicted novelty 6.0

    GEMQ applies global LP-based expert importance estimation and router fine-tuning within progressive quantization to cut memory and speed inference in MoE LLMs with little accuracy loss.

  28. One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs

    cs.LG 2026-05 conditional novelty 6.0

    Heavy-tail guided layerwise learning rates improve LLM convergence speed and generalization across LLaMA, GPT variants, AdamW and Muon optimizers from 60M to 1B parameters.

  29. Nonlinear Bipolar Compensation: Handling Outliers in Post-Training Quantization

    cs.CV 2026-05 unverdicted novelty 6.0

    Nonlinear Bipolar Compensation with Bipolar Logarithmic Transformation reduces outlier effects in post-training quantization by performing compensation in a compressed transformed space.

  30. Scaling Laws for Mixture Pretraining Under Data Constraints

    cs.LG 2026-05 unverdicted novelty 6.0

    Empirical study shows mixture pretraining tolerates higher target data repetition than single-source training, with a new repetition-aware scaling law enabling principled mixture selection based on data size, compute,...

  31. TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

    cs.CL 2026-05 unverdicted novelty 6.0

    TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.

  32. Likelihood scoring for continuations of mathematical text: a self-supervised benchmark with tests for shortcut vulnerabilities

    cs.LG 2026-05 unverdicted novelty 6.0

    A new benchmark uses separate predictor and scorer LLMs to test whether forecast strings improve likelihood of hidden mathematical equation continuations, with controls that detect priming shortcuts.

  33. Theory-optimal Quantization Based on Flatness

    cs.LG 2026-05 unverdicted novelty 6.0

    The paper introduces the Flatness metric, derives a theory-optimal quantization solution, and presents BDQ that uses bidirectional diagonal transformations to reduce outlier impact, achieving under 1% drop at W4A4 on ...

  34. Structured Recurrent Mixers for Massively Parallelized Sequence Generation

    cs.CL 2026-05 conditional novelty 6.0

    Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, yielding higher throughput, concurrency, and training efficiency than comparable linear-complex...

  35. Structured Recurrent Mixers for Massively Parallelized Sequence Generation

    cs.CL 2026-05 unverdicted novelty 6.0

    Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, delivering higher efficiency, information capacity, and throughput than other linear-complexity models.

  36. Revealing Modular Gradient Noise Imbalance in LLMs: Calibrating Adam via Signal-to-Noise Ratio

    cs.LG 2026-05 unverdicted novelty 6.0

    MoLS scales Adam updates using module-level SNR estimates to correct gradient noise imbalance and improve LLM training convergence and generalization.

  37. Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting

    cs.LG 2026-05 unverdicted novelty 6.0

    Sharpness-aware pretraining and related flat-minima interventions reduce catastrophic forgetting by up to 80% after post-training across 20M-150M models and by 31-40% at 1B scale.

  38. Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling

    cs.CL 2026-04 unverdicted novelty 6.0

    HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.

  39. SMoES: Soft Modality-Guided Expert Specialization in MoE-VLMs

    cs.CV 2026-04 unverdicted novelty 6.0

    SMoES improves MoE-VLM performance and efficiency via soft modality-guided expert routing and inter-bin mutual information regularization, yielding 0.9-4.2% task gains and 56% communication reduction.

  40. FASQ: Flexible Accelerated Subspace Quantization for Calibration-Free LLM Compression

    cs.LG 2026-04 unverdicted novelty 6.0

    FASQ delivers calibration-free LLM compression with continuous size trade-offs via product quantization and custom CUDA kernels that accelerate decode beyond FP16 speeds on consumer hardware.

  41. LBLLM: Lightweight Binarization of Large Language Models via Three-Stage Distillation

    cs.LG 2026-04 unverdicted novelty 6.0

    LBLLM achieves better accuracy than prior binarization methods for LLMs by decoupling weight and activation quantization through initialization, layer-wise distillation, and learnable activation scaling.

  42. TLoRA: Task-aware Low Rank Adaptation of Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer ...

  43. Representation-Guided Parameter-Efficient LLM Unlearning

    cs.CL 2026-04 unverdicted novelty 6.0

    REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.

  44. Robust Ultra Low-Bit Post-Training Quantization via Stable Diagonal Curvature Estimate

    cs.LG 2026-04 unverdicted novelty 6.0

    DASH-Q uses a stable diagonal curvature estimate and weighted least squares to achieve robust ultra-low-bit post-training quantization of LLMs, improving zero-shot accuracy by 7% on average over baselines.

  45. BiSpikCLM: A Spiking Language Model integrating Softmax-Free Spiking Attention and Spike-Aware Alignment Distillation

    cs.NE 2026-04 unverdicted novelty 6.0

    BiSpikCLM is the first fully binary spiking MatMul-free causal language model that matches ANN performance on generation tasks using only 4-6 percent of the compute via softmax-free spiking attention and spike-aware d...

  46. Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts

    cs.CL 2026-04 conditional novelty 6.0

    Loss-based pruning of training data to limit facts and flatten their frequency distribution enables a 110M-parameter GPT-2 model to memorize 1.3 times more entity facts than standard training, matching a 1.3B-paramete...

  47. Rethinking Residual Errors in Compensation-based LLM Quantization

    cs.LG 2026-04 conditional novelty 6.0

    Redefining residual errors to include compensation-aware discrepancies and realigning calibration to full-precision outputs improves GPTQ and GPTAQ performance on LLMs.

  48. In-Place Test-Time Training

    cs.LG 2026-04 conditional novelty 6.0

    In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.

  49. PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer

    cs.CV 2026-04 unverdicted novelty 6.0

    PoM is a new linear-complexity token mixer using learned polynomials that matches attention performance in transformers while enabling efficient long-sequence processing.

  50. SLaB: Sparse-Lowrank-Binary Decomposition for Efficient Large Language Models

    cs.LG 2026-04 unverdicted novelty 6.0

    SLaB compresses LLM weights via sparse-lowrank-binary decomposition guided by activation-aware scores, achieving up to 36% lower perplexity than prior methods at 50% compression on Llama models.

  51. Frequency Matters: Fast Model-Agnostic Data Curation for Pruning and Quantization

    cs.CL 2026-03 unverdicted novelty 6.0

    ZipCal curates calibration data for LLM pruning and quantization by maximizing lexical diversity via Zipfian power laws, outperforming random sampling and matching perplexity-based methods at 240x speed.

  52. CoreQ: Learning-Free Mismatch Correction and Successive Rounding for Quantization

    cs.LG 2026-02 unverdicted novelty 6.0

    CoreQ delivers adaptive mismatch correction via closed-form geometric coefficient and successive rounding to improve PTQ accuracy for large language models.

  53. L2R: Low-Rank and Lipschitz-Controlled Routing for Mixture-of-Experts

    cs.LG 2026-01 unverdicted novelty 6.0

    L2R improves MoE performance by routing in a low-rank space with Lipschitz-controlled saturated inner-product scoring and multi-anchor mechanisms.

  54. LLaDA2.0: Scaling Up Diffusion Language Models to 100B

    cs.LG 2025-12 conditional novelty 6.0

    LLaDA2.0 scales discrete diffusion language models to 100B parameters via systematic conversion from autoregressive models using a 3-phase WSD training scheme and releases open-source 16B and 100B MoE variants.

  55. SpecQuant: Spectral Decomposition and Adaptive Truncation for Ultra-Low-Bit LLMs Quantization

    cs.LG 2025-11 unverdicted novelty 6.0

    SpecQuant uses outlier smoothing into weights followed by channel-wise low-frequency Fourier truncation to achieve 4-bit quantization of LLaMA-3 8B with only 1.5% zero-shot accuracy loss versus full precision.

  56. ScaLoRA: Optimally Scaled Low-Rank Adaptation for Efficient High-Rank Fine-Tuning

    cs.LG 2025-10 unverdicted novelty 6.0

    ScaLoRA analytically derives per-update column scalings that let low-rank increments accumulate into high-rank weight updates, yielding faster convergence and higher accuracy than prior LoRA variants on LLMs up to 12B...

  57. Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs

    cs.LG 2025-10 unverdicted novelty 6.0

    A conditional scaling law fitted on over 200 models from 80M to 3B parameters identifies architectures that deliver up to 2.1% higher accuracy and 42% higher inference throughput than LLaMA-3.2 under the same training budget.

  58. Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in LLM Unlearning

    cs.LG 2025-10 conditional novelty 6.0

    Downgrading optimizers to lower-information variants during LLM unlearning yields more robust forgetting on MUSE and WMDP benchmarks by converging to harder-to-perturb loss basins.

  59. Short window attention enables long-term memorization

    cs.LG 2025-09 unverdicted novelty 6.0

    Short sliding windows in hybrid attention-xLSTM models boost long-context performance by encouraging long-term memory use, and stochastic window sizing improves both short and long tasks.

  60. Multiplayer Nash Preference Optimization

    cs.AI 2025-09 unverdicted novelty 6.0

    MNPO extends NLHF to multiplayer Nash games, inheriting equilibrium guarantees while showing empirical gains on instruction-following benchmarks under diverse preferences.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 115 Pith papers · 2 internal anchors

  1. [1]

    Yonatan Belinkov and Yonatan Bisk. 2018. Synthetic and natural noise both break neural machine translation. In ICLR. ICLR

  2. [2]

    Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017. Enhanced lstm for natural language inference. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1657--1668

  3. [3]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

  4. [4]

    Max Glockner, Vered Shwartz, and Yoav Goldberg. 2018. Breaking nli systems with sentences that require simple lexical inferences. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 650--655

  5. [5]

    Jonathan Gordon and Benjamin Van Durme. 2013. Reporting bias and knowledge acquisition. In Proceedings of the 2013 workshop on Automated knowledge base construction, pages 25--30. ACM

  6. [6]

    Bowman, and Noah A

    Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, and Noah A. Smith. 2018. Annotation artifacts in natural language inference data. In Proc. of NAACL

  7. [7]

    Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin Choi. 2019. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751

  8. [8]

    Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2021--2031

  9. [9]

    Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, volume 2, pages 427--431

  10. [10]

    Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense- Captioning Events in Videos . In International Conference on Computer Vision ( ICCV )

  11. [11]

    Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), volume 1, pages 2227--2237

  12. [12]

    Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, and Benjamin Van Durme. 2018. Hypothesis only baselines in natural language inference. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, pages 180--191

  13. [13]

    Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. https://blog.openai.com/language-unsupervised/ Improving language understanding by generative pre-training . Technical report, OpenAI

  14. [14]

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. https://openai.com/blog/better-language-models/ Language models are unsupervised multitask learners . Technical report, OpenAI

  15. [15]

    Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket Tandon, Christopher Pal, Hugo Larochelle, Aaron Courville, and Bernt Schiele. 2017. https://doi.org/10.1007/s11263-016-0987-1 Movie Description . International Journal of Computer Vision, 123(1):94--120

  16. [16]

    Rachel Rudinger, Vera Demberg, Ashutosh Modi, Benjamin Van Durme, and Manfred Pinkal. 2015. Learning to predict script events from domain-specific text. In Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics , pages 205--210

  17. [17]

    Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for floating-point programs and multicore architectures. Technical report, Lawrence Berkeley National Lab.(LBNL), Berkeley, CA (United States)

  18. [18]

    Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. Swag: A large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)

  19. [19]

    Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In arXiv preprint arXiv:1506.06724