hub

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , year=

HellaSwag: Can a Machine Really Finish Your Sentence? , author=

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

browse 12 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Simply Stabilizing the Loop via Fully Looped Transformer

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Fully Looped Transformer stabilizes looped training up to 12 iterations via distributed inter-loop signals and attention injection, improving downstream performance by up to 13.2%.

Chain-based Distillation for Effective Initialization of Variable-Sized Small Language Models

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

Chain-based Distillation constructs a sequence of anchor models to enable efficient initialization of variable-sized SLMs through interpolation, with bridge distillation for cross-architecture transfer, yielding better performance than scratch training.

Understanding Performance Collapse in Layer-Pruned Large Language Models via Decision Representation Transitions

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

Performance collapse in layer-pruned LLMs stems from disrupting the Silent Phase of decision-making, which blocks the transition to correct predictions, while the later Decisive Phase is robust to pruning.

Leveraging Pretrained Language Models as Energy Functions for Glauber Dynamics Text Diffusion

cs.LG · 2026-05-05 · unverdicted · novelty 7.0

Pretrained language models are used as energy functions for Glauber dynamics in discrete text diffusion, improving generation quality over prior diffusion LMs and matching autoregressive models on benchmarks and reasoning tasks.

Always Learning, Always Mixing: Efficient and Simple Data Mixing All The Time

cs.CL · 2026-05-13 · conditional · novelty 6.0

OP-Mix is an on-policy data mixing method that uses low-rank adapter interpolation to find near-optimal data mixtures throughout language model training with reduced compute.

Measuring Five-Nines Reliability: Sample-Efficient LLM Evaluation in Saturated Benchmarks

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

Cross-entropy method sampling reduces inferences needed to estimate five-nines LLM reliability by up to 156x on parameterized GSM8K templates, revealing reliability differences hidden by saturated accuracy scores.

Compute Optimal Tokenization

cs.CL · 2026-05-02 · unverdicted · novelty 6.0

In compute-optimal regimes, language model parameter count scales proportionally with data bytes rather than tokens, and the optimal compression rate decreases with increasing compute.

Scaling Diffusion Language Models via Adaptation from Autoregressive Models

cs.CL · 2024-10-23 · conditional · novelty 6.0

Adapting autoregressive models via continual pre-training yields diffusion language models from 127M to 7B parameters that outperform prior diffusion models and compete with their autoregressive counterparts on language, reasoning, and commonsense benchmarks.

The Efficiency Gap in Byte Modeling

cs.LG · 2026-05-13 · unverdicted · novelty 5.0

Byte modeling incurs greater scaling overhead for masked diffusion than autoregressive models because the diffusion objective destroys local byte contiguity needed to resolve semantics.

Learn-To-Learn on Arbitrary Textual Conditioning: A Hypernetwork-Driven Meta-Gated LLM

cs.CL · 2026-05-03 · unverdicted · novelty 5.0

A hypernetwork produces a condition-dependent beta that meta-gates SwiGLU nonlinearity, giving LLMs adaptive behavior across task, domain, persona and style inputs without finetuning.

GiVA: Gradient-Informed Bases for Vector-Based Adaptation

cs.CL · 2026-04-23 · unverdicted · novelty 5.0

GiVA uses gradients to initialize vector adapters so they match LoRA performance at eight times lower rank while keeping extreme parameter efficiency.

Toward Cross-Lingual Quality Classifiers for Multilingual Pretraining Data Selection

cs.CL · 2026-04-22 · unverdicted · novelty 4.0

Multilingual pooling for quality classifiers outperforms monolingual baselines in rank stability and accuracy for LLM pretraining data selection across high- and low-resource languages.

citing papers explorer

Showing 10 of 10 citing papers after filters.

Simply Stabilizing the Loop via Fully Looped Transformer cs.LG · 2026-05-11 · unverdicted · none · ref 60
Fully Looped Transformer stabilizes looped training up to 12 iterations via distributed inter-loop signals and attention injection, improving downstream performance by up to 13.2%.
Chain-based Distillation for Effective Initialization of Variable-Sized Small Language Models cs.CL · 2026-05-08 · unverdicted · none · ref 38
Chain-based Distillation constructs a sequence of anchor models to enable efficient initialization of variable-sized SLMs through interpolation, with bridge distillation for cross-architecture transfer, yielding better performance than scratch training.
Understanding Performance Collapse in Layer-Pruned Large Language Models via Decision Representation Transitions cs.CL · 2026-05-08 · unverdicted · none · ref 16
Performance collapse in layer-pruned LLMs stems from disrupting the Silent Phase of decision-making, which blocks the transition to correct predictions, while the later Decisive Phase is robust to pruning.
Leveraging Pretrained Language Models as Energy Functions for Glauber Dynamics Text Diffusion cs.LG · 2026-05-05 · unverdicted · none · ref 8
Pretrained language models are used as energy functions for Glauber dynamics in discrete text diffusion, improving generation quality over prior diffusion LMs and matching autoregressive models on benchmarks and reasoning tasks.
Measuring Five-Nines Reliability: Sample-Efficient LLM Evaluation in Saturated Benchmarks cs.LG · 2026-05-11 · unverdicted · none · ref 37
Cross-entropy method sampling reduces inferences needed to estimate five-nines LLM reliability by up to 156x on parameterized GSM8K templates, revealing reliability differences hidden by saturated accuracy scores.
Compute Optimal Tokenization cs.CL · 2026-05-02 · unverdicted · none · ref 20
In compute-optimal regimes, language model parameter count scales proportionally with data bytes rather than tokens, and the optimal compression rate decreases with increasing compute.
The Efficiency Gap in Byte Modeling cs.LG · 2026-05-13 · unverdicted · none · ref 34
Byte modeling incurs greater scaling overhead for masked diffusion than autoregressive models because the diffusion objective destroys local byte contiguity needed to resolve semantics.
Learn-To-Learn on Arbitrary Textual Conditioning: A Hypernetwork-Driven Meta-Gated LLM cs.CL · 2026-05-03 · unverdicted · none · ref 110
A hypernetwork produces a condition-dependent beta that meta-gates SwiGLU nonlinearity, giving LLMs adaptive behavior across task, domain, persona and style inputs without finetuning.
GiVA: Gradient-Informed Bases for Vector-Based Adaptation cs.CL · 2026-04-23 · unverdicted · none · ref 9
GiVA uses gradients to initialize vector adapters so they match LoRA performance at eight times lower rank while keeping extreme parameter efficiency.
Toward Cross-Lingual Quality Classifiers for Multilingual Pretraining Data Selection cs.CL · 2026-04-22 · unverdicted · none · ref 35
Multilingual pooling for quality classifiers outperforms monolingual baselines in rank stability and accuracy for LLM pretraining data selection across high- and low-resource languages.

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , year=

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer