Language Models are Unsupervised Multitask Learners , url =

Radford, Alec, Wu, Jeffrey, Child, Rewon, Luan, David, Amodei, Dario, Sutskever, Ilya , biburl =

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

browse 6 citing papers

representative citing papers

Language Acquisition Device in Large Language Models

cs.CL · 2026-05-16 · unverdicted · novelty 7.0

Pre-pretraining on MP-STRUCT matches k-Shuffle Dyck baselines in efficiency while adding human-like resistance to implausible languages and challenges the need for C-RASP definability in effective PPT languages.

From Mechanistic to Compositional Interpretability

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

Compositional interpretability defines explanations as commuting syntactic-semantic mapping pairs grounded in compositionality and minimum description length, with compressive refinement and a parsimony theorem guaranteeing concise human-aligned decompositions.

Surprisal Minimisation over Goal-directed Alternatives Predicts Production Choice in Dialogue

cs.CL · 2026-05-01 · unverdicted · novelty 7.0

Surprisal minimization over goal-directed alternatives generated by language models provides the strongest account of production choices in open-ended dialogue compared to uniform information density or length-based costs.

InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition

cs.CL · 2026-05-04 · unverdicted · novelty 6.0

InfoLaw models pretraining as information accumulation where quality sets information density and repetition causes scale-dependent diminishing returns, predicting loss with low error on unseen mixtures and larger scales up to 7B models and 425B tokens.

When and Why Grouping Attention Heads Accelerates Muon Optimization

cs.LG · 2026-05-09 · unverdicted · novelty 5.0

Grouping attention heads in Muon creates a trade-off between whitening gains and norm costs that, when tuned, improves training loss over full or per-head Muon on GPT-2.

Multi-Gate Residuals

cs.LG · 2026-05-22 · unverdicted · novelty 3.0

Multi-Gate Residuals stabilizes activation scales in deep residual networks via multi-stream gating and attention pooling without added communication overhead.

citing papers explorer

Showing 6 of 6 citing papers.

Language Acquisition Device in Large Language Models cs.CL · 2026-05-16 · unverdicted · none · ref 96
Pre-pretraining on MP-STRUCT matches k-Shuffle Dyck baselines in efficiency while adding human-like resistance to implausible languages and challenges the need for C-RASP definability in effective PPT languages.
From Mechanistic to Compositional Interpretability cs.LG · 2026-05-09 · unverdicted · none · ref 130
Compositional interpretability defines explanations as commuting syntactic-semantic mapping pairs grounded in compositionality and minimum description length, with compressive refinement and a parsimony theorem guaranteeing concise human-aligned decompositions.
Surprisal Minimisation over Goal-directed Alternatives Predicts Production Choice in Dialogue cs.CL · 2026-05-01 · unverdicted · none · ref 119
Surprisal minimization over goal-directed alternatives generated by language models provides the strongest account of production choices in open-ended dialogue compared to uniform information density or length-based costs.
InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition cs.CL · 2026-05-04 · unverdicted · none · ref 34
InfoLaw models pretraining as information accumulation where quality sets information density and repetition causes scale-dependent diminishing returns, predicting loss with low error on unseen mixtures and larger scales up to 7B models and 425B tokens.
When and Why Grouping Attention Heads Accelerates Muon Optimization cs.LG · 2026-05-09 · unverdicted · none · ref 1
Grouping attention heads in Muon creates a trade-off between whitening gains and norm costs that, when tuned, improves training loss over full or per-head Muon on GPT-2.
Multi-Gate Residuals cs.LG · 2026-05-22 · unverdicted · none · ref 6
Multi-Gate Residuals stabilizes activation scales in deep residual networks via multi-stream gating and attention pooling without added communication overhead.

Language Models are Unsupervised Multitask Learners , url =

fields

years

verdicts

representative citing papers

citing papers explorer