pith. sign in

Language Models are Unsupervised Multitask Learners , url =

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

fields

cs.CL 3 cs.LG 3

years

2026 6

verdicts

UNVERDICTED 6

representative citing papers

Language Acquisition Device in Large Language Models

cs.CL · 2026-05-16 · unverdicted · novelty 7.0

Pre-pretraining on MP-STRUCT matches k-Shuffle Dyck baselines in efficiency while adding human-like resistance to implausible languages and challenges the need for C-RASP definability in effective PPT languages.

From Mechanistic to Compositional Interpretability

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

Compositional interpretability defines explanations as commuting syntactic-semantic mapping pairs grounded in compositionality and minimum description length, with compressive refinement and a parsimony theorem guaranteeing concise human-aligned decompositions.

Multi-Gate Residuals

cs.LG · 2026-05-22 · unverdicted · novelty 3.0

Multi-Gate Residuals stabilizes activation scales in deep residual networks via multi-stream gating and attention pooling without added communication overhead.

citing papers explorer

Showing 6 of 6 citing papers.

  • Language Acquisition Device in Large Language Models cs.CL · 2026-05-16 · unverdicted · none · ref 96

    Pre-pretraining on MP-STRUCT matches k-Shuffle Dyck baselines in efficiency while adding human-like resistance to implausible languages and challenges the need for C-RASP definability in effective PPT languages.

  • From Mechanistic to Compositional Interpretability cs.LG · 2026-05-09 · unverdicted · none · ref 130

    Compositional interpretability defines explanations as commuting syntactic-semantic mapping pairs grounded in compositionality and minimum description length, with compressive refinement and a parsimony theorem guaranteeing concise human-aligned decompositions.

  • Surprisal Minimisation over Goal-directed Alternatives Predicts Production Choice in Dialogue cs.CL · 2026-05-01 · unverdicted · none · ref 119

    Surprisal minimization over goal-directed alternatives generated by language models provides the strongest account of production choices in open-ended dialogue compared to uniform information density or length-based costs.

  • InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition cs.CL · 2026-05-04 · unverdicted · none · ref 34

    InfoLaw models pretraining as information accumulation where quality sets information density and repetition causes scale-dependent diminishing returns, predicting loss with low error on unseen mixtures and larger scales up to 7B models and 425B tokens.

  • When and Why Grouping Attention Heads Accelerates Muon Optimization cs.LG · 2026-05-09 · unverdicted · none · ref 1

    Grouping attention heads in Muon creates a trade-off between whitening gains and norm costs that, when tuned, improves training loss over full or per-head Muon on GPT-2.

  • Multi-Gate Residuals cs.LG · 2026-05-22 · unverdicted · none · ref 6

    Multi-Gate Residuals stabilizes activation scales in deep residual networks via multi-stream gating and attention pooling without added communication overhead.