Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever · 2019

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

browse 5 citing papers

representative citing papers

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

Dynamic Latent Routing jointly learns discrete latent codes, routing policies, and model parameters via dynamic search to match or exceed supervised fine-tuning by 6.6 points on average in low-data settings across four datasets and six models.

The E$\Delta$-MHC-Geo Transformer: Adaptive Geodesic Operations with Guaranteed Orthogonality

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

The EΔ-MHC-Geo Transformer achieves input-adaptive unconditionally orthogonal residual connections via a Cayley-based rotation that works for all parameters, combined with a learned hybrid gate for reflections.

Graph In-Context Operator Networks for Generalizable Spatiotemporal Prediction

cs.LG · 2026-03-13 · unverdicted · novelty 6.0

GICON combines graph message passing with example-aware positional encoding to enable in-context operator learning that outperforms classical operator learning on air quality prediction tasks across regions.

Why Code, Why Now: An Information-Theoretic Perspective on the Limits of Machine Learning

cs.LG · 2026-02-15 · unverdicted · novelty 6.0

Task information structure determines ML scaling success, with code's dense verifiable signals enabling predictable progress while sparse-feedback tasks like typical RL do not.

Graph Memory Transformer (GMT)

cs.LG · 2026-04-26

citing papers explorer

Showing 5 of 5 citing papers.

Dynamic Latent Routing cs.LG · 2026-05-14 · unverdicted · none · ref 36
Dynamic Latent Routing jointly learns discrete latent codes, routing policies, and model parameters via dynamic search to match or exceed supervised fine-tuning by 6.6 points on average in low-data settings across four datasets and six models.
The E$\Delta$-MHC-Geo Transformer: Adaptive Geodesic Operations with Guaranteed Orthogonality cs.LG · 2026-05-07 · unverdicted · none · ref 28
The EΔ-MHC-Geo Transformer achieves input-adaptive unconditionally orthogonal residual connections via a Cayley-based rotation that works for all parameters, combined with a learned hybrid gate for reflections.
Graph In-Context Operator Networks for Generalizable Spatiotemporal Prediction cs.LG · 2026-03-13 · unverdicted · none · ref 13
GICON combines graph message passing with example-aware positional encoding to enable in-context operator learning that outperforms classical operator learning on air quality prediction tasks across regions.
Why Code, Why Now: An Information-Theoretic Perspective on the Limits of Machine Learning cs.LG · 2026-02-15 · unverdicted · none · ref 31
Task information structure determines ML scaling success, with code's dense verifiable signals enabling predictable progress while sparse-feedback tasks like typical RL do not.
Graph Memory Transformer (GMT) cs.LG · 2026-04-26 · unreviewed · ref 28

Language models are unsupervised multitask learners

fields

years

verdicts

representative citing papers

citing papers explorer