Dynamic Latent Routing jointly learns discrete latent codes, routing policies, and model parameters via dynamic search to match or exceed supervised fine-tuning by 6.6 points on average in low-data settings across four datasets and six models.
Language models are unsupervised multitask learners
5 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 5years
2026 5representative citing papers
The EΔ-MHC-Geo Transformer achieves input-adaptive unconditionally orthogonal residual connections via a Cayley-based rotation that works for all parameters, combined with a learned hybrid gate for reflections.
GICON combines graph message passing with example-aware positional encoding to enable in-context operator learning that outperforms classical operator learning on air quality prediction tasks across regions.
Task information structure determines ML scaling success, with code's dense verifiable signals enabling predictable progress while sparse-feedback tasks like typical RL do not.
citing papers explorer
-
Dynamic Latent Routing
Dynamic Latent Routing jointly learns discrete latent codes, routing policies, and model parameters via dynamic search to match or exceed supervised fine-tuning by 6.6 points on average in low-data settings across four datasets and six models.
-
The E$\Delta$-MHC-Geo Transformer: Adaptive Geodesic Operations with Guaranteed Orthogonality
The EΔ-MHC-Geo Transformer achieves input-adaptive unconditionally orthogonal residual connections via a Cayley-based rotation that works for all parameters, combined with a learned hybrid gate for reflections.
-
Graph In-Context Operator Networks for Generalizable Spatiotemporal Prediction
GICON combines graph message passing with example-aware positional encoding to enable in-context operator learning that outperforms classical operator learning on air quality prediction tasks across regions.
-
Why Code, Why Now: An Information-Theoretic Perspective on the Limits of Machine Learning
Task information structure determines ML scaling success, with code's dense verifiable signals enabling predictable progress while sparse-feedback tasks like typical RL do not.
- Graph Memory Transformer (GMT)