Looped language models with latent iterative computation and entropy-regularized depth allocation achieve performance matching up to 12B standard LLMs through superior knowledge manipulation.
Transformers learn to implement multi-step gradient descent with chain of thought
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
verdicts
UNVERDICTED 2representative citing papers
Power-law data sampling creates beneficial asymmetry in the loss landscape that lets models acquire high-frequency skill compositions first, enabling more efficient learning of rare long-tail skills than uniform distributions.
citing papers explorer
-
Scaling Latent Reasoning via Looped Language Models
Looped language models with latent iterative computation and entropy-regularized depth allocation achieve performance matching up to 12B standard LLMs through superior knowledge manipulation.
-
The Power of Power Law: Asymmetry Enables Compositional Reasoning
Power-law data sampling creates beneficial asymmetry in the loss landscape that lets models acquire high-frequency skill compositions first, enabling more efficient learning of rare long-tail skills than uniform distributions.