End-to-end sample complexity for autoregressive generators can realize any scaling rate r(T) between constant and linear, while chain-of-thought supervision eliminates all dependence on T.
Transformers provably learn chain-of-thought reasoning with length generalization
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
Power-law data sampling creates beneficial asymmetry in the loss landscape that lets models acquire high-frequency skill compositions first, enabling more efficient learning of rare long-tail skills than uniform distributions.
citing papers explorer
-
Sample Complexity of Autoregressive Reasoning: Chain-of-Thought vs. End-to-End
End-to-end sample complexity for autoregressive generators can realize any scaling rate r(T) between constant and linear, while chain-of-thought supervision eliminates all dependence on T.
-
The Power of Power Law: Asymmetry Enables Compositional Reasoning
Power-law data sampling creates beneficial asymmetry in the loss landscape that lets models acquire high-frequency skill compositions first, enabling more efficient learning of rare long-tail skills than uniform distributions.