Transformers provably learn chain-of-thought reasoning with length generalization

Huang, Y · 2025 · arXiv 2511.07378

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

Sample Complexity of Autoregressive Reasoning: Chain-of-Thought vs. End-to-End

cs.LG · 2026-04-13 · unverdicted · novelty 8.0

End-to-end sample complexity for autoregressive generators can realize any scaling rate r(T) between constant and linear, while chain-of-thought supervision eliminates all dependence on T.

The Power of Power Law: Asymmetry Enables Compositional Reasoning

cs.AI · 2026-04-24 · unverdicted · novelty 6.0

Power-law data sampling creates beneficial asymmetry in the loss landscape that lets models acquire high-frequency skill compositions first, enabling more efficient learning of rare long-tail skills than uniform distributions.

citing papers explorer

Showing 2 of 2 citing papers.

Sample Complexity of Autoregressive Reasoning: Chain-of-Thought vs. End-to-End cs.LG · 2026-04-13 · unverdicted · none · ref 6
End-to-end sample complexity for autoregressive generators can realize any scaling rate r(T) between constant and linear, while chain-of-thought supervision eliminates all dependence on T.
The Power of Power Law: Asymmetry Enables Compositional Reasoning cs.AI · 2026-04-24 · unverdicted · none · ref 24
Power-law data sampling creates beneficial asymmetry in the loss landscape that lets models acquire high-frequency skill compositions first, enabling more efficient learning of rare long-tail skills than uniform distributions.

Transformers provably learn chain-of-thought reasoning with length generalization

fields

years

verdicts

representative citing papers

citing papers explorer