arxiv: 2604.22951 · v1 · submitted 2026-04-24 · 💻 cs.AI · cs.CL· cs.LG

Recognition: unknown

The Power of Power Law: Asymmetry Enables Compositional Reasoning

Zixuan Wang , Xingyu Dang , Jason D. Lee , Kaifeng Lyu

Authors on Pith no claims yet

Pith reviewed 2026-05-08 11:46 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords compositional reasoningpower-law distributiondata efficiencyloss landscapeskill compositionstate trackingmulti-step arithmetic

0 comments

The pith

Power-law distributions let models learn rare compositional skills with far less data than uniform sampling does.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that training on power-law distributions, where most skills appear rarely, outperforms uniform distributions on tasks that require combining skills such as state tracking and multi-step arithmetic. It introduces a minimalist skill-composition task to prove that this advantage holds because power-law sampling creates an asymmetry in the loss landscape. Models therefore master frequent skill combinations first, then use those as building blocks to acquire rare combinations more efficiently. The result challenges the common practice of reweighting data toward uniformity and indicates that preserving natural frequency imbalances can reduce the data needed for compositional reasoning.

Core claim

Training under power-law distributions consistently outperforms training under uniform distributions across a wide range of compositional reasoning tasks such as state tracking and multi-step arithmetic. The paper proves this advantage in a minimalist skill-composition task, showing that power-law sampling requires significantly less training data. The mechanism is that the induced asymmetry improves the pathological loss landscape, enabling models to first acquire high-frequency skill compositions with low data complexity; these compositions then serve as stepping stones to efficiently learn rare long-tailed skills.

What carries the argument

The asymmetry induced by power-law sampling in the loss landscape of skill-composition tasks.

If this is right

High-frequency skill compositions are learned early with low data complexity.
These compositions act as stepping stones that reduce the data needed for rare combinations.
Overall training data requirements drop for achieving compositional reasoning.
Data curation that flattens natural frequencies can slow progress on long-tail skills.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Data pipelines for language models might benefit from keeping rather than correcting power-law imbalances.
The stepping-stone effect could appear in other hierarchical domains such as planning or visual composition.
Scaling experiments on larger models could quantify the data savings in practice.
The loss-landscape asymmetry offers a new angle on why certain pretraining distributions work well.

Load-bearing premise

The minimalist skill-composition task faithfully captures the structure of real compositional reasoning problems so that the data-efficiency advantage transfers.

What would settle it

A controlled experiment on a real compositional task where uniform sampling reaches target performance with equal or fewer examples than power-law sampling.

Figures

Figures reproduced from arXiv: 2604.22951 by Jason D. Lee, Kaifeng Lyu, Xingyu Dang, Zixuan Wang.

**Figure 1.** Figure 1: Compositional reasoning tasks require composition of skills. We find that only changing the distribution of skills from uniform to power law enables the language models to learn compositional reasoning tasks faster (arithmetic), or even turning an unlearnable task (e.g. state tracking) into a learnable one. multi-hop tasks like the k-hop state tracking task (Merrill & Sabharwal, 2023b; Li et al., 2024b). T… view at source ↗

**Figure 2.** Figure 2: Initial loss landscape comparison. We plot the loss landscape under both uniform and power-law distribution, where the two directions are determined by PCA of both trained checkpoints along the trajectory. The training loss landscape based on power-law distribution has an apparently steeper slope as the descent direction, while training on uniform distribution fails to escape from the initial flat region.… view at source ↗

**Figure 3.** Figure 3: Training dynamics under power law loss on S5. Left: Test loss in total and on subset with samples composed from different group of permutations (ordered by rank). Mid. Test accuracy of each group. Right. The gradient norm on samples that requires tail skills. When the head skills are learned after stage 1, the gradient norm increases and speeds up the learning of tail skills. initial region and draw denser… view at source ↗

**Figure 4.** Figure 4: Left. Test accuracy on synthetic iGSM data. The operations are restricted within 2-8. All arithmetic calculation are done with modulo p = 211. Right. A multi-hop task with |E| = 50 individuals, with each person has |R| = 20 relations, and with hop k = 3. ture that better ‘granularity of asymmetry’ may lead to better training loss landscape, which further accelerates training. Power law is an example of fin… view at source ↗

**Figure 5.** Figure 5: Left. Test loss of different samples on subset with samples composed from different group of permutations (ordered by rank). Right. Loss landscape comparison for multi-hop QA tasks. Power law loss landscape still has a steeper slope, which indicates the generality of the benefit of the power law distribution for training. We can interpret the relations as a dependency graph: each individual as a node, and … view at source ↗

**Figure 6.** Figure 6: Results of Arithmetic from scratch (Qwen3-0.6B, 4 ops over [1, 50], operators {+, −, ×}). Zipf(α=1.0) with shuffled rank-to-value mapping vs uniform sampling, evaluated on uniformly sampled test expressions. Solid line represents the mean over 4 seeds; shaded region represents the min-max range. Both the eval set and the loss validation set are sampled from uniform distribution. Zipf reaches ∼82% accuracy … view at source ↗

**Figure 7.** Figure 7: The loss curves and accuracy curves for α = 0.5 (Top) and α = 0.75 (Bottom). When the exponent is not large enough, the optimization is still not benign enough for successful learning of composition. C.2. The order of the operation Because the skills are not all equally difficult, the ordering of skills in the power-law sampling procedure can affect learning dynamics. Some skills are comparatively easy or … view at source ↗

**Figure 8.** Figure 8: The loss curves and accuracy curves for α = 1.0 (Top), α = 1.25 (Mid) and α = 1.5 (Bottom). Here we tried several different, more coarse-grained power-law distributions: we divide the |S5| = 120 permutations into m = {5, 10, 20, 40, 60} bins in lexicographical order. Then we assign the sum probability for different bins with the power law over i ∈ {1, 2, 3, ..., m} with the sum probability Pi,sum ∝ 1 iα . … view at source ↗

**Figure 9.** Figure 9: Landscape visualization for different αs. Larger α has better initial landscapes, while the landscape for small αs are still flat. individuals |E| ∈ {20, 50}. For training, we use online sampled 3-hop and 4-hop training set and test on the leave-out 4096 test questions. For each test instance, we greedy decode the single token answer given the question prompt (e.g. ‘Who is the instructor of the teacher of … view at source ↗

**Figure 10.** Figure 10: The order ablation experiments on S5. The top two figures are using lexicographical order and the middle two are using random order, which is used in the power-law sampling. The learning process with lexicographical order learns slightly quicker than the random order. Here α = 1.5. The bottom one is using reverse lexicographical order. The training accuracy of each bin still shows similar performance. 0 5… view at source ↗

**Figure 14.** Figure 14: Test accuracy in the multi-hop QA tasks. Training with fine-grained power law outperforms coarser-grained (binned) power law and uniform distribution. 29 view at source ↗

**Figure 11.** Figure 11: (Top) Explicit curriculum with mixture of k = 1, 2, 3, 4 hops under uniform distribution. The model eventually learns the group elements composition, but the loss plateaus occasionally, suggesting that curriculum alone does not fully remove the optimization difficulty. (Bottom) Power law + explicit curriculum further accelerates training and substantially reduces the plateaus. This indicates that power la… view at source ↗

**Figure 12.** Figure 12: The granularity ablation experiments on S5. From top to bottom are # of bins = {5,10,20,40,60}. When # of bin = 120, it falls back to the original power law. Here α = 1.5. As shown in the plot, coarse-grained power law learns much slower compared to fine-grained power law. We conjecture that the fine-grained asymmetry is the key to improve the landscape when the task is intrinsically symmetric and many sa… view at source ↗

**Figure 13.** Figure 13: Accuracy plots for the multi-hop QA task. Across different data settings, power-law distribution generally accelerate the learning of such multi-hop natural language reasoning tasks. The difficulty indeed increases when the hop number k and individual number |E| grows, but power law always help in terms of training. Training details We use a standard GPT-2 tokenizer extended with necessary special tokens.… view at source ↗

**Figure 15.** Figure 15: Accuracy plots for GSM tasks. Left: non-modular arithmetic with maximum leaf value p = 200. Right: modular arithmetic with p = 211, but with multi-hop template randomly combine two steps. Power-law distributions generally helps the model to learn to solve Grade school math synthetic problems much faster than uniform distribution. 33 view at source ↗

read the original abstract

Natural language data follows a power-law distribution, with most knowledge and skills appearing at very low frequency. While a common intuition suggests that reweighting or curating data towards a uniform distribution may help models better learn these long-tail skills, we find a counterintuitive result: across a wide range of compositional reasoning tasks, such as state tracking and multi-step arithmetic, training under power-law distributions consistently outperforms training under uniform distributions. To understand this advantage, we introduce a minimalist skill-composition task and show that learning under a power-law distribution provably requires significantly less training data. Our theoretical analysis reveals that power law sampling induces a beneficial asymmetry that improves the pathological loss landscape, which enables models to first acquire high-frequency skill compositions with low data complexity, which in turn serves as a stepping stone to efficiently learn rare long-tailed skills. Our results offer an alternative perspective on what constitutes an effective data distribution for training models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Power-law sampling outperforms uniform on the tested compositional tasks with a clean toy proof of asymmetry, but the toy setup likely misses key interactions in real problems.

read the letter

The main point is that power-law data distributions give better results than uniform ones on compositional reasoning tasks like state tracking and multi-step arithmetic, and the authors back this with both experiments and a proof in a stripped-down model. The asymmetry they identify lets frequent simple combinations act as stepping stones, cutting the data needed for rarer ones by improving the loss landscape early on. That counter-intuitive empirical pattern plus the minimalist-task analysis is the actual new piece here. It sits in the data-distribution literature and challenges the usual push toward reweighting for uniformity. The experiments run across several tasks and show consistent gains, while the theory stays self-contained and derives the data-complexity advantage directly from their defined setting. Both parts are internally coherent and worth noting. The soft spot is the gap between the toy model and actual language-model problems. Their minimalist task assumes cleanly separable skills with tree-like composition and independent long-tail items. Real state tracking and arithmetic often involve shared sub-components and context that couple skills, which can reshape the loss surface and change whether the high-frequency stepping stones still help. The paper does not close that mapping, so the provable advantage may not carry over at scale. This is aimed at people working on sampling strategies and long-tail effects in LLM training. A reader focused on data efficiency for reasoning would pick up the empirical result and the asymmetry idea even if they stay skeptical on transfer. It should go to peer review. The combination of experiment and formal analysis on the toy case gives referees something concrete to check, and the limitation on real-task structure is straightforward to flag.

Referee Report

2 major / 1 minor

Summary. The paper claims that training under power-law distributions outperforms uniform distributions on compositional reasoning tasks such as state tracking and multi-step arithmetic. It supports this empirically across multiple tasks and theoretically via a minimalist skill-composition task, where power-law sampling induces asymmetry in the loss landscape that lets high-frequency skill compositions serve as stepping stones, provably reducing the data needed to learn rare long-tail skills.

Significance. If the central claim holds, the result challenges the common practice of reweighting data toward uniformity for long-tail coverage and instead highlights the value of preserving natural power-law statistics for compositional generalization. The work earns credit for running experiments across several tasks that support the performance claim and for providing a minimalist-task analysis that yields a concrete data-complexity advantage; these elements make the contribution more than purely empirical.

major comments (2)

[theoretical analysis / minimalist skill-composition task] The theoretical analysis (minimalist skill-composition task): the provable data-efficiency result is derived from the authors' own definition of the task rather than an external benchmark, and the manuscript does not supply a quantitative argument or ablation showing that the clean separability and tree-like composition assumed there survive the overlapping sub-components and context coupling present in the state-tracking tasks of the empirical section.
[discussion of loss-landscape asymmetry] Mapping from toy model to full-scale loss landscapes: the paper asserts that the asymmetry benefit transfers to real compositional reasoning, yet no diagnostic (e.g., loss-surface visualization or gradient-flow analysis on the actual state-tracking or arithmetic models) is provided to verify that high-frequency stepping stones remain useful once shared sub-skills and contextual dependencies are introduced.

minor comments (1)

[abstract] The abstract and introduction could more explicitly delimit the range of tasks and model scales on which the power-law advantage was measured, to help readers assess the scope of the empirical claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the insightful comments, which help clarify the scope and limitations of our theoretical analysis. We address each major point below, providing additional context on the design of the minimalist task and the nature of the loss-landscape argument while outlining targeted revisions to strengthen the manuscript.

read point-by-point responses

Referee: [theoretical analysis / minimalist skill-composition task] The theoretical analysis (minimalist skill-composition task): the provable data-efficiency result is derived from the authors' own definition of the task rather than an external benchmark, and the manuscript does not supply a quantitative argument or ablation showing that the clean separability and tree-like composition assumed there survive the overlapping sub-components and context coupling present in the state-tracking tasks of the empirical section.

Authors: The minimalist task is deliberately constructed as an abstracted model that isolates the effect of power-law sampling on compositional acquisition, enabling a clean proof of reduced data complexity via the stepping-stone mechanism. The separability and tree structure are modeling choices that make the asymmetry in the loss landscape mathematically tractable; they are not claimed to be literal replicas of natural language. In the empirical sections, state-tracking and arithmetic tasks do contain overlapping sub-skills and contextual dependencies, yet the consistent performance advantage under power-law sampling provides supporting evidence that the core dynamic persists. We do not provide a new quantitative ablation bridging the two in the current version. We will revise the manuscript to add an explicit discussion paragraph that articulates the modeling assumptions, notes the gap between the toy setting and full tasks, and explains why the empirical results are still consistent with the predicted mechanism. revision: partial
Referee: [discussion of loss-landscape asymmetry] Mapping from toy model to full-scale loss landscapes: the paper asserts that the asymmetry benefit transfers to real compositional reasoning, yet no diagnostic (e.g., loss-surface visualization or gradient-flow analysis on the actual state-tracking or arithmetic models) is provided to verify that high-frequency stepping stones remain useful once shared sub-skills and contextual dependencies are introduced.

Authors: The loss-landscape asymmetry is rigorously characterized only within the minimalist task, where the loss function can be analyzed directly. Extending the same visualization or gradient-flow diagnostics to the high-dimensional parameter spaces of the state-tracking and arithmetic models is computationally prohibitive and rarely yields interpretable surfaces even when attempted. The transfer argument therefore rests on the theoretical insight plus the observed generalization gap between power-law and uniform training across multiple tasks. We agree that additional diagnostics would be valuable. We will revise the discussion section to acknowledge this limitation explicitly and to outline feasible future probes, such as measuring the acquisition order of high-frequency versus low-frequency compositions via intermediate checkpoints or representation similarity. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected; derivation is self-contained.

full rationale

The paper introduces a minimalist skill-composition task as an explicit theoretical construct to isolate the effect of power-law sampling on compositional learning. The claimed provable data-efficiency advantage is derived as a consequence of the sampling distribution within this model (asymmetry improving the loss landscape for high-frequency stepping stones), which does not reduce to a tautology or self-referential definition. Empirical results on state tracking and arithmetic tasks are reported separately and do not depend on the theory for validity. No load-bearing step equates a prediction to its fitted input by construction, nor does any uniqueness theorem or ansatz reduce to prior self-citation. The analysis remains within standard bounds for model-based proofs and does not import external results that collapse back to the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that the authors' minimalist skill-composition task is representative and that power-law sampling induces a beneficial asymmetry not captured by uniform sampling; no new entities are postulated.

axioms (2)

domain assumption Natural language data follows a power-law distribution over skills and compositions.
Invoked in the opening sentence of the abstract as the starting point for the data-distribution comparison.
ad hoc to paper The minimalist skill-composition task isolates the essential structure of real compositional reasoning.
Required for the provable data-efficiency result to transfer to the listed downstream tasks.

pith-pipeline@v0.9.0 · 5463 in / 1327 out tokens · 60620 ms · 2026-05-08T11:46:09.427692+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

68 extracted references · 39 canonical work pages · 6 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
[2]

S., Bresler, G., and Nagaraj, D

Abbe, E., Boix-Adsera, E., Brennan, M. S., Bresler, G., and Nagaraj, D. The staircase property: How hierarchical structure can guide deep learning. Advances in Neural Information Processing Systems, 34: 0 26989--27002, 2021

2021
[3]

B., and Misiakiewicz, T

Abbe, E., Adsera, E. B., and Misiakiewicz, T. The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks. In Conference on Learning Theory, pp.\ 4782--4887. PMLR, 2022

2022
[4]

Provable advantage of curriculum learning on parity targets with mixed inputs

Abbe, E., Cornacchia, E., and Lotfi, A. Provable advantage of curriculum learning on parity targets with mixed inputs. Advances in Neural Information Processing Systems, 36: 0 24291--24321, 2023

2023
[5]

Physics of language models: Part 3.2, knowledge manipula- tion.arXiv:2309.14402, 2023

Allen-Zhu, Z. and Li, Y. Physics of language models: Part 3.2, knowledge manipulation. arXiv preprint arXiv:2309.14402, 2023

work page arXiv 2023
[6]

Asymptotics of SGD in sequence-single index models and single-layer attention networks

Arnaboldi, L., Loureiro, B., Stephan, L., Krzakala, F., and Zdeborova, L. Asymptotics of sgd in sequence-single index models and single-layer attention networks. arXiv preprint arXiv:2506.02651, 2025

work page arXiv 2025
[7]

A theory for emergence of complex skills in language models

Arora, S. and Goyal, A. A theory for emergence of complex skills in language models. arXiv preprint arXiv:2307.15936, 2023

work page arXiv 2023
[8]

Hidden progress in deep learning: Sgd learns parities near the computational limit

Barak, B., Edelman, B., Goel, S., Kakade, S., Malach, E., and Zhang, C. Hidden progress in deep learning: Sgd learns parities near the computational limit. Advances in Neural Information Processing Systems, 35: 0 21750--21764, 2022

2022
[9]

arXiv preprint arXiv:2406.12775 , year=

Biran, E., Gottesman, D., Yang, S., Geva, M., and Globerson, A. Hopping too late: Exploring the limitations of large language models on multi-hop queries. arXiv preprint arXiv:2406.12775, 2024

work page arXiv 2024
[10]

David Chiang, Peter Cholak, and Anand Pillay

Chen, L., Peng, B., and Wu, H. Theoretical limitations of multi-layer transformer. arXiv preprint arXiv:2412.02975, 2024

work page arXiv 2024
[11]

Skill-it! a data-driven skills framework for understanding and training language models

Chen, M., Roberts, N., Bhatia, K., Wang, J., Zhang, C., Sala, F., and R \'e , C. Skill-it! a data-driven skills framework for understanding and training language models. Advances in Neural Information Processing Systems, 36: 0 36000--36040, 2023

2023
[12]

and Mossel, E

Cornacchia, E. and Mossel, E. A mathematical model for curriculum learning for parities. In International Conference on Machine Learning, pp.\ 6402--6423. PMLR, 2023

2023
[13]

arXiv preprint arXiv:2502.06443 , year=

Cornacchia, E., Mikulincer, D., and Mossel, E. Low-dimensional functions are efficiently learnable under randomly biased distributions. arXiv preprint arXiv:2502.06443, 2025

work page arXiv 2025
[14]

Neural networks can learn representations with gradient descent

Damian, A., Lee, J., and Soltanolkotabi, M. Neural networks can learn representations with gradient descent. In Conference on Learning Theory, pp.\ 5413--5452. PMLR, 2022

2022
[15]

and Malach, E

Daniely, A. and Malach, E. Learning parities with neural networks. Advances in Neural Information Processing Systems, 33: 0 20356--20365, 2020

2020
[16]

From explicit cot to implicit cot: Learning to internalize cot step by step

Deng, Y., Choi, Y., and Shieber, S. From explicit cot to implicit cot: Learning to internalize cot step by step. arXiv preprint arXiv:2405.14838, 2024

work page arXiv 2024
[17]

R., Guo, S., Valko, M., Lillicrap, T., Jimenez Rezende, D., Bengio, Y., Mozer, M

Didolkar, A., Goyal, A., Ke, N. R., Guo, S., Valko, M., Lillicrap, T., Jimenez Rezende, D., Bengio, Y., Mozer, M. C., and Arora, S. Metacognitive capabilities of llms: An exploration in mathematical problem solving. Advances in Neural Information Processing Systems, 37: 0 19783--19812, 2024

2024
[18]

and Hsu, D

Dudeja, R. and Hsu, D. Learning single-index models in gaussian space. In Bubeck, S., Perchet, V., and Rigollet, P. (eds.), Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pp.\ 1887--1930. PMLR, 06--09 Jul 2018. URL https://proceedings.mlr.press/v75/dudeja18a.html

1930
[19]

A., Duh, K., and Kaplan, J

Gordon, M. A., Duh, K., and Kaplan, J. Data and parameter scaling laws for neural machine translation. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t. (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 5915--5922, Online and Punta Cana, Dominican Republic, November 2021. Association for Computati...

work page doi:10.18653/v1/2021.emnlp-main.478 2021
[20]

Scaling Laws for Autoregressive Generative Modeling

Henighan, T., Kaplan, J., Katz, M., Chen, M., Hesse, C., Jackson, J., Jun, H., Brown, T. B., Dhariwal, P., Gray, S., et al. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020

work page internal anchor Pith review arXiv 2010
[21]

Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., Patwary, M. M. A., Yang, Y., and Zhou, Y. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409, 2017

work page internal anchor Pith review arXiv 2017
[22]

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A., Welbl, J., Clark, A., et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022

work page internal anchor Pith review arXiv 2022
[23]

Huang, J., Wang, Z., and Lee, J. D. Transformers learn to implement multi-step gradient descent with chain of thought. arXiv preprint arXiv:2502.21212, 2025 a

work page arXiv 2025
[24]

Transformers provably learn chain-of-thought reasoning with length generalization

Huang, Y., Wen, Z., Singh, A., Chi, Y., and Chen, Y. Transformers provably learn chain-of-thought reasoning with length generalization. arXiv preprint arXiv:2511.07378, 2025 b

work page arXiv 2025
[25]

A., Brown, M., Yang, M.-H., Wang, L., and Gong, B

Jamal, M. A., Brown, M., Yang, M.-H., Wang, L., and Gong, B. Rethinking class-balanced methods for long-tailed visual recognition from a domain adaptation perspective. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 7610--7619, 2020

2020
[26]

Investigating multi-hop factual shortcuts in knowledge editing of large language models

Ju, T., Chen, Y., Yuan, X., Zhang, Z., Du, W., Zheng, Y., and Liu, G. Investigating multi-hop factual shortcuts in knowledge editing of large language models. arXiv preprint arXiv:2402.11900, 2024

work page arXiv 2024
[27]

Scaling Laws for Neural Language Models

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review arXiv 2001
[28]

Linear convergence of gradient and proximal-gradient methods under the polyak- ojasiewicz condition

Karimi, H., Nutini, J., and Schmidt, M. Linear convergence of gradient and proximal-gradient methods under the polyak- ojasiewicz condition. In Machine Learning and Knowledge Discovery in Databases, pp.\ 795--811. Springer, 2016

2016
[29]

Are pretrained language models symbolic reasoners over knowledge? arXiv preprint arXiv:2006.10413, 2020

Kassner, N., Krojer, B., and Sch \"u tze, H. Are pretrained language models symbolic reasoners over knowledge? arXiv preprint arXiv:2006.10413, 2020

work page arXiv 2006
[30]

Transformers provably solve parity efficiently with chain of thought.arXiv preprint arXiv:2410.08633, 2024

Kim, J. and Suzuki, T. Transformers provably solve parity efficiently with chain of thought. arXiv preprint arXiv:2410.08633, 2024

work page arXiv 2024
[31]

Understanding and patching compositional reasoning in llms

Li, Z., Jiang, G., Xie, H., Song, L., Lian, D., and Wei, Y. Understanding and patching compositional reasoning in llms. arXiv preprint arXiv:2402.14328, 2024 a

work page arXiv 2024
[32]

Chain of thought empowers transformers to solve inherently serial problems, 2024

Li, Z., Liu, H., Zhou, D., and Ma, T. Chain of thought empowers transformers to solve inherently serial problems. arXiv preprint arXiv:2402.12875, 2024 b

work page arXiv 2024
[33]

T., Goel, S., Krishnamurthy, A., and Zhang, C

Liu, B., Ash, J. T., Goel, S., Krishnamurthy, A., and Zhang, C. Transformers learn shortcuts to automata. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=De4FYqjFueZ

2023
[34]

J., Gore, J., and Tegmark, M

Liu, Z., Liu, Y., Michaud, E. J., Gore, J., and Tegmark, M. Physics of skill learning. arXiv preprint arXiv:2501.12391, 2025

work page arXiv 2025
[35]

and Ramdas, A

Martinez-Taboada, D. and Ramdas, A. Empirical bernstein in smooth banach spaces. arXiv preprint arXiv:2409.06060, 2024

work page arXiv 2024
[36]

arXiv preprint arXiv:2510.25108 , year=

Medvedev, M., Lyu, K., Li, Z., and Srebro, N. Shift is good: Mismatched data mixing improves test performance. arXiv preprint arXiv:2510.25108, 2025

work page arXiv 2025
[37]

arXiv preprint arXiv:2602.08907 , year=

Medvedev, M., Attias, I., Cornacchia, E., Misiakiewicz, T., Vardi, G., and Srebro, N. Positive distribution shift as a framework for understanding tractable learning. arXiv preprint arXiv:2602.08907, 2026 a

work page arXiv 2026
[38]

Shift is good: Mismatched data mixing improves test performance

Medvedev, M., Lyu, K., Li, Z., and Srebro, N. Shift is good: Mismatched data mixing improves test performance. In The 29th International Conference on Artificial Intelligence and Statistics, 2026 b

2026
[39]

The expressive power of transformers with chain of thought, 2024

Merrill, W. and Sabharwal, A. The expresssive power of transformers with chain of thought. arXiv preprint arXiv:2310.07923, 2023 a

work page arXiv 2023
[40]

and Sabharwal, A

Merrill, W. and Sabharwal, A. The parallelism tradeoff: Limitations of log-precision transformers. Transactions of the Association for Computational Linguistics, 11: 0 531--545, 2023 b

2023
[41]

The quantization model of neural scaling

Michaud, E., Liu, Z., Girit, U., and Tegmark, M. The quantization model of neural scaling. Advances in Neural Information Processing Systems, 36: 0 28699--28722, 2023

2023
[42]

Mousavi-Hosseini, A., Wu, D., Suzuki, T., and Erdogdu, M. A. Gradient-based feature learning under structured data. Advances in Neural Information Processing Systems, 36: 0 71449--71485, 2023

2023
[43]

Task arithmetic in the tangent space: Improved editing of pre-trained models

Ortiz-Jimenez, G., Favero, A., and Frossard, P. Task arithmetic in the tangent space: Improved editing of pre-trained models. Advances in Neural Information Processing Systems, 36: 0 66727--66754, 2023

2023
[44]

On limitations of the transformer architecture

Peng, B., Narayanan, S., and Papadimitriou, C. On limitations of the transformer architecture. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=KidynPuLNW

2024
[45]

Prabha, D., Aswini, J., Maheswari, B., Subramanian, R

Press, O., Zhang, M., Min, S., Schmidt, L., Smith, N., and Lewis, M. Measuring and narrowing the compositionality gap in language models. In Bouamor, H., Pino, J., and Bali, K. (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.\ 5687--5711, Singapore, December 2023. Association for Computational Linguistics. doi:10.18653/v1...

work page doi:10.18653/v1/2023.findings-emnlp.378 2023
[46]

Ren, Y., Wang, Z., and Lee, J. D. Learning and transferring sparse contextual bigrams with linear transformers. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

2024
[47]

and Vershynin, R

Rudelson, M. and Vershynin, R. Hanson-wright inequality and sub-gaussian concentration. 2013

2013
[48]

Representational strengths and limitations of transformers

Sanford, C., Hsu, D., and Telgarsky, M. Representational strengths and limitations of transformers. In Advances in Neural Information Processing Systems 36, 2023

2023
[49]

Beyond neural scaling laws: beating power law scaling via data pruning

Sorscher, B., Geirhos, R., Shekhar, S., Ganguli, S., and Morcos, A. Beyond neural scaling laws: beating power law scaling via data pruning. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35, pp.\ 19523--19536. Curran Associates, Inc., 2022. URL https://proceeding...

2022
[50]

Characterizing statistical query learning: simplified notions and proofs

Sz \"o r \'e nyi, B. Characterizing statistical query learning: simplified notions and proofs. In International Conference on Algorithmic Learning Theory, pp.\ 186--200. Springer, 2009

2009
[51]

Grokked transformers are implicit reasoners: A mechanistic journey to the edge of generalization

Wang, B., Yue, X., Su, Y., and Sun, H. Grokked transformers are implicit reasoners: A mechanistic journey to the edge of generalization. arXiv preprint arXiv:2405.15071, 2024

work page arXiv 2024
[52]

and Lu, W

Wang, T. and Lu, W. Learning multi-step reasoning by solving arithmetic tasks, 2023. URL https://arxiv.org/abs/2306.01707

work page arXiv 2023
[53]

Learning compositional functions with transformers from easy-to-hard data.arXiv preprint arXiv:2505.23683, 2025

Wang, Z., Nichani, E., Bietti, A., Damian, A., Hsu, D., Lee, J. D., and Wu, D. Learning compositional functions with transformers from easy-to-hard data. arXiv preprint arXiv:2505.23683, 2025

work page arXiv 2025
[54]

International Conference on Learning Representations , year =

Wen, K., Zhang, H., Lin, H., and Zhang, J. From sparse dependence to sparse attention: unveiling how chain-of-thought enhances transformer sample efficiency. arXiv preprint arXiv:2410.05459, 2024

work page arXiv 2024
[55]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, P., Wang, P., Zhu, Q...

work page internal anchor Pith review arXiv 2025
[56]

arXiv preprint arXiv:2402.16837 , year=

Yang, S., Gribovskaya, E., Kassner, N., Geva, M., and Riedel, S. Do large language models latently perform multi-hop reasoning? arXiv preprint arXiv:2402.16837, 2024

work page arXiv 2024
[57]

Language models can learn implicit multi-hop reasoning, but only if they have lots of training data.arXiv preprint arXiv:2505.17923, 2025

Yao, Y., Du, Y., Zhu, D., Hahn, M., and Koller, A. Language models can learn implicit multi-hop reasoning, but only if they have lots of training data. arXiv preprint arXiv:2505.17923, 2025 a

work page arXiv 2025
[58]

Cake: Circuit-aware editing enables generalizable knowledge learners

Yao, Y., Fang, J., Gu, J.-C., Zhang, N., Deng, S., Chen, H., and Peng, N. Cake: Circuit-aware editing enables generalizable knowledge learners. arXiv preprint arXiv:2503.16356, 2025 b

work page arXiv 2025
[59]

How does transformer learn implicit reasoning? arXiv preprint arXiv:2505.23653, 2025

Ye, J., Yao, Z., Huang, Z., Pan, L., Liu, J., Bai, Y., Xin, A., Weichuan, L., Che, X., Hou, L., et al. How does transformer learn implicit reasoning? arXiv preprint arXiv:2505.23653, 2025

work page arXiv 2025
[60]

Physics of language models: Part 2.1, grade-school math and the hidden reasoning process.arXiv preprint arXiv:2407.20311, 2024

Ye, T., Xu, Z., Li, Y., and Allen-Zhu, Z. Physics of language models: Part 2.1, grade-school math and the hidden reasoning process. arXiv preprint arXiv:2407.20311, 2024

work page arXiv 2024
[61]

Skill-mix: a flexible and expandable family of evaluations for ai models

Yu, D., Kaur, S., Gupta, A., Brown-Cohen, J., Goyal, A., and Arora, S. Skill-mix: a flexible and expandable family of evaluations for ai models. In The Twelfth International Conference on Learning Representations, 2023

2023
[62]

Frequency balanced datasets lead to better language models

Zevallos, R., Farr \'u s, M., and Bel, N. Frequency balanced datasets lead to better language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.\ 7859--7872, 2023

2023
[63]

Weidinger, L.; Mellor, J.; Rauh, M.; Griffin, C.; Uesato, J.; Huang, P.-S.; Cheng, M.; Glaese, M.; Balle, B.; Kasirzadeh, A.; et al

Zhang, M., Fang, B., Liu, Q., Ren, P., Wu, S., Chen, Z., and Wang, L. Enhancing multi-hop reasoning through knowledge erasure in large language model editing. arXiv preprint arXiv:2408.12456, 2024

work page arXiv 2024
[64]

Can models learn skill composition from examples? Advances in Neural Information Processing Systems, 37: 0 102393--102427, 2024

Zhao, H., Kaur, S., Yu, D., Goyal, A., and Arora, S. Can models learn skill composition from examples? Advances in Neural Information Processing Systems, 37: 0 102393--102427, 2024

2024
[65]

arXiv preprint arXiv:2305.14795

Zhong, Z., Wu, Z., Manning, C. D., Potts, C., and Chen, D. Mquake: Assessing knowledge editing in language models via multi-hop questions. arXiv preprint arXiv:2305.14795, 2023

work page arXiv 2023
[66]

Gsm-infinite: How do your llms behave over infinitely increasing context length and reasoning complexity? arXiv preprint arXiv:2502.05252, 2025

Zhou, Y., Liu, H., Chen, Z., Tian, Y., and Chen, B. Gsm-infinite: How do your llms behave over infinitely increasing context length and reasoning complexity? arXiv preprint arXiv:2502.05252, 2025

work page arXiv 2025
[67]

Scaling Latent Reasoning via Looped Language Models

Zhu, R.-J., Wang, Z., Hua, K., Zhang, T., Li, Z., Que, H., Wei, B., Wen, Z., Yin, F., Xing, H., et al. Scaling latent reasoning via looped language models. arXiv preprint arXiv:2510.25741, 2025

work page internal anchor Pith review arXiv 2025
[68]

Zipf, G. K. Human behavior and the principle of least effort: An introduction to human ecology. Ravenio books, 2016

2016