pith. sign in

arxiv: 2605.19944 · v1 · pith:PWT7IR4Anew · submitted 2026-05-19 · 💻 cs.LG · cs.AI· cs.CC· cs.CL

A Measure-Theoretic Analysis of Reasoning: Structural Generalization and Approximation Limits

Pith reviewed 2026-05-20 07:56 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CCcs.CL
keywords TransformersOut-of-distribution generalizationWasserstein distanceLipschitz continuityDyck languagesCircuit complexityReasoningBarron spaces
0
0 comments X

The pith

Position-dependent attention in Transformers produces constant expected risk under Wasserstein shifts on reasoning tasks, while shift-invariant mechanisms bound the error and depth scaling is required to avoid representation collapse.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper uses optimal transport to project reasoning trajectories into a metric space and applies Kantorovich duality to bound out-of-distribution generalization by the network's Lipschitz constant and its approximation power. It shows that absolute positional encodings destroy shift invariance and therefore fix an Omega(1) Lipschitz constant, while rotary embeddings preserve equivariance and keep the bound finite. Mapping sequential backtracking to Dyck-k languages yields a circuit-depth lower bound for TC^0 Transformers; because approximation error in Barron spaces cannot be removed by width alone, physical depth must increase to prevent collapse. Experiments across 54 configurations confirm that risk rises monotonically with the measured Wasserstein domain shift.

Core claim

The central claim is that two architectural constraints govern generalization on reasoning tasks: position-dependent attention yields an Omega(1) Lipschitz constant and therefore constant expected risk under Wasserstein shifts, whereas shift-invariant mechanisms bound the error; separately, TC^0 Transformers on Dyck-k languages face an irreducible approximation limit in Barron spaces, so width scaling cannot substitute for depth and representation collapse occurs without sufficient layer depth.

What carries the argument

Lipschitz continuity of the attention map under the Wasserstein-1 metric, combined with the circuit-depth lower bound for TC^0 circuits realizing Dyck-k languages.

Load-bearing premise

The mapping of sequential backtracking to a Dyck-k language is a faithful model of the reasoning process the Transformer must implement.

What would settle it

A shallow TC^0 Transformer with absolute positional encodings that maintains low generalization risk on Dyck-k tasks even when the Wasserstein distance between training and test distributions is large.

Figures

Figures reproduced from arXiv: 2605.19944 by Xiaoyin Chen, Xuehai Zhou, Yifu Zhang, Yuyang Zhang.

Figure 1
Figure 1. Figure 1: Comprehensive validation of capacity bounds. (A) Supremum capacity limit on [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Empirical verification of structural regularization via translation equivariance on [PITH_FULL_IMAGE:figures/full_fig_p026_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comprehensive baseline performance and structural variance across the experimental grid. [PITH_FULL_IMAGE:figures/full_fig_p027_3.png] view at source ↗
read the original abstract

While empirical scaling laws for LLM reasoning are well-documented, the theoretical mechanisms governing out-of-distribution (OOD) generalization remain elusive. We formalize reasoning via optimal transport, projecting discrete trajectories into a continuous metric space to quantify domain shifts using the Wasserstein-1 distance. Invoking Kantorovich duality, we bound OOD generalization via architectural Lipschitz continuity and functional approximation limits. This exposes two primary constraints. First, position-dependent attention (e.g., Absolute Positional Encoding) fails to preserve shift invariance, yielding an $\Omega(1)$ Lipschitz constant and expected risk, whereas shift-invariant mechanisms (e.g., Rotary Embeddings) preserve equivariance and bound the error. Second, by mapping sequential backtracking to a Dyck-$k$ language, we establish a strict circuit depth lower bound for $\text{TC}^0$ Transformers. Scaling physical layer depth is necessary to avert representation collapse -- a constraint that scaling representation width cannot bypass due to irreducible approximation bounds in Barron spaces. Evaluations across 54 Transformer configurations on combinatorial search corroborate these bounds, demonstrating that generalization risk degrades monotonically with the Wasserstein domain shift.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper formalizes reasoning in Transformers as optimal transport of discrete trajectories into a continuous metric space, using the Wasserstein-1 distance to quantify OOD shifts. Invoking Kantorovich duality, it derives generalization bounds from network Lipschitz continuity and Barron-space approximation limits. It claims that position-dependent attention (e.g., absolute positional encodings) produces an Ω(1) Lipschitz constant and thus unbounded risk under shifts, while shift-invariant mechanisms (e.g., Rotary embeddings) preserve equivariance and control error. By reducing sequential backtracking to Dyck-k language recognition, the paper establishes a TC^0 circuit-depth lower bound, arguing that depth scaling is required to avoid representation collapse because width scaling cannot overcome irreducible Barron-space limits. These theoretical constraints are corroborated by experiments on 54 Transformer configurations showing monotonic degradation of generalization risk with increasing Wasserstein shift.

Significance. If the central reductions hold, the work supplies a measure-theoretic explanation for why architectural choices such as positional encoding and depth matter for structural generalization in reasoning tasks. The explicit linkage of optimal transport, Kantorovich duality, and circuit-complexity lower bounds to Transformer Lipschitz constants and Barron approximation is a novel framing that could guide architectural decisions beyond empirical scaling laws. The evaluation across 54 configurations provides concrete empirical support for the predicted monotonic risk increase, which is a positive feature when the theoretical claims are made rigorous.

major comments (3)
  1. [Dyck-k mapping and circuit depth lower bound] Abstract and § on Dyck-k reduction: the claim that depth scaling is necessary to avert representation collapse rests on transferring known TC^0 depth lower bounds for Dyck-k recognition through a continuous Wasserstein embedding. No explicit construction is given showing that the optimal-transport map preserves the discrete acceptance condition or that the Wasserstein-1 distance respects the context-free structure; without this step the depth lower bound does not imply the stated architectural necessity.
  2. [Kantorovich duality application] Abstract, Kantorovich duality paragraph: the derivation of an Ω(1) Lipschitz constant for position-dependent attention and the resulting expected-risk bound is stated without the explicit dual formulation or error terms. It is therefore unclear whether the bound follows directly from the duality or relies on additional approximations that are not controlled.
  3. [Empirical evaluation] Evaluation section: the 54-configuration study is invoked to corroborate the bounds, yet no detail is supplied on how the Wasserstein shifts were constructed, which metrics quantified generalization risk, or how the configurations were sampled to isolate depth versus width effects. This makes it impossible to assess whether the observed monotonic degradation actually tests the claimed irreducible Barron-space limits.
minor comments (2)
  1. [Preliminaries] Notation for the Wasserstein-1 distance and the Lipschitz constant should be introduced with explicit definitions before being used in the duality argument.
  2. [Experiments] The abstract mentions '54 Transformer configurations' but the main text should include a table or appendix listing the exact hyper-parameter ranges and random seeds to support reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment point by point below, providing the strongest honest defense of the manuscript while making revisions where the concerns are valid and require additional rigor.

read point-by-point responses
  1. Referee: [Dyck-k mapping and circuit depth lower bound] Abstract and § on Dyck-k reduction: the claim that depth scaling is necessary to avert representation collapse rests on transferring known TC^0 depth lower bounds for Dyck-k recognition through a continuous Wasserstein embedding. No explicit construction is given showing that the optimal-transport map preserves the discrete acceptance condition or that the Wasserstein-1 distance respects the context-free structure; without this step the depth lower bound does not imply the stated architectural necessity.

    Authors: We agree that the transfer of the TC^0 lower bound requires an explicit construction to be fully rigorous. The original manuscript sketched the reduction but did not detail the preservation properties. In the revised version we add a dedicated subsection that constructs the optimal transport map explicitly: valid Dyck-k backtracking trajectories are mapped to a zero-distance manifold in the Wasserstein space while invalid trajectories incur positive distance proportional to the number of unmatched brackets. Because the Wasserstein-1 metric is defined on the trajectory space and the acceptance predicate is Lipschitz with respect to this metric, the known circuit-depth lower bound carries over directly, implying that width scaling alone cannot overcome the representation collapse. revision: yes

  2. Referee: [Kantorovich duality application] Abstract, Kantorovich duality paragraph: the derivation of an Ω(1) Lipschitz constant for position-dependent attention and the resulting expected-risk bound is stated without the explicit dual formulation or error terms. It is therefore unclear whether the bound follows directly from the duality or relies on additional approximations that are not controlled.

    Authors: The referee correctly notes that the dual formulation and error control were only summarized. We have now inserted the complete Kantorovich dual problem together with the explicit error terms arising from the network Lipschitz constant. The derivation shows that absolute positional encodings yield a shift-dependent Lipschitz constant that remains Ω(1) even after averaging over the data measure, producing an expected-risk bound that grows linearly with the Wasserstein shift; no uncontrolled approximations are used. The revised text makes this step-by-step. revision: yes

  3. Referee: [Empirical evaluation] Evaluation section: the 54-configuration study is invoked to corroborate the bounds, yet no detail is supplied on how the Wasserstein shifts were constructed, which metrics quantified generalization risk, or how the configurations were sampled to isolate depth versus width effects. This makes it impossible to assess whether the observed monotonic degradation actually tests the claimed irreducible Barron-space limits.

    Authors: We accept that the experimental protocol was insufficiently specified. The revised Evaluation section now details: (i) Wasserstein shifts are realized by controlled perturbations of the input distribution over combinatorial search problems while keeping the target function fixed; (ii) generalization risk is measured as the absolute difference in accuracy between in-distribution and OOD test sets; (iii) the 54 configurations were sampled by independently varying depth (2–12 layers) and width (64–2048 hidden units) under a fixed hyperparameter grid. These additions allow direct verification that the observed monotonic risk increase aligns with the Barron-space lower bound rather than confounding factors. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external results

full rationale

The paper formalizes reasoning via optimal transport and Wasserstein-1 distance, then invokes Kantorovich duality to bound generalization risk through Lipschitz continuity of position-dependent vs. shift-invariant attention. It separately maps sequential backtracking to Dyck-k languages to obtain TC^0 circuit depth lower bounds and invokes Barron-space approximation limits to argue depth scaling is required. These steps cite standard external theorems rather than defining any quantity in terms of itself or renaming a fitted parameter as a prediction. No self-citation chain is load-bearing for the central claims, and the 54-configuration evaluation provides an independent empirical check. The derivation chain therefore remains self-contained against external mathematical benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard results from optimal transport and circuit complexity plus the modeling choice that backtracking equals Dyck-k recognition; no new free parameters or invented entities are introduced in the abstract.

axioms (2)
  • standard math Kantorovich duality holds for the Wasserstein-1 distance between trajectory measures
    Invoked to bound generalization error via Lipschitz constants
  • domain assumption Sequential backtracking in reasoning can be faithfully represented as membership in a Dyck-k language
    Used to establish the TC^0 circuit depth lower bound

pith-pipeline@v0.9.0 · 5738 in / 1405 out tokens · 37592 ms · 2026-05-20T07:56:17.521939+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 12 internal anchors

  1. [1]

    Barron, a.e.: Universal approximation bounds for superpositions of a sigmoidal function

    Andrew Barron. Barron, a.e.: Universal approximation bounds for superpositions of a sigmoidal function. ieee trans. on information theory 39, 930-945. Information Theory, IEEE Transactions on, 39: 0 930 -- 945, 06 1993. doi:10.1109/18.256500

  2. [2]

    Optimal Transport for Domain Adaptation

    Nicolas Courty, Rémi Flamary, Devis Tuia, and Alain Rakotomamonjy. Optimal transport for domain adaptation, 2016. URL https://arxiv.org/abs/1507.00504

  3. [3]

    Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances

    Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transportation distances, 2013. URL https://arxiv.org/abs/1306.0895

  4. [4]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023. URL https://arxiv.org/abs/2307.08691

  5. [5]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948

  6. [6]

    The Power of Depth for Feedforward Neural Networks

    Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks, 2016. URL https://arxiv.org/abs/1512.03965

  7. [7]

    Towards revealing the mystery behind chain of thought: A theoretical perspective, 2023

    Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, Di He, and Liwei Wang. Towards revealing the mystery behind chain of thought: A theoretical perspective, 2023. URL https://arxiv.org/abs/2305.15408

  8. [8]

    Alaya, Aurélie Boisbunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, Léo Gautheron, Nathalie T.H

    Rémi Flamary, Nicolas Courty, Alexandre Gramfort, Mokhtar Z. Alaya, Aurélie Boisbunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, Léo Gautheron, Nathalie T.H. Gayraud, Hicham Janati, Alain Rakotomamonjy, Ievgen Redko, Antoine Rolet, Antony Schutz, Vivien Seguy, Danica J. Sutherland, Romain Tavenard, Alexander Tong,...

  9. [9]

    Kanishk Gandhi, Denise Lee, Gabriel Grand, Muxin Liu, Winson Cheng, Archit Sharma, and Noah D. Goodman. Stream of search (sos): Learning to search in language, 2024. URL https://arxiv.org/abs/2404.03683

  10. [10]

    Theoretical limitations of self-attention in neural sequence models

    Michael Hahn. Theoretical limitations of self-attention in neural sequence models. Transactions of the Association for Computational Linguistics, 8: 0 156–171, December 2020. ISSN 2307-387X. doi:10.1162/tacl_a_00306. URL http://dx.doi.org/10.1162/tacl_a_00306

  11. [11]

    Two stones hit one bird: Bilevel positional encoding for better length extrapolation, 2024

    Zhenyu He, Guhao Feng, Shengjie Luo, Kai Yang, Liwei Wang, Jingjing Xu, Zhi Zhang, Hongxia Yang, and Di He. Two stones hit one bird: Bilevel positional encoding for better length extrapolation, 2024. URL https://arxiv.org/abs/2401.16421

  12. [12]

    The impact of positional encoding on length generalization in transformers, 2023

    Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, and Siva Reddy. The impact of positional encoding on length generalization in transformers, 2023. URL https://arxiv.org/abs/2305.19466

  13. [13]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirzi...

  14. [14]

    The depth-to-width interplay in self-attention, 2021

    Yoav Levine, Noam Wies, Or Sharir, Hofit Bata, and Amnon Shashua. The depth-to-width interplay in self-attention, 2021. URL https://arxiv.org/abs/2006.12467

  15. [15]

    arXiv preprint arXiv:2310.04418 , year=

    Shanda Li, Chong You, Guru Guruganesh, Joshua Ainslie, Santiago Ontanon, Manzil Zaheer, Sumit Sanghai, Yiming Yang, Sanjiv Kumar, and Srinadh Bhojanapalli. Functional interpolation for relative positions improves long context transformers, 2024. URL https://arxiv.org/abs/2310.04418

  16. [16]

    Let's verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. In The Twelfth International Conference on Learning Representations, 2023

  17. [17]

    Y. Makovoz. Random approximants and neural networks. Journal of Approximation Theory, 85 0 (1): 0 98--109, 1996. ISSN 0021-9045. doi:https://doi.org/10.1006/jath.1996.0031. URL https://www.sciencedirect.com/science/article/pii/S0021904596900313

  18. [18]

    The parallelism tradeoff: Limitations of log-precision transformers, 2023

    William Merrill and Ashish Sabharwal. The parallelism tradeoff: Limitations of log-precision transformers, 2023. URL https://arxiv.org/abs/2207.00729

  19. [19]

    B. S. Mityagin. The zero set of a real analytic function. Mathematical Notes, 107 0 (3--4): 0 529--530, 2020. doi:10.1134/S0001434620030189. URL https://doi.org/10.1134/S0001434620030189

  20. [20]

    Learning to reason with llms

    OpenAI. Learning to reason with llms. https://openai.com/index/learning-to-reason-with-llms/. Accessed: 2025-03-21

  21. [21]

    Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

    Ofir Press, Noah A. Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation, 2022. URL https://arxiv.org/abs/2108.12409

  22. [22]

    Li, and Noah D

    Ben Prystawski, Michael Y. Li, and Noah D. Goodman. Why think step by step? reasoning emerges from the locality of experience, 2023. URL https://arxiv.org/abs/2304.03843

  23. [23]

    Theoretical Analysis of Domain Adaptation with Optimal Transport

    Ievgen Redko, Amaury Habrard, and Marc Sebban. Theoretical analysis of domain adaptation with optimal transport, 2017. URL https://arxiv.org/abs/1610.04420

  24. [24]

    Randomized positional encodings boost length generalization of transformers, 2023

    Anian Ruoss, Grégoire Delétang, Tim Genewein, Jordi Grau-Moya, Róbert Csordás, Mehdi Bennani, Shane Legg, and Joel Veness. Randomized positional encodings boost length generalization of transformers, 2023. URL https://arxiv.org/abs/2305.16843

  25. [25]

    What formal languages can transformers express? a survey

    Lena Strobl, William Merrill, Gail Weiss, David Chiang, and Dana Angluin. What formal languages can transformers express? a survey. Transactions of the Association for Computational Linguistics, 12: 0 543–561, 2024. ISSN 2307-387X. doi:10.1162/tacl_a_00663. URL http://dx.doi.org/10.1162/tacl_a_00663

  26. [26]

    RoFormer: Enhanced Transformer with Rotary Position Embedding

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023. URL https://arxiv.org/abs/2104.09864

  27. [27]

    Statistical behavior and consistency of classification methods based on convex risk minimization

    Alexander B. Tsybakov. Optimal aggregation of classifiers in statistical learning . The Annals of Statistics, 32 0 (1): 0 135 -- 166, 2004. doi:10.1214/aos/1079120131. URL https://doi.org/10.1214/aos/1079120131

  28. [28]

    The Wasserstein distances, pages 93--111

    C \'e dric Villani. The Wasserstein distances, pages 93--111. Springer Berlin Heidelberg, Berlin, Heidelberg, 2009. ISBN 978-3-540-71050-9. doi:10.1007/978-3-540-71050-9_6. URL https://doi.org/10.1007/978-3-540-71050-9_6

  29. [29]

    Sharp asymptotic and finite-sample rates of convergence of empirical measures in wasserstein distance

    Jonathan Weed and Francis Bach. Sharp asymptotic and finite-sample rates of convergence of empirical measures in wasserstein distance. Bernoulli, 25 0 (4A): 0 2620--2648, 2019

  30. [30]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023. URL https://arxiv.org/abs/2201.11903

  31. [31]

    Logit margin matters: Improving transferable targeted adversarial attack by logit calibration, 2023

    Juanjuan Weng, Zhiming Luo, Zhun Zhong, Shaozi Li, and Nicu Sebe. Logit margin matters: Improving transferable targeted adversarial attack by logit calibration, 2023. URL https://arxiv.org/abs/2303.03680

  32. [32]

    Tree of Thoughts: Deliberate Problem Solving with Large Language Models

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models, 2023. URL https://arxiv.org/abs/2305.10601

  33. [33]

    Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens

    Chengshuai Zhao, Zhen Tan, Pingchuan Ma, Dawei Li, Bohan Jiang, Yancheng Wang, Yingzhen Yang, and Huan Liu. Is chain-of-thought reasoning of llms a mirage? a data distribution lens, 2026. URL https://arxiv.org/abs/2508.01191

  34. [34]

    Length extrapolation of transformers: A survey from the perspective of positional encoding, 2024

    Liang Zhao, Xiachong Feng, Xiaocheng Feng, Weihong Zhong, Dongliang Xu, Qing Yang, Hongtao Liu, Bing Qin, and Ting Liu. Length extrapolation of transformers: A survey from the perspective of positional encoding, 2024. URL https://arxiv.org/abs/2312.17044

  35. [35]

    Dape: Data-adaptive positional encoding for length extrapolation, 2024

    Chuanyang Zheng, Yihang Gao, Han Shi, Minbin Huang, Jingyao Li, Jing Xiong, Xiaozhe Ren, Michael Ng, Xin Jiang, Zhenguo Li, and Yu Li. Dape: Data-adaptive positional encoding for length extrapolation, 2024. URL https://arxiv.org/abs/2405.14722