pith. sign in

arxiv: 2605.21724 · v1 · pith:J4ZEHADBnew · submitted 2026-05-20 · 💻 cs.LG · cs.AI

TBP-mHC: full expressivity for manifold-constrained hyper connections through transportation polytopes

Pith reviewed 2026-05-22 09:56 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords hyper-connectionsdoubly stochastic matricesBirkhoff polytopetransportation polytopesresidual networkslanguage model pre-trainingmanifold constraints
0
0 comments X

The pith

Transportation polytope parameterizations produce exactly doubly stochastic mixing matrices for hyper-connections with only (n-1)^2 free parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Hyper-connections let residual networks mix multiple streams with learned weights, but free mixing often destabilizes training. Earlier fixes either approximate double stochasticity through iterative Sinkhorn steps or enforce it exactly via permutations at factorial cost or via restricted Kronecker structures. The paper replaces those with Transportation Birkhoff Polytope and recursive variants that directly output exact doubly stochastic matrices. The construction uses exactly the dimension of the Birkhoff polytope and skips both normalization loops and combinatorial enumeration. Experiments on language-model pre-training report competitive accuracy together with gains in training stability and scaling behavior.

Core claim

TBP and RTBP parameterizations construct exactly doubly stochastic mixing matrices with (n-1)^2 degrees of freedom. The approach avoids iterative normalization and combinatorial explosion while preserving full expressivity of the Birkhoff polytope.

What carries the argument

Transportation Birkhoff Polytope (TBP) parameterization, which maps unconstrained parameters onto the full set of doubly stochastic matrices via transportation polytopes.

If this is right

  • Mixing matrices satisfy exact double stochasticity at every forward pass without Sinkhorn iterations.
  • The number of trainable parameters per mixing matrix matches the intrinsic dimension of the Birkhoff polytope.
  • Complexity remains polynomial in n instead of factorial.
  • Language-model pre-training reaches competitive performance while showing improved stability and scalability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same parameterization could be substituted into any architecture that already imposes manifold constraints on weight matrices.
  • If the span is complete, gradient flow on the mixing weights should avoid the projection steps that sometimes slow Sinkhorn-based variants.
  • Direct comparison of the learned mixing matrices against those produced by KromHC would quantify how much additional expressivity is actually used.

Load-bearing premise

The chosen parameterization is assumed to reach every doubly stochastic matrix rather than only a lower-dimensional subset of them.

What would settle it

Exhibiting, for any tested matrix size n, at least one doubly stochastic matrix that cannot be exactly recovered from the TBP map.

Figures

Figures reproduced from arXiv: 2605.21724 by Anton Lyubinin.

Figure 1
Figure 1. Figure 1: Gradient norm dynamics across four experiments. Each panel reports the gradient norm [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
read the original abstract

Hyper-Connections (HC) improve residual networks by introducing learnable mixing across multiple residual streams, but unconstrained mixing leads to training instability. Manifold-Constrained Hyper-Connections (mHC) address this by enforcing approximate double stochasticity via Sinkhorn normalization, while mHC-lite ensures exact constraints through convex combinations of permutation matrices at the cost of factorial complexity. KromHC reduces this cost using Kronecker-product parameterizations, but restricts the mixing matrices to a structured submanifold of the Birkhoff polytope . We propose Transportation Birkhoff Polytope (TBP) parameterizations and their Recursive variants (RTBP), which construct exactly doubly stochastic mixing matrices with $(n-1)^2$ degrees of freedom. Our approach avoids iterative normalization and combinatorial explosion while preserving full expressivity of the Birkhoff polytope. Empirical results on language model pre-training' demonstrate competitive performance with improved stability and scalability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Transportation Birkhoff Polytope (TBP) parameterizations and their recursive variants (RTBP) for manifold-constrained hyper-connections (mHC). It claims these construct exactly doubly stochastic mixing matrices with exactly (n-1)^2 degrees of freedom, achieve full expressivity over the Birkhoff polytope, avoid iterative normalization and combinatorial explosion, and yield competitive performance with improved stability in language-model pre-training.

Significance. If the surjectivity claim holds, the work would supply a practical, exact parameterization of the full Birkhoff polytope that matches its known dimension while remaining differentiable and free of iterative projections. This would directly address the expressivity–scalability trade-off left open by Sinkhorn-based mHC and permutation-based mHC-lite, and could be adopted in any architecture that requires learnable doubly stochastic mixing.

major comments (2)
  1. [§3 and §4] §3 (TBP construction) and §4 (RTBP): the central claim that the parameterization is surjective onto the entire Birkhoff polytope is load-bearing for the title and abstract. The parameter count (n-1)^2 matches the dimension, yet surjectivity is not automatic for a transportation-polytope encoding; an explicit inverse map, a density argument, or a constructive proof that every interior and boundary point is attainable must be supplied. Without it the “full expressivity” assertion remains an unverified assertion rather than a theorem.
  2. [Experimental section] Experimental section (language-model runs): the reported n values and the concrete mixing matrices used in the LM experiments should be checked against the claimed surjectivity. If the recursive Kronecker-style construction in RTBP introduces additional linear dependencies, the effective image may be a proper submanifold for the n appearing in the tables; an ablation that samples random doubly stochastic targets and measures reconstruction error under the learned parameterization is required to substantiate the claim.
minor comments (2)
  1. [Abstract] Abstract: stray apostrophe in “pre-training'” should be removed.
  2. [§3] Notation: define the precise mapping from free variables to the transportation polytope entries (e.g., the role of the marginal vectors) before the recursive construction is introduced; the current presentation leaves the base TBP map implicit.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments highlight important points regarding the rigor of our surjectivity claim and the need for additional empirical validation. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3 and §4] §3 (TBP construction) and §4 (RTBP): the central claim that the parameterization is surjective onto the entire Birkhoff polytope is load-bearing for the title and abstract. The parameter count (n-1)^2 matches the dimension, yet surjectivity is not automatic for a transportation-polytope encoding; an explicit inverse map, a density argument, or a constructive proof that every interior and boundary point is attainable must be supplied. Without it the “full expressivity” assertion remains an unverified assertion rather than a theorem.

    Authors: We agree that a formal proof of surjectivity is necessary to substantiate the central claim. In the revised manuscript we will add an explicit constructive proof in a new subsection of §3. For any target doubly stochastic matrix B, we exhibit a closed-form inverse that recovers the (n-1)×(n-1) transportation parameters whose associated transportation polytope projects exactly onto B; the construction handles both interior points and all permutation-matrix boundary points. For the recursive RTBP construction in §4 we will prove by induction that surjectivity is preserved and that no additional linear dependencies arise, so the image remains the full Birkhoff polytope with exactly (n-1)^2 degrees of freedom. revision: yes

  2. Referee: [Experimental section] Experimental section (language-model runs): the reported n values and the concrete mixing matrices used in the LM experiments should be checked against the claimed surjectivity. If the recursive Kronecker-style construction in RTBP introduces additional linear dependencies, the effective image may be a proper submanifold for the n appearing in the tables; an ablation that samples random doubly stochastic targets and measures reconstruction error under the learned parameterization is required to substantiate the claim.

    Authors: We will add the requested ablation study to the experimental section. For each n appearing in the language-model tables we will sample several thousand random doubly stochastic matrices (including boundary permutations) and report the reconstruction error obtained by applying the inverse TBP/RTBP map. We will also explicitly state the n values used and confirm that the mixing matrices realized during pre-training lie in the attainable set. This empirical check will be presented alongside the existing results. revision: yes

Circularity Check

0 steps flagged

No circularity: TBP/RTBP parameterization rests on transportation polytope definitions

full rationale

The paper's central construction defines TBP and RTBP directly from the geometry of transportation polytopes to produce exactly doubly stochastic matrices with (n-1)^2 free parameters. This matches the known dimension of the Birkhoff polytope by standard convex geometry, without defining the target expressivity in terms of the parameterization itself or invoking fitted quantities. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results appear in the derivation chain. The avoidance of Sinkhorn iteration and combinatorial enumeration is achieved by explicit construction rather than by re-expressing the desired property. The claim of full expressivity is therefore an independent mathematical assertion grounded in polytope theory, not a reduction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven assertion that transportation polytopes yield a parameterization of the entire Birkhoff polytope; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Transportation polytopes can be used to construct every doubly stochastic matrix using exactly (n-1)^2 independent parameters.
    This is the load-bearing mathematical premise that enables the claim of full expressivity without iterative normalization.

pith-pipeline@v0.9.0 · 5683 in / 1311 out tokens · 32921 ms · 2026-05-22T09:56:44.831222+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 2 internal anchors

  1. [1]

    Adamn: Accelerating deep learning training via nested momentum and exact bias handling

    Mohamed Aboulsaad and Adnan Shaout. Adamn: Accelerating deep learning training via nested momentum and exact bias handling. Electronics, 15 0 (3), 2026. ISSN 2079-9292. doi:10.3390/electronics15030670. URL https://www.mdpi.com/2079-9292/15/3/670

  2. [2]

    Three observations on linear algebra

    Garrett Birkhoff. Three observations on linear algebra. Univ. Nac. Tucum \'a n. Revista A , 5: 0 147--151, 1946

  3. [3]

    Richard A. Brualdi. Combinatorial Matrix Classes. Encyclopedia of Mathematics and its Applications. Cambridge University Press, 2006

  4. [4]

    go- m hc: Direct parameterization of manifold-constrained hyper-connections via generalized orthostochastic matrices, 2026

    Torque Dandachi and Sophia Diggs-Galligan. go- m hc: Direct parameterization of manifold-constrained hyper-connections via generalized orthostochastic matrices, 2026. URL https://arxiv.org/abs/2604.02309

  5. [5]

    George B. Dantzig. Linear Programming and Extensions. Princeton Landmarks in Mathematics and Physics. Princeton University Press, August 1998. First published and copyrighted 1963; Princeton Landmark in Mathematics paperback reissued Aug.\ 23, 1998. The foundational text that established mathematical linear programming

  6. [6]

    On the -lazy version of markov chains in estimation and testing problems

    Sela Fried and Geoffrey Wolfer. On the -lazy version of markov chains in estimation and testing problems. 2021. URL https://arxiv.org/abs/2105.09536

  7. [7]

    Openwebtext corpus

    Aaron Gokaslan and Vanya Cohen. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019

  8. [8]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770--778, 2016

  9. [9]

    Maximal Inequalities and Mixing Times

    Jonathan Hermon. Maximal Inequalities and Mixing Times. PhD thesis, University of California, Berkeley, 2016. URL https://escholarship.org/uc/item/7q665159. ProQuest ID: Hermon\_berkeley\_0028E\_16704; Merritt ID: ark:/13030/m5906znj

  10. [10]

    Andrej Karpathy. nanogpt. https://github.com/karpathy/nanoGPT, 2022. GitHub repository

  11. [11]

    Langville and Carl D

    Amy N. Langville and Carl D. Meyer. Deeper inside pagerank. Internet Mathematics, 1 0 (3): 0 335--380, 2004. Published 2003/2004

  12. [12]

    Sophia: A scalable stochastic second-order optimizer for language model pre-training, 2024a

    Hong Liu, Zhiyuan Li, David Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training, 2024. URL https://arxiv.org/abs/2305.14342

  13. [14]

    Muon is Scalable for LLM Training

    Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is Scalable for LLM Training . arXiv preprint arXiv:2502.16982, 2025 b

  14. [15]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization . In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7

  15. [16]

    Sean Meyn and Richard L. Tweedie. Markov Chains and Stochastic Stability. Cambridge Mathematical Library. Cambridge University Press, 2 edition, April 2009. ISBN 9780521731829

  16. [17]

    Mathematical aspects of mixing times in markov chains

    Ravi Montenegro and Prasad Tetali. Mathematical aspects of mixing times in markov chains. Foundations and Trends in Theoretical Computer Science, 1 0 (3): 0 237--354, 01 2006. ISSN 1551-305X. doi:10.1561/0400000003. URL https://doi.org/10.1561/0400000003

  17. [18]

    Concerning nonnegative matrices and doubly stochastic matrices

    Richard Sinkhorn and Paul Knopp. Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics, 21 0 (2): 0 343--348, 1967

  18. [19]

    Operations Research: An Introduction, Global Edition

    Hamdy Taha. Operations Research: An Introduction, Global Edition. Pearson, 10 edition, 2017. ISBN 978-1-292-16554-7. E-ISBN: 978-1-292-16556-1; Language: English

  19. [20]

    John von Neumann. 1. A Certain Zero-sum Two-person Game Equivalent to the Optimal Assignment Problem, pages 5--12. Princeton University Press, Princeton, 1953. ISBN 9781400881970. doi:doi:10.1515/9781400881970-002

  20. [21]

    mHC: Manifold-Constrained Hyper-Connections

    Zhenda Xie, Yixuan Wei, Huanqi Cao, Chenggang Zhao, Chengqi Deng, Jiashi Li, Damai Dai, Huazuo Gao, Jiang Chang, Liang Zhao, et al. m HC : Manifold-Constrained Hyper-Connections . arXiv preprint arXiv:2512.24880, 2025

  21. [22]

    doi:10.48550/arXiv.2601.05732 , abstract =

    Yongyi Yang and Jianyang Gao. m HC -lite: You Don't Need 20 Sinkhorn-Knopp Iterations . arXiv preprint arXiv:2601.05732, 2026

  22. [23]

    Root mean square layer normalization

    Biao Zhang and Rico Sennrich. Root mean square layer normalization. Advances in neural information processing systems, 32, 2019

  23. [24]

    Kromhc: Manifold-constrained hyper-connections with kronecker-product residual matrices, 2026

    Wuyang Zhou, Yuxuan Gu, Giorgos Iacovides, and Danilo Mandic. Kromhc: Manifold-constrained hyper-connections with kronecker-product residual matrices, 2026. URL https://arxiv.org/abs/2601.21579

  24. [25]

    Hyper- C onnections

    Defa Zhu, Hongzhi Huang, Zihao Huang, Yutao Zeng, Yunyao Mao, Banggu Wu, Qiyang Min, and Xun Zhou. Hyper- C onnections . In Proceedings of The Thirteenth International Conference on Learning Representations, 2025