TBP-mHC: full expressivity for manifold-constrained hyper connections through transportation polytopes

Anton Lyubinin

arxiv: 2605.21724 · v1 · pith:J4ZEHADBnew · submitted 2026-05-20 · 💻 cs.LG · cs.AI

TBP-mHC: full expressivity for manifold-constrained hyper connections through transportation polytopes

Anton Lyubinin This is my paper

Pith reviewed 2026-05-22 09:56 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords hyper-connectionsdoubly stochastic matricesBirkhoff polytopetransportation polytopesresidual networkslanguage model pre-trainingmanifold constraints

0 comments

The pith

Transportation polytope parameterizations produce exactly doubly stochastic mixing matrices for hyper-connections with only (n-1)^2 free parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Hyper-connections let residual networks mix multiple streams with learned weights, but free mixing often destabilizes training. Earlier fixes either approximate double stochasticity through iterative Sinkhorn steps or enforce it exactly via permutations at factorial cost or via restricted Kronecker structures. The paper replaces those with Transportation Birkhoff Polytope and recursive variants that directly output exact doubly stochastic matrices. The construction uses exactly the dimension of the Birkhoff polytope and skips both normalization loops and combinatorial enumeration. Experiments on language-model pre-training report competitive accuracy together with gains in training stability and scaling behavior.

Core claim

TBP and RTBP parameterizations construct exactly doubly stochastic mixing matrices with (n-1)^2 degrees of freedom. The approach avoids iterative normalization and combinatorial explosion while preserving full expressivity of the Birkhoff polytope.

What carries the argument

Transportation Birkhoff Polytope (TBP) parameterization, which maps unconstrained parameters onto the full set of doubly stochastic matrices via transportation polytopes.

If this is right

Mixing matrices satisfy exact double stochasticity at every forward pass without Sinkhorn iterations.
The number of trainable parameters per mixing matrix matches the intrinsic dimension of the Birkhoff polytope.
Complexity remains polynomial in n instead of factorial.
Language-model pre-training reaches competitive performance while showing improved stability and scalability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same parameterization could be substituted into any architecture that already imposes manifold constraints on weight matrices.
If the span is complete, gradient flow on the mixing weights should avoid the projection steps that sometimes slow Sinkhorn-based variants.
Direct comparison of the learned mixing matrices against those produced by KromHC would quantify how much additional expressivity is actually used.

Load-bearing premise

The chosen parameterization is assumed to reach every doubly stochastic matrix rather than only a lower-dimensional subset of them.

What would settle it

Exhibiting, for any tested matrix size n, at least one doubly stochastic matrix that cannot be exactly recovered from the TBP map.

Figures

Figures reproduced from arXiv: 2605.21724 by Anton Lyubinin.

read the original abstract

Hyper-Connections (HC) improve residual networks by introducing learnable mixing across multiple residual streams, but unconstrained mixing leads to training instability. Manifold-Constrained Hyper-Connections (mHC) address this by enforcing approximate double stochasticity via Sinkhorn normalization, while mHC-lite ensures exact constraints through convex combinations of permutation matrices at the cost of factorial complexity. KromHC reduces this cost using Kronecker-product parameterizations, but restricts the mixing matrices to a structured submanifold of the Birkhoff polytope . We propose Transportation Birkhoff Polytope (TBP) parameterizations and their Recursive variants (RTBP), which construct exactly doubly stochastic mixing matrices with $(n-1)^2$ degrees of freedom. Our approach avoids iterative normalization and combinatorial explosion while preserving full expressivity of the Birkhoff polytope. Empirical results on language model pre-training' demonstrate competitive performance with improved stability and scalability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TBP/RTBP gives a clean parameterization for exactly doubly stochastic hyper-connections with the right parameter count, but the claim of full expressivity onto the Birkhoff polytope still needs an explicit inverse or small-n verification to be convincing.

read the letter

The main thing to know is that the paper introduces transportation-polytope (TBP) and recursive (RTBP) parameterizations that produce exactly doubly stochastic mixing matrices for hyper-connections using exactly (n-1)^2 free parameters. This sits between the approximate Sinkhorn approach in mHC and the factorial cost of mHC-lite, while trying to remove the submanifold restriction that KromHC imposes. The recursive construction is the practical piece that could scale to the matrix sizes appearing in residual streams for language models. The pre-training experiments report competitive performance with gains in stability and scalability, which is the result that would matter if the parameterization works as stated. The soft spot is the surjectivity question. Matching the dimension of the Birkhoff polytope is necessary but not sufficient; the map from free variables through the transportation constraints could still miss some matrices, especially after the recursive Kronecker-style layering. The abstract asserts full expressivity without showing an inverse construction or even a numerical check that recovers arbitrary target matrices for small n. If the full text contains a clear proof or such a verification, that would close the gap. Otherwise the central claim rests more on the parameter count than on demonstrated reachability. This is for people already working on manifold-constrained mixing in residuals or on polytope parameterizations in deep networks. A reader following the mHC line would find the direct comparison useful and might want to plug the parameterization into their own code. It deserves a serious referee because the motivation is concrete, the empirical signal is positive, and the idea is specific enough that referees can check the math and the small-n cases directly.

Referee Report

2 major / 2 minor

Summary. The paper proposes Transportation Birkhoff Polytope (TBP) parameterizations and their recursive variants (RTBP) for manifold-constrained hyper-connections (mHC). It claims these construct exactly doubly stochastic mixing matrices with exactly (n-1)^2 degrees of freedom, achieve full expressivity over the Birkhoff polytope, avoid iterative normalization and combinatorial explosion, and yield competitive performance with improved stability in language-model pre-training.

Significance. If the surjectivity claim holds, the work would supply a practical, exact parameterization of the full Birkhoff polytope that matches its known dimension while remaining differentiable and free of iterative projections. This would directly address the expressivity–scalability trade-off left open by Sinkhorn-based mHC and permutation-based mHC-lite, and could be adopted in any architecture that requires learnable doubly stochastic mixing.

major comments (2)

[§3 and §4] §3 (TBP construction) and §4 (RTBP): the central claim that the parameterization is surjective onto the entire Birkhoff polytope is load-bearing for the title and abstract. The parameter count (n-1)^2 matches the dimension, yet surjectivity is not automatic for a transportation-polytope encoding; an explicit inverse map, a density argument, or a constructive proof that every interior and boundary point is attainable must be supplied. Without it the “full expressivity” assertion remains an unverified assertion rather than a theorem.
[Experimental section] Experimental section (language-model runs): the reported n values and the concrete mixing matrices used in the LM experiments should be checked against the claimed surjectivity. If the recursive Kronecker-style construction in RTBP introduces additional linear dependencies, the effective image may be a proper submanifold for the n appearing in the tables; an ablation that samples random doubly stochastic targets and measures reconstruction error under the learned parameterization is required to substantiate the claim.

minor comments (2)

[Abstract] Abstract: stray apostrophe in “pre-training'” should be removed.
[§3] Notation: define the precise mapping from free variables to the transportation polytope entries (e.g., the role of the marginal vectors) before the recursive construction is introduced; the current presentation leaves the base TBP map implicit.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments highlight important points regarding the rigor of our surjectivity claim and the need for additional empirical validation. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§3 and §4] §3 (TBP construction) and §4 (RTBP): the central claim that the parameterization is surjective onto the entire Birkhoff polytope is load-bearing for the title and abstract. The parameter count (n-1)^2 matches the dimension, yet surjectivity is not automatic for a transportation-polytope encoding; an explicit inverse map, a density argument, or a constructive proof that every interior and boundary point is attainable must be supplied. Without it the “full expressivity” assertion remains an unverified assertion rather than a theorem.

Authors: We agree that a formal proof of surjectivity is necessary to substantiate the central claim. In the revised manuscript we will add an explicit constructive proof in a new subsection of §3. For any target doubly stochastic matrix B, we exhibit a closed-form inverse that recovers the (n-1)×(n-1) transportation parameters whose associated transportation polytope projects exactly onto B; the construction handles both interior points and all permutation-matrix boundary points. For the recursive RTBP construction in §4 we will prove by induction that surjectivity is preserved and that no additional linear dependencies arise, so the image remains the full Birkhoff polytope with exactly (n-1)^2 degrees of freedom. revision: yes
Referee: [Experimental section] Experimental section (language-model runs): the reported n values and the concrete mixing matrices used in the LM experiments should be checked against the claimed surjectivity. If the recursive Kronecker-style construction in RTBP introduces additional linear dependencies, the effective image may be a proper submanifold for the n appearing in the tables; an ablation that samples random doubly stochastic targets and measures reconstruction error under the learned parameterization is required to substantiate the claim.

Authors: We will add the requested ablation study to the experimental section. For each n appearing in the language-model tables we will sample several thousand random doubly stochastic matrices (including boundary permutations) and report the reconstruction error obtained by applying the inverse TBP/RTBP map. We will also explicitly state the n values used and confirm that the mixing matrices realized during pre-training lie in the attainable set. This empirical check will be presented alongside the existing results. revision: yes

Circularity Check

0 steps flagged

No circularity: TBP/RTBP parameterization rests on transportation polytope definitions

full rationale

The paper's central construction defines TBP and RTBP directly from the geometry of transportation polytopes to produce exactly doubly stochastic matrices with (n-1)^2 free parameters. This matches the known dimension of the Birkhoff polytope by standard convex geometry, without defining the target expressivity in terms of the parameterization itself or invoking fitted quantities. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results appear in the derivation chain. The avoidance of Sinkhorn iteration and combinatorial enumeration is achieved by explicit construction rather than by re-expressing the desired property. The claim of full expressivity is therefore an independent mathematical assertion grounded in polytope theory, not a reduction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven assertion that transportation polytopes yield a parameterization of the entire Birkhoff polytope; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Transportation polytopes can be used to construct every doubly stochastic matrix using exactly (n-1)^2 independent parameters.
This is the load-bearing mathematical premise that enables the claim of full expressivity without iterative normalization.

pith-pipeline@v0.9.0 · 5683 in / 1311 out tokens · 32921 ms · 2026-05-22T09:56:44.831222+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 2 internal anchors

[1]

Adamn: Accelerating deep learning training via nested momentum and exact bias handling

Mohamed Aboulsaad and Adnan Shaout. Adamn: Accelerating deep learning training via nested momentum and exact bias handling. Electronics, 15 0 (3), 2026. ISSN 2079-9292. doi:10.3390/electronics15030670. URL https://www.mdpi.com/2079-9292/15/3/670

work page doi:10.3390/electronics15030670 2026
[2]

Three observations on linear algebra

Garrett Birkhoff. Three observations on linear algebra. Univ. Nac. Tucum \'a n. Revista A , 5: 0 147--151, 1946

work page 1946
[3]

Richard A. Brualdi. Combinatorial Matrix Classes. Encyclopedia of Mathematics and its Applications. Cambridge University Press, 2006

work page 2006
[4]

go- m hc: Direct parameterization of manifold-constrained hyper-connections via generalized orthostochastic matrices, 2026

Torque Dandachi and Sophia Diggs-Galligan. go- m hc: Direct parameterization of manifold-constrained hyper-connections via generalized orthostochastic matrices, 2026. URL https://arxiv.org/abs/2604.02309

work page arXiv 2026
[5]

George B. Dantzig. Linear Programming and Extensions. Princeton Landmarks in Mathematics and Physics. Princeton University Press, August 1998. First published and copyrighted 1963; Princeton Landmark in Mathematics paperback reissued Aug.\ 23, 1998. The foundational text that established mathematical linear programming

work page 1998
[6]

On the -lazy version of markov chains in estimation and testing problems

Sela Fried and Geoffrey Wolfer. On the -lazy version of markov chains in estimation and testing problems. 2021. URL https://arxiv.org/abs/2105.09536

work page arXiv 2021
[7]

Openwebtext corpus

Aaron Gokaslan and Vanya Cohen. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019

work page 2019
[8]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770--778, 2016

work page 2016
[9]

Maximal Inequalities and Mixing Times

Jonathan Hermon. Maximal Inequalities and Mixing Times. PhD thesis, University of California, Berkeley, 2016. URL https://escholarship.org/uc/item/7q665159. ProQuest ID: Hermon\_berkeley\_0028E\_16704; Merritt ID: ark:/13030/m5906znj

work page 2016
[10]

Andrej Karpathy. nanogpt. https://github.com/karpathy/nanoGPT, 2022. GitHub repository

work page 2022
[11]

Langville and Carl D

Amy N. Langville and Carl D. Meyer. Deeper inside pagerank. Internet Mathematics, 1 0 (3): 0 335--380, 2004. Published 2003/2004

work page 2004
[12]

Sophia: A scalable stochastic second-order optimizer for language model pre-training, 2024a

Hong Liu, Zhiyuan Li, David Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training, 2024. URL https://arxiv.org/abs/2305.14342

work page arXiv 2024
[14]

Muon is Scalable for LLM Training

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is Scalable for LLM Training . arXiv preprint arXiv:2502.16982, 2025 b

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization . In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7

work page 2019
[16]

Sean Meyn and Richard L. Tweedie. Markov Chains and Stochastic Stability. Cambridge Mathematical Library. Cambridge University Press, 2 edition, April 2009. ISBN 9780521731829

work page 2009
[17]

Mathematical aspects of mixing times in markov chains

Ravi Montenegro and Prasad Tetali. Mathematical aspects of mixing times in markov chains. Foundations and Trends in Theoretical Computer Science, 1 0 (3): 0 237--354, 01 2006. ISSN 1551-305X. doi:10.1561/0400000003. URL https://doi.org/10.1561/0400000003

work page doi:10.1561/0400000003 2006
[18]

Concerning nonnegative matrices and doubly stochastic matrices

Richard Sinkhorn and Paul Knopp. Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics, 21 0 (2): 0 343--348, 1967

work page 1967
[19]

Operations Research: An Introduction, Global Edition

Hamdy Taha. Operations Research: An Introduction, Global Edition. Pearson, 10 edition, 2017. ISBN 978-1-292-16554-7. E-ISBN: 978-1-292-16556-1; Language: English

work page 2017
[20]

John von Neumann. 1. A Certain Zero-sum Two-person Game Equivalent to the Optimal Assignment Problem, pages 5--12. Princeton University Press, Princeton, 1953. ISBN 9781400881970. doi:doi:10.1515/9781400881970-002

work page doi:10.1515/9781400881970-002 1953
[21]

mHC: Manifold-Constrained Hyper-Connections

Zhenda Xie, Yixuan Wei, Huanqi Cao, Chenggang Zhao, Chengqi Deng, Jiashi Li, Damai Dai, Huazuo Gao, Jiang Chang, Liang Zhao, et al. m HC : Manifold-Constrained Hyper-Connections . arXiv preprint arXiv:2512.24880, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

doi:10.48550/arXiv.2601.05732 , abstract =

Yongyi Yang and Jianyang Gao. m HC -lite: You Don't Need 20 Sinkhorn-Knopp Iterations . arXiv preprint arXiv:2601.05732, 2026

work page arXiv 2026
[23]

Root mean square layer normalization

Biao Zhang and Rico Sennrich. Root mean square layer normalization. Advances in neural information processing systems, 32, 2019

work page 2019
[24]

Kromhc: Manifold-constrained hyper-connections with kronecker-product residual matrices, 2026

Wuyang Zhou, Yuxuan Gu, Giorgos Iacovides, and Danilo Mandic. Kromhc: Manifold-constrained hyper-connections with kronecker-product residual matrices, 2026. URL https://arxiv.org/abs/2601.21579

work page arXiv 2026
[25]

Hyper- C onnections

Defa Zhu, Hongzhi Huang, Zihao Huang, Yutao Zeng, Yunyao Mao, Banggu Wu, Qiyang Min, and Xun Zhou. Hyper- C onnections . In Proceedings of The Thirteenth International Conference on Learning Representations, 2025

work page 2025

[1] [1]

Adamn: Accelerating deep learning training via nested momentum and exact bias handling

Mohamed Aboulsaad and Adnan Shaout. Adamn: Accelerating deep learning training via nested momentum and exact bias handling. Electronics, 15 0 (3), 2026. ISSN 2079-9292. doi:10.3390/electronics15030670. URL https://www.mdpi.com/2079-9292/15/3/670

work page doi:10.3390/electronics15030670 2026

[2] [2]

Three observations on linear algebra

Garrett Birkhoff. Three observations on linear algebra. Univ. Nac. Tucum \'a n. Revista A , 5: 0 147--151, 1946

work page 1946

[3] [3]

Richard A. Brualdi. Combinatorial Matrix Classes. Encyclopedia of Mathematics and its Applications. Cambridge University Press, 2006

work page 2006

[4] [4]

go- m hc: Direct parameterization of manifold-constrained hyper-connections via generalized orthostochastic matrices, 2026

Torque Dandachi and Sophia Diggs-Galligan. go- m hc: Direct parameterization of manifold-constrained hyper-connections via generalized orthostochastic matrices, 2026. URL https://arxiv.org/abs/2604.02309

work page arXiv 2026

[5] [5]

George B. Dantzig. Linear Programming and Extensions. Princeton Landmarks in Mathematics and Physics. Princeton University Press, August 1998. First published and copyrighted 1963; Princeton Landmark in Mathematics paperback reissued Aug.\ 23, 1998. The foundational text that established mathematical linear programming

work page 1998

[6] [6]

On the -lazy version of markov chains in estimation and testing problems

Sela Fried and Geoffrey Wolfer. On the -lazy version of markov chains in estimation and testing problems. 2021. URL https://arxiv.org/abs/2105.09536

work page arXiv 2021

[7] [7]

Openwebtext corpus

Aaron Gokaslan and Vanya Cohen. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019

work page 2019

[8] [8]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770--778, 2016

work page 2016

[9] [9]

Maximal Inequalities and Mixing Times

Jonathan Hermon. Maximal Inequalities and Mixing Times. PhD thesis, University of California, Berkeley, 2016. URL https://escholarship.org/uc/item/7q665159. ProQuest ID: Hermon\_berkeley\_0028E\_16704; Merritt ID: ark:/13030/m5906znj

work page 2016

[10] [10]

Andrej Karpathy. nanogpt. https://github.com/karpathy/nanoGPT, 2022. GitHub repository

work page 2022

[11] [11]

Langville and Carl D

Amy N. Langville and Carl D. Meyer. Deeper inside pagerank. Internet Mathematics, 1 0 (3): 0 335--380, 2004. Published 2003/2004

work page 2004

[12] [12]

Sophia: A scalable stochastic second-order optimizer for language model pre-training, 2024a

Hong Liu, Zhiyuan Li, David Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training, 2024. URL https://arxiv.org/abs/2305.14342

work page arXiv 2024

[13] [14]

Muon is Scalable for LLM Training

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is Scalable for LLM Training . arXiv preprint arXiv:2502.16982, 2025 b

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [15]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization . In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7

work page 2019

[15] [16]

Sean Meyn and Richard L. Tweedie. Markov Chains and Stochastic Stability. Cambridge Mathematical Library. Cambridge University Press, 2 edition, April 2009. ISBN 9780521731829

work page 2009

[16] [17]

Mathematical aspects of mixing times in markov chains

Ravi Montenegro and Prasad Tetali. Mathematical aspects of mixing times in markov chains. Foundations and Trends in Theoretical Computer Science, 1 0 (3): 0 237--354, 01 2006. ISSN 1551-305X. doi:10.1561/0400000003. URL https://doi.org/10.1561/0400000003

work page doi:10.1561/0400000003 2006

[17] [18]

Concerning nonnegative matrices and doubly stochastic matrices

Richard Sinkhorn and Paul Knopp. Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics, 21 0 (2): 0 343--348, 1967

work page 1967

[18] [19]

Operations Research: An Introduction, Global Edition

Hamdy Taha. Operations Research: An Introduction, Global Edition. Pearson, 10 edition, 2017. ISBN 978-1-292-16554-7. E-ISBN: 978-1-292-16556-1; Language: English

work page 2017

[19] [20]

John von Neumann. 1. A Certain Zero-sum Two-person Game Equivalent to the Optimal Assignment Problem, pages 5--12. Princeton University Press, Princeton, 1953. ISBN 9781400881970. doi:doi:10.1515/9781400881970-002

work page doi:10.1515/9781400881970-002 1953

[20] [21]

mHC: Manifold-Constrained Hyper-Connections

Zhenda Xie, Yixuan Wei, Huanqi Cao, Chenggang Zhao, Chengqi Deng, Jiashi Li, Damai Dai, Huazuo Gao, Jiang Chang, Liang Zhao, et al. m HC : Manifold-Constrained Hyper-Connections . arXiv preprint arXiv:2512.24880, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [22]

doi:10.48550/arXiv.2601.05732 , abstract =

Yongyi Yang and Jianyang Gao. m HC -lite: You Don't Need 20 Sinkhorn-Knopp Iterations . arXiv preprint arXiv:2601.05732, 2026

work page arXiv 2026

[22] [23]

Root mean square layer normalization

Biao Zhang and Rico Sennrich. Root mean square layer normalization. Advances in neural information processing systems, 32, 2019

work page 2019

[23] [24]

Kromhc: Manifold-constrained hyper-connections with kronecker-product residual matrices, 2026

Wuyang Zhou, Yuxuan Gu, Giorgos Iacovides, and Danilo Mandic. Kromhc: Manifold-constrained hyper-connections with kronecker-product residual matrices, 2026. URL https://arxiv.org/abs/2601.21579

work page arXiv 2026

[24] [25]

Hyper- C onnections

Defa Zhu, Hongzhi Huang, Zihao Huang, Yutao Zeng, Yunyao Mao, Banggu Wu, Qiyang Min, and Xun Zhou. Hyper- C onnections . In Proceedings of The Thirteenth International Conference on Learning Representations, 2025

work page 2025