pith. machine review for the scientific record. sign in

arxiv: 2605.02124 · v1 · submitted 2026-05-04 · 💻 cs.LG · cs.AI· math.PR

Recognition: unknown

Boundary Mass and the Soft-to-Hard Limit in Mixture-of-Experts

Authors on Pith no claims yet

Pith reviewed 2026-05-09 16:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.PR
keywords mixture of expertssoft routinghard routingtemperature limitboundary massgamma-convergenceteacher-student
0
0 comments X

The pith

The zero-temperature limit of softmax-routed mixture-of-experts is governed by a thin geometric layer around routing interfaces rather than the full input space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Softmax mixture-of-experts models are expected to approach hard routing as temperature drops to zero, yet the transition is singular wherever the router assigns nearly equal scores to two experts. The paper centers on boundary mass, the probability that the top two router outputs differ by only a small margin. Under smoothness and transversality conditions it proves that this mass grows linearly with margin width, the coefficient being a surface integral over the routing interface. The resulting estimates deliver explicit soft-to-hard risk bounds and Gamma-convergence of the objectives once compactness and margin control are added.

Core claim

Under smoothness and transversality assumptions on the router and input law, coarea and tube estimates show that boundary mass is linear in slab width, with leading constant a surface integral over the routing interface in the binary case. These estimates produce quantitative soft-to-hard risk bounds and, under compactness and uniform margin control, Gamma-convergence of the soft objectives to the hard-routing objective. The zero-temperature limit is therefore controlled by a thin geometric layer around routing interfaces.

What carries the argument

Boundary mass, the probability that the top two router scores differ by at most a small margin, together with coarea/tube estimates that convert it into a surface integral over the routing interface.

Load-bearing premise

The router and input distribution must satisfy smoothness and transversality so that the coarea and tube formulas apply near the ties.

What would settle it

For a linear router and Gaussian inputs, compute boundary mass over a sequence of shrinking margins and test whether the observed scaling matches the predicted surface integral within numerical error.

read the original abstract

Softmax-routed mixture-of-experts models approach hard routing as the temperature tends to zero, but this limit is singular near routing ties. This paper studies that singularity at the population level for squared-loss MoE regression. The central object is the \emph{boundary mass}, namely the probability that the top two router scores are separated by only a small margin. Under smoothness and transversality assumptions on the router and input law, we prove coarea/tube estimates showing that this mass is linear in the slab width, with leading constant given by a surface integral over the routing interface in the binary case. These estimates yield quantitative soft-to-hard risk bounds and, under compactness and uniform margin control, $\Gamma$-convergence of the soft objectives to the hard-routing objective. The main conclusion is that the zero-temperature limit is controlled by a thin geometric layer around routing interfaces, not by the full input space. We then use this geometric core in two more model-dependent directions. In a teacher--student setting, we prove a conditional landscape-transfer principle showing that, when the profiled hard-routing problem has favorable identifiability and curvature and the relevant derivatives transfer at boundary-layer scale, small-temperature soft routing inherits approximate teacher recovery and strict-saddle behavior away from teacher-equivalent partitions. We also give a reduced two-expert Gaussian calculation that illustrates a local symmetry-breaking mechanism aligned with the teacher separator.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper studies the singular zero-temperature limit of softmax-routed mixture-of-experts models in squared-loss regression. It defines boundary mass as the probability that the top-two router scores differ by at most a small margin and, under smoothness and transversality assumptions on the router and input measure, proves via coarea and tube estimates that this mass scales linearly with slab width, with leading constant equal to a surface integral over the routing interface (binary case). These estimates are used to obtain quantitative soft-to-hard risk bounds and, under compactness plus uniform margin control, Γ-convergence of the soft objective to the hard-routing objective. The work further derives a conditional landscape-transfer result in a teacher-student setting and illustrates local symmetry breaking via a reduced two-expert Gaussian calculation. The central conclusion is that the limit is governed by a thin geometric layer around routing interfaces rather than the full input space.

Significance. If the stated assumptions hold and the derivations are complete, the paper supplies a rigorous geometric explanation for why soft routing approaches hard routing in a controlled, localized manner. This is potentially significant for theoretical analysis of MoE training dynamics and generalization. Credit is due for the explicit use of coarea/tube estimates from geometric measure theory to obtain linear scaling with a surface-integral prefactor, for the quantitative risk bounds, and for the Γ-convergence result under added compactness and margin hypotheses. The teacher-student landscape transfer and Gaussian symmetry-breaking example are useful model-dependent corollaries.

major comments (2)
  1. [§3] §3 (Coarea/tube estimates): the linear scaling of boundary mass with slab width is asserted with leading constant given by the surface integral over the routing interface, but the explicit error term in the tube estimate and the lower bound on |∇(router-score difference)| away from zero are not stated with sufficient precision to verify that the constant remains positive and finite under the transversality hypothesis; this is load-bearing for the claimed quantitative soft-to-hard risk bounds.
  2. [§4] §4 (Γ-convergence): the uniform margin control is invoked to pass to the hard-routing limit, yet no argument is given showing compatibility with the transversality assumption when the router gradient may approach zero at isolated interface points; without this, the linear scaling could degrade and the Γ-convergence claim would require additional justification.
minor comments (2)
  1. [Abstract] The abstract introduces 'boundary mass' without an inline formal definition; adding one sentence would improve immediate readability for readers unfamiliar with the geometric setting.
  2. [§2] Notation for the router-score difference function and the slab width parameter is introduced in §2 but used without a consolidated table of symbols; a short notation summary would aid cross-referencing in the estimates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our manuscript. The major comments identify areas where greater precision and explicit justification would strengthen the presentation of the coarea/tube estimates and the Γ-convergence argument. We address each point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [§3] §3 (Coarea/tube estimates): the linear scaling of boundary mass with slab width is asserted with leading constant given by the surface integral over the routing interface, but the explicit error term in the tube estimate and the lower bound on |∇(router-score difference)| away from zero are not stated with sufficient precision to verify that the constant remains positive and finite under the transversality hypothesis; this is load-bearing for the claimed quantitative soft-to-hard risk bounds.

    Authors: We agree that explicit statements of the error term and the gradient lower bound would make verification immediate. Under the transversality assumption (Assumption 3.2), the router-score difference has |∇(f1−f2)| ≥ c > 0 uniformly on the compact interface by the implicit function theorem and C² smoothness. Lemma 3.3 applies the coarea formula to obtain the exact surface-integral leading term, with remainder O(δ²) controlled by the second derivatives and the input measure's regularity. In the revision we will insert the explicit lower bound c (depending only on the C² norm and transversality constant) and the precise O(δ²) error into the statement of Lemma 3.3, together with a short remark confirming that the leading constant remains positive and finite. This clarification supports the quantitative risk bounds in §4 without changing any claims. revision: yes

  2. Referee: [§4] §4 (Γ-convergence): the uniform margin control is invoked to pass to the hard-routing limit, yet no argument is given showing compatibility with the transversality assumption when the router gradient may approach zero at isolated interface points; without this, the linear scaling could degrade and the Γ-convergence claim would require additional justification.

    Authors: The referee correctly notes that transversality alone does not preclude |∇| from becoming arbitrarily small at isolated points. The uniform margin control (Assumption 4.1) is imposed precisely to keep the soft-to-hard approximation uniform. Because the set where |∇| is small has measure zero under transversality and the margin control is uniform over the compact domain, the linear scaling of boundary mass persists with the same surface-integral prefactor. In the revision we will add a short lemma (or remark) in §4 that combines the two assumptions to show that the Γ-convergence error remains O(τ) (temperature) without degradation. This supplies the missing compatibility argument while leaving the main Γ-convergence statement unchanged. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper derives its central boundary-mass estimates and soft-to-hard limits by applying standard coarea and tube formulas from geometric measure theory to the router function under explicitly stated smoothness and transversality assumptions on the router and input measure. These yield the claimed linear scaling in slab width (with surface-integral prefactor in the binary case), quantitative risk bounds, and Γ-convergence under added compactness and margin control. The subsequent teacher-student landscape-transfer principle and reduced Gaussian calculation are presented as model-dependent corollaries that inherit the geometric core rather than feeding back into it. No step reduces by construction to a fitted parameter, self-referential definition, or load-bearing self-citation; the argument is self-contained against external mathematical tools.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Ledger constructed from abstract only; full paper may introduce additional fitted constants or background results.

axioms (1)
  • domain assumption Smoothness and transversality assumptions on the router and input law
    Required to prove the coarea/tube estimates showing boundary mass is linear in slab width.
invented entities (1)
  • boundary mass no independent evidence
    purpose: Quantify the probability that the top two router scores differ by only a small margin near routing ties.
    Central object introduced to analyze the singularity of the zero-temperature limit; no independent falsifiable evidence supplied beyond the definition.

pith-pipeline@v0.9.0 · 5546 in / 1500 out tokens · 54171 ms · 2026-05-09T16:45:21.115516+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 10 canonical work pages · 1 internal anchor

  1. [1]

    Monk, Finite element methods for Maxwell’s equations, Oxford uni- versity press, 2003.doi:10.1093/acprof:oso/9780198508885.001

    A. Braides.Γ-Convergence for Beginners. Oxford University Press, 2002. DOI: 10.1093/acprof:oso/9780198507840.001.0001

  2. [2]

    L. C. Evans and R. F. Gariepy.Measure Theory and Fine Properties of Functions(revised edition). CRC Press, 2015. DOI: 10.1201/b18333

  3. [3]

    Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

    W. Fedus, B. Zoph, and N. Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022. JMLR: jmlr.org/papers/v23/21-0998.html. arXiv: arxiv.org/abs/2101.03961

  4. [4]

    R. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping from saddle points—online stochastic gradient for tensor decomposition. InProceedings of COLT 2015, PMLR 40, 2015. PMLR: proceed- ings.mlr.press/v40/Ge15.html. arXiv: arxiv.org/abs/1503.02101

  5. [5]

    Kijowski and W

    A. Henrot and M. Pierre.Variation et Optimisation de Formes. Springer, 2005. DOI: 10.1007/3- 540-37689-5

  6. [6]

    R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local experts. Neural Computation, 3(1):79–87, 1991. DOI: 10.1162/neco.1991.3.1.79

  7. [7]

    M. I. Jordan and R. A. Jacobs. Hierarchical mixtures of experts and the EM algorithm.Neural Computation, 6(2):181–214, 1994. DOI: 10.1162/neco.1994.6.2.181. 37

  8. [8]

    Shazeer, A

    N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Out- rageously large neural networks: The sparsely-gated mixture-of-experts layer. InIn- ternational Conference on Learning Representations (ICLR), 2017. OpenReview: openre- view.net/forum?id=B1ckMDqlg

  9. [9]

    Sokołowski and J.-P

    J. Sokołowski and J.-P. Zolésio.Introduction to Shape Optimization: Shape Sensitivity Analysis. Springer, 1992. DOI: 10.1007/978-3-642-58106-9

  10. [10]

    J. Sun, Q. Qu, and J. Wright. A geometric analysis of phase retrieval.Foundations of Compu- tational Mathematics, 18:1131–1198, 2018. DOI: 10.1007/s10208-017-9365-9

  11. [11]

    S. E. Yüksel, J. N. Wilson, and P. D. Gader. Twenty years of mixture of experts. IEEE Transactions on Neural Networks and Learning Systems, 23(8):1177–1193, 2012. DOI: 10.1109/TNNLS.2012.2200299

  12. [12]

    Z. Chen, Y. Deng, Y. Wu, Q. Gu, and Y. Li. Towards understanding the mixture-of-experts layer in deep learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. Proceedings: neurips.cc/proceedings/2022

  13. [13]

    Dikkala, N

    N. Dikkala, N. Ghosh, R. Meka, R. Panigrahy, N. Vyas, and X. Wang. On the benefits of learning to route in mixture-of-experts models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9376–9396, 2023. ACL Anthology: aclanthology.org/2023.emnlp-main.583

  14. [14]

    Ho, C.-Y

    N. Ho, C.-Y. Yang, and M. I. Jordan. Convergence rates for Gaussian mixtures of experts. Journal of Machine Learning Research, 23(323):1–81, 2022. JMLR: jmlr.org/papers/v23/20- 1129.html

  15. [15]

    Kawata, K

    R. Kawata, K. Matsutani, Y. Kinoshita, N. Nishikawa, and T. Suzuki. Mixture of experts provably detect and learn the latent cluster structure in gradient-based learning. InProceedings of the 42nd International Conference on Machine Learning (ICML), PMLR 267:29390–29448,

  16. [16]

    PMLR: proceedings.mlr.press/v267/kawata25a.html

  17. [17]

    Makkuva, P

    A. Makkuva, P. Viswanath, S. Kannan, and S. Oh. Breaking the gridlock in mixture- of-experts: Consistent and efficient algorithms. InProceedings of the 36th International Conference on Machine Learning (ICML), PMLR 97:4304–4313, 2019. PMLR: proceed- ings.mlr.press/v97/makkuva19a.html

  18. [18]

    Makkuva, S

    A. Makkuva, S. Oh, S. Kannan, and P. Viswanath. Learning in gated neural networks. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AIS- TATS), PMLR 108:3338–3348, 2020. PMLR: proceedings.mlr.press/v108/makkuva20a.html

  19. [19]

    Nguyen, T

    H. Nguyen, T. Nguyen, and N. Ho. Demystifying softmax gating function in Gaussian mixture of experts. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. Proceed- ings: neurips.cc/proceedings/2023

  20. [20]

    Nguyen, P

    H. Nguyen, P. Akbarian, F. Yan, and N. Ho. Statistical perspective of top-Ksparse softmax gating mixture of experts. InInternational Conference on Learning Representations (ICLR),

  21. [21]

    Proceedings: proceedings.iclr.cc/2024. 38

  22. [22]

    Nguyen, P

    H. Nguyen, P. Akbarian, T. Nguyen, and N. Ho. A general theory for softmax gat- ing multinomial logistic mixture of experts. InProceedings of the 41st International Con- ference on Machine Learning (ICML), PMLR 235:37617–37648, 2024. PMLR: proceed- ings.mlr.press/v235/nguyen24b.html

  23. [23]

    Nguyen, N

    H. Nguyen, N. Ho, and A. Rinaldo. On least square estimation in softmax gating mixture of experts. InProceedings of the 41st International Conference on Machine Learning (ICML), PMLR 235:37707–37735, 2024. PMLR: proceedings.mlr.press/v235/nguyen24f.html

  24. [24]

    Nguyen, N

    H. Nguyen, N. Ho, and A. Rinaldo. Sigmoid gating is more sample efficient than softmax gating in mixture of experts. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. Proceedings: neurips.cc/proceedings/2024

  25. [25]

    Nguyen, P

    H. Nguyen, P. Akbarian, H. T. Pham, T. T. N. Vu, S. Zhang, and N. Ho. Statistical advantages of perturbing cosine router in mixture of experts. InInternational Conference on Learning Representations (ICLR), 2025. Proceedings: proceedings.iclr.cc/2025

  26. [26]

    Wang and W

    M. Wang and W. E. On the expressive power of mixture-of-experts for structured complex tasks. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. OpenReview: openreview.net/forum?id=zSrb8rtH9M. 39