Recognition: unknown
Boundary Mass and the Soft-to-Hard Limit in Mixture-of-Experts
Pith reviewed 2026-05-09 16:45 UTC · model grok-4.3
The pith
The zero-temperature limit of softmax-routed mixture-of-experts is governed by a thin geometric layer around routing interfaces rather than the full input space.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under smoothness and transversality assumptions on the router and input law, coarea and tube estimates show that boundary mass is linear in slab width, with leading constant a surface integral over the routing interface in the binary case. These estimates produce quantitative soft-to-hard risk bounds and, under compactness and uniform margin control, Gamma-convergence of the soft objectives to the hard-routing objective. The zero-temperature limit is therefore controlled by a thin geometric layer around routing interfaces.
What carries the argument
Boundary mass, the probability that the top two router scores differ by at most a small margin, together with coarea/tube estimates that convert it into a surface integral over the routing interface.
Load-bearing premise
The router and input distribution must satisfy smoothness and transversality so that the coarea and tube formulas apply near the ties.
What would settle it
For a linear router and Gaussian inputs, compute boundary mass over a sequence of shrinking margins and test whether the observed scaling matches the predicted surface integral within numerical error.
read the original abstract
Softmax-routed mixture-of-experts models approach hard routing as the temperature tends to zero, but this limit is singular near routing ties. This paper studies that singularity at the population level for squared-loss MoE regression. The central object is the \emph{boundary mass}, namely the probability that the top two router scores are separated by only a small margin. Under smoothness and transversality assumptions on the router and input law, we prove coarea/tube estimates showing that this mass is linear in the slab width, with leading constant given by a surface integral over the routing interface in the binary case. These estimates yield quantitative soft-to-hard risk bounds and, under compactness and uniform margin control, $\Gamma$-convergence of the soft objectives to the hard-routing objective. The main conclusion is that the zero-temperature limit is controlled by a thin geometric layer around routing interfaces, not by the full input space. We then use this geometric core in two more model-dependent directions. In a teacher--student setting, we prove a conditional landscape-transfer principle showing that, when the profiled hard-routing problem has favorable identifiability and curvature and the relevant derivatives transfer at boundary-layer scale, small-temperature soft routing inherits approximate teacher recovery and strict-saddle behavior away from teacher-equivalent partitions. We also give a reduced two-expert Gaussian calculation that illustrates a local symmetry-breaking mechanism aligned with the teacher separator.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies the singular zero-temperature limit of softmax-routed mixture-of-experts models in squared-loss regression. It defines boundary mass as the probability that the top-two router scores differ by at most a small margin and, under smoothness and transversality assumptions on the router and input measure, proves via coarea and tube estimates that this mass scales linearly with slab width, with leading constant equal to a surface integral over the routing interface (binary case). These estimates are used to obtain quantitative soft-to-hard risk bounds and, under compactness plus uniform margin control, Γ-convergence of the soft objective to the hard-routing objective. The work further derives a conditional landscape-transfer result in a teacher-student setting and illustrates local symmetry breaking via a reduced two-expert Gaussian calculation. The central conclusion is that the limit is governed by a thin geometric layer around routing interfaces rather than the full input space.
Significance. If the stated assumptions hold and the derivations are complete, the paper supplies a rigorous geometric explanation for why soft routing approaches hard routing in a controlled, localized manner. This is potentially significant for theoretical analysis of MoE training dynamics and generalization. Credit is due for the explicit use of coarea/tube estimates from geometric measure theory to obtain linear scaling with a surface-integral prefactor, for the quantitative risk bounds, and for the Γ-convergence result under added compactness and margin hypotheses. The teacher-student landscape transfer and Gaussian symmetry-breaking example are useful model-dependent corollaries.
major comments (2)
- [§3] §3 (Coarea/tube estimates): the linear scaling of boundary mass with slab width is asserted with leading constant given by the surface integral over the routing interface, but the explicit error term in the tube estimate and the lower bound on |∇(router-score difference)| away from zero are not stated with sufficient precision to verify that the constant remains positive and finite under the transversality hypothesis; this is load-bearing for the claimed quantitative soft-to-hard risk bounds.
- [§4] §4 (Γ-convergence): the uniform margin control is invoked to pass to the hard-routing limit, yet no argument is given showing compatibility with the transversality assumption when the router gradient may approach zero at isolated interface points; without this, the linear scaling could degrade and the Γ-convergence claim would require additional justification.
minor comments (2)
- [Abstract] The abstract introduces 'boundary mass' without an inline formal definition; adding one sentence would improve immediate readability for readers unfamiliar with the geometric setting.
- [§2] Notation for the router-score difference function and the slab width parameter is introduced in §2 but used without a consolidated table of symbols; a short notation summary would aid cross-referencing in the estimates.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments on our manuscript. The major comments identify areas where greater precision and explicit justification would strengthen the presentation of the coarea/tube estimates and the Γ-convergence argument. We address each point below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [§3] §3 (Coarea/tube estimates): the linear scaling of boundary mass with slab width is asserted with leading constant given by the surface integral over the routing interface, but the explicit error term in the tube estimate and the lower bound on |∇(router-score difference)| away from zero are not stated with sufficient precision to verify that the constant remains positive and finite under the transversality hypothesis; this is load-bearing for the claimed quantitative soft-to-hard risk bounds.
Authors: We agree that explicit statements of the error term and the gradient lower bound would make verification immediate. Under the transversality assumption (Assumption 3.2), the router-score difference has |∇(f1−f2)| ≥ c > 0 uniformly on the compact interface by the implicit function theorem and C² smoothness. Lemma 3.3 applies the coarea formula to obtain the exact surface-integral leading term, with remainder O(δ²) controlled by the second derivatives and the input measure's regularity. In the revision we will insert the explicit lower bound c (depending only on the C² norm and transversality constant) and the precise O(δ²) error into the statement of Lemma 3.3, together with a short remark confirming that the leading constant remains positive and finite. This clarification supports the quantitative risk bounds in §4 without changing any claims. revision: yes
-
Referee: [§4] §4 (Γ-convergence): the uniform margin control is invoked to pass to the hard-routing limit, yet no argument is given showing compatibility with the transversality assumption when the router gradient may approach zero at isolated interface points; without this, the linear scaling could degrade and the Γ-convergence claim would require additional justification.
Authors: The referee correctly notes that transversality alone does not preclude |∇| from becoming arbitrarily small at isolated points. The uniform margin control (Assumption 4.1) is imposed precisely to keep the soft-to-hard approximation uniform. Because the set where |∇| is small has measure zero under transversality and the margin control is uniform over the compact domain, the linear scaling of boundary mass persists with the same surface-integral prefactor. In the revision we will add a short lemma (or remark) in §4 that combines the two assumptions to show that the Γ-convergence error remains O(τ) (temperature) without degradation. This supplies the missing compatibility argument while leaving the main Γ-convergence statement unchanged. revision: yes
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper derives its central boundary-mass estimates and soft-to-hard limits by applying standard coarea and tube formulas from geometric measure theory to the router function under explicitly stated smoothness and transversality assumptions on the router and input measure. These yield the claimed linear scaling in slab width (with surface-integral prefactor in the binary case), quantitative risk bounds, and Γ-convergence under added compactness and margin control. The subsequent teacher-student landscape-transfer principle and reduced Gaussian calculation are presented as model-dependent corollaries that inherit the geometric core rather than feeding back into it. No step reduces by construction to a fitted parameter, self-referential definition, or load-bearing self-citation; the argument is self-contained against external mathematical tools.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Smoothness and transversality assumptions on the router and input law
invented entities (1)
-
boundary mass
no independent evidence
Reference graph
Works this paper leans on
-
[1]
A. Braides.Γ-Convergence for Beginners. Oxford University Press, 2002. DOI: 10.1093/acprof:oso/9780198507840.001.0001
work page doi:10.1093/acprof:oso/9780198507840.001.0001 2002
-
[2]
L. C. Evans and R. F. Gariepy.Measure Theory and Fine Properties of Functions(revised edition). CRC Press, 2015. DOI: 10.1201/b18333
-
[3]
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
W. Fedus, B. Zoph, and N. Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022. JMLR: jmlr.org/papers/v23/21-0998.html. arXiv: arxiv.org/abs/2101.03961
work page internal anchor Pith review arXiv 2022
- [4]
-
[5]
A. Henrot and M. Pierre.Variation et Optimisation de Formes. Springer, 2005. DOI: 10.1007/3- 540-37689-5
work page doi:10.1007/3- 2005
-
[6]
R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local experts. Neural Computation, 3(1):79–87, 1991. DOI: 10.1162/neco.1991.3.1.79
-
[7]
M. I. Jordan and R. A. Jacobs. Hierarchical mixtures of experts and the EM algorithm.Neural Computation, 6(2):181–214, 1994. DOI: 10.1162/neco.1994.6.2.181. 37
-
[8]
Shazeer, A
N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Out- rageously large neural networks: The sparsely-gated mixture-of-experts layer. InIn- ternational Conference on Learning Representations (ICLR), 2017. OpenReview: openre- view.net/forum?id=B1ckMDqlg
2017
-
[9]
J. Sokołowski and J.-P. Zolésio.Introduction to Shape Optimization: Shape Sensitivity Analysis. Springer, 1992. DOI: 10.1007/978-3-642-58106-9
-
[10]
J. Sun, Q. Qu, and J. Wright. A geometric analysis of phase retrieval.Foundations of Compu- tational Mathematics, 18:1131–1198, 2018. DOI: 10.1007/s10208-017-9365-9
-
[11]
S. E. Yüksel, J. N. Wilson, and P. D. Gader. Twenty years of mixture of experts. IEEE Transactions on Neural Networks and Learning Systems, 23(8):1177–1193, 2012. DOI: 10.1109/TNNLS.2012.2200299
-
[12]
Z. Chen, Y. Deng, Y. Wu, Q. Gu, and Y. Li. Towards understanding the mixture-of-experts layer in deep learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. Proceedings: neurips.cc/proceedings/2022
2022
-
[13]
Dikkala, N
N. Dikkala, N. Ghosh, R. Meka, R. Panigrahy, N. Vyas, and X. Wang. On the benefits of learning to route in mixture-of-experts models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9376–9396, 2023. ACL Anthology: aclanthology.org/2023.emnlp-main.583
2023
-
[14]
Ho, C.-Y
N. Ho, C.-Y. Yang, and M. I. Jordan. Convergence rates for Gaussian mixtures of experts. Journal of Machine Learning Research, 23(323):1–81, 2022. JMLR: jmlr.org/papers/v23/20- 1129.html
2022
-
[15]
Kawata, K
R. Kawata, K. Matsutani, Y. Kinoshita, N. Nishikawa, and T. Suzuki. Mixture of experts provably detect and learn the latent cluster structure in gradient-based learning. InProceedings of the 42nd International Conference on Machine Learning (ICML), PMLR 267:29390–29448,
-
[16]
PMLR: proceedings.mlr.press/v267/kawata25a.html
-
[17]
Makkuva, P
A. Makkuva, P. Viswanath, S. Kannan, and S. Oh. Breaking the gridlock in mixture- of-experts: Consistent and efficient algorithms. InProceedings of the 36th International Conference on Machine Learning (ICML), PMLR 97:4304–4313, 2019. PMLR: proceed- ings.mlr.press/v97/makkuva19a.html
2019
-
[18]
Makkuva, S
A. Makkuva, S. Oh, S. Kannan, and P. Viswanath. Learning in gated neural networks. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AIS- TATS), PMLR 108:3338–3348, 2020. PMLR: proceedings.mlr.press/v108/makkuva20a.html
2020
-
[19]
Nguyen, T
H. Nguyen, T. Nguyen, and N. Ho. Demystifying softmax gating function in Gaussian mixture of experts. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. Proceed- ings: neurips.cc/proceedings/2023
2023
-
[20]
Nguyen, P
H. Nguyen, P. Akbarian, F. Yan, and N. Ho. Statistical perspective of top-Ksparse softmax gating mixture of experts. InInternational Conference on Learning Representations (ICLR),
-
[21]
Proceedings: proceedings.iclr.cc/2024. 38
2024
-
[22]
Nguyen, P
H. Nguyen, P. Akbarian, T. Nguyen, and N. Ho. A general theory for softmax gat- ing multinomial logistic mixture of experts. InProceedings of the 41st International Con- ference on Machine Learning (ICML), PMLR 235:37617–37648, 2024. PMLR: proceed- ings.mlr.press/v235/nguyen24b.html
2024
-
[23]
Nguyen, N
H. Nguyen, N. Ho, and A. Rinaldo. On least square estimation in softmax gating mixture of experts. InProceedings of the 41st International Conference on Machine Learning (ICML), PMLR 235:37707–37735, 2024. PMLR: proceedings.mlr.press/v235/nguyen24f.html
2024
-
[24]
Nguyen, N
H. Nguyen, N. Ho, and A. Rinaldo. Sigmoid gating is more sample efficient than softmax gating in mixture of experts. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. Proceedings: neurips.cc/proceedings/2024
2024
-
[25]
Nguyen, P
H. Nguyen, P. Akbarian, H. T. Pham, T. T. N. Vu, S. Zhang, and N. Ho. Statistical advantages of perturbing cosine router in mixture of experts. InInternational Conference on Learning Representations (ICLR), 2025. Proceedings: proceedings.iclr.cc/2025
2025
-
[26]
Wang and W
M. Wang and W. E. On the expressive power of mixture-of-experts for structured complex tasks. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. OpenReview: openreview.net/forum?id=zSrb8rtH9M. 39
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.