arxiv: 2605.02124 · v1 · submitted 2026-05-04 · 💻 cs.LG · cs.AI· math.PR

Recognition: unknown

Boundary Mass and the Soft-to-Hard Limit in Mixture-of-Experts

Reza Rastegar

Authors on Pith no claims yet

Pith reviewed 2026-05-09 16:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.PR

keywords mixture of expertssoft routinghard routingtemperature limitboundary massgamma-convergenceteacher-student

0 comments

The pith

The zero-temperature limit of softmax-routed mixture-of-experts is governed by a thin geometric layer around routing interfaces rather than the full input space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Softmax mixture-of-experts models are expected to approach hard routing as temperature drops to zero, yet the transition is singular wherever the router assigns nearly equal scores to two experts. The paper centers on boundary mass, the probability that the top two router outputs differ by only a small margin. Under smoothness and transversality conditions it proves that this mass grows linearly with margin width, the coefficient being a surface integral over the routing interface. The resulting estimates deliver explicit soft-to-hard risk bounds and Gamma-convergence of the objectives once compactness and margin control are added.

Core claim

Under smoothness and transversality assumptions on the router and input law, coarea and tube estimates show that boundary mass is linear in slab width, with leading constant a surface integral over the routing interface in the binary case. These estimates produce quantitative soft-to-hard risk bounds and, under compactness and uniform margin control, Gamma-convergence of the soft objectives to the hard-routing objective. The zero-temperature limit is therefore controlled by a thin geometric layer around routing interfaces.

What carries the argument

Boundary mass, the probability that the top two router scores differ by at most a small margin, together with coarea/tube estimates that convert it into a surface integral over the routing interface.

Load-bearing premise

The router and input distribution must satisfy smoothness and transversality so that the coarea and tube formulas apply near the ties.

What would settle it

For a linear router and Gaussian inputs, compute boundary mass over a sequence of shrinking margins and test whether the observed scaling matches the predicted surface integral within numerical error.

read the original abstract

Softmax-routed mixture-of-experts models approach hard routing as the temperature tends to zero, but this limit is singular near routing ties. This paper studies that singularity at the population level for squared-loss MoE regression. The central object is the \emph{boundary mass}, namely the probability that the top two router scores are separated by only a small margin. Under smoothness and transversality assumptions on the router and input law, we prove coarea/tube estimates showing that this mass is linear in the slab width, with leading constant given by a surface integral over the routing interface in the binary case. These estimates yield quantitative soft-to-hard risk bounds and, under compactness and uniform margin control, $\Gamma$-convergence of the soft objectives to the hard-routing objective. The main conclusion is that the zero-temperature limit is controlled by a thin geometric layer around routing interfaces, not by the full input space. We then use this geometric core in two more model-dependent directions. In a teacher--student setting, we prove a conditional landscape-transfer principle showing that, when the profiled hard-routing problem has favorable identifiability and curvature and the relevant derivatives transfer at boundary-layer scale, small-temperature soft routing inherits approximate teacher recovery and strict-saddle behavior away from teacher-equivalent partitions. We also give a reduced two-expert Gaussian calculation that illustrates a local symmetry-breaking mechanism aligned with the teacher separator.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces boundary mass to characterize the thin-layer control of the soft-to-hard limit in MoE routing and backs it with coarea estimates that look internally consistent.

read the letter

The main point is that the zero-temperature limit of softmax-routed MoE regression is governed by a thin geometric layer around the routing interfaces rather than the full input space. Their boundary-mass quantity scales linearly with slab width under smoothness and transversality, with the prefactor given by a surface integral over the interface in the binary case. This directly yields quantitative soft-to-hard risk bounds and, with added compactness and margin control, Gamma-convergence of the soft objective to the hard one.

Referee Report

2 major / 2 minor

Summary. The paper studies the singular zero-temperature limit of softmax-routed mixture-of-experts models in squared-loss regression. It defines boundary mass as the probability that the top-two router scores differ by at most a small margin and, under smoothness and transversality assumptions on the router and input measure, proves via coarea and tube estimates that this mass scales linearly with slab width, with leading constant equal to a surface integral over the routing interface (binary case). These estimates are used to obtain quantitative soft-to-hard risk bounds and, under compactness plus uniform margin control, Γ-convergence of the soft objective to the hard-routing objective. The work further derives a conditional landscape-transfer result in a teacher-student setting and illustrates local symmetry breaking via a reduced two-expert Gaussian calculation. The central conclusion is that the limit is governed by a thin geometric layer around routing interfaces rather than the full input space.

Significance. If the stated assumptions hold and the derivations are complete, the paper supplies a rigorous geometric explanation for why soft routing approaches hard routing in a controlled, localized manner. This is potentially significant for theoretical analysis of MoE training dynamics and generalization. Credit is due for the explicit use of coarea/tube estimates from geometric measure theory to obtain linear scaling with a surface-integral prefactor, for the quantitative risk bounds, and for the Γ-convergence result under added compactness and margin hypotheses. The teacher-student landscape transfer and Gaussian symmetry-breaking example are useful model-dependent corollaries.

major comments (2)

[§3] §3 (Coarea/tube estimates): the linear scaling of boundary mass with slab width is asserted with leading constant given by the surface integral over the routing interface, but the explicit error term in the tube estimate and the lower bound on |∇(router-score difference)| away from zero are not stated with sufficient precision to verify that the constant remains positive and finite under the transversality hypothesis; this is load-bearing for the claimed quantitative soft-to-hard risk bounds.
[§4] §4 (Γ-convergence): the uniform margin control is invoked to pass to the hard-routing limit, yet no argument is given showing compatibility with the transversality assumption when the router gradient may approach zero at isolated interface points; without this, the linear scaling could degrade and the Γ-convergence claim would require additional justification.

minor comments (2)

[Abstract] The abstract introduces 'boundary mass' without an inline formal definition; adding one sentence would improve immediate readability for readers unfamiliar with the geometric setting.
[§2] Notation for the router-score difference function and the slab width parameter is introduced in §2 but used without a consolidated table of symbols; a short notation summary would aid cross-referencing in the estimates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our manuscript. The major comments identify areas where greater precision and explicit justification would strengthen the presentation of the coarea/tube estimates and the Γ-convergence argument. We address each point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [§3] §3 (Coarea/tube estimates): the linear scaling of boundary mass with slab width is asserted with leading constant given by the surface integral over the routing interface, but the explicit error term in the tube estimate and the lower bound on |∇(router-score difference)| away from zero are not stated with sufficient precision to verify that the constant remains positive and finite under the transversality hypothesis; this is load-bearing for the claimed quantitative soft-to-hard risk bounds.

Authors: We agree that explicit statements of the error term and the gradient lower bound would make verification immediate. Under the transversality assumption (Assumption 3.2), the router-score difference has |∇(f1−f2)| ≥ c > 0 uniformly on the compact interface by the implicit function theorem and C² smoothness. Lemma 3.3 applies the coarea formula to obtain the exact surface-integral leading term, with remainder O(δ²) controlled by the second derivatives and the input measure's regularity. In the revision we will insert the explicit lower bound c (depending only on the C² norm and transversality constant) and the precise O(δ²) error into the statement of Lemma 3.3, together with a short remark confirming that the leading constant remains positive and finite. This clarification supports the quantitative risk bounds in §4 without changing any claims. revision: yes
Referee: [§4] §4 (Γ-convergence): the uniform margin control is invoked to pass to the hard-routing limit, yet no argument is given showing compatibility with the transversality assumption when the router gradient may approach zero at isolated interface points; without this, the linear scaling could degrade and the Γ-convergence claim would require additional justification.

Authors: The referee correctly notes that transversality alone does not preclude |∇| from becoming arbitrarily small at isolated points. The uniform margin control (Assumption 4.1) is imposed precisely to keep the soft-to-hard approximation uniform. Because the set where |∇| is small has measure zero under transversality and the margin control is uniform over the compact domain, the linear scaling of boundary mass persists with the same surface-integral prefactor. In the revision we will add a short lemma (or remark) in §4 that combines the two assumptions to show that the Γ-convergence error remains O(τ) (temperature) without degradation. This supplies the missing compatibility argument while leaving the main Γ-convergence statement unchanged. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper derives its central boundary-mass estimates and soft-to-hard limits by applying standard coarea and tube formulas from geometric measure theory to the router function under explicitly stated smoothness and transversality assumptions on the router and input measure. These yield the claimed linear scaling in slab width (with surface-integral prefactor in the binary case), quantitative risk bounds, and Γ-convergence under added compactness and margin control. The subsequent teacher-student landscape-transfer principle and reduced Gaussian calculation are presented as model-dependent corollaries that inherit the geometric core rather than feeding back into it. No step reduces by construction to a fitted parameter, self-referential definition, or load-bearing self-citation; the argument is self-contained against external mathematical tools.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Ledger constructed from abstract only; full paper may introduce additional fitted constants or background results.

axioms (1)

domain assumption Smoothness and transversality assumptions on the router and input law
Required to prove the coarea/tube estimates showing boundary mass is linear in slab width.

invented entities (1)

boundary mass no independent evidence
purpose: Quantify the probability that the top two router scores differ by only a small margin near routing ties.
Central object introduced to analyze the singularity of the zero-temperature limit; no independent falsifiable evidence supplied beyond the definition.

pith-pipeline@v0.9.0 · 5546 in / 1500 out tokens · 54171 ms · 2026-05-09T16:45:21.115516+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 10 canonical work pages · 1 internal anchor

[1]

Monk, Finite element methods for Maxwell’s equations, Oxford uni- versity press, 2003.doi:10.1093/acprof:oso/9780198508885.001

A. Braides.Γ-Convergence for Beginners. Oxford University Press, 2002. DOI: 10.1093/acprof:oso/9780198507840.001.0001

work page doi:10.1093/acprof:oso/9780198507840.001.0001 2002
[2]

L. C. Evans and R. F. Gariepy.Measure Theory and Fine Properties of Functions(revised edition). CRC Press, 2015. DOI: 10.1201/b18333

work page doi:10.1201/b18333 2015
[3]

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

W. Fedus, B. Zoph, and N. Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022. JMLR: jmlr.org/papers/v23/21-0998.html. arXiv: arxiv.org/abs/2101.03961

work page internal anchor Pith review arXiv 2022
[4]

R. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping from saddle points—online stochastic gradient for tensor decomposition. InProceedings of COLT 2015, PMLR 40, 2015. PMLR: proceed- ings.mlr.press/v40/Ge15.html. arXiv: arxiv.org/abs/1503.02101

work page arXiv 2015
[5]

Kijowski and W

A. Henrot and M. Pierre.Variation et Optimisation de Formes. Springer, 2005. DOI: 10.1007/3- 540-37689-5

work page doi:10.1007/3- 2005
[6]

R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local experts. Neural Computation, 3(1):79–87, 1991. DOI: 10.1162/neco.1991.3.1.79

work page doi:10.1162/neco.1991.3.1.79 1991
[7]

M. I. Jordan and R. A. Jacobs. Hierarchical mixtures of experts and the EM algorithm.Neural Computation, 6(2):181–214, 1994. DOI: 10.1162/neco.1994.6.2.181. 37

work page doi:10.1162/neco.1994.6.2.181 1994
[8]

Shazeer, A

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Out- rageously large neural networks: The sparsely-gated mixture-of-experts layer. InIn- ternational Conference on Learning Representations (ICLR), 2017. OpenReview: openre- view.net/forum?id=B1ckMDqlg

2017
[9]

Sokołowski and J.-P

J. Sokołowski and J.-P. Zolésio.Introduction to Shape Optimization: Shape Sensitivity Analysis. Springer, 1992. DOI: 10.1007/978-3-642-58106-9

work page doi:10.1007/978-3-642-58106-9 1992
[10]

J. Sun, Q. Qu, and J. Wright. A geometric analysis of phase retrieval.Foundations of Compu- tational Mathematics, 18:1131–1198, 2018. DOI: 10.1007/s10208-017-9365-9

work page doi:10.1007/s10208-017-9365-9 2018
[11]

S. E. Yüksel, J. N. Wilson, and P. D. Gader. Twenty years of mixture of experts. IEEE Transactions on Neural Networks and Learning Systems, 23(8):1177–1193, 2012. DOI: 10.1109/TNNLS.2012.2200299

work page doi:10.1109/tnnls.2012.2200299 2012
[12]

Z. Chen, Y. Deng, Y. Wu, Q. Gu, and Y. Li. Towards understanding the mixture-of-experts layer in deep learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. Proceedings: neurips.cc/proceedings/2022

2022
[13]

Dikkala, N

N. Dikkala, N. Ghosh, R. Meka, R. Panigrahy, N. Vyas, and X. Wang. On the benefits of learning to route in mixture-of-experts models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9376–9396, 2023. ACL Anthology: aclanthology.org/2023.emnlp-main.583

2023
[14]

Ho, C.-Y

N. Ho, C.-Y. Yang, and M. I. Jordan. Convergence rates for Gaussian mixtures of experts. Journal of Machine Learning Research, 23(323):1–81, 2022. JMLR: jmlr.org/papers/v23/20- 1129.html

2022
[15]

Kawata, K

R. Kawata, K. Matsutani, Y. Kinoshita, N. Nishikawa, and T. Suzuki. Mixture of experts provably detect and learn the latent cluster structure in gradient-based learning. InProceedings of the 42nd International Conference on Machine Learning (ICML), PMLR 267:29390–29448,
[16]

PMLR: proceedings.mlr.press/v267/kawata25a.html
[17]

Makkuva, P

A. Makkuva, P. Viswanath, S. Kannan, and S. Oh. Breaking the gridlock in mixture- of-experts: Consistent and efficient algorithms. InProceedings of the 36th International Conference on Machine Learning (ICML), PMLR 97:4304–4313, 2019. PMLR: proceed- ings.mlr.press/v97/makkuva19a.html

2019
[18]

Makkuva, S

A. Makkuva, S. Oh, S. Kannan, and P. Viswanath. Learning in gated neural networks. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AIS- TATS), PMLR 108:3338–3348, 2020. PMLR: proceedings.mlr.press/v108/makkuva20a.html

2020
[19]

Nguyen, T

H. Nguyen, T. Nguyen, and N. Ho. Demystifying softmax gating function in Gaussian mixture of experts. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. Proceed- ings: neurips.cc/proceedings/2023

2023
[20]

Nguyen, P

H. Nguyen, P. Akbarian, F. Yan, and N. Ho. Statistical perspective of top-Ksparse softmax gating mixture of experts. InInternational Conference on Learning Representations (ICLR),
[21]

Proceedings: proceedings.iclr.cc/2024. 38

2024
[22]

Nguyen, P

H. Nguyen, P. Akbarian, T. Nguyen, and N. Ho. A general theory for softmax gat- ing multinomial logistic mixture of experts. InProceedings of the 41st International Con- ference on Machine Learning (ICML), PMLR 235:37617–37648, 2024. PMLR: proceed- ings.mlr.press/v235/nguyen24b.html

2024
[23]

Nguyen, N

H. Nguyen, N. Ho, and A. Rinaldo. On least square estimation in softmax gating mixture of experts. InProceedings of the 41st International Conference on Machine Learning (ICML), PMLR 235:37707–37735, 2024. PMLR: proceedings.mlr.press/v235/nguyen24f.html

2024
[24]

Nguyen, N

H. Nguyen, N. Ho, and A. Rinaldo. Sigmoid gating is more sample efficient than softmax gating in mixture of experts. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. Proceedings: neurips.cc/proceedings/2024

2024
[25]

Nguyen, P

H. Nguyen, P. Akbarian, H. T. Pham, T. T. N. Vu, S. Zhang, and N. Ho. Statistical advantages of perturbing cosine router in mixture of experts. InInternational Conference on Learning Representations (ICLR), 2025. Proceedings: proceedings.iclr.cc/2025

2025
[26]

Wang and W

M. Wang and W. E. On the expressive power of mixture-of-experts for structured complex tasks. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. OpenReview: openreview.net/forum?id=zSrb8rtH9M. 39

2025