Sparse Mixture-of-Experts Routing in Visual Diffusion Transformers:Diagnosis, Boundary Calibration and Evolutionary Roadmap from Routing Collapse to Selective Deadlock

Haiying Sha

arxiv: 2605.19378 · v1 · pith:XKV7IR4Gnew · submitted 2026-05-12 · 💻 cs.CV

Sparse Mixture-of-Experts Routing in Visual Diffusion Transformers:Diagnosis, Boundary Calibration and Evolutionary Roadmap from Routing Collapse to Selective Deadlock

Haiying Sha This is my paper

Pith reviewed 2026-05-20 22:11 UTC · model grok-4.3

classification 💻 cs.CV

keywords Mixture of ExpertsSparse MoEDiffusion TransformersRouting FailureSelective DeadlockFunctional RedundancyVideo DiffusionExpert Routing

0 comments

The pith

Token-Choice sparse MoE in video Diffusion Transformers exhibits five failure modes, with selective deadlock as a rational waiting strategy under the Functional Redundancy Hypothesis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper diagnoses training failures when converting a 5-billion-parameter dense video Diffusion Transformer into a Token-Choice sparse MoE architecture. It identifies a hierarchy of five escalating issues: global expert homogenization with linear routers, selective deadlock where one-third of layers lock into single-expert mode despite auxiliary losses, partial self-recovery with cross-attention routers, U-shaped concentration of deadlocks in shallow and deep layers, and bfloat16 precision truncating small updates. Routing time series over 65 million tokens support the Functional Redundancy Hypothesis that deadlock serves as a rational pause until the shared expert matures inside the gate-shared-routed expert system, drawing on systems biology concepts. The work states three laws for dense-to-MoE conversion, fixes the precision trap, calibrates the paradigm's current limits, and sketches a three-step roadmap from visual unification toward world models.

Core claim

Starting from a pretrained dense model, MoE conversion clones FFN weights exactly for routed experts, initializes shared experts at zero or tiny noise, and randomizes gates. This process uncovers the five failure modes and leads to the Functional Redundancy Hypothesis: observed selective deadlock is a deliberate waiting behavior in the triadic gate-shared expert-routed expert system before the shared expert can contribute, confirmed by long-horizon routing decision analysis.

What carries the argument

The Functional Redundancy Hypothesis, which interprets selective deadlock as a rational waiting strategy within the gate-shared expert-routed expert triadic system, backed by routing decision time series.

If this is right

Increasing the auxiliary loss cannot eliminate selective deadlock once it sets in.
Deadlocked layers follow a U-shaped pattern, hitting hardest in early visual layers and late semantic layers.
bfloat16 mixed precision must be replaced or augmented to prevent truncation of tiny expert weight updates.
The current Token-Choice paradigm reaches a calibrated capability boundary on video diffusion tasks.
A three-step evolutionary path can move from visual unification toward integrated world models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Initializing shared experts with more structured pre-training could shorten or bypass the observed waiting phase.
The triadic system framing may generalize to other MoE variants if functional redundancy proves common.
Cross-domain ideas from systems biology could suggest new regularization or detection methods for expert imbalance.
Explicit maturity monitoring for shared experts might become a standard engineering control in future sparse models.

Load-bearing premise

The claim that selective deadlock constitutes a rational waiting strategy explained by functional redundancy in the gate-shared-routed expert system rather than some other training dynamic.

What would settle it

Pre-train or pre-mature the shared expert to full capability before joint MoE training begins, then measure whether selective deadlock still appears or whether routing balances across experts.

read the original abstract

This paper systematically diagnoses the training failure modes of Token-Choice sparse Mixture-of-Experts (MoE) on video Diffusion Transformers. Starting from a pretrained dense model of about 5 billion parameters, we convert it into an MoE architecture following three laws: routed experts exactly clone the original FFN weights, shared experts are initialized to zero for verification and then to extremely small non-zero noise for actual training, while only the gating networks start from random initialization. Experiments reveal a hierarchy of five failure modes: (1) linear routers suffer global soft saturation with complete expert homogenization; (2) MLP routers introduce selective deadlock, where roughly one-third of layers degenerate into a single-expert mode that cannot be prevented by increasing the auxiliary loss; (3) cross-attention routers exhibit preliminary self-recovery, yet about nine layers remain stubbornly deadlocked; (4) deadlocked layers display a U-shaped distribution, concentrated in shallow visual processing layers and deep semantic integration layers; (5) bfloat16 mixed precision causes tiny weight updates to be truncated to zero by hardware. Based on routing decision time series over 65 million tokens across 5,000 training steps, we propose the Functional Redundancy Hypothesis: deadlock is a rational waiting strategy before the shared expert matures within the gate-shared expert-routed expert triadic system. This hypothesis is supported by the theory of functional redundancy in systems biology. On the engineering side, we summarize the Three Laws of dense-to-MoE conversion and provide a complete solution for the bfloat16 precision trap. We calibrate the current capability boundary of the Token-Choice paradigm and outline a three-step evolutionary roadmap from visual unification to a world model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper maps out a hierarchy of routing failures when turning dense diffusion transformers into token-choice MoE, with useful practical observations, but its central hypothesis rests on post-hoc interpretation without causal tests.

read the letter

The main point is that this work tracks how token-choice sparse MoE behaves when grafted onto a 5B video diffusion transformer. They start from a pretrained dense model, clone the routed experts, zero the shared ones at first, and randomize the gates. Over 65 million tokens they see a progression: linear routers saturate globally, MLP routers produce selective deadlock in roughly one-third of layers, cross-attention routers recover somewhat but still leave deadlocked layers, those layers cluster in a U-shape at shallow and deep positions, and bfloat16 truncates tiny updates. These patterns are described clearly enough that another group could try to reproduce the time series.

Referee Report

3 major / 2 minor

Summary. The paper claims to diagnose a hierarchy of five failure modes in Token-Choice sparse MoE routing for video Diffusion Transformers converted from a ~5B-parameter dense pretrained model. Routed experts clone original FFN weights, shared experts start at zero or tiny noise, and gates are randomly initialized. Using routing time series over 65M tokens across 5000 steps, it identifies linear-router saturation, selective deadlock (one-third of layers stuck in single-expert mode), partial self-recovery in cross-attention routers, a U-shaped distribution of deadlocked layers, and bfloat16 truncation of tiny updates. It proposes the Functional Redundancy Hypothesis that selective deadlock is a rational waiting strategy before the shared expert matures in the gate-shared-routed-expert triadic system, drawing from systems biology, and supplies the Three Laws of dense-to-MoE conversion plus a bfloat16 fix and three-step evolutionary roadmap.

Significance. If the observations and hypothesis are substantiated with causal evidence, the work could help explain and mitigate training instabilities in sparse MoE for large visual generative models, informing initialization protocols and routing design. The scale of the time-series analysis and the explicit triadic-system framing offer a concrete starting point for future MoE studies in diffusion transformers. The engineering contributions (Three Laws, bfloat16 solution) are immediately usable. Current interpretive framing and absence of interventions limit immediate impact.

major comments (3)

[Functional Redundancy Hypothesis] Functional Redundancy Hypothesis section: the claim that selective deadlock constitutes a 'rational waiting strategy' before shared-expert maturation rests on post-hoc interpretation of routing traces without interventions, gradient-flow ablations, or activation analyses that would causally link deadlock duration to subsequent maturation or measurable benefit in the triadic system. This is load-bearing for the central explanatory claim.
[Experimental results] Results on failure modes: the hierarchy of five modes (including the one-third layer deadlock fraction and U-shaped distribution) is presented without quantitative metrics, statistical tests, ablation tables, or error bars. The abstract and time-series description alone do not establish the claimed ordering or prevalence with sufficient rigor.
[Discussion] Discussion of alternatives: accounts such as bfloat16 truncation of tiny updates, router saturation under video DiT token statistics, or optimization saddle points are acknowledged but not ruled out by controlled comparisons or additional diagnostics. This weakens the uniqueness of the functional-redundancy explanation.

minor comments (2)

[Abstract] Abstract: specific model architecture details, dataset, and exact layer counts for the U-shaped distribution are omitted, hindering reproducibility.
[Introduction / Hypothesis] Notation: the 'gate-shared expert-routed expert triadic system' is introduced without an accompanying diagram or formal definition that would clarify the roles and interactions.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed report. We appreciate the acknowledgment of the scale of our time-series analysis and the immediate usability of the engineering contributions. Below we address each major comment point by point, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Functional Redundancy Hypothesis] Functional Redundancy Hypothesis section: the claim that selective deadlock constitutes a 'rational waiting strategy' before shared-expert maturation rests on post-hoc interpretation of routing traces without interventions, gradient-flow ablations, or activation analyses that would causally link deadlock duration to subsequent maturation or measurable benefit in the triadic system. This is load-bearing for the central explanatory claim.

Authors: We acknowledge that the Functional Redundancy Hypothesis is interpretive and rests on observational patterns from the routing time series rather than direct causal interventions. The manuscript frames it as a hypothesis informed by consistent deadlock behaviors across 65M tokens and biological analogies, without asserting proven causality. In revision we will add an explicit limitations paragraph stating the observational basis and outline targeted future experiments (e.g., controlled shared-expert initialization sweeps and gradient-flow measurements) to test the waiting-strategy interpretation. This clarifies the current evidential scope while retaining the hypothesis as a generative framework. revision: partial
Referee: [Experimental results] Results on failure modes: the hierarchy of five modes (including the one-third layer deadlock fraction and U-shaped distribution) is presented without quantitative metrics, statistical tests, ablation tables, or error bars. The abstract and time-series description alone do not establish the claimed ordering or prevalence with sufficient rigor.

Authors: We agree that the presentation of the five failure modes would be strengthened by additional quantitative support. The revised manuscript will include confidence intervals and error bars on key statistics (e.g., deadlock fraction and layer distribution), formal statistical tests for the U-shaped pattern, and ablation tables comparing router types and auxiliary-loss strengths. These will appear in the main results section and supplementary material to establish prevalence and ordering with greater rigor. revision: yes
Referee: [Discussion] Discussion of alternatives: accounts such as bfloat16 truncation of tiny updates, router saturation under video DiT token statistics, or optimization saddle points are acknowledged but not ruled out by controlled comparisons or additional diagnostics. This weakens the uniqueness of the functional-redundancy explanation.

Authors: We recognize that alternative accounts are noted but not exhaustively differentiated. We will expand the discussion with additional diagnostics, including precision-ablation runs (bfloat16 vs. fp32) to quantify truncation effects and router-saturation metrics under video token statistics. While complete exclusion of every alternative may require further controlled studies, we will articulate how the triadic-system patterns and cross-layer U-shape provide a unifying account consistent with the observed data. revision: partial

standing simulated objections not resolved

Direct causal interventions, gradient-flow ablations, or activation analyses to substantiate the Functional Redundancy Hypothesis, as these would require new experimental campaigns beyond the scope of the present observational study.

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper describes an empirical workflow starting from a pretrained dense model converted to MoE via explicit initialization rules (routed experts cloned, shared experts zero then small noise, gates random), followed by direct observation of routing time series over 65M tokens. The Functional Redundancy Hypothesis is presented as a post-hoc interpretation of the five documented failure modes, explicitly supported by external systems biology theory rather than any self-referential equation, fitted parameter renamed as prediction, or load-bearing self-citation. No derivation reduces by construction to the inputs; the central claims remain observational and interpretive without tautological collapse.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims rest on the three conversion laws and the biology analogy for the hypothesis; these are introduced without prior independent verification in the abstract.

axioms (2)

domain assumption Routed experts exactly clone the original FFN weights
Stated as the first law of dense-to-MoE conversion in the abstract.
domain assumption Shared experts initialized to zero then extremely small non-zero noise
Stated as the second law for verification and training.

invented entities (1)

Functional Redundancy Hypothesis no independent evidence
purpose: Explains selective deadlock as a rational waiting strategy in the triadic expert system
Introduced to account for observed layer behavior; independent evidence not provided in abstract.

pith-pipeline@v0.9.0 · 5842 in / 1359 out tokens · 58905 ms · 2026-05-20T22:11:43.521549+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose the Functional Redundancy Hypothesis: deadlock is a rational waiting strategy before the shared expert matures within the gate-shared expert-routed expert triadic system.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experiments reveal a hierarchy of five failure modes... selective deadlock... U-shaped distribution

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 1 internal anchor

[1]

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer .ICLR, 2017

Shazeer, N., Mirhoseini, A., Maziarz, K., et al. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer .ICLR, 2017

work page 2017
[2]

Switch transformers: Scaling to trillion parameter models with simple and eﬃcient sparsity.JMLR, 2022

Fedus, W ., Zoph, B., Shazeer, N. Switch transformers: Scaling to trillion parameter models with simple and eﬃcient sparsity.JMLR, 2022

work page 2022
[3]

DeepSeek-V4: Towards Highly Eﬃcient Million-Token Context Intelligence

DeepSeek-AI. DeepSeek-V4: Towards Highly Eﬃcient Million-Token Context Intelligence. Technical Re- port, 2026

work page 2026
[4]

Qwen3: Qwen3 Technical Report., 2025

Alibaba Cloud. Qwen3: Qwen3 Technical Report., 2025

work page 2025
[5]

Routing matters in MoE: Scaling diﬀusion transformers with explicit routing guidance.ICLR, 2026

Wei, Y ., Zhang, S., Yuan, H., et al. Routing matters in MoE: Scaling diﬀusion transformers with explicit routing guidance.ICLR, 2026

work page 2026
[6]

DiﬀMoE: Mixture-of-experts for diﬀusion models.CVPR, 2025

Kuaishou Technology. DiﬀMoE: Mixture-of-experts for diﬀusion models.CVPR, 2025

work page 2025
[7]

Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE

Shi, Y ., et al. Mamoda2.5: Enhancing uniﬁed multimodal model with DiT-MoE.arXiv:2605.02641, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

Mixture of distributions matters: Dynamic sparse attention for eﬃcient video diﬀusion transformers.arXiv, 2026

Jiang, K., et al. Mixture of distributions matters: Dynamic sparse attention for eﬃcient video diﬀusion transformers.arXiv, 2026

work page 2026
[9]

Score-based generative modeling through stochastic diﬀerential equations.ICLR, 2021

Song, Y ., Sohl-Dickstein, J., Kingma, D.P ., et al. Score-based generative modeling through stochastic diﬀerential equations.ICLR, 2021

work page 2021
[10]

DREAM: Dynamic routing of experts via attention-based mixture for vision-language-action mod- eling.Knowledge-Based Systems, 2026.https://www.ebiotrade.com/newsf/2026-2/ 20260227000821269.htm

work page 2026
[11]

The myth of expert specialization in MoEs: Why routing reﬂects geometry, not neces- sarily domain expertise.arXiv, 2026

Dai, D., et al. The myth of expert specialization in MoEs: Why routing reﬂects geometry, not neces- sarily domain expertise.arXiv, 2026

work page 2026
[12]

T . Lunyan. Analysis of Data Type Issue with TopKRouter Expert Bias in Megatron-LM. GitCode Blog, May 2025.https://blog.gitcode.com/fbce3dac83adb53fcf8c720b4b0c06dd.html

work page 2025
[13]

Y . Li. The Stability Gap: Why Top-K Routing Breaks RL Optimization. Personal Blog, December 2025. https://richardli.xyz/post/topk-routing-stability-gap/. 22

work page 2025
[14]

Load balancing mixture of experts with similarity preserving routers.arXiv, 2025

work page 2025
[15]

Mixture-of-Experts with Expert Choice Routing.arXiv, 2022

Zhou, Y ., et al. Mixture-of-Experts with Expert Choice Routing.arXiv, 2022

work page 2022
[16]

Sparse VideoGen: Accelerating video generation with sparse attention.arXiv, 2025

Chen, J., et al. Sparse VideoGen: Accelerating video generation with sparse attention.arXiv, 2025

work page 2025
[17]

UniGen-LingXi: A Resource-Eﬃcient, Editing-First Framework for 9-in-1 Multi-Modal Generation and Editing.arXiv, 2026

Sha, H., Zheng, Y . UniGen-LingXi: A Resource-Eﬃcient, Editing-First Framework for 9-in-1 Multi-Modal Generation and Editing.arXiv, 2026

work page 2026
[18]

Scaling Diﬀusion Transformers to 16 Billion Parameters.arXiv preprint arXiv:2407.11633, 2024

Fei, Z., Fan, M., Yu, C., Li, D., Huang, J. Scaling Diﬀusion Transformers to 16 Billion Parameters.arXiv preprint arXiv:2407.11633, 2024. 23

work page arXiv 2024

[1] [1]

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer .ICLR, 2017

Shazeer, N., Mirhoseini, A., Maziarz, K., et al. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer .ICLR, 2017

work page 2017

[2] [2]

Switch transformers: Scaling to trillion parameter models with simple and eﬃcient sparsity.JMLR, 2022

Fedus, W ., Zoph, B., Shazeer, N. Switch transformers: Scaling to trillion parameter models with simple and eﬃcient sparsity.JMLR, 2022

work page 2022

[3] [3]

DeepSeek-V4: Towards Highly Eﬃcient Million-Token Context Intelligence

DeepSeek-AI. DeepSeek-V4: Towards Highly Eﬃcient Million-Token Context Intelligence. Technical Re- port, 2026

work page 2026

[4] [4]

Qwen3: Qwen3 Technical Report., 2025

Alibaba Cloud. Qwen3: Qwen3 Technical Report., 2025

work page 2025

[5] [5]

Routing matters in MoE: Scaling diﬀusion transformers with explicit routing guidance.ICLR, 2026

Wei, Y ., Zhang, S., Yuan, H., et al. Routing matters in MoE: Scaling diﬀusion transformers with explicit routing guidance.ICLR, 2026

work page 2026

[6] [6]

DiﬀMoE: Mixture-of-experts for diﬀusion models.CVPR, 2025

Kuaishou Technology. DiﬀMoE: Mixture-of-experts for diﬀusion models.CVPR, 2025

work page 2025

[7] [7]

Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE

Shi, Y ., et al. Mamoda2.5: Enhancing uniﬁed multimodal model with DiT-MoE.arXiv:2605.02641, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[8] [8]

Mixture of distributions matters: Dynamic sparse attention for eﬃcient video diﬀusion transformers.arXiv, 2026

Jiang, K., et al. Mixture of distributions matters: Dynamic sparse attention for eﬃcient video diﬀusion transformers.arXiv, 2026

work page 2026

[9] [9]

Score-based generative modeling through stochastic diﬀerential equations.ICLR, 2021

Song, Y ., Sohl-Dickstein, J., Kingma, D.P ., et al. Score-based generative modeling through stochastic diﬀerential equations.ICLR, 2021

work page 2021

[10] [10]

DREAM: Dynamic routing of experts via attention-based mixture for vision-language-action mod- eling.Knowledge-Based Systems, 2026.https://www.ebiotrade.com/newsf/2026-2/ 20260227000821269.htm

work page 2026

[11] [11]

The myth of expert specialization in MoEs: Why routing reﬂects geometry, not neces- sarily domain expertise.arXiv, 2026

Dai, D., et al. The myth of expert specialization in MoEs: Why routing reﬂects geometry, not neces- sarily domain expertise.arXiv, 2026

work page 2026

[12] [12]

T . Lunyan. Analysis of Data Type Issue with TopKRouter Expert Bias in Megatron-LM. GitCode Blog, May 2025.https://blog.gitcode.com/fbce3dac83adb53fcf8c720b4b0c06dd.html

work page 2025

[13] [13]

Y . Li. The Stability Gap: Why Top-K Routing Breaks RL Optimization. Personal Blog, December 2025. https://richardli.xyz/post/topk-routing-stability-gap/. 22

work page 2025

[14] [14]

Load balancing mixture of experts with similarity preserving routers.arXiv, 2025

work page 2025

[15] [15]

Mixture-of-Experts with Expert Choice Routing.arXiv, 2022

Zhou, Y ., et al. Mixture-of-Experts with Expert Choice Routing.arXiv, 2022

work page 2022

[16] [16]

Sparse VideoGen: Accelerating video generation with sparse attention.arXiv, 2025

Chen, J., et al. Sparse VideoGen: Accelerating video generation with sparse attention.arXiv, 2025

work page 2025

[17] [17]

UniGen-LingXi: A Resource-Eﬃcient, Editing-First Framework for 9-in-1 Multi-Modal Generation and Editing.arXiv, 2026

Sha, H., Zheng, Y . UniGen-LingXi: A Resource-Eﬃcient, Editing-First Framework for 9-in-1 Multi-Modal Generation and Editing.arXiv, 2026

work page 2026

[18] [18]

Scaling Diﬀusion Transformers to 16 Billion Parameters.arXiv preprint arXiv:2407.11633, 2024

Fei, Z., Fan, M., Yu, C., Li, D., Huang, J. Scaling Diﬀusion Transformers to 16 Billion Parameters.arXiv preprint arXiv:2407.11633, 2024. 23

work page arXiv 2024