Layer-wise MoE Routing Locality under Shared-Prefix Code Generation: Token-Identity Decomposition and Compile-Equivalent Fork Redundancy
Pith reviewed 2026-05-10 06:42 UTC · model grok-4.3
The pith
MoE expert routing in shared-prefix code generation overlaps far above random levels even across different tokens.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using tree-search branching from a shared prefix in Qwen3.5-35B-A3B-FP8 and gcc -S -O0 assembly alignment to control for token identity, the study shows Jaccard similarity of expert selections at 0.649 for same-token positions and 0.175 for different-token positions. Layer-wise decomposition reveals a crossing pattern: same-token similarity exceeds different-token similarity in every layer but dips in the middle layers while different-token similarity peaks there. In addition, 67 percent of successfully compiled codes concentrate in the top three assembly-equivalent groups, and 99.6 percent of within-group differences are only comments or blank lines.
What carries the argument
Token-identity decomposition through gcc -S -O0 assembly alignment that isolates routing similarity due to shared token identity from other contextual factors during shared-prefix generation.
If this is right
- Routing overlap persists even for different tokens, so parallel code sampling could share more expert computation than prefix-only KV caching allows.
- The middle-layer peak for different-token routing suggests those layers drive semantic branching decisions in code.
- High concentration of valid codes in few assembly groups means token-level diversity metrics overstate functional variety in sampling.
- Nearly all intra-group differences being non-functional implies many generated variants are machine-equivalent despite surface changes.
Where Pith is reading between the lines
- Routing-aware pruning could discard low-diversity branches earlier in search algorithms.
- The observed locality may enable fused inference passes for similar sequences in production code tools.
- Similar layer-wise patterns could appear in non-code tasks, offering a route to general efficiency gains in MoE inference.
Load-bearing premise
The gcc -S -O0 assembly alignment fully removes token-identity confounds and the routing patterns hold beyond the single tested model and 851-sample tree-search setup.
What would settle it
Re-running the layer-wise Jaccard analysis on a different MoE model or with an alternative alignment such as semantic code equivalence would show whether the high similarities and crossing pattern disappear or remain.
Figures
read the original abstract
In LLM-based code generation, multiple code candidates are often generated in parallel from the same prompt -- for example, in best-of-N sampling or multi-candidate code completion. These requests can share KV caches through a common prefix, yet the extent to which their Mixture-of-Experts (MoE) expert routing overlaps, and how this overlap varies across layers, remains insufficiently understood. We study Qwen3.5-35B-A3B-FP8 (256 routed experts, top-8) by performing tree-search-based branching generation from a shared prefix (851 completed codes, temperature 0.7) and analyzing the results with a compiler-output-based alignment (gcc -S -O0 assembly) that controls for token-identity confounds. Our findings are threefold: (1) At positions where both sequences generated the same token, Jaccard similarity reaches 0.649 (40x random), while even at positions with different tokens it remains 0.175 (11x random). (2) A layer-wise decomposition reveals a crossing pattern: same-token routing similarity exceeds different-token similarity across all layers, but dips in the middle layers (L14-20), while different-token similarity peaks in the middle layers at 14x random. (3) In tree-search code generation, 67% of successfully compiled codes concentrate in the top three assembly-equivalent groups, and 99.6% of within-group differences consist of comments and blank lines. We show that diversity in top-P search, including beam search, poses a significant challenge. These results refine the "context-independent routing" claim of prior work through layer-wise decomposition and suggest opportunities for improving search efficiency in LLM code generation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper empirically examines Mixture-of-Experts routing locality in LLM code generation under shared prefixes. Using tree-search branching from Qwen3.5-35B-A3B-FP8 (256 experts, top-8) on 851 samples at temperature 0.7, it applies gcc -S -O0 assembly alignment to compare routing vectors, reporting Jaccard similarity of 0.649 (40x random) at same-token positions and 0.175 (11x random) at different-token positions, a layer-wise crossing pattern (same-token similarity dips in L14-20 while different-token peaks at 14x random), and that 67% of compiled codes fall into the top three assembly-equivalent groups where 99.6% of differences are comments or blank lines. It concludes this refines prior context-independent routing claims and highlights redundancy challenges for top-P and beam search.
Significance. If substantiated with proper controls and statistics, the results would indicate meaningful routing overlap even across token differences in parallel code generation, offering layer-specific insights that refine existing MoE observations. This could support practical gains in KV-cache reuse and more efficient multi-candidate search for compilable code, while quantifying redundancy in diversity-oriented decoding methods.
major comments (2)
- [Abstract] Abstract (reported findings 1 and 2): The Jaccard similarities (0.649 and 0.175) and the described crossing pattern across layers L14-20 are given as concrete multiples of random baselines without error bars, confidence intervals, statistical significance tests, or any description of how the 851 samples were selected or filtered. This absence directly weakens the ability to evaluate whether the quantitative claims and layer-wise trends are reliable or generalizable.
- [Abstract] Abstract (finding 2 and the alignment description): The claim that 0.175 Jaccard similarity at different-token positions demonstrates routing locality beyond token identity depends on the gcc -S -O0 assembly alignment cleanly separating confounds. However, the paper states that 99.6% of within-group differences are comments and blank lines, implying most different-token positions reflect only superficial variations. This raises a concrete risk that the reported similarity and middle-layer peak are still largely attributable to near-identical semantic content rather than independent routing behavior.
minor comments (1)
- The manuscript presents no tables or figures summarizing the layer-wise Jaccard values, clustering percentages, or alignment statistics, which reduces clarity when conveying the crossing pattern and group concentrations.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our work examining MoE routing locality in shared-prefix code generation. We address each major comment point-by-point below, providing clarifications from the full manuscript and committing to revisions that improve statistical transparency and interpretive precision without altering the core empirical observations.
read point-by-point responses
-
Referee: [Abstract] Abstract (reported findings 1 and 2): The Jaccard similarities (0.649 and 0.175) and the described crossing pattern across layers L14-20 are given as concrete multiples of random baselines without error bars, confidence intervals, statistical significance tests, or any description of how the 851 samples were selected or filtered. This absence directly weakens the ability to evaluate whether the quantitative claims and layer-wise trends are reliable or generalizable.
Authors: We agree that the abstract omits these details. The full manuscript (Section 3.1) specifies that the 851 samples comprise all successfully completed code generations obtained via tree-search branching from a shared prefix using Qwen3.5-35B-A3B-FP8 at temperature 0.7; no additional filtering was applied beyond requiring valid compilation. The random baseline is computed via position-wise permutation of routing vectors across the entire dataset. In the revision we will (i) add a one-sentence description of sample provenance to the abstract, (ii) report 95% bootstrap confidence intervals on the Jaccard values, and (iii) include a permutation-test p-value confirming the observed multiples exceed chance. The layer-wise crossing pattern remains stable under these controls (see revised Figure 3 and Appendix B). revision: yes
-
Referee: [Abstract] Abstract (finding 2 and the alignment description): The claim that 0.175 Jaccard similarity at different-token positions demonstrates routing locality beyond token identity depends on the gcc -S -O0 assembly alignment cleanly separating confounds. However, the paper states that 99.6% of within-group differences are comments and blank lines, implying most different-token positions reflect only superficial variations. This raises a concrete risk that the reported similarity and middle-layer peak are still largely attributable to near-identical semantic content rather than independent routing behavior.
Authors: The gcc -S -O0 alignment is deliberately used to isolate token-identity effects while holding semantic content approximately constant: the 67% concentration in the top three assembly-equivalent groups, together with 99.6% intra-group differences being comments or blanks, confirms that the different-token positions we analyze occur inside near-semantically-identical outputs. The 0.175 Jaccard (11× random) and the middle-layer peak therefore reflect routing overlap that persists after token identity is removed but while semantic context remains shared. We do not claim routing independence from semantics; rather, we refine prior “context-independent” assertions by showing that even modest token divergence within the same semantic class still yields substantial routing locality. In revision we will (i) rephrase the abstract to state “within assembly-equivalent groups” explicitly and (ii) add a limitations paragraph acknowledging that the result does not isolate routing from semantic similarity. revision: partial
Circularity Check
No circularity: purely empirical measurements of routing similarity
full rationale
The paper reports direct observations from tree-search code generation on Qwen3.5-35B-A3B-FP8, followed by gcc -S -O0 assembly alignment and Jaccard similarity computations on expert routing vectors. No equations, fitted parameters, predictions, or first-principles derivations are present that could reduce the reported similarities (0.649 same-token, 0.175 different-token) or layer-wise crossing patterns back to inputs by construction. The concentration statistic (67% in top three groups) is likewise a direct count from the 851-sample dataset. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The work is self-contained empirical analysis; the skeptic concern about alignment controlling for token confounds is a methodological validity question, not a circularity reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Jaccard similarity is an appropriate metric for comparing sets of routed experts at each token position
Reference graph
Works this paper leans on
-
[1]
Proceedings of the 41st International Conference on Machine Learning (ICML) , series =
Fuzhao Xue and others , title =. Proceedings of the 41st International Conference on Machine Learning (ICML) , series =. 2024 , note =
work page 2024
-
[2]
DeepSeek-AI , title =. arXiv preprint arXiv:2412.19437 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Proceedings of the European Conference on Computer Systems (EuroSys) , year =
Hanfei Yu and Xingqi Cui and Hong Zhang and Hao Wang and Hao Wang , title =. Proceedings of the European Conference on Computer Systems (EuroSys) , year =
-
[4]
arXiv preprint arXiv:2602.07265 , year =
Daniil Vankov and Nikita Ivkin and Kyle Ulrich and Xiang Song and Ashish Khetan and George Karypis , title =. arXiv preprint arXiv:2602.07265 , year =
-
[5]
Advances in Neural Information Processing Systems (NeurIPS) , year =
Lianmin Zheng and Liangsheng Yin and Zhiqiang Xie and others , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[6]
Powerinfer: Fast large language model serving with a consumer-grade gpu
Yixin Song and Zeyu Mi and Haotong Xie and Haibo Chen , title =. arXiv preprint arXiv:2312.12456 , year =
-
[7]
Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, and Mahesh K
Leyang Xue and others , title =. arXiv preprint arXiv:2401.14361 , year =
- [8]
-
[9]
arXiv preprint arXiv:2511.14102 , year =
Anonymous , title =. arXiv preprint arXiv:2511.14102 , year =
-
[10]
Kening Zheng and others , title =. arXiv preprint arXiv:2604.03592 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
arXiv preprint arXiv:2602.04105 , year =
Amir Nuriyev and Gabriel Kulp , title =. arXiv preprint arXiv:2602.04105 , year =
-
[12]
Jordan Juravsky and others , title =. Proceedings of ICML , year =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.