Layer-wise MoE Routing Locality under Shared-Prefix Code Generation: Token-Identity Decomposition and Compile-Equivalent Fork Redundancy

Daichi Mukunoki; Shun-ichiro Hayashi; Takahiro Katagiri; Tetsuya Hoshino

arxiv: 2604.17182 · v1 · submitted 2026-04-19 · 💻 cs.SE · cs.AI

Layer-wise MoE Routing Locality under Shared-Prefix Code Generation: Token-Identity Decomposition and Compile-Equivalent Fork Redundancy

Shun-ichiro Hayashi , Daichi Mukunoki , Tetsuya Hoshino , Takahiro Katagiri This is my paper

Pith reviewed 2026-05-10 06:42 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords mixture of expertsMoE routingcode generationshared prefixrouting localitylayer-wise analysiscompiler alignmenttree search

0 comments

The pith

MoE expert routing in shared-prefix code generation overlaps far above random levels even across different tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how a large Mixture-of-Experts model chooses experts when it generates many code completions from one starting prompt. It generates 851 candidates via tree search, aligns them using compiler assembly output to separate token effects, and measures expert overlap with Jaccard similarity. Overlap reaches 0.649 at matching-token positions and stays at 0.175 at differing-token positions, both well above chance. The overlap changes with depth: same-token routes stay more similar overall but dip in middle layers, while different-token routes peak there. Most runnable programs collapse into just three assembly-level groups.

Core claim

Using tree-search branching from a shared prefix in Qwen3.5-35B-A3B-FP8 and gcc -S -O0 assembly alignment to control for token identity, the study shows Jaccard similarity of expert selections at 0.649 for same-token positions and 0.175 for different-token positions. Layer-wise decomposition reveals a crossing pattern: same-token similarity exceeds different-token similarity in every layer but dips in the middle layers while different-token similarity peaks there. In addition, 67 percent of successfully compiled codes concentrate in the top three assembly-equivalent groups, and 99.6 percent of within-group differences are only comments or blank lines.

What carries the argument

Token-identity decomposition through gcc -S -O0 assembly alignment that isolates routing similarity due to shared token identity from other contextual factors during shared-prefix generation.

If this is right

Routing overlap persists even for different tokens, so parallel code sampling could share more expert computation than prefix-only KV caching allows.
The middle-layer peak for different-token routing suggests those layers drive semantic branching decisions in code.
High concentration of valid codes in few assembly groups means token-level diversity metrics overstate functional variety in sampling.
Nearly all intra-group differences being non-functional implies many generated variants are machine-equivalent despite surface changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Routing-aware pruning could discard low-diversity branches earlier in search algorithms.
The observed locality may enable fused inference passes for similar sequences in production code tools.
Similar layer-wise patterns could appear in non-code tasks, offering a route to general efficiency gains in MoE inference.

Load-bearing premise

The gcc -S -O0 assembly alignment fully removes token-identity confounds and the routing patterns hold beyond the single tested model and 851-sample tree-search setup.

What would settle it

Re-running the layer-wise Jaccard analysis on a different MoE model or with an alternative alignment such as semantic code equivalence would show whether the high similarities and crossing pattern disappear or remain.

Figures

Figures reproduced from arXiv: 2604.17182 by Daichi Mukunoki, Shun-ichiro Hayashi, Takahiro Katagiri, Tetsuya Hoshino.

**Figure 1.** Figure 1: Layer-wise Jaccard similarity crossing pattern ( [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 3.** Figure 3: 256 × 256 expert co-activation matrix at L14 (n=851 completed codes). Each cell (i, j) shows the number of token positions where experts i and j were both selected in top-8 (log scale). Axes are sorted by L14 activation count in descending order; diagonal and zero-count cells are masked in gray [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Routing overlap decay after fork (same-branch [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution of O0-equivalent groups (n=851 completed codes). G1+G2 account for 49%; the top three groups cover 54%. B. Refinement of the OpenMoE Claim OpenMoE [4] concluded that “routing is contextindependent and dominated by token identity.” Our results refine this claim through layer-wise decomposition: token identity is dominant in the input layers (L0 diff-tok ≈ 0.09, low relative to the middle layer… view at source ↗

read the original abstract

In LLM-based code generation, multiple code candidates are often generated in parallel from the same prompt -- for example, in best-of-N sampling or multi-candidate code completion. These requests can share KV caches through a common prefix, yet the extent to which their Mixture-of-Experts (MoE) expert routing overlaps, and how this overlap varies across layers, remains insufficiently understood. We study Qwen3.5-35B-A3B-FP8 (256 routed experts, top-8) by performing tree-search-based branching generation from a shared prefix (851 completed codes, temperature 0.7) and analyzing the results with a compiler-output-based alignment (gcc -S -O0 assembly) that controls for token-identity confounds. Our findings are threefold: (1) At positions where both sequences generated the same token, Jaccard similarity reaches 0.649 (40x random), while even at positions with different tokens it remains 0.175 (11x random). (2) A layer-wise decomposition reveals a crossing pattern: same-token routing similarity exceeds different-token similarity across all layers, but dips in the middle layers (L14-20), while different-token similarity peaks in the middle layers at 14x random. (3) In tree-search code generation, 67% of successfully compiled codes concentrate in the top three assembly-equivalent groups, and 99.6% of within-group differences consist of comments and blank lines. We show that diversity in top-P search, including beam search, poses a significant challenge. These results refine the "context-independent routing" claim of prior work through layer-wise decomposition and suggest opportunities for improving search efficiency in LLM code generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The different-token routing similarity looks overstated because most such positions are just comments and blanks that don't change the code.

read the letter

This paper measures routing overlap in an MoE model during shared-prefix code generation and finds high similarity at same-token spots plus some at different-token spots, along with a layer crossing pattern and heavy clustering in assembly outputs. The different-token similarity may not show what they think it does. They take tree-search generations from Qwen3.5-35B-A3B, align them using compiler assembly output to handle token differences, and compute Jaccard on routing vectors per layer. Same-token positions reach 0.649 similarity, 40 times random. Different-token positions still hit 0.175, 11 times random. The layer breakdown shows same-token similarity higher everywhere but dipping in middle layers, while different-token similarity rises in the middle to 14 times random. On top of that, 67 percent of the successful compiles land in the top three assembly groups, and nearly all differences inside those groups are comments or empty lines. The compiler alignment is a reasonable way to try to separate token identity from routing decisions. The layer-wise numbers give a finer picture than just overall overlap. The redundancy finding is useful for anyone thinking about how to avoid duplicate work in multi-candidate generation. The main issue is that the different-token result rests on shaky ground. Since 99.6 percent of within-group differences are comments and blanks, most positions where tokens differ are probably not changing the actual code meaning. That means the alignment might be flagging superficial changes as different tokens, so the remaining similarity could come from the codes being semantically close rather than from routing that stays local despite token changes. The crossing pattern could be an artifact of how those trivial differences align across layers. The study is also limited to one model and one setup with no error bars or details on sample selection, which makes the exact percentages hard to trust as general. This is for people working on efficient inference for code-generating LLMs, especially MoE ones. It offers specific data on where routing overlaps and where redundancy happens in practice. A reader looking for ideas on pruning or caching in parallel generation would find the numbers worth looking at. I would recommend sending it for peer review. The core measurements are new enough and the setup is clear, but the authors should clarify how the alignment handles non-semantic differences and add some robustness checks before it gets published.

Referee Report

2 major / 1 minor

Summary. The paper empirically examines Mixture-of-Experts routing locality in LLM code generation under shared prefixes. Using tree-search branching from Qwen3.5-35B-A3B-FP8 (256 experts, top-8) on 851 samples at temperature 0.7, it applies gcc -S -O0 assembly alignment to compare routing vectors, reporting Jaccard similarity of 0.649 (40x random) at same-token positions and 0.175 (11x random) at different-token positions, a layer-wise crossing pattern (same-token similarity dips in L14-20 while different-token peaks at 14x random), and that 67% of compiled codes fall into the top three assembly-equivalent groups where 99.6% of differences are comments or blank lines. It concludes this refines prior context-independent routing claims and highlights redundancy challenges for top-P and beam search.

Significance. If substantiated with proper controls and statistics, the results would indicate meaningful routing overlap even across token differences in parallel code generation, offering layer-specific insights that refine existing MoE observations. This could support practical gains in KV-cache reuse and more efficient multi-candidate search for compilable code, while quantifying redundancy in diversity-oriented decoding methods.

major comments (2)

[Abstract] Abstract (reported findings 1 and 2): The Jaccard similarities (0.649 and 0.175) and the described crossing pattern across layers L14-20 are given as concrete multiples of random baselines without error bars, confidence intervals, statistical significance tests, or any description of how the 851 samples were selected or filtered. This absence directly weakens the ability to evaluate whether the quantitative claims and layer-wise trends are reliable or generalizable.
[Abstract] Abstract (finding 2 and the alignment description): The claim that 0.175 Jaccard similarity at different-token positions demonstrates routing locality beyond token identity depends on the gcc -S -O0 assembly alignment cleanly separating confounds. However, the paper states that 99.6% of within-group differences are comments and blank lines, implying most different-token positions reflect only superficial variations. This raises a concrete risk that the reported similarity and middle-layer peak are still largely attributable to near-identical semantic content rather than independent routing behavior.

minor comments (1)

The manuscript presents no tables or figures summarizing the layer-wise Jaccard values, clustering percentages, or alignment statistics, which reduces clarity when conveying the crossing pattern and group concentrations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work examining MoE routing locality in shared-prefix code generation. We address each major comment point-by-point below, providing clarifications from the full manuscript and committing to revisions that improve statistical transparency and interpretive precision without altering the core empirical observations.

read point-by-point responses

Referee: [Abstract] Abstract (reported findings 1 and 2): The Jaccard similarities (0.649 and 0.175) and the described crossing pattern across layers L14-20 are given as concrete multiples of random baselines without error bars, confidence intervals, statistical significance tests, or any description of how the 851 samples were selected or filtered. This absence directly weakens the ability to evaluate whether the quantitative claims and layer-wise trends are reliable or generalizable.

Authors: We agree that the abstract omits these details. The full manuscript (Section 3.1) specifies that the 851 samples comprise all successfully completed code generations obtained via tree-search branching from a shared prefix using Qwen3.5-35B-A3B-FP8 at temperature 0.7; no additional filtering was applied beyond requiring valid compilation. The random baseline is computed via position-wise permutation of routing vectors across the entire dataset. In the revision we will (i) add a one-sentence description of sample provenance to the abstract, (ii) report 95% bootstrap confidence intervals on the Jaccard values, and (iii) include a permutation-test p-value confirming the observed multiples exceed chance. The layer-wise crossing pattern remains stable under these controls (see revised Figure 3 and Appendix B). revision: yes
Referee: [Abstract] Abstract (finding 2 and the alignment description): The claim that 0.175 Jaccard similarity at different-token positions demonstrates routing locality beyond token identity depends on the gcc -S -O0 assembly alignment cleanly separating confounds. However, the paper states that 99.6% of within-group differences are comments and blank lines, implying most different-token positions reflect only superficial variations. This raises a concrete risk that the reported similarity and middle-layer peak are still largely attributable to near-identical semantic content rather than independent routing behavior.

Authors: The gcc -S -O0 alignment is deliberately used to isolate token-identity effects while holding semantic content approximately constant: the 67% concentration in the top three assembly-equivalent groups, together with 99.6% intra-group differences being comments or blanks, confirms that the different-token positions we analyze occur inside near-semantically-identical outputs. The 0.175 Jaccard (11× random) and the middle-layer peak therefore reflect routing overlap that persists after token identity is removed but while semantic context remains shared. We do not claim routing independence from semantics; rather, we refine prior “context-independent” assertions by showing that even modest token divergence within the same semantic class still yields substantial routing locality. In revision we will (i) rephrase the abstract to state “within assembly-equivalent groups” explicitly and (ii) add a limitations paragraph acknowledging that the result does not isolate routing from semantic similarity. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical measurements of routing similarity

full rationale

The paper reports direct observations from tree-search code generation on Qwen3.5-35B-A3B-FP8, followed by gcc -S -O0 assembly alignment and Jaccard similarity computations on expert routing vectors. No equations, fitted parameters, predictions, or first-principles derivations are present that could reduce the reported similarities (0.649 same-token, 0.175 different-token) or layer-wise crossing patterns back to inputs by construction. The concentration statistic (67% in top three groups) is likewise a direct count from the 851-sample dataset. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The work is self-contained empirical analysis; the skeptic concern about alignment controlling for token confounds is a methodological validity question, not a circularity reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study relies on standard similarity metrics and compiler equivalence without introducing new free parameters, axioms beyond basic statistics, or invented entities.

axioms (1)

standard math Jaccard similarity is an appropriate metric for comparing sets of routed experts at each token position
Applied directly to routing decisions to quantify overlap.

pith-pipeline@v0.9.0 · 5638 in / 1388 out tokens · 48128 ms · 2026-05-10T06:42:50.256819+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 2 internal anchors

[1]

Proceedings of the 41st International Conference on Machine Learning (ICML) , series =

Fuzhao Xue and others , title =. Proceedings of the 41st International Conference on Machine Learning (ICML) , series =. 2024 , note =

work page 2024
[2]

DeepSeek-V3 Technical Report

DeepSeek-AI , title =. arXiv preprint arXiv:2412.19437 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Proceedings of the European Conference on Computer Systems (EuroSys) , year =

Hanfei Yu and Xingqi Cui and Hong Zhang and Hao Wang and Hao Wang , title =. Proceedings of the European Conference on Computer Systems (EuroSys) , year =

work page
[4]

arXiv preprint arXiv:2602.07265 , year =

Daniil Vankov and Nikita Ivkin and Kyle Ulrich and Xiang Song and Ashish Khetan and George Karypis , title =. arXiv preprint arXiv:2602.07265 , year =

work page arXiv
[5]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Lianmin Zheng and Liangsheng Yin and Zhiqiang Xie and others , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[6]

Powerinfer: Fast large language model serving with a consumer-grade gpu

Yixin Song and Zeyu Mi and Haotong Xie and Haibo Chen , title =. arXiv preprint arXiv:2312.12456 , year =

work page arXiv
[7]

Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, and Mahesh K

Leyang Xue and others , title =. arXiv preprint arXiv:2401.14361 , year =

work page arXiv
[8]

2025 , howpublished =

Qwen Team , title =. 2025 , howpublished =

work page 2025
[9]

arXiv preprint arXiv:2511.14102 , year =

Anonymous , title =. arXiv preprint arXiv:2511.14102 , year =

work page arXiv
[10]

Unveiling Language Routing Isolation in Multilingual MoE Models for Interpretable Subnetwork Adaptation

Kening Zheng and others , title =. arXiv preprint arXiv:2604.03592 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[11]

arXiv preprint arXiv:2602.04105 , year =

Amir Nuriyev and Gabriel Kulp , title =. arXiv preprint arXiv:2602.04105 , year =

work page arXiv
[12]

Proceedings of ICML , year =

Jordan Juravsky and others , title =. Proceedings of ICML , year =

work page

[1] [1]

Proceedings of the 41st International Conference on Machine Learning (ICML) , series =

Fuzhao Xue and others , title =. Proceedings of the 41st International Conference on Machine Learning (ICML) , series =. 2024 , note =

work page 2024

[2] [2]

DeepSeek-V3 Technical Report

DeepSeek-AI , title =. arXiv preprint arXiv:2412.19437 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Proceedings of the European Conference on Computer Systems (EuroSys) , year =

Hanfei Yu and Xingqi Cui and Hong Zhang and Hao Wang and Hao Wang , title =. Proceedings of the European Conference on Computer Systems (EuroSys) , year =

work page

[4] [4]

arXiv preprint arXiv:2602.07265 , year =

Daniil Vankov and Nikita Ivkin and Kyle Ulrich and Xiang Song and Ashish Khetan and George Karypis , title =. arXiv preprint arXiv:2602.07265 , year =

work page arXiv

[5] [5]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Lianmin Zheng and Liangsheng Yin and Zhiqiang Xie and others , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page

[6] [6]

Powerinfer: Fast large language model serving with a consumer-grade gpu

Yixin Song and Zeyu Mi and Haotong Xie and Haibo Chen , title =. arXiv preprint arXiv:2312.12456 , year =

work page arXiv

[7] [7]

Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, and Mahesh K

Leyang Xue and others , title =. arXiv preprint arXiv:2401.14361 , year =

work page arXiv

[8] [8]

2025 , howpublished =

Qwen Team , title =. 2025 , howpublished =

work page 2025

[9] [9]

arXiv preprint arXiv:2511.14102 , year =

Anonymous , title =. arXiv preprint arXiv:2511.14102 , year =

work page arXiv

[10] [10]

Unveiling Language Routing Isolation in Multilingual MoE Models for Interpretable Subnetwork Adaptation

Kening Zheng and others , title =. arXiv preprint arXiv:2604.03592 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

arXiv preprint arXiv:2602.04105 , year =

Amir Nuriyev and Gabriel Kulp , title =. arXiv preprint arXiv:2602.04105 , year =

work page arXiv

[12] [12]

Proceedings of ICML , year =

Jordan Juravsky and others , title =. Proceedings of ICML , year =

work page