Cosine-Similarity Routing with Semantic Anchors for Interpretable Mixture-of-Experts Language Models
Pith reviewed 2026-05-18 17:44 UTC · model grok-4.3
The pith
Cosine-similarity routing with learnable semantic anchors matches linear routing performance in MoE models while providing direct interpretability of routing decisions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that routing in MoE models via cosine similarity to semantic anchors delivers competitive language modeling performance on par with standard linear routing (e.g., 12.57 vs 12.45 perplexity for top-1 to top-4), while inherently supporting inspectability and showing advantages in routing stability and subtoken coherence in deeper layers.
What carries the argument
The Semantic Resonance Architecture (SRA), which computes routing scores as cosine similarities between token representations and a set of learnable semantic anchors, one per expert.
If this is right
- Cosine routing offers built-in interpretability because each routing decision can be directly attributed to anchor-token similarity scores.
- The bandpass routing loss reduces dead experts from 30-45% to 0-6% and works for both routing types.
- Cosine routing shows significantly better word-level subtoken coherence in deeper layers compared to linear routing.
- Using k=5 at inference provides a free perplexity improvement of 0.08-0.16 over k=4.
- Cross-dataset validation on OpenWebText confirms similar performance and preserved specialization patterns.
Where Pith is reading between the lines
- Because the training recipe appears to drive specialization quality more than the choice of routing function, efforts to improve expert diversity might yield larger gains by targeting optimization schedules rather than router architecture.
- The bounded output range of cosine similarity could produce more predictable saturation behavior when scaling to models with thousands of experts.
- Direct inspection of the learned anchors might reveal whether individual experts capture syntactic categories or semantic clusters, offering a new lens for analyzing what sparse activation actually learns.
Load-bearing premise
That learnable semantic anchors remain stable and semantically meaningful across layers and datasets so that cosine similarity scores reliably reflect the intended routing logic rather than collapsing to arbitrary directions during training.
What would settle it
If a controlled multi-seed run on a larger model or different dataset showed cosine routing producing higher perplexity or more dead experts than linear routing even after applying the bandpass loss, the claim of competitive performance and structural advantages would be refuted.
Figures
read the original abstract
Mixture-of-Experts (MoE) models improve efficiency through sparse activation, but their learned gating functions provide limited insight into routing decisions. This work introduces the Semantic Resonance Architecture (SRA), which routes tokens to experts via cosine similarity between token representations and learnable semantic anchors, making every routing decision directly traceable to anchor-token similarity scores. We evaluate SRA on WikiText-103 across 17 configurations. In a controlled multi-seed comparison (3 seeds x 4 configurations, 256 experts, $D_{ff}=256$), cosine routing achieves competitive perplexity with standard linear routing ($12.57 \pm 0.03$ vs $12.45 \pm 0.03$ for $K=1 \to 4$; $12.52 \pm 0.02$ vs $12.57 \pm 0.02$ for $K=2 \to 4$). The training recipe -- not the routing function -- drives specialization quality, while cosine routing provides inherent inspectability. We introduce a bandpass routing loss -- a floor-and-ceiling corridor on expert utilization -- that reduces dead experts from 30-45% to 0-6% and transfers to both routing types. Routing-space evaluation shows cosine routing provides significantly better word-level subtoken coherence in deeper layers ($p < 0.001$), with 44-54% of expert specialization being syntactic rather than semantic. Extended analysis reveals cosine routing maintains more stable router saturation and tighter per-expert vocabulary distributions -- structural advantages from the bounded cosine similarity range. An inference-time $k$-sweep shows that $k=5$ yields a free 0.08-0.16 perplexity gain over $k=4$. Cross-dataset validation on OpenWebText confirms generalization: cosine routing achieves comparable perplexity (44.88 vs 45.44), the bandpass loss eliminates dead experts, and specialization patterns are preserved.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes the Semantic Resonance Architecture (SRA) for Mixture-of-Experts language models, replacing the standard linear router with cosine-similarity routing to a set of learnable semantic anchors. On WikiText-103 with 256 experts, it reports competitive perplexity to linear routing across K=1-to-4 and K=2-to-4 configurations in a 3-seed controlled comparison, introduces a bandpass routing loss that reduces dead experts from 30-45% to 0-6%, and claims that the training recipe—not the routing function—primarily drives specialization. Cosine routing is presented as providing inherent inspectability via direct traceability to anchor-token similarity scores, with additional reported advantages in subtoken coherence (p<0.001 in deeper layers), router saturation stability, and tighter per-expert vocabulary distributions. Cross-dataset results on OpenWebText are included to support generalization.
Significance. If the learnable anchors remain stable and semantically coherent, the approach could offer a more directly inspectable routing mechanism for MoE models while preserving performance parity. The multi-seed design with reported standard deviations and a statistical test for coherence provides a reasonable empirical basis for the performance and stability claims; the bandpass loss is shown to transfer across routing types. These elements constitute a clear strength in experimental control and reproducibility of the quantitative results.
major comments (2)
- [§3.2] §3.2 (Semantic Anchors and Routing): The central interpretability claim—that routing decisions are 'directly traceable to anchor-token similarity scores' and therefore inherently more inspectable—requires that the learned anchors encode stable, meaningful semantics rather than collapsing to arbitrary directions. No analysis (e.g., top-k nearest tokens per anchor, cosine similarity heatmaps across layers, or correlation with linguistic categories) is provided to verify this; the reported subtoken coherence and vocabulary tightness measure routing outcomes but do not test anchor semantics directly. This is load-bearing for the contrast to linear routing.
- [§4.3] §4.3 (Routing-Space Evaluation): The claim of 'significantly better word-level subtoken coherence' (p<0.001) and '44-54% of expert specialization being syntactic' is presented as evidence of structural advantages, yet these metrics are computed on routing assignments rather than on the semantic content of the anchors themselves. Without a direct probe of whether anchor directions align with human-interpretable concepts, the inspectability benefit remains mechanical rather than semantic.
minor comments (2)
- [§4.1] The experimental setup omits specifics on semantic-anchor initialization, learning-rate schedules, and the precise weighting of the bandpass loss term; these details are needed for full reproducibility even if the central performance trends hold.
- [Figure 4] Figure captions and axis labels in the routing-saturation and vocabulary-distribution plots could be expanded to explicitly note the number of experts and layers shown, improving immediate readability.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and detailed report. The comments highlight important ways to strengthen the interpretability claims. We address each major comment below and agree to incorporate additional direct analyses of the semantic anchors in the revised manuscript.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Semantic Anchors and Routing): The central interpretability claim—that routing decisions are 'directly traceable to anchor-token similarity scores' and therefore inherently more inspectable—requires that the learned anchors encode stable, meaningful semantics rather than collapsing to arbitrary directions. No analysis (e.g., top-k nearest tokens per anchor, cosine similarity heatmaps across layers, or correlation with linguistic categories) is provided to verify this; the reported subtoken coherence and vocabulary tightness measure routing outcomes but do not test anchor semantics directly. This is load-bearing for the contrast to linear routing.
Authors: We agree that direct verification of anchor semantics would strengthen the interpretability argument. The manuscript's core claim is that cosine routing supplies mechanical traceability—every decision is an explicit similarity score to a specific anchor—unlike opaque linear weights. To address the referee's point, we will add an analysis of the top-k nearest tokens per anchor (and selected layer-wise similarity patterns) in the revision. This will show whether anchors align with coherent semantic directions rather than arbitrary vectors, providing the missing direct evidence. revision: yes
-
Referee: [§4.3] §4.3 (Routing-Space Evaluation): The claim of 'significantly better word-level subtoken coherence' (p<0.001) and '44-54% of expert specialization being syntactic' is presented as evidence of structural advantages, yet these metrics are computed on routing assignments rather than on the semantic content of the anchors themselves. Without a direct probe of whether anchor directions align with human-interpretable concepts, the inspectability benefit remains mechanical rather than semantic.
Authors: The subtoken coherence and syntactic-specialization statistics are intentionally computed on routing outcomes because they quantify the downstream structural benefits (stability, tighter vocabulary distributions) that arise from the bounded cosine space. We view these as complementary to, rather than substitutes for, anchor semantics. We accept that a direct probe is needed to move from mechanical to semantic interpretability and will include one—e.g., nearest-token inspection and correlation with linguistic categories—in the revised manuscript. revision: yes
Circularity Check
No significant circularity in empirical results or claims
full rationale
The paper's core results consist of direct empirical comparisons of perplexity, expert utilization, subtoken coherence, and vocabulary distributions between cosine-similarity routing and standard linear routing, obtained via controlled multi-seed training on WikiText-103 and validated on OpenWebText. These quantities are measured on held-out data and are not defined in terms of other fitted parameters that are then repurposed as predictions. The stated inspectability follows immediately from the architectural definition of routing as cosine similarity to learnable anchors rather than from any theorem or derivation that reduces to its own inputs. No self-citations, uniqueness theorems, or ansatzes imported from prior work are used to support the central claims. The bandpass loss is introduced and evaluated empirically as a regularization technique. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (2)
- number of semantic anchors
- bandpass loss floor and ceiling thresholds
axioms (1)
- domain assumption Cosine similarity is a valid measure of semantic relatedness between token embeddings and anchors.
invented entities (1)
-
semantic anchors
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ri = cos(h,ai) = h·ai / (||h||2 · ||ai||2 +ϵ); Ldispersion = 1/N(N−1) Σ cos(ai,aj)
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
orthogonal initialization ... Dispersion Loss ... semantic anchors
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Tom Brown et al. Language models are few-shot learners.Advances in Neural Information Processing Systems (NeurIPS), 2020
work page 2020
-
[2]
Xmoe: Scaling mixture-of-experts with adaptive routing for multitask learning
Jiaao Chi et al. Xmoe: Scaling mixture-of-experts with adaptive routing for multitask learning. arXiv preprint arXiv:2305.14704, 2023
-
[3]
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery et al. Palm: Scaling language modeling with pathways.arXiv preprint arXiv:2204.02311, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
What does bert look at? an analysis of bert’s attention
Kevin Clark et al. What does bert look at? an analysis of bert’s attention. InProceedings of the 2019 ACL Workshop BlackboxNLP, 2019
work page 2019
-
[5]
A mathematical framework for transformer circuits.Transformer Circuits Thread, 2021
Nelson Elhage et al. A mathematical framework for transformer circuits.Transformer Circuits Thread, 2021
work page 2021
-
[6]
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research (JMLR), 2022
work page 2022
-
[7]
Demix layers: Disentangling domains for modular language modeling
Suchin Gururangan et al. Demix layers: Disentangling domains for modular language modeling. InProceedings of NAACL, 2022
work page 2022
-
[8]
Delving deep into rectifiers: Surpassing human-level performance on imagenet classification
Kaiming He et al. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. InProceedings of the IEEE international conference on computer vision (ICCV), 2015. 10
work page 2015
-
[9]
Adaptive mixtures of local experts.Neural Computation, 1991
Robert A Jacobs et al. Adaptive mixtures of local experts.Neural Computation, 1991
work page 1991
-
[10]
Gshard: Scaling giant models with conditional computation and automatic sharding
Dmitry Lepikhin et al. Gshard: Scaling giant models with conditional computation and automatic sharding. InProceedings of ICLR, 2020
work page 2020
-
[11]
Pointer Sentinel Mixture Models
Stephen Merity et al. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[12]
Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.Nature Machine Intelligence, 2019
work page 2019
-
[13]
Outrageously large neural networks: The sparsely-gated mixture-of-experts layer
Noam Shazeer et al. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InProceedings of ICLR, 2017
work page 2017
-
[14]
Roformer: Enhanced transformer with rotary position embedding.Neurocom- puting, 2024
Jianlin Su et al. Roformer: Enhanced transformer with rotary position embedding.Neurocom- puting, 2024
work page 2024
-
[15]
Bert rediscovers the classical nlp pipeline
Ian Tenney, Dipanjan Das, and Ellie Pavlick. Bert rediscovers the classical nlp pipeline. In Proceedings of ACL, 2019
work page 2019
-
[16]
Attention is all you need.Advances in Neural Information Processing Systems (NeurIPS), 2017
Ashish Vaswani et al. Attention is all you need.Advances in Neural Information Processing Systems (NeurIPS), 2017
work page 2017
-
[17]
ST-MoE: Designing Stable and Transferable Sparse Expert Models
Barret Zoph et al. St-moe: Designing stable and transferable sparse expert models.arXiv preprint arXiv:2202.08906, 2022. 11 Appendix A.1 Interpretability Case Study: Detailed Routing To illustrate the interpretability of SRA, Table 6 provides a detailed look at the routing decisions for the sentence: "The film was released in December 1995 and received po...
work page internal anchor Pith review Pith/arXiv arXiv 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.