Cosine-Similarity Routing with Semantic Anchors for Interpretable Mixture-of-Experts Language Models

Ivan Ternovtsii; Yurii Bilak

arxiv: 2509.14255 · v2 · submitted 2025-09-12 · 💻 cs.CL · cs.AI

Cosine-Similarity Routing with Semantic Anchors for Interpretable Mixture-of-Experts Language Models

Ivan Ternovtsii , Yurii Bilak This is my paper

Pith reviewed 2026-05-18 17:44 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords mixture of expertsroutinginterpretabilitylanguage modelscosine similaritysemantic anchorsexpert utilization

0 comments

The pith

Cosine-similarity routing with learnable semantic anchors matches linear routing performance in MoE models while providing direct interpretability of routing decisions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a routing method for Mixture-of-Experts language models that uses cosine similarity between token embeddings and learnable semantic anchors instead of traditional linear projections. This approach aims to make routing decisions traceable to specific similarity scores, addressing the black-box nature of standard gating. Experiments on WikiText-103 show that this cosine routing achieves similar perplexity scores to linear routing across multiple configurations and seeds. Additionally, a new bandpass routing loss is proposed to improve expert utilization by reducing the number of dead experts.

Core claim

The central discovery is that routing in MoE models via cosine similarity to semantic anchors delivers competitive language modeling performance on par with standard linear routing (e.g., 12.57 vs 12.45 perplexity for top-1 to top-4), while inherently supporting inspectability and showing advantages in routing stability and subtoken coherence in deeper layers.

What carries the argument

The Semantic Resonance Architecture (SRA), which computes routing scores as cosine similarities between token representations and a set of learnable semantic anchors, one per expert.

If this is right

Cosine routing offers built-in interpretability because each routing decision can be directly attributed to anchor-token similarity scores.
The bandpass routing loss reduces dead experts from 30-45% to 0-6% and works for both routing types.
Cosine routing shows significantly better word-level subtoken coherence in deeper layers compared to linear routing.
Using k=5 at inference provides a free perplexity improvement of 0.08-0.16 over k=4.
Cross-dataset validation on OpenWebText confirms similar performance and preserved specialization patterns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Because the training recipe appears to drive specialization quality more than the choice of routing function, efforts to improve expert diversity might yield larger gains by targeting optimization schedules rather than router architecture.
The bounded output range of cosine similarity could produce more predictable saturation behavior when scaling to models with thousands of experts.
Direct inspection of the learned anchors might reveal whether individual experts capture syntactic categories or semantic clusters, offering a new lens for analyzing what sparse activation actually learns.

Load-bearing premise

That learnable semantic anchors remain stable and semantically meaningful across layers and datasets so that cosine similarity scores reliably reflect the intended routing logic rather than collapsing to arbitrary directions during training.

What would settle it

If a controlled multi-seed run on a larger model or different dataset showed cosine routing producing higher perplexity or more dead experts than linear routing even after applying the bandpass loss, the claim of competitive performance and structural advantages would be refuted.

Figures

Figures reproduced from arXiv: 2509.14255 by Ivan Ternovtsii, Yurii Bilak.

**Figure 2.** Figure 2: Expert Utilization on the validation dataset (final SRA model). [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Expert Utilization during the Top-1 phase (Epoch 5). [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: t-SNE projections of semantic anchors 12 [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Expert Utilization on the validation dataset (baseline MoE). [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

read the original abstract

Mixture-of-Experts (MoE) models improve efficiency through sparse activation, but their learned gating functions provide limited insight into routing decisions. This work introduces the Semantic Resonance Architecture (SRA), which routes tokens to experts via cosine similarity between token representations and learnable semantic anchors, making every routing decision directly traceable to anchor-token similarity scores. We evaluate SRA on WikiText-103 across 17 configurations. In a controlled multi-seed comparison (3 seeds x 4 configurations, 256 experts, $D_{ff}=256$), cosine routing achieves competitive perplexity with standard linear routing ($12.57 \pm 0.03$ vs $12.45 \pm 0.03$ for $K=1 \to 4$; $12.52 \pm 0.02$ vs $12.57 \pm 0.02$ for $K=2 \to 4$). The training recipe -- not the routing function -- drives specialization quality, while cosine routing provides inherent inspectability. We introduce a bandpass routing loss -- a floor-and-ceiling corridor on expert utilization -- that reduces dead experts from 30-45% to 0-6% and transfers to both routing types. Routing-space evaluation shows cosine routing provides significantly better word-level subtoken coherence in deeper layers ($p < 0.001$), with 44-54% of expert specialization being syntactic rather than semantic. Extended analysis reveals cosine routing maintains more stable router saturation and tighter per-expert vocabulary distributions -- structural advantages from the bounded cosine similarity range. An inference-time $k$-sweep shows that $k=5$ yields a free 0.08-0.16 perplexity gain over $k=4$. Cross-dataset validation on OpenWebText confirms generalization: cosine routing achieves comparable perplexity (44.88 vs 45.44), the bandpass loss eliminates dead experts, and specialization patterns are preserved.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper swaps linear MoE routing for cosine similarity to learned anchors plus a bandpass utilization loss, delivering competitive perplexity and fewer dead experts, but the semantic grounding of those anchors is not directly tested.

read the letter

The main takeaway is that this work replaces the usual linear gate in Mixture-of-Experts with cosine similarity to a set of learned semantic anchors and adds a bandpass loss that keeps expert usage inside a floor-and-ceiling corridor. On WikiText-103 it reaches perplexity numbers close to the linear baseline across multi-seed runs, and the bandpass loss cuts dead experts from 30-45% down to 0-6% while also working on the standard router. Cross-dataset checks on OpenWebText show the pattern holds, and they report tighter per-expert vocab distributions plus more stable router saturation with bounded cosine scores. The training recipe, not the router itself, appears to drive most specialization quality, which is a useful clarification. The subtoken coherence result in deeper layers reaches p<0.001, giving that metric some weight. What is actually new is the specific pairing of cosine routing to explicit anchors with the bandpass corridor; prior MoE gating papers do not combine these elements in the same way. The controlled 3-seed by 4-configuration setup with reported standard deviations is a clear strength and avoids the usual single-run noise. The soft spot is the interpretability story. The claim rests on routing decisions being traceable to anchor-token cosine scores, yet the paper supplies no anchor visualizations, no stability checks across layers, and no test that the anchors encode genuine semantics rather than directions that simply minimize the loss. If the anchors collapse or drift into arbitrary vectors, the traceability stays mechanical and the contrast to opaque linear routing weakens. The 44-54% syntactic specialization figure is interesting but does not resolve this gap. This paper is for researchers who build or tune sparse language models and want practical routing changes plus better expert utilization. Readers focused on incremental MoE improvements with reproducible multi-seed numbers will extract usable pieces, especially the bandpass loss. The empirical comparisons are direct and on held-out data, so the work is grounded enough to deserve a serious referee. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes the Semantic Resonance Architecture (SRA) for Mixture-of-Experts language models, replacing the standard linear router with cosine-similarity routing to a set of learnable semantic anchors. On WikiText-103 with 256 experts, it reports competitive perplexity to linear routing across K=1-to-4 and K=2-to-4 configurations in a 3-seed controlled comparison, introduces a bandpass routing loss that reduces dead experts from 30-45% to 0-6%, and claims that the training recipe—not the routing function—primarily drives specialization. Cosine routing is presented as providing inherent inspectability via direct traceability to anchor-token similarity scores, with additional reported advantages in subtoken coherence (p<0.001 in deeper layers), router saturation stability, and tighter per-expert vocabulary distributions. Cross-dataset results on OpenWebText are included to support generalization.

Significance. If the learnable anchors remain stable and semantically coherent, the approach could offer a more directly inspectable routing mechanism for MoE models while preserving performance parity. The multi-seed design with reported standard deviations and a statistical test for coherence provides a reasonable empirical basis for the performance and stability claims; the bandpass loss is shown to transfer across routing types. These elements constitute a clear strength in experimental control and reproducibility of the quantitative results.

major comments (2)

[§3.2] §3.2 (Semantic Anchors and Routing): The central interpretability claim—that routing decisions are 'directly traceable to anchor-token similarity scores' and therefore inherently more inspectable—requires that the learned anchors encode stable, meaningful semantics rather than collapsing to arbitrary directions. No analysis (e.g., top-k nearest tokens per anchor, cosine similarity heatmaps across layers, or correlation with linguistic categories) is provided to verify this; the reported subtoken coherence and vocabulary tightness measure routing outcomes but do not test anchor semantics directly. This is load-bearing for the contrast to linear routing.
[§4.3] §4.3 (Routing-Space Evaluation): The claim of 'significantly better word-level subtoken coherence' (p<0.001) and '44-54% of expert specialization being syntactic' is presented as evidence of structural advantages, yet these metrics are computed on routing assignments rather than on the semantic content of the anchors themselves. Without a direct probe of whether anchor directions align with human-interpretable concepts, the inspectability benefit remains mechanical rather than semantic.

minor comments (2)

[§4.1] The experimental setup omits specifics on semantic-anchor initialization, learning-rate schedules, and the precise weighting of the bandpass loss term; these details are needed for full reproducibility even if the central performance trends hold.
[Figure 4] Figure captions and axis labels in the routing-saturation and vocabulary-distribution plots could be expanded to explicitly note the number of experts and layers shown, improving immediate readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed report. The comments highlight important ways to strengthen the interpretability claims. We address each major comment below and agree to incorporate additional direct analyses of the semantic anchors in the revised manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (Semantic Anchors and Routing): The central interpretability claim—that routing decisions are 'directly traceable to anchor-token similarity scores' and therefore inherently more inspectable—requires that the learned anchors encode stable, meaningful semantics rather than collapsing to arbitrary directions. No analysis (e.g., top-k nearest tokens per anchor, cosine similarity heatmaps across layers, or correlation with linguistic categories) is provided to verify this; the reported subtoken coherence and vocabulary tightness measure routing outcomes but do not test anchor semantics directly. This is load-bearing for the contrast to linear routing.

Authors: We agree that direct verification of anchor semantics would strengthen the interpretability argument. The manuscript's core claim is that cosine routing supplies mechanical traceability—every decision is an explicit similarity score to a specific anchor—unlike opaque linear weights. To address the referee's point, we will add an analysis of the top-k nearest tokens per anchor (and selected layer-wise similarity patterns) in the revision. This will show whether anchors align with coherent semantic directions rather than arbitrary vectors, providing the missing direct evidence. revision: yes
Referee: [§4.3] §4.3 (Routing-Space Evaluation): The claim of 'significantly better word-level subtoken coherence' (p<0.001) and '44-54% of expert specialization being syntactic' is presented as evidence of structural advantages, yet these metrics are computed on routing assignments rather than on the semantic content of the anchors themselves. Without a direct probe of whether anchor directions align with human-interpretable concepts, the inspectability benefit remains mechanical rather than semantic.

Authors: The subtoken coherence and syntactic-specialization statistics are intentionally computed on routing outcomes because they quantify the downstream structural benefits (stability, tighter vocabulary distributions) that arise from the bounded cosine space. We view these as complementary to, rather than substitutes for, anchor semantics. We accept that a direct probe is needed to move from mechanical to semantic interpretability and will include one—e.g., nearest-token inspection and correlation with linguistic categories—in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical results or claims

full rationale

The paper's core results consist of direct empirical comparisons of perplexity, expert utilization, subtoken coherence, and vocabulary distributions between cosine-similarity routing and standard linear routing, obtained via controlled multi-seed training on WikiText-103 and validated on OpenWebText. These quantities are measured on held-out data and are not defined in terms of other fitted parameters that are then repurposed as predictions. The stated inspectability follows immediately from the architectural definition of routing as cosine similarity to learnable anchors rather than from any theorem or derivation that reduces to its own inputs. No self-citations, uniqueness theorems, or ansatzes imported from prior work are used to support the central claims. The bandpass loss is introduced and evaluated empirically as a regularization technique. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The central claims rest on standard assumptions of gradient-based optimization and the existence of meaningful directions in embedding space; no new physical or mathematical axioms are introduced.

free parameters (2)

number of semantic anchors
Chosen to match number of experts; value not stated in abstract but controls routing granularity.
bandpass loss floor and ceiling thresholds
Hand-tuned corridor on expert utilization to reduce dead experts.

axioms (1)

domain assumption Cosine similarity is a valid measure of semantic relatedness between token embeddings and anchors.
Invoked when defining the routing decision.

invented entities (1)

semantic anchors no independent evidence
purpose: Provide fixed reference points for cosine-based routing decisions.
New learnable vectors introduced to replace opaque gating; no independent evidence outside the training objective is given.

pith-pipeline@v0.9.0 · 5893 in / 1364 out tokens · 41231 ms · 2026-05-18T17:44:32.875761+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ri = cos(h,ai) = h·ai / (||h||2 · ||ai||2 +ϵ); Ldispersion = 1/N(N−1) Σ cos(ai,aj)
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

orthogonal initialization ... Dispersion Loss ... semantic anchors

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 3 internal anchors

[1]

Language models are few-shot learners.Advances in Neural Information Processing Systems (NeurIPS), 2020

Tom Brown et al. Language models are few-shot learners.Advances in Neural Information Processing Systems (NeurIPS), 2020

work page 2020
[2]

Xmoe: Scaling mixture-of-experts with adaptive routing for multitask learning

Jiaao Chi et al. Xmoe: Scaling mixture-of-experts with adaptive routing for multitask learning. arXiv preprint arXiv:2305.14704, 2023

work page arXiv 2023
[3]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery et al. Palm: Scaling language modeling with pathways.arXiv preprint arXiv:2204.02311, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

What does bert look at? an analysis of bert’s attention

Kevin Clark et al. What does bert look at? an analysis of bert’s attention. InProceedings of the 2019 ACL Workshop BlackboxNLP, 2019

work page 2019
[5]

A mathematical framework for transformer circuits.Transformer Circuits Thread, 2021

Nelson Elhage et al. A mathematical framework for transformer circuits.Transformer Circuits Thread, 2021

work page 2021
[6]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research (JMLR), 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research (JMLR), 2022

work page 2022
[7]

Demix layers: Disentangling domains for modular language modeling

Suchin Gururangan et al. Demix layers: Disentangling domains for modular language modeling. InProceedings of NAACL, 2022

work page 2022
[8]

Delving deep into rectifiers: Surpassing human-level performance on imagenet classification

Kaiming He et al. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. InProceedings of the IEEE international conference on computer vision (ICCV), 2015. 10

work page 2015
[9]

Adaptive mixtures of local experts.Neural Computation, 1991

Robert A Jacobs et al. Adaptive mixtures of local experts.Neural Computation, 1991

work page 1991
[10]

Gshard: Scaling giant models with conditional computation and automatic sharding

Dmitry Lepikhin et al. Gshard: Scaling giant models with conditional computation and automatic sharding. InProceedings of ICLR, 2020

work page 2020
[11]

Pointer Sentinel Mixture Models

Stephen Merity et al. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[12]

Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.Nature Machine Intelligence, 2019

Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.Nature Machine Intelligence, 2019

work page 2019
[13]

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

Noam Shazeer et al. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InProceedings of ICLR, 2017

work page 2017
[14]

Roformer: Enhanced transformer with rotary position embedding.Neurocom- puting, 2024

Jianlin Su et al. Roformer: Enhanced transformer with rotary position embedding.Neurocom- puting, 2024

work page 2024
[15]

Bert rediscovers the classical nlp pipeline

Ian Tenney, Dipanjan Das, and Ellie Pavlick. Bert rediscovers the classical nlp pipeline. In Proceedings of ACL, 2019

work page 2019
[16]

Attention is all you need.Advances in Neural Information Processing Systems (NeurIPS), 2017

Ashish Vaswani et al. Attention is all you need.Advances in Neural Information Processing Systems (NeurIPS), 2017

work page 2017
[17]

ST-MoE: Designing Stable and Transferable Sparse Expert Models

Barret Zoph et al. St-moe: Designing stable and transferable sparse expert models.arXiv preprint arXiv:2202.08906, 2022. 11 Appendix A.1 Interpretability Case Study: Detailed Routing To illustrate the interpretability of SRA, Table 6 provides a detailed look at the routing decisions for the sentence: "The film was released in December 1995 and received po...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[1] [1]

Language models are few-shot learners.Advances in Neural Information Processing Systems (NeurIPS), 2020

Tom Brown et al. Language models are few-shot learners.Advances in Neural Information Processing Systems (NeurIPS), 2020

work page 2020

[2] [2]

Xmoe: Scaling mixture-of-experts with adaptive routing for multitask learning

Jiaao Chi et al. Xmoe: Scaling mixture-of-experts with adaptive routing for multitask learning. arXiv preprint arXiv:2305.14704, 2023

work page arXiv 2023

[3] [3]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery et al. Palm: Scaling language modeling with pathways.arXiv preprint arXiv:2204.02311, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[4] [4]

What does bert look at? an analysis of bert’s attention

Kevin Clark et al. What does bert look at? an analysis of bert’s attention. InProceedings of the 2019 ACL Workshop BlackboxNLP, 2019

work page 2019

[5] [5]

A mathematical framework for transformer circuits.Transformer Circuits Thread, 2021

Nelson Elhage et al. A mathematical framework for transformer circuits.Transformer Circuits Thread, 2021

work page 2021

[6] [6]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research (JMLR), 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research (JMLR), 2022

work page 2022

[7] [7]

Demix layers: Disentangling domains for modular language modeling

Suchin Gururangan et al. Demix layers: Disentangling domains for modular language modeling. InProceedings of NAACL, 2022

work page 2022

[8] [8]

Delving deep into rectifiers: Surpassing human-level performance on imagenet classification

Kaiming He et al. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. InProceedings of the IEEE international conference on computer vision (ICCV), 2015. 10

work page 2015

[9] [9]

Adaptive mixtures of local experts.Neural Computation, 1991

Robert A Jacobs et al. Adaptive mixtures of local experts.Neural Computation, 1991

work page 1991

[10] [10]

Gshard: Scaling giant models with conditional computation and automatic sharding

Dmitry Lepikhin et al. Gshard: Scaling giant models with conditional computation and automatic sharding. InProceedings of ICLR, 2020

work page 2020

[11] [11]

Pointer Sentinel Mixture Models

Stephen Merity et al. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[12] [12]

Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.Nature Machine Intelligence, 2019

Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.Nature Machine Intelligence, 2019

work page 2019

[13] [13]

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

Noam Shazeer et al. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InProceedings of ICLR, 2017

work page 2017

[14] [14]

Roformer: Enhanced transformer with rotary position embedding.Neurocom- puting, 2024

Jianlin Su et al. Roformer: Enhanced transformer with rotary position embedding.Neurocom- puting, 2024

work page 2024

[15] [15]

Bert rediscovers the classical nlp pipeline

Ian Tenney, Dipanjan Das, and Ellie Pavlick. Bert rediscovers the classical nlp pipeline. In Proceedings of ACL, 2019

work page 2019

[16] [16]

Attention is all you need.Advances in Neural Information Processing Systems (NeurIPS), 2017

Ashish Vaswani et al. Attention is all you need.Advances in Neural Information Processing Systems (NeurIPS), 2017

work page 2017

[17] [17]

ST-MoE: Designing Stable and Transferable Sparse Expert Models

Barret Zoph et al. St-moe: Designing stable and transferable sparse expert models.arXiv preprint arXiv:2202.08906, 2022. 11 Appendix A.1 Interpretability Case Study: Detailed Routing To illustrate the interpretability of SRA, Table 6 provides a detailed look at the routing decisions for the sentence: "The film was released in December 1995 and received po...

work page internal anchor Pith review Pith/arXiv arXiv 2022