pith. sign in

arxiv: 2509.14255 · v2 · submitted 2025-09-12 · 💻 cs.CL · cs.AI

Cosine-Similarity Routing with Semantic Anchors for Interpretable Mixture-of-Experts Language Models

Pith reviewed 2026-05-18 17:44 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords mixture of expertsroutinginterpretabilitylanguage modelscosine similaritysemantic anchorsexpert utilization
0
0 comments X

The pith

Cosine-similarity routing with learnable semantic anchors matches linear routing performance in MoE models while providing direct interpretability of routing decisions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a routing method for Mixture-of-Experts language models that uses cosine similarity between token embeddings and learnable semantic anchors instead of traditional linear projections. This approach aims to make routing decisions traceable to specific similarity scores, addressing the black-box nature of standard gating. Experiments on WikiText-103 show that this cosine routing achieves similar perplexity scores to linear routing across multiple configurations and seeds. Additionally, a new bandpass routing loss is proposed to improve expert utilization by reducing the number of dead experts.

Core claim

The central discovery is that routing in MoE models via cosine similarity to semantic anchors delivers competitive language modeling performance on par with standard linear routing (e.g., 12.57 vs 12.45 perplexity for top-1 to top-4), while inherently supporting inspectability and showing advantages in routing stability and subtoken coherence in deeper layers.

What carries the argument

The Semantic Resonance Architecture (SRA), which computes routing scores as cosine similarities between token representations and a set of learnable semantic anchors, one per expert.

If this is right

  • Cosine routing offers built-in interpretability because each routing decision can be directly attributed to anchor-token similarity scores.
  • The bandpass routing loss reduces dead experts from 30-45% to 0-6% and works for both routing types.
  • Cosine routing shows significantly better word-level subtoken coherence in deeper layers compared to linear routing.
  • Using k=5 at inference provides a free perplexity improvement of 0.08-0.16 over k=4.
  • Cross-dataset validation on OpenWebText confirms similar performance and preserved specialization patterns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Because the training recipe appears to drive specialization quality more than the choice of routing function, efforts to improve expert diversity might yield larger gains by targeting optimization schedules rather than router architecture.
  • The bounded output range of cosine similarity could produce more predictable saturation behavior when scaling to models with thousands of experts.
  • Direct inspection of the learned anchors might reveal whether individual experts capture syntactic categories or semantic clusters, offering a new lens for analyzing what sparse activation actually learns.

Load-bearing premise

That learnable semantic anchors remain stable and semantically meaningful across layers and datasets so that cosine similarity scores reliably reflect the intended routing logic rather than collapsing to arbitrary directions during training.

What would settle it

If a controlled multi-seed run on a larger model or different dataset showed cosine routing producing higher perplexity or more dead experts than linear routing even after applying the bandpass loss, the claim of competitive performance and structural advantages would be refuted.

Figures

Figures reproduced from arXiv: 2509.14255 by Ivan Ternovtsii, Yurii Bilak.

Figure 1
Figure 1. Figure 1: Validation perplexity curves during training. The SRA model (b) shows a distinct pattern [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Expert Utilization on the validation dataset (final SRA model). [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Expert Utilization during the Top-1 phase (Epoch 5). [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: t-SNE projections of semantic anchors 12 [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Expert Utilization on the validation dataset (baseline MoE). [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
read the original abstract

Mixture-of-Experts (MoE) models improve efficiency through sparse activation, but their learned gating functions provide limited insight into routing decisions. This work introduces the Semantic Resonance Architecture (SRA), which routes tokens to experts via cosine similarity between token representations and learnable semantic anchors, making every routing decision directly traceable to anchor-token similarity scores. We evaluate SRA on WikiText-103 across 17 configurations. In a controlled multi-seed comparison (3 seeds x 4 configurations, 256 experts, $D_{ff}=256$), cosine routing achieves competitive perplexity with standard linear routing ($12.57 \pm 0.03$ vs $12.45 \pm 0.03$ for $K=1 \to 4$; $12.52 \pm 0.02$ vs $12.57 \pm 0.02$ for $K=2 \to 4$). The training recipe -- not the routing function -- drives specialization quality, while cosine routing provides inherent inspectability. We introduce a bandpass routing loss -- a floor-and-ceiling corridor on expert utilization -- that reduces dead experts from 30-45% to 0-6% and transfers to both routing types. Routing-space evaluation shows cosine routing provides significantly better word-level subtoken coherence in deeper layers ($p < 0.001$), with 44-54% of expert specialization being syntactic rather than semantic. Extended analysis reveals cosine routing maintains more stable router saturation and tighter per-expert vocabulary distributions -- structural advantages from the bounded cosine similarity range. An inference-time $k$-sweep shows that $k=5$ yields a free 0.08-0.16 perplexity gain over $k=4$. Cross-dataset validation on OpenWebText confirms generalization: cosine routing achieves comparable perplexity (44.88 vs 45.44), the bandpass loss eliminates dead experts, and specialization patterns are preserved.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes the Semantic Resonance Architecture (SRA) for Mixture-of-Experts language models, replacing the standard linear router with cosine-similarity routing to a set of learnable semantic anchors. On WikiText-103 with 256 experts, it reports competitive perplexity to linear routing across K=1-to-4 and K=2-to-4 configurations in a 3-seed controlled comparison, introduces a bandpass routing loss that reduces dead experts from 30-45% to 0-6%, and claims that the training recipe—not the routing function—primarily drives specialization. Cosine routing is presented as providing inherent inspectability via direct traceability to anchor-token similarity scores, with additional reported advantages in subtoken coherence (p<0.001 in deeper layers), router saturation stability, and tighter per-expert vocabulary distributions. Cross-dataset results on OpenWebText are included to support generalization.

Significance. If the learnable anchors remain stable and semantically coherent, the approach could offer a more directly inspectable routing mechanism for MoE models while preserving performance parity. The multi-seed design with reported standard deviations and a statistical test for coherence provides a reasonable empirical basis for the performance and stability claims; the bandpass loss is shown to transfer across routing types. These elements constitute a clear strength in experimental control and reproducibility of the quantitative results.

major comments (2)
  1. [§3.2] §3.2 (Semantic Anchors and Routing): The central interpretability claim—that routing decisions are 'directly traceable to anchor-token similarity scores' and therefore inherently more inspectable—requires that the learned anchors encode stable, meaningful semantics rather than collapsing to arbitrary directions. No analysis (e.g., top-k nearest tokens per anchor, cosine similarity heatmaps across layers, or correlation with linguistic categories) is provided to verify this; the reported subtoken coherence and vocabulary tightness measure routing outcomes but do not test anchor semantics directly. This is load-bearing for the contrast to linear routing.
  2. [§4.3] §4.3 (Routing-Space Evaluation): The claim of 'significantly better word-level subtoken coherence' (p<0.001) and '44-54% of expert specialization being syntactic' is presented as evidence of structural advantages, yet these metrics are computed on routing assignments rather than on the semantic content of the anchors themselves. Without a direct probe of whether anchor directions align with human-interpretable concepts, the inspectability benefit remains mechanical rather than semantic.
minor comments (2)
  1. [§4.1] The experimental setup omits specifics on semantic-anchor initialization, learning-rate schedules, and the precise weighting of the bandpass loss term; these details are needed for full reproducibility even if the central performance trends hold.
  2. [Figure 4] Figure captions and axis labels in the routing-saturation and vocabulary-distribution plots could be expanded to explicitly note the number of experts and layers shown, improving immediate readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed report. The comments highlight important ways to strengthen the interpretability claims. We address each major comment below and agree to incorporate additional direct analyses of the semantic anchors in the revised manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Semantic Anchors and Routing): The central interpretability claim—that routing decisions are 'directly traceable to anchor-token similarity scores' and therefore inherently more inspectable—requires that the learned anchors encode stable, meaningful semantics rather than collapsing to arbitrary directions. No analysis (e.g., top-k nearest tokens per anchor, cosine similarity heatmaps across layers, or correlation with linguistic categories) is provided to verify this; the reported subtoken coherence and vocabulary tightness measure routing outcomes but do not test anchor semantics directly. This is load-bearing for the contrast to linear routing.

    Authors: We agree that direct verification of anchor semantics would strengthen the interpretability argument. The manuscript's core claim is that cosine routing supplies mechanical traceability—every decision is an explicit similarity score to a specific anchor—unlike opaque linear weights. To address the referee's point, we will add an analysis of the top-k nearest tokens per anchor (and selected layer-wise similarity patterns) in the revision. This will show whether anchors align with coherent semantic directions rather than arbitrary vectors, providing the missing direct evidence. revision: yes

  2. Referee: [§4.3] §4.3 (Routing-Space Evaluation): The claim of 'significantly better word-level subtoken coherence' (p<0.001) and '44-54% of expert specialization being syntactic' is presented as evidence of structural advantages, yet these metrics are computed on routing assignments rather than on the semantic content of the anchors themselves. Without a direct probe of whether anchor directions align with human-interpretable concepts, the inspectability benefit remains mechanical rather than semantic.

    Authors: The subtoken coherence and syntactic-specialization statistics are intentionally computed on routing outcomes because they quantify the downstream structural benefits (stability, tighter vocabulary distributions) that arise from the bounded cosine space. We view these as complementary to, rather than substitutes for, anchor semantics. We accept that a direct probe is needed to move from mechanical to semantic interpretability and will include one—e.g., nearest-token inspection and correlation with linguistic categories—in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical results or claims

full rationale

The paper's core results consist of direct empirical comparisons of perplexity, expert utilization, subtoken coherence, and vocabulary distributions between cosine-similarity routing and standard linear routing, obtained via controlled multi-seed training on WikiText-103 and validated on OpenWebText. These quantities are measured on held-out data and are not defined in terms of other fitted parameters that are then repurposed as predictions. The stated inspectability follows immediately from the architectural definition of routing as cosine similarity to learnable anchors rather than from any theorem or derivation that reduces to its own inputs. No self-citations, uniqueness theorems, or ansatzes imported from prior work are used to support the central claims. The bandpass loss is introduced and evaluated empirically as a regularization technique. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The central claims rest on standard assumptions of gradient-based optimization and the existence of meaningful directions in embedding space; no new physical or mathematical axioms are introduced.

free parameters (2)
  • number of semantic anchors
    Chosen to match number of experts; value not stated in abstract but controls routing granularity.
  • bandpass loss floor and ceiling thresholds
    Hand-tuned corridor on expert utilization to reduce dead experts.
axioms (1)
  • domain assumption Cosine similarity is a valid measure of semantic relatedness between token embeddings and anchors.
    Invoked when defining the routing decision.
invented entities (1)
  • semantic anchors no independent evidence
    purpose: Provide fixed reference points for cosine-based routing decisions.
    New learnable vectors introduced to replace opaque gating; no independent evidence outside the training objective is given.

pith-pipeline@v0.9.0 · 5893 in / 1364 out tokens · 41231 ms · 2026-05-18T17:44:32.875761+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 3 internal anchors

  1. [1]

    Language models are few-shot learners.Advances in Neural Information Processing Systems (NeurIPS), 2020

    Tom Brown et al. Language models are few-shot learners.Advances in Neural Information Processing Systems (NeurIPS), 2020

  2. [2]

    Xmoe: Scaling mixture-of-experts with adaptive routing for multitask learning

    Jiaao Chi et al. Xmoe: Scaling mixture-of-experts with adaptive routing for multitask learning. arXiv preprint arXiv:2305.14704, 2023

  3. [3]

    PaLM: Scaling Language Modeling with Pathways

    Aakanksha Chowdhery et al. Palm: Scaling language modeling with pathways.arXiv preprint arXiv:2204.02311, 2022

  4. [4]

    What does bert look at? an analysis of bert’s attention

    Kevin Clark et al. What does bert look at? an analysis of bert’s attention. InProceedings of the 2019 ACL Workshop BlackboxNLP, 2019

  5. [5]

    A mathematical framework for transformer circuits.Transformer Circuits Thread, 2021

    Nelson Elhage et al. A mathematical framework for transformer circuits.Transformer Circuits Thread, 2021

  6. [6]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research (JMLR), 2022

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research (JMLR), 2022

  7. [7]

    Demix layers: Disentangling domains for modular language modeling

    Suchin Gururangan et al. Demix layers: Disentangling domains for modular language modeling. InProceedings of NAACL, 2022

  8. [8]

    Delving deep into rectifiers: Surpassing human-level performance on imagenet classification

    Kaiming He et al. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. InProceedings of the IEEE international conference on computer vision (ICCV), 2015. 10

  9. [9]

    Adaptive mixtures of local experts.Neural Computation, 1991

    Robert A Jacobs et al. Adaptive mixtures of local experts.Neural Computation, 1991

  10. [10]

    Gshard: Scaling giant models with conditional computation and automatic sharding

    Dmitry Lepikhin et al. Gshard: Scaling giant models with conditional computation and automatic sharding. InProceedings of ICLR, 2020

  11. [11]

    Pointer Sentinel Mixture Models

    Stephen Merity et al. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843, 2016

  12. [12]

    Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.Nature Machine Intelligence, 2019

    Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.Nature Machine Intelligence, 2019

  13. [13]

    Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

    Noam Shazeer et al. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InProceedings of ICLR, 2017

  14. [14]

    Roformer: Enhanced transformer with rotary position embedding.Neurocom- puting, 2024

    Jianlin Su et al. Roformer: Enhanced transformer with rotary position embedding.Neurocom- puting, 2024

  15. [15]

    Bert rediscovers the classical nlp pipeline

    Ian Tenney, Dipanjan Das, and Ellie Pavlick. Bert rediscovers the classical nlp pipeline. In Proceedings of ACL, 2019

  16. [16]

    Attention is all you need.Advances in Neural Information Processing Systems (NeurIPS), 2017

    Ashish Vaswani et al. Attention is all you need.Advances in Neural Information Processing Systems (NeurIPS), 2017

  17. [17]

    ST-MoE: Designing Stable and Transferable Sparse Expert Models

    Barret Zoph et al. St-moe: Designing stable and transferable sparse expert models.arXiv preprint arXiv:2202.08906, 2022. 11 Appendix A.1 Interpretability Case Study: Detailed Routing To illustrate the interpretability of SRA, Table 6 provides a detailed look at the routing decisions for the sentence: "The film was released in December 1995 and received po...