arxiv: 2601.21349 · v2 · submitted 2026-01-29 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

L2R: Low-Rank and Lipschitz-Controlled Routing for Mixture-of-Experts

Minghao Yang , Ren Togo , Guang Li , Takahiro Ogawa , Miki Haseyama

Authors on Pith no claims yet

Pith reviewed 2026-05-16 09:53 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords mixture-of-expertsrouting mechanismlow-rank projectionLipschitz continuityexpert specializationneural network scalinglanguage modelsvision models

0 comments

The pith

Projecting mixture-of-experts routing to a low-rank latent space with saturated inner-product scoring yields smoother geometry and stronger expert specialization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard linear routers in high-dimensional spaces suffer from mismatch, angular concentration, and scale sensitivity that weaken expert discrimination in mixture-of-experts models. L2R counters this by mapping routing decisions into a shared low-rank latent space and replacing raw inner products with Saturated Inner-Product Scoring that enforces Lipschitz bounds for stability. The same framework adds a parameter-efficient multi-anchor mechanism to keep routing expressive. Experiments on an OLMoE language model and an ImageNet vision MoE show consistent gains in routing quality and task performance. A reader cares because these changes address a practical bottleneck that appears whenever MoE scaling is attempted.

Core claim

L2R performs expert assignment in a shared low-rank latent routing space and introduces Saturated Inner-Product Scoring to explicitly control the Lipschitz behavior of routing functions, yielding smoother and more stable routing geometry. In addition, L2R incorporates a parameter-efficient multi-anchor routing mechanism to enhance expert expressiveness. Experiments on an OLMoE-based language MoE model and a vision MoE setting on ImageNet demonstrate that L2R consistently improves routing geometry, expert discrimination, and overall model performance.

What carries the argument

Low-rank & Lipschitz-controlled Routing (L2R) framework that projects high-dimensional inputs into a shared low-rank latent space and applies Saturated Inner-Product Scoring (SIPS) to bound the Lipschitz constant of the routing function.

If this is right

Routing geometry becomes smoother and less sensitive to input scale because the Lipschitz constant is explicitly bounded.
Expert discrimination increases because decisions occur in a compact latent space rather than the raw high-dimensional representation.
Overall model performance rises on both language modeling and image classification without increasing parameter count.
The multi-anchor component adds expressiveness while remaining parameter-efficient.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same low-rank projection plus Lipschitz control pattern could be applied to gating or attention modules outside MoE architectures.
Lipschitz-controlled routing may reduce training variance when MoE models are scaled to larger expert counts or sequence lengths.
Testing whether the low-rank dimension can be chosen adaptively rather than fixed would reveal further efficiency gains.

Load-bearing premise

Projecting to a low-rank latent space plus saturated inner-product scoring preserves enough expressiveness to maintain or improve specialization and generalization.

What would settle it

If replacing a standard linear router with L2R produces no gain or a loss in validation perplexity on the OLMoE language model or top-1 accuracy on the ImageNet vision MoE, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2601.21349 by Guang Li, Miki Haseyama, Minghao Yang, Ren Togo, Takahiro Ogawa.

**Figure 1.** Figure 1: Comparison between the linear router (left) and the proposed L2R framework (right). (a) Routing space: Tokens are projected from high-dimensional raw representations into a lowrank routing space as q, where experts are represented by learnable anchors k. (b) Scoring mode: Compared to dot-product scoring, SIPS reshapes the score landscape into a bounded and smoother geometry, improving routing stability an… view at source ↗

**Figure 2.** Figure 2: Variance of pairwise cosine similarity in routing spaces. We compare the layer-averaged variance of pairwise token cosine similarities in the latent routing space for Linear, XMoE (Chi et al., 2022), and L2R-SIPS applied to OLMoE (Muennighoff et al., 2025). For L2R-SIPS, we use rank r=2, hence the isotropic reference corresponds to the 2D value (0.5). While Linear and X-MoE exhibit near-zero variance, L… view at source ↗

**Figure 3.** Figure 3: Score landscapes under fixed expert anchor. Heatmaps visualize routing logits as a function of query location q = (Qx, Qy) with a fixed anchor k = [2, 2]. Standard dot-product yields a linear half-space separation with unbounded magnitude effects, while SIPS reshapes the landscape into a bounded, anglesensitive geometry that is more amenable to stable routing. lines, and the score grows linearly with ∥q∥,… view at source ↗

**Figure 4.** Figure 4: OLMoE training dynamics. Curves show MMLU and HellaSwag accuracies, C4 (Raffel et al., 2020) validation cross-entropy (CE), training CE, and load-balance loss. L2R exhibits clear convergence over 10B tokens and consistently improves MMLU/HellaSwag. objective combines the task loss with auxiliary regularizers for routing stability and expert utilization: L = Ltask + λbal Lbal + λz Lz, (14) where Ltask is th… view at source ↗

**Figure 6.** Figure 6: Variance of pairwise cosine similarity in routing spaces. We compare the layer-averaged variance of pairwise token cosine similarities in the routing space for Linear, X-MoE (Chi et al., 2022), and L2R-SIPS with different rank settings [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 5.** Figure 5: Training dynamics (train cross-entropy loss) for ablations. Both lower rank and more heads yield faster convergence over 10B tokens. Ablation on Routing Rank r [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 7.** Figure 7: Representation-space visualization at a middle layer (Layer 8). PCA projections of token representations in the raw backbone space (x) and the low-rank routing space (q). Points are colored by the top-1 routed expert. Routing-Space Geometry [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Expert-usage patterns across layers. From top to bottom: top-1 routing frequency, top-k routing frequency, and importance-based routing weights. Each row is a Transformer layer, and each column is an expert. Expert Usage and Cooperation [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Effect of query-magnitude saturation strength (β) under a fixed anchor. We fix the expert anchor at k = [−2, 0] and visualize routing logits over query locations q = (Qx, Qy) in a 2D routing plane (rank r = 2). Dot-product scoring produces an unbounded, nearly half-space-like field dominated by radial growth (a), whereas cosine scoring removes magnitude effects entirely (b). SIPS interpolates between these… view at source ↗

**Figure 10.** Figure 10: Visualizations of OLMoE training dynamics. The panels report MMLU accuracy, HellaSwag accuracy, C4 validation cross-entropy (CE), training CE, and load-balance loss across router variants. From top to bottom, we compare alternative methods, scoring modes, and ablations over head and rank settings. The x-axis denotes the number of trained tokens. I.2. Routing Geometry Visualizations in OLMoE 20 0 20 Dim 1 … view at source ↗

**Figure 11.** Figure 11: Linear-router baselines. PCA scatter plots in backbone representation space (x-space) for the same layers as [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Routing geometry under L2R. PCA scatter plots for three representative MoE FFN layers in OLMoE. Points are token representations colored by the top-1 expert. The first row shows backbone representations (x-space), and the second row shows routing latent queries (q-space). Compared to x-space, the q-space geometry typically exhibits clearer expert-aligned separation and reduced selection ambiguity. 19 [PI… view at source ↗

**Figure 13.** Figure 13: Routing geometry under the X-MoE router. PCA projections of token representations at three FFN routing sites (blocks 0/8/15; left to right). Top: backbone activations in the raw representation space x. Bottom: router features q used for gating at the corresponding blocks. Each point denotes a token from the same evaluation batch (aggregated across distributed ranks), and colors indicate the selected top-1… view at source ↗

**Figure 14.** Figure 14: Expert usage heatmaps for L2R-Cosine, L2R-Dot and L2R-SIPS. From left to right: Top-1 routing frequency, Top-k routing frequency, and importance-based routing weights. Each row corresponds to a transformer layer and each column corresponds to an expert. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗

**Figure 15.** Figure 15: Expert usage heatmaps for Linear routing and X-MoE. From left to right: Top-1 routing frequency, Top-k routing frequency, and importance-based routing weights. Each row corresponds to a transformer layer and each column corresponds to an expert. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 5… view at source ↗

**Figure 16.** Figure 16: Expert usage heatmaps under different head settings in L2R-SIPS (Rank=2). From left to right: Top-1 routing frequency, Top-k routing frequency, and importance-based routing weights. Each row corresponds to a transformer layer, and each column corresponds to an expert. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49… view at source ↗

**Figure 17.** Figure 17: Expert usage heatmaps under different rank settings in L2R-SIPS (Head=4). From left to right: Top-1 routing frequency, Top-k routing frequency, and importance-based routing weights. Each row corresponds to a transformer layer, and each column corresponds to an expert. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_17.png] view at source ↗

read the original abstract

Mixture-of-Experts (MoE) models scale neural networks by conditionally activating a small subset of experts, where the router plays a central role in determining expert specialization and overall model performance. However, many modern MoE systems still adopt linear routers in raw high-dimensional representation spaces, where representation mismatch, angular concentration, and scale-sensitive scoring can jointly undermine routing discriminability and stable expert specialization. In this work, we propose Low-rank & Lipschitz-controlled Routing (L2R), a unified routing framework that reshapes both the routing space and scoring geometry. L2R performs expert assignment in a shared low-rank latent routing space and introduces Saturated Inner-Product Scoring (SIPS) to explicitly control the Lipschitz behavior of routing functions, yielding smoother and more stable routing geometry. In addition, L2R incorporates a parameter-efficient multi-anchor routing mechanism to enhance expert expressiveness. Extensive experiments on an OLMoE-based language MoE model and a vision MoE setting on ImageNet demonstrate that L2R consistently improves routing geometry, expert discrimination, and overall model performance. Code will be released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

L2R's low-rank latent space plus saturated inner-product scoring offers a clean way to stabilize MoE routers, but the abstract gives no evidence that the changes preserve enough routing capacity.

read the letter

The paper's main contribution is a routing framework that projects inputs into a shared low-rank latent space, applies saturated inner-product scoring to bound the Lipschitz constant, and adds a parameter-efficient multi-anchor mechanism for extra expressiveness. This combination directly targets angular concentration and scale sensitivity in standard high-dimensional linear routers, which is a practical pain point when scaling MoE models. The multi-anchor part in particular looks like a lightweight way to avoid losing too much specialization while keeping parameter count down. That framing is new enough relative to the usual router baselines cited in the abstract. The description of the problems is clear and the proposed fixes follow logically from them. Experiments are claimed on an OLMoE language setup and an ImageNet vision MoE, with reported gains in routing geometry, expert discrimination, and end performance. Those are the right benchmarks for the claim. The soft spot is the missing verification. No numbers, no ablation tables, and no controls appear in the provided text, so it is impossible to check whether the low-rank projection actually keeps the mutual information between hidden states and expert choices or whether it collapses directions that matter for fine specialization. The stress-test concern about hidden capacity loss therefore stands, because nothing shown rules it out. The paper is aimed at people already working on MoE scaling who need routing tweaks. A reader could borrow the low-rank plus SIPS idea as a starting point, but the work is too preliminary to stand alone. It deserves peer review only if the full version supplies the ablations and quantitative breakdowns that isolate each component; otherwise the claims rest on unshown results.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Low-rank and Lipschitz-Controlled Routing (L2R) for Mixture-of-Experts models. It identifies problems with standard linear routers in high-dimensional spaces (representation mismatch, angular concentration, scale sensitivity) and introduces a shared low-rank latent routing space, Saturated Inner-Product Scoring (SIPS) to enforce Lipschitz control, and a parameter-efficient multi-anchor mechanism. Experiments on an OLMoE-based language model and an ImageNet vision MoE setting are reported to show consistent gains in routing geometry, expert discrimination, and end-task performance.

Significance. If the central claims hold, L2R offers a practical, parameter-efficient way to stabilize MoE routing geometry without retraining the entire model. The dual evaluation on language and vision tasks plus the commitment to release code are positive for reproducibility and generality.

major comments (2)

[§3.2] §3.2 (Low-rank projection and SIPS definition): the claim that the low-rank bottleneck plus saturated inner-product scoring preserves sufficient expressiveness for expert specialization is not accompanied by an information-theoretic bound or mutual-information measurement between inputs and assignments; without this, the reported discrimination gains could be driven by regularization rather than the proposed geometry change.
[Table 2] Table 2 and §4.3 (ablation rows): the performance deltas are shown only for the full L2R versus baseline; no row isolates the low-rank projection alone versus SIPS alone, so it is impossible to verify that the low-rank step does not incur a hidden capacity cost in regimes where fine-grained expert selection is required.

minor comments (2)

[§2.1] §2.1: the notation for the latent dimension k is introduced without an explicit statement of how it is chosen relative to the original hidden dimension d.
[Figure 3] Figure 3: axis labels on the routing-geometry plots are too small for print; consider increasing font size or adding a supplementary high-resolution version.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (Low-rank projection and SIPS definition): the claim that the low-rank bottleneck plus saturated inner-product scoring preserves sufficient expressiveness for expert specialization is not accompanied by an information-theoretic bound or mutual-information measurement between inputs and assignments; without this, the reported discrimination gains could be driven by regularization rather than the proposed geometry change.

Authors: We appreciate the referee's point regarding the need for stronger theoretical grounding. The low-rank projection is learned end-to-end to retain task-relevant routing information, and SIPS is constructed to be strictly monotonic in the inner-product scores, thereby preserving relative expert preferences without introducing additional regularization beyond the explicit Lipschitz bound. While the current version relies on empirical evidence from routing metrics and downstream performance, we acknowledge that an explicit information-theoretic analysis would be valuable. In the revised manuscript we will add mutual-information measurements between the original input representations and the expert assignment distributions to quantify information retention under the low-rank + SIPS transformation. revision: partial
Referee: [Table 2] Table 2 and §4.3 (ablation rows): the performance deltas are shown only for the full L2R versus baseline; no row isolates the low-rank projection alone versus SIPS alone, so it is impossible to verify that the low-rank step does not incur a hidden capacity cost in regimes where fine-grained expert selection is required.

Authors: We agree that the current ablation table does not fully disentangle the contributions of the low-rank projection and SIPS. In the revised version we will expand Table 2 (and the corresponding discussion in §4.3) with two additional rows: one applying only the low-rank projection with standard inner-product scoring, and one applying SIPS on the original high-dimensional space. These results will allow direct assessment of any capacity trade-offs introduced by the low-rank bottleneck and will confirm that the full L2R combination yields synergistic gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes L2R as a new routing method using low-rank projection and Saturated Inner-Product Scoring, then reports empirical gains on OLMoE language models and ImageNet vision MoE. No equations or derivations are shown that reduce the claimed improvements in routing geometry or performance to quantities fitted or defined inside the same paper. The central claims rest on experimental validation rather than any self-referential reduction, self-citation chain, or ansatz smuggled via prior work by the same authors. This is a standard method-proposal structure with independent empirical content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated. The low-rank dimension, saturation threshold, and number of anchors are likely chosen or fitted but cannot be audited from the given text.

pith-pipeline@v0.9.0 · 5505 in / 1034 out tokens · 22129 ms · 2026-05-16T09:53:56.842045+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SIPS ... explicitly control the Lipschitz behavior of routing functions, yielding smoother and more stable routing geometry
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

low-rank latent routing space ... r≪d

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 10 internal anchors

[1]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, M., and Toutanova, K. Boolq: Exploring the surprising difficulty of natural yes/no questions.arXiv preprint arXiv:1905.10044,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[2]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Dai, D., Deng, C., Zhao, C., Xu, R., Gao, H., Chen, D., Li, J., Zeng, W., Yu, X., Wu, Y ., et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts lan- guage models.arXiv preprint arXiv:2401.06066,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Toy Models of Superposition

Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., et al. Toy models of superposition.arXiv preprint arXiv:2209.10652,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Mixtral of Experts

Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D. S., Casas, D. d. l., Hanna, E. B., Bressand, F., et al. Mixtral of experts.arXiv preprint arXiv:2401.04088,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Lepikhin, D., Lee, H., Xu, Y ., Chen, D., Firat, O., Huang, Y ., Krikun, M., Shazeer, N., and Chen, Z. Gshard: Scaling giant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668,

work page internal anchor Pith review Pith/arXiv arXiv 2006
[7]

Router upcycling: Lever- aging mixture-of-routers in mixture-of-experts upcycling

Ran, J., Zhao, G., Wu, Y ., Zhu, D., Wu, L., Zhao, Y ., Yang, T., Sun, L., Zhang, X., and Li, S. Router upcycling: Lever- aging mixture-of-routers in mixture-of-experts upcycling. arXiv preprint arXiv:2509.00679,

work page arXiv
[8]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Crowdsourcing Multiple Choice Science Questions

Welbl, J., Liu, N. F., and Gardner, M. Crowdsourc- ing multiple choice science questions.arXiv preprint arXiv:1707.06209,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Yuan 2.0-m32: Mixture of experts with attention router.arXiv preprint arXiv:2405.17976, 2024a

Wu, S., Luo, J., Chen, X., Li, L., Zhao, X., Yu, T., Wang, C., Wang, Y ., Wang, F., Qiao, W., et al. Yuan 2.0-m32: Mixture of experts with attention router.arXiv preprint arXiv:2405.17976, 2024a. Wu, X., Huang, S., Wang, W., Ma, S., Dong, L., and Wei, F. Multi-head mixture-of-experts. InProc. NeurIPS, pp. 94073–94096, 2024b. Yang, A., Yang, B., Hui, B., e...

work page arXiv
[11]

Adaptive shared experts with LoRA-based mix- ture of experts for multi-task learning.arXiv preprint arXiv:2510.00570,

Yang, M., Togo, R., Li, G., Ogawa, T., and Haseyama, M. Adaptive shared experts with LoRA-based mix- ture of experts for multi-task learning.arXiv preprint arXiv:2510.00570,

work page arXiv
[12]

HellaSwag: Can a Machine Really Finish Your Sentence?

Zellers, R., Holtzman, A., Bisk, Y ., Farhadi, A., and Choi, Y . Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[13]

ST-MoE: Designing Stable and Transferable Sparse Expert Models

Zoph, B., Bello, I., Kumar, S., Du, N., Huang, Y ., Dean, J., Shazeer, N., and Fedus, W. St-moe: Designing stable and transferable sparse expert models.arXiv preprint arXiv:2202.08906,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

origin sharpness

10 L2R: Low-Rank and Lipschitz-Controlled Routing for Mixture-of-Experts A. Additional Geometry of Query-Magnitude Saturation Q_x 4 2 0 2 4 Q_y 4 2 0 2 4 Score 10 5 0 5 10 (a)Linear (Dot Product) Q_x 4 2 0 2 4 Q_y 4 2 0 2 4 Score 0.5 0.0 0.5 (b)Cosine Scoring Q_x 4 2 0 2 4 Q_y 4 2 0 2 4 Score 1.0 0.5 0.0 0.5 1.0 (c)SIPS (β= 0.0) Q_x 4 2 0 2 4 Q_y 4 2 0 2 ...

work page 2048
[15]

to stabilize gating logits in large-scale language model training. 13 L2R: Low-Rank and Lipschitz-Controlled Routing for Mixture-of-Experts Load-Balancing Loss.Following prior MoE work, we employ a load-balancing loss to prevent expert under-utilization and collapse. Let st,i denote the routing probability assigned to expert i for token t, and let I[i∈ T ...

work page 2025
[16]

Table 8.Key hyperparameters for OLMoE training

settings and keep the backbone architecture, optimization recipe, and MoE auxiliary losses fixed across all router variants, modifying only router-specific components (e.g., projection rank, scoring, or head design) when applicable. Table 8.Key hyperparameters for OLMoE training. Item Value Item Value Item Value dmodel 2,048 Max sequence length 4,096 Expe...

work page 2048
[17]

However, Megablocks currently does not support Blackwell sm 100 image configuration, which limits its compatibility on B200

for MoE kernel acceleration. However, Megablocks currently does not support Blackwell sm 100 image configuration, which limits its compatibility on B200. To ensure correct execution and efficient expert dispatch, we replace theMegablocks MoE module with Tutel (Hwang et al., 2023), which offers architecture-agnostic kernels and stable performance under mod...

work page 2023
[18]

Specifically, we evaluate a fixed set of commonsense and knowledge benchmarks at checkpoints along training and report the average score (Overall) across all included tasks

for reporting model quality. Specifically, we evaluate a fixed set of commonsense and knowledge benchmarks at checkpoints along training and report the average score (Overall) across all included tasks. All evaluations are performed using the same prompt format and normalization conventions as in the OLMoE/OLMES setup. H.1. Tasks and Metrics Table 11 summ...

work page 2021
[19]

Including such tasks at 10B tokens can introduce noise that obscures routing-induced differences

and ARC-Challenge (Clark et al., 2018)) tend to exhibit high variance and unstable ranking across checkpoints, because the model has not yet formed sufficiently robust linguistic and commonsense representations. Including such tasks at 10B tokens can introduce noise that obscures routing-induced differences. Therefore, we focus on a compact set of tasks (Table

work page 2018