arxiv: 2605.13769 · v1 · submitted 2026-05-13 · 💻 cs.CL · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Dense vs Sparse Pretraining at Tiny Scale: Active-Parameter vs Total-Parameter Matching

Abdalrahman Wael

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:14 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords mixture of expertsdense vs sparseactive parameter matchingtotal parameter matchingtiny scale pretrainingtransformer validation losstop-2 routing

0 comments

The pith

Mixture-of-experts models beat dense baselines when matching active parameters but fall short when total stored capacity is equalized in sub-25M pretraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares dense and mixture-of-experts transformers in a controlled tiny-scale pretraining setup under 25 million parameters using a fixed LLaMA-style recipe. It replaces dense feed-forward blocks with four routed experts using top-2 routing and Switch-style balancing, then resizes the dense models to match either the active parameters used per token or the full total parameter count. The MoE model reaches a validation loss of 1.5788, better than the active-matched dense model at 1.6545 yet worse than the total-matched dense model at 1.5608. The active-matched advantage for MoE grows during training while the total-matched dense advantage narrows but stays positive. A sympathetic reader would care because the result distinguishes whether sparsity delivers a true efficiency gain or merely redistributes the same total capacity.

Core claim

In this sub-25M-parameter regime, the MoE model improves validation loss under active-parameter matching but does not surpass dense training at equal total stored capacity. Across three seeds the matched-active gap favors MoE by 0.0758 while the matched-total gap favors dense by 0.0180, with the active advantage widening and the total advantage shrinking over the course of training.

What carries the argument

Active-parameter versus total-parameter matching, which counts only the weights used in a forward pass for the active budget and all stored expert weights for the total budget.

If this is right

The MoE advantage over active-matched dense models grows steadily across training steps.
The dense advantage over MoE when total parameters are matched narrows sharply but remains positive at the end of training.
The reported gaps hold with the given error bars across three independent seeds.
MoE therefore provides a computational benefit only when capacity is measured by active weights rather than total stored weights.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the same ordering appears at larger scales, it would imply that MoE gains are mainly computational savings rather than raw parameter efficiency.
Varying the number of experts or the routing function while keeping total capacity fixed could test whether the observed total-match deficit is tied to this particular four-expert configuration.
The narrowing total-match gap during training suggests that extended schedules or larger data volumes might eventually close or reverse the ordering.

Load-bearing premise

That the specific choice of four experts with top-2 routing and Switch-style balancing captures the essential difference between sparse and dense models rather than depending on untested details of the tiny-scale setup.

What would settle it

Repeating the exact experiment with eight experts instead of four and checking whether the MoE still underperforms the total-matched dense model would show whether the result depends on the chosen sparsity level.

Figures

Figures reproduced from arXiv: 2605.13769 by Abdalrahman Wael.

**Figure 2.** Figure 2: Validation loss versus tokens contributing to next-token loss for the three headline models [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Validation-loss gap over training for the two fairness comparisons, again using three-seed [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Routing diagnostics for the full-data MoE run, showing busiest-expert fraction, expert-usage [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Routing diagnostics for the full-data MoE run. Expert loads remain balanced while deeper [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Single-GPU sparse throughput under naive, grouped, and stacked dispatch. Grouped and [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

We study dense and mixture-of-experts (MoE) transformers in a tiny-scale pretraining regime under a shared LLaMA-style decoder training recipe. The sparse model replaces dense feed-forward blocks with Mixtral-style routed experts. Dense baselines are modestly width-resized to tightly match either active or total parameter budgets, while tokenizer, data, optimizer, schedule, depth, context length, normalization style, and evaluation protocol are held fixed. Our best sparse recipe uses four experts, top-2 routing, Switch-style load balancing, and router z-loss. In a three-seed full-data comparison, the dense active-match model reaches 1.6545 +/- 0.0012 best validation loss, the MoE reaches 1.5788 +/- 0.0020, and the dense total-match model reaches 1.5608 +/- 0.0025. This yields a matched-active gap of 0.0758 +/- 0.0021 in the MoE's favor and a matched-total gap of 0.0180 +/- 0.0020 in the dense model's favor. Across training, the matched-active advantage grows while the matched-total dense advantage narrows sharply. In this sub-25M-parameter regime, MoE therefore improves validation loss under active-parameter matching but does not surpass dense training at equal total stored capacity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

At sub-25M scale this paper shows MoE beats active-matched dense by 0.076 loss but loses to total-matched dense by 0.018 under a fixed recipe.

read the letter

The main result is straightforward: with everything else fixed, the four-expert top-2 MoE reaches 1.5788 validation loss while the active-matched dense hits 1.6545 and the total-matched dense hits 1.5608. The gaps sit well above the three-seed standard deviations, and the active advantage widens over training while the total dense edge shrinks. That is the clean empirical takeaway at this scale. The work does a good job keeping the recipe identical across models—same data, tokenizer, optimizer, depth, context, and normalization—so the only moving parts are the width adjustments for the dense baselines and the expert routing in the MoE. Reporting seed-level statistics and explicit loss differences makes the central numbers reproducible from the text. The design avoids the usual scaling-law extrapolation and just measures what happens at sub-25M. One soft spot is the active-parameter accounting. The router is always active and contributes d_model times num_experts parameters that the dense active-match baseline does not have in the same form, since only FFN widths are resized. If that small always-on slice is not subtracted from the MoE budget, the active comparison carries a modest unacknowledged advantage for the sparse model; the total-parameter match is unaffected. The findings are also tied to this exact configuration and scale, so they do not yet speak to larger models or different routing choices. This paper is mainly for groups training or studying models under 25M parameters who want concrete numbers on active versus total budget tradeoffs. It is narrow but the experiment is controlled enough that a serious referee should see it; the result is new at this scale and the evidence is direct rather than fitted. I would send it to review with a request to clarify the router inclusion in the active count and to note the scale limitation explicitly.

Referee Report

0 major / 2 minor

Summary. The manuscript empirically investigates dense versus mixture-of-experts (MoE) pretraining at tiny scales below 25M parameters. Under a fixed LLaMA-style training setup, dense models are width-adjusted to match either the active or total parameter count of a four-expert top-2 MoE model. With three seeds, the MoE achieves a validation loss of 1.5788 ± 0.0020 compared to 1.6545 ± 0.0012 for active-matched dense and 1.5608 ± 0.0025 for total-matched dense, indicating an advantage for MoE in active matching but not in total matching.

Significance. If these findings hold, the work offers valuable insights into the benefits of sparsity in low-parameter regimes, demonstrating that MoE improves performance when computation is matched but not when total capacity is equalized. The controlled experimental design with reported variances enhances the credibility of the conclusions for understanding scaling behaviors in sparse architectures.

minor comments (2)

[Methods] Explicitly state whether the router parameters are counted within the active-parameter budget for the MoE models, since they remain active during inference and could introduce a minor asymmetry in the active-matching comparison.
[Abstract] Include the precise active and total parameter counts for the dense and MoE configurations to allow readers to verify the tightness of the matching.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation for minor revision. No specific major comments were raised in the report, so we have no individual points requiring rebuttal or revision at this stage. We are pleased that the controlled experimental design and reported variances were viewed as enhancing credibility.

Circularity Check

0 steps flagged

No circularity: direct empirical loss measurements

full rationale

The paper reports direct empirical measurements of validation loss from training runs under controlled active- and total-parameter matching. No equations, derivations, or predictive models are present that reduce reported gaps to fitted parameters, self-citations, or definitions by construction. All claims rest on observed losses (e.g., 1.5788 vs 1.6545) with three-seed statistics, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The claim rests on standard transformer training assumptions plus several design choices for the MoE router that are selected rather than derived.

free parameters (2)

Number of experts = 4
Set to four as the best sparse recipe
Routing top-k = 2
Mixtral-style top-2 routing chosen

axioms (1)

domain assumption LLaMA-style decoder training recipe and fixed tokenizer, data, optimizer, schedule, depth, context, and normalization are held constant across models
All comparisons rely on this shared base setup being fair.

pith-pipeline@v0.9.0 · 5542 in / 1222 out tokens · 43967 ms · 2026-05-14T19:14:59.810519+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

In this sub-25M-parameter regime, MoE therefore improves validation loss under active-parameter matching but does not surpass dense training at equal total stored capacity.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Dense baselines are modestly width-resized to tightly match either active or total parameter budgets

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 4 internal anchors

[1]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , author=. arXiv preprint arXiv:1701.06538 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Journal of Machine Learning Research , volume=

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , author=. Journal of Machine Learning Research , volume=

work page
[3]

Zoph, Barret and Fedus, William and Zhou, Denny and others , journal=

work page
[4]

Mixtral of Experts

Mixtral of Experts , author=. arXiv preprint arXiv:2401.04088 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Du, Nan and Huang, Yanping and Dai, Andrew and others , journal=

work page
[6]

Proceedings of the 39th International Conference on Machine Learning , pages=

Unified Scaling Laws for Routed Language Models , author=. Proceedings of the 39th International Conference on Machine Learning , pages=

work page
[7]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , year=

Efficient Large Scale Language Modeling with Mixtures of Experts , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , year=

work page 2022
[8]

arXiv preprint arXiv:2402.07871 , year=

Scaling Laws for Fine-Grained Mixture of Experts , author=. arXiv preprint arXiv:2402.07871 , year=

work page arXiv
[9]

Parameters vs

Abnar, Samira and Shah, Harshay and Busbridge, Dan and Mohamed Elnouby Ali, Alaaeldin and Susskind, Josh and Thilak, Vimal , journal=. Parameters vs

work page
[10]

Ludziejewski, Jan and Pi. Joint. arXiv preprint arXiv:2502.05172 , year=

work page arXiv
[11]

Muennighoff, Niklas and Soldaini, Luca and Groeneveld, Dirk and Lo, Kyle and Morrison, Jacob and others , booktitle=

work page
[12]

Mixture-of-Experts Can Surpass Dense

Li, Houyi and Lo, Ka Man and Xuyang, Shijie and Wang, Ziqi and Zheng, Wenzhen and others , booktitle=. Mixture-of-Experts Can Surpass Dense

work page
[13]

Eldan, Ronen and Li, Yuanzhi , journal=

work page
[14]

Proceedings of Machine Learning and Systems , year=

MegaBlocks: Efficient Sparse Training with Mixture-of-Experts , author=. Proceedings of Machine Learning and Systems , year=

work page
[15]

Training Compute-Optimal Large Language Models

Training Compute-Optimal Large Language Models , author=. arXiv preprint arXiv:2203.15556 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Dai, Damai and Li, Wenbin and Xu, Nuo and others , journal=

work page
[17]

arXiv preprint arXiv:2405.04434 , year=

work page internal anchor Pith review Pith/arXiv arXiv