pith. sign in
Pith Number

pith:RS6XGWCW

pith:2026:RS6XGWCWJNK7PNWCHDEWVB4F5E
not attested not anchored not stored refs resolved

How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization

Alessandro Breccia, Leena Chennuru Vankadara, Luke Hayward, Moritz Haas, Sebastian Bordt

Mixture-of-Experts models require a Maximally Scale-Stable Parameterization to restore learning-rate transfer and monotonic gains at scale.

arxiv:2605.14200 v1 · 2026-05-13 · cs.LG · stat.ML

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{RS6XGWCWJNK7PNWCHDEWVB4F5E}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Experiments verify that MSSP robustly recovers learning rate transfer and monotonic improvement with scale across regimes. Combined with existing depth-scaling theory, these results provide a complete scaling prescription for MoE architectures as a function of width, depth, expert width, and number of experts.

C2weakest assumption

The DMFT description of limiting training dynamics accurately captures the scale-dependent observables in the aggregation dynamics of MoE models in all three regimes, and that the maximal scale stability desiderata are the right refinement of muP.

C3one line summary

The authors derive a Maximally Scale-Stable Parameterization (MSSP) for MoE models that achieves robust learning-rate transfer and monotonic performance gains with scale across co-scaling regimes of width, experts, and sparsity.

References

300 extracted · 300 resolved · 13 Pith anchors

[1] arXiv preprint arXiv:2512.22768 , year=
[2] CS 231N , volume=
[3] Generalization and Scaling Laws for Mixture-of-Experts Transformers , author=. 2026 , note= 2026
[4] arXiv preprint arXiv:2407.04153 , year=
[5] arXiv preprint arXiv:2402.07871 , year=

Formal links

2 machine-checked theorem links

Receipt and verification
First computed 2026-05-17T23:39:11.056526Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

8cbd7358564b55f7b6c238c96a8785e91f7f88c99329040b80b5cd2156f0acde

Aliases

arxiv: 2605.14200 · arxiv_version: 2605.14200v1 · doi: 10.48550/arxiv.2605.14200 · pith_short_12: RS6XGWCWJNK7 · pith_short_16: RS6XGWCWJNK7PNWC · pith_short_8: RS6XGWCW
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/RS6XGWCWJNK7PNWCHDEWVB4F5E \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 8cbd7358564b55f7b6c238c96a8785e91f7f88c99329040b80b5cd2156f0acde
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "5eb369bff6059693137a7555603f25b25cb1343bbf17526f994424c320d540f6",
    "cross_cats_sorted": [
      "stat.ML"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2026-05-13T23:32:00Z",
    "title_canon_sha256": "ee0649dd4644fe53526dbe57ada7cc84065acbc9ac94140d0bbeea51224cadc8"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.14200",
    "kind": "arxiv",
    "version": 1
  }
}