We adopt Their counting conventions for active parameters and FLOPs when ﬁtting/reading off exponents

provides ﬁtted coefﬁcients, per-E reduced exponents, a compute-optimal analysis under the training-FLOPs proxy F = 6 Nact D · 2020

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Generalization and Scaling Laws for Mixture-of-Experts Transformers

cs.LG · 2026-04-10 · unverdicted · novelty 6.0

A covering-number bound and ERM analysis for MoE Transformers under manifold data produces generalization and scaling laws that treat active capacity like dense networks while adding routing overhead.

citing papers explorer

Showing 1 of 1 citing paper.

Generalization and Scaling Laws for Mixture-of-Experts Transformers cs.LG · 2026-04-10 · unverdicted · none · ref 4
A covering-number bound and ERM analysis for MoE Transformers under manifold data produces generalization and scaling laws that treat active capacity like dense networks while adding routing overhead.

We adopt Their counting conventions for active parameters and FLOPs when ﬁtting/reading off exponents

fields

years

verdicts

representative citing papers

citing papers explorer