Proceedings of Machine Learning and Systems , year=

MegaBlocks: Efficient Sparse Training with Mixture-of-Experts , author=

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Dense vs Sparse Pretraining at Tiny Scale: Active-Parameter vs Total-Parameter Matching

cs.CL · 2026-05-13 · accept · novelty 5.0

At tiny scale, MoE transformers lower validation loss versus dense models when active parameters match but raise it when total stored parameters match.

citing papers explorer

Showing 1 of 1 citing paper.

Dense vs Sparse Pretraining at Tiny Scale: Active-Parameter vs Total-Parameter Matching cs.CL · 2026-05-13 · accept · none · ref 14
At tiny scale, MoE transformers lower validation loss versus dense models when active parameters match but raise it when total stored parameters match.

Proceedings of Machine Learning and Systems , year=

fields

years

verdicts

representative citing papers

citing papers explorer