On the sdes and scaling rules for adaptive gradient algorithms.Advances in Neural Information Processing Systems, 35:7697–7711, 2022

Sadhika Malladi, Kaifeng Lyu, Abhishek Panigrahi, Sanjeev Arora · 2022

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models

cs.LG · 2026-05-22 · unverdicted · novelty 6.0

Complete-muE combines active-width μP and activated-expert scaling to transfer hyperparameters across dense FFN, dense MoE, and sparse MoE while covering changes in experts, capacity, width, depth, batch size, and duration.

citing papers explorer

Showing 1 of 1 citing paper.

Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models cs.LG · 2026-05-22 · unverdicted · none · ref 11
Complete-muE combines active-width μP and activated-expert scaling to transfer hyperparameters across dense FFN, dense MoE, and sparse MoE while covering changes in experts, capacity, width, depth, batch size, and duration.

On the sdes and scaling rules for adaptive gradient algorithms.Advances in Neural Information Processing Systems, 35:7697–7711, 2022

fields

years

verdicts

representative citing papers

citing papers explorer