arxiv: 2408.15664 · v1 · submitted 2024-08-28 · 💻 cs.LG · cs.CL

Recognition: 2 theorem links

· Lean Theorem

Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts

Lean Wang , Huazuo Gao , Chenggang Zhao , Xu Sun , Damai Dai

Authors on Pith no claims yet

Pith reviewed 2026-05-15 12:50 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords mixture of expertsload balancingauxiliary lossrouting strategydeep learningmodel optimization

0 comments

The pith

Mixture-of-Experts models reach higher performance with load balancing that avoids auxiliary loss gradients.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

For Mixture-of-Experts models, unbalanced expert loads cause routing collapse or extra compute costs. The paper introduces Loss-Free Balancing, which adds and dynamically updates per-expert biases on routing scores to keep loads even. This method produces no interference gradients that would otherwise hurt training. A sympathetic reader cares because it removes the performance penalty usually paid for balance in MoE training. Experiments on models up to 3B parameters and 200B tokens show gains in both performance and balance over traditional methods.

Core claim

Loss-Free Balancing maintains expert load balance in MoE models by applying dynamically updated expert-wise biases to routing scores before top-K selection, eliminating the need for auxiliary losses that introduce unwanted gradients and limit model performance.

What carries the argument

Expert-wise bias applied to routing scores before top-K routing, updated from recent load statistics to enforce balance.

If this is right

Models achieve better final performance because no interference gradients from auxiliary losses are produced.
Load distribution stays balanced without manual tuning of auxiliary loss weights.
Training remains stable across long runs with up to 200B tokens.
The approach works on MoE models with up to 3B parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Removing auxiliary losses could make MoE architectures easier to scale to more experts or larger models.
This bias mechanism might generalize to other sparse activation methods beyond MoE.
Practitioners could see reduced hyperparameter search effort since no auxiliary loss coefficient needs tuning.

Load-bearing premise

Dynamically updating per-expert biases from recent load statistics will keep expert loads balanced throughout training without causing instability or changing routing behavior in harmful ways.

What would settle it

An experiment showing that after many training steps the expert load becomes imbalanced or model accuracy falls below the auxiliary-loss baseline when using the proposed bias updates.

read the original abstract

For Mixture-of-Experts (MoE) models, an unbalanced expert load will lead to routing collapse or increased computational overhead. Existing methods commonly employ an auxiliary loss to encourage load balance, but a large auxiliary loss will introduce non-negligible interference gradients into training and thus impair the model performance. In order to control load balance while not producing undesired gradients during training, we propose Loss-Free Balancing, featured by an auxiliary-loss-free load balancing strategy. To be specific, before the top-K routing decision, Loss-Free Balancing will first apply an expert-wise bias to the routing scores of each expert. By dynamically updating the bias of each expert according to its recent load, Loss-Free Balancing can consistently maintain a balanced distribution of expert load. In addition, since Loss-Free Balancing does not produce any interference gradients, it also elevates the upper bound of model performance gained from MoE training. We validate the performance of Loss-Free Balancing on MoE models with up to 3B parameters trained on up to 200B tokens. Experimental results show that Loss-Free Balancing achieves both better performance and better load balance compared with traditional auxiliary-loss-controlled load balancing strategies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Loss-free balancing via dynamic bias on router scores is a practical tweak for MoE training that avoids aux-loss gradients, though the update rule needs more specification.

read the letter

The main point is that this paper replaces the usual auxiliary loss for keeping Mixture-of-Experts loads balanced with a simple dynamic bias added to the router scores before top-K selection. The bias for each expert gets updated based on its recent load, and the authors say this gives both better balance and better model performance without the gradient interference that auxiliary losses create. What works here is the removal of that interference. Auxiliary losses have been the standard fix for routing collapse, but they add noise to the main training signal. By handling balance outside the loss, the router can focus purely on the task loss, which seems to pay off in their experiments up to 3B parameters and 200B tokens. The soft spot is the lack of detail on the bias update itself. The abstract mentions updating according to its recent load but does not give the formula, the window size, or how large the adjustment is. Without that, it is hard to tell if the method stays stable over long training runs or if it requires careful tuning that could reintroduce the problems it aims to solve. The results are stated as positive but no specific metrics, baselines, or error bars appear in the summary, so the size of the improvement is not clear from the abstract alone. This is for people who train large MoE models and are frustrated with auxiliary loss tuning. A practitioner who wants to try dropping the aux term will find the high-level idea useful, though they will need the full paper for the exact implementation. It is not a theoretical advance, but the engineering angle is worth referee attention because the problem is common in production-scale training. I would send it to peer review. The idea is clean enough that referees can ask for the missing equations and ablations, and the community can check whether the balance holds without introducing new instabilities.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Loss-Free Balancing, an auxiliary-loss-free load balancing method for Mixture-of-Experts models. Before the top-K routing step, it applies a dynamically updated expert-wise bias to the routing scores; the bias for each expert is adjusted according to its recent observed load. The central claim is that this approach maintains balanced expert utilization without introducing interference gradients from an auxiliary loss, thereby improving both load balance and final model performance relative to conventional auxiliary-loss methods. Validation is reported on MoE models up to 3B parameters trained on up to 200B tokens.

Significance. If the method is shown to be stable and reproducible, it would remove a known source of optimization interference in MoE training and could raise the achievable performance of sparse models without additional compute overhead. The absence of auxiliary-loss gradients is a potentially valuable property for large-scale training.

major comments (2)

[Method description (abstract and §3)] The bias-update rule is described only qualitatively (“dynamically updating the bias of each expert according to its recent load”). No equation, update rate, window length, smoothing factor, or initialization is supplied, rendering the central algorithmic claim impossible to reproduce or to verify for stability across long training runs.
[Experiments (abstract and §4)] The experimental section reports that Loss-Free Balancing “achieves both better performance and better load balance” but supplies no numerical metrics, baseline values, standard deviations, or statistical tests. Without these quantities the performance claim cannot be evaluated.

minor comments (1)

[Method] Notation for the bias term and the load statistic should be defined explicitly with symbols and updated in every relevant equation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important areas for improving reproducibility and the strength of our empirical claims. We address each major comment below and will incorporate the requested details in the revised manuscript.

read point-by-point responses

Referee: [Method description (abstract and §3)] The bias-update rule is described only qualitatively (“dynamically updating the bias of each expert according to its recent load”). No equation, update rate, window length, smoothing factor, or initialization is supplied, rendering the central algorithmic claim impossible to reproduce or to verify for stability across long training runs.

Authors: We agree that the description provided in the abstract and Section 3 is qualitative and insufficient for full reproducibility. In the revised manuscript we will add the exact bias-update equation, the precise update rate, the window length over which recent load is measured, any smoothing factor applied, and the initialization procedure. These additions will enable independent verification of stability over long training runs. revision: yes
Referee: [Experiments (abstract and §4)] The experimental section reports that Loss-Free Balancing “achieves both better performance and better load balance” but supplies no numerical metrics, baseline values, standard deviations, or statistical tests. Without these quantities the performance claim cannot be evaluated.

Authors: We acknowledge that the current experimental reporting lacks concrete numerical values, baseline numbers, standard deviations, and statistical tests. In the revision we will include detailed tables with exact performance metrics (e.g., perplexity or downstream accuracy), load-balance statistics, direct comparisons against auxiliary-loss baselines, standard deviations across multiple runs, and appropriate statistical significance tests to substantiate the claims. revision: yes

Circularity Check

0 steps flagged

No circularity: explicit algorithmic update rule, not a self-referential derivation

full rationale

The paper proposes a constructive procedure: apply per-expert biases before top-K routing and update those biases from recent load counts. This is an explicit algorithm whose output (balanced load) follows directly from the stated update rule by design. No equations reduce to fitted parameters renamed as predictions, no self-citation chain supplies a uniqueness theorem, and no ansatz is smuggled in. The method is self-contained as a practical balancing heuristic; any performance claims rest on empirical validation rather than tautological re-derivation of inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that load statistics alone can be used to set biases that keep routing balanced without side effects on the primary loss surface.

free parameters (1)

bias update rate
The magnitude or schedule by which each expert's bias is adjusted after observing its load is a tunable hyper-parameter required for the method to function.

axioms (1)

domain assumption Dynamic per-expert bias adjustment from recent load will produce stable balanced routing without destabilizing the main training dynamics.
Invoked when claiming that the method maintains balance while elevating the performance upper bound.

pith-pipeline@v0.9.0 · 5507 in / 1207 out tokens · 44715 ms · 2026-05-15T12:50:11.218526+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

before the top-K routing decision, Loss-Free Balancing will first apply an expert-wise bias to the routing scores of each expert. By dynamically updating the bias of each expert according to its recent load
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

bi = bi + u * sign(ei)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts
cs.LG 2026-05 unverdicted novelty 7.0

Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.
Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference
cs.DC 2026-05 unverdicted novelty 7.0

EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a f...
When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models
cs.LG 2026-05 unverdicted novelty 7.0

Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.
Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning
cs.LG 2026-04 unverdicted novelty 7.0

A new SFT framework for MoE models combines bias-driven sparsification with gated condenser experts to retain long-tailed expert information, outperforming DenseMixer and ESFT by over 2.5% on math reasoning and common...
Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts
cs.LG 2026-04 unverdicted novelty 7.0

Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.
EMO: Frustratingly Easy Progressive Training of Extendable MoE
cs.LG 2026-05 unverdicted novelty 6.0

EMO progressively expands the expert pool in MoE models during training to match fixed-expert performance with improved wall-clock efficiency.
Conditional Memory Enhanced Item Representation for Generative Recommendation
cs.IR 2026-05 unverdicted novelty 6.0

ComeIR introduces dual-level Engram memory and memory-restoring prediction to reconstruct SID-token embeddings and restore token granularity in generative recommendation.
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
cs.LG 2026-05 unverdicted novelty 6.0

DECO sparse MoE matches dense Transformer performance at 20% expert activation with a 3x hardware inference speedup.
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
cs.LG 2026-05 conditional novelty 6.0

DECO matches dense model performance at 20% expert activation via ReLU-based routing with learnable scaling and the NormSiLU activation, plus a 3x real-hardware speedup.
Hierarchical Mixture-of-Experts with Two-Stage Optimization
cs.LG 2026-05 unverdicted novelty 6.0

Hi-MoE uses two-level hierarchical routing objectives to enforce group-level balance while promoting within-group specialization, yielding better perplexity and expert utilization than prior MoE baselines in NLP and v...
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts
cs.LG 2026-05 unverdicted novelty 6.0

A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.
SPHERE: Mitigating the Loss of Spectral Plasticity in Mixture-of-Experts for Deep Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

SPHERE applies a Parseval penalty to MoE policies in continual RL to maintain spectral plasticity, yielding 133% and 50% higher average success on MetaWorld and HumanoidBench versus unregularized MoE baselines.
SPHERE: Mitigating the Loss of Spectral Plasticity in Mixture-of-Experts for Deep Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

SPHERE applies a Parseval penalty derived from a Neural Tangent Kernel proxy for spectral plasticity to Mixture-of-Experts policies, raising average success rates by 133% on MetaWorld and 50% on HumanoidBench in conti...
Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts
cs.LG 2026-04 unverdicted novelty 6.0

Expert upcycling expands MoE models by duplicating experts and continuing pre-training, matching baseline performance while saving 32% GPU hours in 7B-13B experiments.
Rethinking Language Model Scaling under Transferable Hypersphere Optimization
cs.LG 2026-03 conditional novelty 6.0

HyperP transfers optimal learning rates across model width, depth, tokens, and MoE granularity under Frobenius-sphere constraints, delivering stable scaling and 1.58x efficiency gains.
mHC: Manifold-Constrained Hyper-Connections
cs.CL 2025-12 unverdicted novelty 6.0

mHC projects hyper-connection residual spaces onto a manifold to restore identity mapping, enabling stable large-scale training with performance gains over standard HC.
E = T*H/(O+B): A Dimensionless Control Parameter for Mixture-of-Experts Ecology
cs.LG 2026-05 unverdicted novelty 5.0

A dimensionless parameter E = T*H/(O+B) >= 0.5 is claimed to guarantee zero dead experts in Mixture-of-Experts models, eliminating the need for auxiliary load-balancing losses.
Revisiting Auxiliary Losses for Conditional Depth Routing: An Empirical Study
cs.LG 2026-04 conditional novelty 5.0

Removing utility regression and rank supervision auxiliary losses improves language modeling performance and training efficiency for conditional depth routing gates, and eliminates the advantage of a more complex JEPA...
JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency
cs.CL 2026-04 unverdicted novelty 5.0

JoyAI-LLM Flash delivers a 48B MoE LLM with 2.7B active parameters per token via FiberPO RL and dense multi-token prediction, released with checkpoints on Hugging Face.
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
cs.CV 2024-12 accept novelty 5.0

DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B a...
EMO: Frustratingly Easy Progressive Training of Extendable MoE
cs.LG 2026-05 unverdicted novelty 4.0

EMO progressively expands the expert pool in MoE models using scaling-law-derived token budgets per stage, matching fixed-expert performance while cutting wall-clock time and GPU cost.
Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE
cs.CV 2026-05 unverdicted novelty 4.0

Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.
Position: LLM Serving Needs Mathematical Optimization and Algorithmic Foundations, Not Just Heuristics
cs.DC 2026-05 accept novelty 4.0

LLM serving requires mathematical optimization and algorithms with provable guarantees rather than generic heuristics that fail unpredictably on LLM workloads.
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
cs.CL 2025-08 unverdicted novelty 4.0

GLM-4.5, a 355B-parameter MoE model with hybrid reasoning, scores 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified while ranking 3rd overall and 2nd on agentic benchmarks.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · cited by 20 Pith papers · 5 internal anchors

[1]

Damai Dai, Chengqi Deng, Chenggang Zhao, Runxin Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, Zhenda Xie, Y. K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. ArXiv, abs/2401.06066, 2024. URL https://api.semanticsch...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

DeepSeek-AI, Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y. Wu, Yukun Li, Huazuo Gao, Shirong Ma, Wangding Zeng, Xiao Bi, Zihui Gu, Hanwei Xu, Damai Dai, Kai Dong, Liyue Zhang, Yishi Piao, Zhibin Gou, Zhenda Xie, Zhewen Hao, Bing-Li Wang, Jun-Mei Song, Deli Chen, Xin Xie, Kang Guan, Yu mei You, Aixin Liu, Qiushi Du, Wenjun Gao, ...

work page arXiv 2024
[3]

William Fedus, Barret Zoph, and Noam M. Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. J. Mach. Learn. Res., 23: 0 120:1--120:39, 2021. URL https://api.semanticscholar.org/CorpusID:231573431

work page 2021
[4]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam M. Shazeer, and Z. Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. ArXiv, abs/2006.16668, 2020. URL https://api.semanticscholar.org/CorpusID:220265858

work page internal anchor Pith review Pith/arXiv arXiv 2006
[5]

Sgdr: Stochastic gradient descent with warm restarts

Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv: Learning, 2016. URL https://api.semanticscholar.org/CorpusID:14337532

work page 2016
[6]

Neural Machine Translation of Rare Words with Subword Units

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. ArXiv, abs/1508.07909, 2015. URL https://api.semanticscholar.org/CorpusID:1114678

work page internal anchor Pith review Pith/arXiv arXiv 2015
[7]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Zhihong Shao, Damai Dai, Daya Guo, Bo Liu, and Zihan Wang. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. ArXiv, abs/2405.04434, 2024. URL https://api.semanticscholar.org/CorpusID:269613809

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam M. Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. ArXiv, abs/1701.06538, 2017. URL https://api.semanticscholar.org/CorpusID:12462234

work page internal anchor Pith review Pith/arXiv arXiv 2017
[9]

Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N

Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Neural Information Processing Systems, 2017. URL https://api.semanticscholar.org/CorpusID:13756489

work page 2017
[10]

Dai, Zhifeng Chen, Quoc V

Yan-Quan Zhou, Tao Lei, Han-Chu Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M. Dai, Zhifeng Chen, Quoc V. Le, and James Laudon. Mixture-of-experts with expert choice routing. ArXiv, abs/2202.09368, 2022. URL https://api.semanticscholar.org/CorpusID:247011948

work page arXiv 2022
[11]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page
[12]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page
[13]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[14]

ArXiv , year=

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , author=. ArXiv , year=

work page
[15]

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , author=. J. Mach. Learn. Res. , year=

work page
[16]

ArXiv , year=

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models , author=. ArXiv , year=

work page
[17]

ArXiv , year=

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , author=. ArXiv , year=

work page
[18]

ArXiv , year=

Mixture-of-Experts with Expert Choice Routing , author=. ArXiv , year=

work page
[19]

ArXiv , year=

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model , author=. ArXiv , year=

work page
[20]

Bell System Technical Journal , year=

Stabilized feedback amplifiers , author=. Bell System Technical Journal , year=

work page
[21]

ArXiv , year=

Reducing Activation Recomputation in Large Transformer Models , author=. ArXiv , year=

work page
[22]

SC21: International Conference for High Performance Computing, Networking, Storage and Analysis , year=

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM , author=. SC21: International Conference for High Performance Computing, Networking, Storage and Analysis , year=

work page
[23]

ArXiv , year=

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , author=. ArXiv , year=

work page
[24]

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , year=

ZeRO: Memory optimizations Toward Training Trillion Parameter Models , author=. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , year=

work page
[25]

ArXiv , year=

PipeDream: Fast and Efficient Pipeline Parallel DNN Training , author=. ArXiv , year=

work page
[26]

ArXiv , year=

Neural Machine Translation of Rare Words with Subword Units , author=. ArXiv , year=

work page
[27]

ArXiv , year=

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models , author=. ArXiv , year=

work page
[28]

International Conference on Machine Learning , year=

Unified Scaling Laws for Routed Language Models , author=. International Conference on Machine Learning , year=

work page
[29]

arXiv: Learning , year=

SGDR: Stochastic Gradient Descent with Warm Restarts , author=. arXiv: Learning , year=

work page
[30]

ArXiv , year=

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts , author=. ArXiv , year=

work page
[31]

Neural Information Processing Systems , year=

Attention is All you Need , author=. Neural Information Processing Systems , year=

work page
[32]

ArXiv , year=

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence , author=. ArXiv , year=

work page