{"paper":{"title":"Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Mixture-of-Experts models reach higher performance with load balancing that avoids auxiliary loss gradients.","cross_cats":["cs.CL"],"primary_cat":"cs.LG","authors_text":"Chenggang Zhao, Damai Dai, Huazuo Gao, Lean Wang, Xu Sun","submitted_at":"2024-08-28T09:31:09Z","abstract_excerpt":"For Mixture-of-Experts (MoE) models, an unbalanced expert load will lead to routing collapse or increased computational overhead. Existing methods commonly employ an auxiliary loss to encourage load balance, but a large auxiliary loss will introduce non-negligible interference gradients into training and thus impair the model performance. In order to control load balance while not producing undesired gradients during training, we propose Loss-Free Balancing, featured by an auxiliary-loss-free load balancing strategy. To be specific, before the top-K routing decision, Loss-Free Balancing will f"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Loss-Free Balancing achieves both better performance and better load balance compared with traditional auxiliary-loss-controlled load balancing strategies.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That dynamically updating per-expert biases from recent load statistics will maintain balance across training without introducing instability or unintended routing dynamics.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Loss-Free Balancing keeps expert loads balanced in MoE models by dynamically adjusting routing-score biases based on recent usage, avoiding auxiliary-loss interference and yielding better performance.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Mixture-of-Experts models reach higher performance with load balancing that avoids auxiliary loss gradients.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"c306008b14ccbd0298b01962067596270de956a35d5f098978a086983d20b104"},"source":{"id":"2408.15664","kind":"arxiv","version":1},"verdict":{"id":"9ed35d07-9958-4ddc-9949-761036dfa418","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T12:50:11.218526Z","strongest_claim":"Loss-Free Balancing achieves both better performance and better load balance compared with traditional auxiliary-loss-controlled load balancing strategies.","one_line_summary":"Loss-Free Balancing keeps expert loads balanced in MoE models by dynamically adjusting routing-score biases based on recent usage, avoiding auxiliary-loss interference and yielding better performance.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That dynamically updating per-expert biases from recent load statistics will maintain balance across training without introducing instability or unintended routing dynamics.","pith_extraction_headline":"Mixture-of-Experts models reach higher performance with load balancing that avoids auxiliary loss gradients."},"references":{"count":32,"sample":[{"doi":"","year":2024,"title":"DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models","work_id":"a9888d6d-bf47-4324-9834-7cc12ac3a78c","ref_index":1,"cited_arxiv_id":"2401.06066","is_internal_anchor":true},{"doi":"","year":2024,"title":"DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence","work_id":"713a39be-8c4e-4ee3-8671-11cf9379262f","ref_index":2,"cited_arxiv_id":"2406.11931","is_internal_anchor":true},{"doi":"","year":2021,"title":"William Fedus, Barret Zoph, and Noam M. Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. J. Mach. Learn. Res., 23: 0 120:1--120:39, 2021. URL http","work_id":"2f29f1ca-bd51-4381-acc7-95e946cde35f","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2006,"title":"GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding","work_id":"52b3c9a6-2a27-45a7-ba2b-ebe4b5bb5a5f","ref_index":4,"cited_arxiv_id":"2006.16668","is_internal_anchor":true},{"doi":"","year":2016,"title":"Sgdr: Stochastic gradient descent with warm restarts","work_id":"21d4473d-9c67-4464-b3ea-96588e1d01e1","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":32,"snapshot_sha256":"45ecc5e36e60cebbf268394ae511828f42a1187c8166048c5f0e7f95a94d51e0","internal_anchors":6},"formal_canon":{"evidence_count":2,"snapshot_sha256":"ae8508b9f31a7ff7425541690bf1452fc90bde2e6779ab251d817b0d51cca3d1"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}