arxiv: 2605.08639 · v1 · submitted 2026-05-09 · 💻 cs.LG

Recognition: no theorem link

ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning

Chao Jin , Xinming Wei , Yinmin Zhong , Chengxu Yang , Bingyang Wu , Ruidong Zhu , Zili Zhang , Yuliang Liu

show 1 more author

Xin Jin

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:33 UTC · model grok-4.3

classification 💻 cs.LG

keywords mixture of expertsload balancingreinforcement learningexpert routingmoe trainingtraining throughputrl workflowsexpert parallelism

0 comments

The pith

ReLibra uses replay of rollout routing decisions to reorder and replicate MoE experts at inter- and intra-batch scales for higher training throughput.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Mixture-of-Experts models trained with reinforcement learning experience rapid shifts in expert loads across micro-batches, which standard systems using only historical data cannot predict well. ReLibra observes that the rollout phase and training phase process identical tokens with identical parameters, so the token-to-expert routing decisions are known before training begins. It therefore applies expert reordering at the inter-batch timescale to balance work across nodes over slower network links and dynamic expert replication at the intra-batch timescale to absorb local fluctuations over faster links. This produces up to 1.6 times the throughput of Megatron-LM and 1.2 times the throughput of EPLB (even with oracle loads), while staying within 6-10 percent of an idealized balanced baseline. Readers care because the approach turns a structural property of RL workflows into concrete efficiency gains for large-scale MoE training without altering the model or the RL algorithm.

Core claim

ReLibra exploits the rollout-training workflow in RL, where the same tokens and MoE parameters are used in both phases, to obtain exact advance knowledge of routing decisions. It then performs expert reordering at inter-batch granularity for cross-node balancing and expert replication at intra-batch granularity for micro-batch balancing, matching each mechanism to the available network bandwidth hierarchy.

What carries the argument

Routing-replay-guided load balancing that performs inter-batch expert reordering and intra-batch expert replication.

If this is right

Training throughput rises by up to 1.6 times versus Megatron-LM on diverse MoE LLMs and RL workloads.
Throughput exceeds EPLB by up to 1.2 times even when EPLB receives oracle load information.
Achieved throughput stays within 6-10 percent of an idealized perfectly balanced baseline.
Balancing operates at micro-batch granularity, directly addressing the frequent expert shifts that characterize RL training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The replay technique could be adapted to supervised fine-tuning of MoE models if a cheap way to predict routing in advance can be found.
Lower load imbalance may allow practitioners to train larger MoE models with fewer total experts or on hardware with less over-provisioning.
Similar advance-knowledge mechanisms could be explored for other dynamic properties such as activation sparsity or memory access patterns in future training systems.
Combining routing replay with adaptive routing policies might further reduce how often load shifts occur.

Load-bearing premise

The token-to-expert routing decisions recorded during rollout remain representative of the loads that will actually occur when the same tokens are processed during training, and the overhead of reordering and replication stays low enough to produce net gains.

What would settle it

Measure whether the reported throughput gains disappear when routing patterns are deliberately altered between rollout and training phases, or when the separate time cost of reordering and replication is subtracted from the observed balance savings.

Figures

Figures reproduced from arXiv: 2605.08639 by Bingyang Wu, Chao Jin, Chengxu Yang, Ruidong Zhu, Xin Jin, Xinming Wei, Yinmin Zhong, Yuliang Liu, Zili Zhang.

**Figure 3.** Figure 3: MoE layer and expert parallelism. mation aggregated over the training batch. It employs a swapbased simulated annealing algorithm to efficiently search for an expert reordering plan that minimizes MoE execution time across the EP group, followed by a second round of optimization to improve data locality. At the intra-batch timescale, ReLibra dynamically replicates experts within a node to absorb micro-b… view at source ↗

**Figure 5.** Figure 5: Average intersection ratio of top-k highest-load experts between adjacent micro-batches. Lower values indicate more diverse hot experts and therefore more fluctuating load imbalance. ing samples, which, together with their reward signals, form a training batch. In the training stage, this batch is divided into micro-batches, each of which performs forward and backward passes and accumulates gradients unti… view at source ↗

**Figure 4.** Figure 4: Dynamics of expert loads for one MoE layer across micro [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 7.** Figure 7: Overview of ReLibra. expert replication. Inter-batch expert reordering aims to balance GPU loads across the EP group over an entire training batch. Intra-batch expert replication determines the replica count, placement, and token distribution among replicas for each hot expert. Once the solver finishes, the control plane forwards the resulting plans to the data plane for execution. Data Plane. The data pl… view at source ↗

**Figure 8.** Figure 8: Rail-optimized topology for MoE training. The spine [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗

**Figure 9.** Figure 9: Layer-shared replica buffer for fine-grained intra-batch [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗

**Figure 10.** Figure 10: Training throughput on the mixed dataset with different [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

**Figure 11.** Figure 11: Training throughput on different datasets with Qwen3- [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗

**Figure 14.** Figure 14: Rank-level skewness under different EP sizes and [PITH_FULL_IMAGE:figures/full_fig_p011_14.png] view at source ↗

**Figure 16.** Figure 16: shows the synchronization overhead of dynamic expert replication during forward and backward. In the forward pass, each GPU pushes local experts to the GPUs hosting their replicas; in the backward pass, it sends the replica gradients back. We refer to this parameter and gradient transfer time as replica synchronization time. The figure reports, for all models, the average replica synchronization time, t… view at source ↗

read the original abstract

Load imbalance is a long-standing challenge in Mixture-of-Experts (MoE) training and is exacerbated in reinforcement learning (RL) for LLMs, where hot experts can shift frequently across micro-batches. Existing MoE training systems rely on historical loads to predict future expert demand, making them less effective under sharp fluctuations. We propose ReLibra, an MoE RL training system that exploits a unique opportunity in RL's rollout-training workflow, routing replay, to enable fine-grained load balancing at micro-batch granularity. Because rollout and training process the same tokens with the same MoE parameters, the token-to-expert routing decisions are known before training starts. Leveraging this information, ReLibra places two MoE load-balancing mechanisms at inter- and intra-batch timescales, matching their communication patterns to hierarchical network bandwidths. At the inter-batch timescale, ReLibra performs expert reordering to redistribute experts for batch-level cross-node balancing; at the intra-batch timescale, it dynamically performs expert replication within a node to absorb micro-batch-level load fluctuations. Experiments on diverse MoE LLMs and RL workloads show that ReLibra improves training throughput by up to 1.6$\times$ over Megatron-LM and by up to 1.2$\times$ over EPLB, even when EPLB is given oracle loads. Moreover, ReLibra remains within 6%-10% of the throughput of an idealized balanced baseline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReLibra reuses exact rollout routings for MoE load balancing in RL and reports solid speedups, but the multi-step update issue may limit how far the gains apply.

read the letter

ReLibra's core move is to treat the RL rollout as a source of exact token-to-expert assignments that will be used again in training on the identical tokens and parameters. From there it adds inter-batch expert reordering to even out cross-node load and intra-batch replication inside a node to handle micro-batch spikes, with both steps shaped to the network hierarchy so communication cost stays low. That combination is not in the prior historical-prediction work the abstract cites, so the idea itself is new. The reported numbers are concrete: up to 1.6× over Megatron-LM and 1.2× over an oracle EPLB, while staying within 6-10% of an idealized balanced baseline across several models and workloads. Those are the parts that stand out on a first read. The soft spot is the assumption that parameters stay fixed between rollout and training. Standard RL loops run multiple gradient steps on each batch, so after the first update the router and expert weights shift and the precomputed placements no longer match later micro-batches. The abstract repeats the “same parameters” line, but if the experiments use only single-pass or frozen-weight training, the throughput claims would not carry over to realistic multi-step RL. That is a real gap rather than a minor detail. The work avoids circular claims and uses external baselines, which is clean. It is aimed at people who build or tune large-scale MoE training systems for RL, and the empirical edge is large enough that a serious referee should see it, provided the authors clarify the RL loop details and show the gains survive parameter updates.

Referee Report

2 major / 2 minor

Summary. The paper introduces ReLibra, an MoE training system for RL workloads that exploits the rollout-training workflow to perform routing replay. Because rollout and training use identical tokens and MoE parameters, precomputed token-to-expert assignments enable inter-batch expert reordering for cross-node balance and intra-batch expert replication for micro-batch fluctuations. Experiments report up to 1.6× throughput over Megatron-LM and 1.2× over oracle EPLB, remaining within 6-10% of an idealized balanced baseline across diverse MoE LLMs and RL tasks.

Significance. If the empirical results and the routing-replay assumption hold under realistic RL training loops, ReLibra would provide a practical, low-overhead mechanism for mitigating load imbalance in large-scale MoE RL, a setting where expert hotness shifts rapidly. The work's strength lies in matching communication patterns to hierarchical network bandwidths and delivering concrete speedups against strong baselines including an oracle prior method.

major comments (2)

Abstract and §3 (Routing Replay and Load Balancing Mechanisms): The central claim that 'rollout and training process the same tokens with the same MoE parameters' is load-bearing for the reported speedups. In standard PPO-style RL loops, the training phase on a rollout batch performs multiple gradient steps that update router and expert weights; after the first update the precomputed token-to-expert assignments become stale for subsequent micro-batches. The manuscript must clarify whether it assumes a single gradient step, frozen parameters during training, or a non-standard RL regime, and must demonstrate that the 1.2× gain versus oracle EPLB survives under multi-step updates.
§4 (Experimental Setup) and Table 2: The idealized balanced baseline and the oracle-EPLB comparison are not fully specified with respect to data-exclusion rules, number of independent runs, or variance reporting. Without these details it is impossible to determine whether the 6-10% gap to ideal and the 1.2× gain are robust or sensitive to particular workload characteristics.

minor comments (2)

Figure 3 and §3.2: The overhead of expert reordering and replication (communication volume, synchronization cost) should be quantified separately from the net throughput numbers to allow readers to assess when the technique remains beneficial.
§2 (Related Work): The comparison to prior MoE load-balancing systems would benefit from an explicit statement of how ReLibra differs from history-based predictors in the presence of the RL-specific replay opportunity.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We appreciate the referee's insightful comments on our paper. We address each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: Abstract and §3 (Routing Replay and Load Balancing Mechanisms): The central claim that 'rollout and training process the same tokens with the same MoE parameters' is load-bearing for the reported speedups. In standard PPO-style RL loops, the training phase on a rollout batch performs multiple gradient steps that update router and expert weights; after the first update the precomputed token-to-expert assignments become stale for subsequent micro-batches. The manuscript must clarify whether it assumes a single gradient step, frozen parameters during training, or a non-standard RL regime, and must demonstrate that the 1.2× gain versus oracle EPLB survives under multi-step updates.

Authors: We thank the referee for highlighting this critical point. The ReLibra design relies on the routing decisions being valid for the training phase, which holds when the MoE parameters remain unchanged during the processing of a given rollout batch. In our experimental setup and the targeted RL workloads, we perform a single gradient update per rollout batch. This avoids staleness within the batch while still allowing the benefits of routing replay. We will revise the abstract and §3 to explicitly state this single-gradient-step assumption and provide a brief discussion on how it relates to standard multi-step PPO. We note that demonstrating the exact 1.2× speedup under multi-step updates would require additional experiments that are outside the scope of the current manuscript; however, the inter- and intra-batch balancing mechanisms can still mitigate imbalance even if routing is only approximately accurate. revision: partial
Referee: §4 (Experimental Setup) and Table 2: The idealized balanced baseline and the oracle-EPLB comparison are not fully specified with respect to data-exclusion rules, number of independent runs, or variance reporting. Without these details it is impossible to determine whether the 6-10% gap to ideal and the 1.2× gain are robust or sensitive to particular workload characteristics.

Authors: We agree that more details are needed for reproducibility and robustness assessment. In the revised version, we will expand §4 to specify: (1) the idealized balanced baseline assumes zero expert imbalance and perfect communication scheduling with no overhead; (2) oracle-EPLB is provided with the exact token-to-expert assignments from the rollout phase as input for its load prediction; (3) all throughput numbers are averaged over 5 independent runs using different random seeds for data sampling and initialization; (4) variance is reported as standard deviation in Table 2 and associated figures; (5) data-exclusion rules involve discarding the initial 10% of batches as warmup to stabilize measurements. These additions will clarify that the reported gaps are consistent across runs. revision: yes

standing simulated objections not resolved

Providing empirical demonstration of the 1.2× gain under multi-step gradient updates, as this would necessitate new experiments not present in the current work.

Circularity Check

0 steps flagged

No significant circularity; empirical systems evaluation

full rationale

The paper describes a concrete scheduling system (expert reordering at inter-batch scale and replication at intra-batch scale) that exploits the RL rollout-training property that the same tokens and parameters are processed in both phases. All throughput claims (1.6× vs Megatron-LM, 1.2× vs oracle EPLB) are presented as direct measurements against external baselines rather than as outputs of any fitted parameter, self-defined quantity, or self-citation chain. No equations, uniqueness theorems, or ansatzes are introduced that reduce by construction to the inputs; the central result therefore remains independent of the paper's own definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical systems contribution; its central claim rests on one domain assumption about the RL workflow and on the existence of the described scheduling mechanisms rather than on new mathematical axioms or invented physical entities.

axioms (1)

domain assumption Rollout and training phases process identical tokens with identical MoE parameters, so routing decisions are known before training begins.
This identity is invoked to justify the use of routing replay for load prediction.

pith-pipeline@v0.9.0 · 5589 in / 1507 out tokens · 60263 ms · 2026-05-12T01:33:57.181565+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · 13 internal anchors

[1]

Outrageously large neural net- works: The sparsely-gated mixture-of-experts layer,

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, “Outrageously large neural net- works: The sparsely-gated mixture-of-experts layer,”In- ternational Conference on Learning Representations (ICLR), 2017

work page 2017
[2]

Gshard: Scaling giant models with conditional computation and auto- matic sharding,

D. Lepikhin, H. Lee, Y . Xu, D. Chen, O. Firat, Y . Huang, M. Krikun, N. Shazeer, and Z. Chen, “Gshard: Scaling giant models with conditional computation and auto- matic sharding,”International Conference on Learning Representations (ICLR), 2021

work page 2021
[3]

Switch transform- ers: Scaling to trillion parameter models with simple and efficient sparsity,

W. Fedus, B. Zoph, and N. Shazeer, “Switch transform- ers: Scaling to trillion parameter models with simple and efficient sparsity,”Journal of Machine Learning Research, 2022

work page 2022
[4]

Introducing DBRX: A New State-of-the-Art Open LLM,

“Introducing DBRX: A New State-of-the-Art Open LLM,” 2024. https://www.databricks.com/blog/ introducing-dbrx-new-state-art-open-llm

work page 2024
[5]

Mixtral of Experts

A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressand,et al., “Mixtral of experts,” arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

DeepSeek-V3 Technical Report

A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan,et al., “Deepseek-v3 tech- nical report,”arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Qwen3 Technical Report

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv,et al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Kimi K2: Open Agentic Intelligence

K. Team, Y . Bai, Y . Bao, Y . Charles, C. Chen, G. Chen, H. Chen, H. Chen, J. Chen, N. Chen,et al., “Kimi k2: Open agentic intelligence,”arXiv preprint arXiv:2507.20534, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

A. Chen, A. Li, B. Gong, B. Jiang, B. Fei, B. Yang, B. Shan, C. Yu, C. Wang, C. Zhu,et al., “Minimax- m1: Scaling test-time compute efficiently with lightning attention,”arXiv preprint arXiv:2506.13585, 2025

work page internal anchor Pith review arXiv 2025
[10]

Glam: Efficient scaling of language models with mixture-of- experts,

N. Du, Y . Huang, A. M. Dai, S. Tong, D. Lepikhin, Y . Xu, M. Krikun, Y . Zhou, A. W. Yu, O. Firat,et al., “Glam: Efficient scaling of language models with mixture-of- experts,” inInternational Conference on Machine Learn- ing (ICML), 2022

work page 2022
[11]

Deepspeed- moe: Advancing mixture-of-experts inference and train- ing to power next-generation ai scale,

S. Rajbhandari, C. Li, Z. Yao, M. Zhang, R. Y . Am- inabadi, A. A. Awan, J. Rasley, and Y . He, “Deepspeed- moe: Advancing mixture-of-experts inference and train- ing to power next-generation ai scale,” inInternational Conference on Machine Learning (ICML), 2022

work page 2022
[12]

Training language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al., “Training language models to follow instructions with human feedback,”Advances in Neural Information Processing Systems, 2022

work page 2022
[13]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning,

D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi,et al., “Deepseek-r1 incentivizes reasoning in llms through reinforcement learning,”Nature, 2025

work page 2025
[14]

Flexmoe: Scaling large-scale sparse pre-trained model training via dynamic device placement,

X. Nie, X. Miao, Z. Wang, Z. Yang, J. Xue, L. Ma, G. Cao, and B. Cui, “Flexmoe: Scaling large-scale sparse pre-trained model training via dynamic device placement,”ACM SIGMOD, 2023

work page 2023
[15]

{SmartMoE}: Efficiently training {Sparsely- Activated} models through combining offline and online parallelization,

M. Zhai, J. He, Z. Ma, Z. Zong, R. Zhang, and J. Zhai, “{SmartMoE}: Efficiently training {Sparsely- Activated} models through combining offline and online parallelization,” inUSENIX ATC, 2023

work page 2023
[16]

Micromoe: Fine- grained load balancing for mixture-of-experts with token scheduling,

C. Zhao, W. Wu, L. Song, and Y . Xu, “Micromoe: Fine- grained load balancing for mixture-of-experts with token scheduling,”arXiv preprint arXiv:2511.16947, 2025

work page arXiv 2025
[17]

{PopFetcher}: Towards acceler- ated {Mixture-of-Experts} training via popularity based {Expert-Wise}prefetch,

J. Zhang, C. Ma, X. Wang, Y . Nie, Y . Li, Y . Xu, X. Liao, B. Li, and H. Jin, “ {PopFetcher}: Towards acceler- ated {Mixture-of-Experts} training via popularity based {Expert-Wise}prefetch,” inUSENIX ATC, 2025

work page 2025
[18]

Laer-moe: Load-adaptive expert re-layout for efficient mixture-of-experts training,

X. Liu, Y . Wang, F. Fu, X. Xiao, H. Li, J. Li, and B. Cui, “Laer-moe: Load-adaptive expert re-layout for efficient mixture-of-experts training,” inACM ASPLOS, 2026

work page 2026
[19]

Moe par- allel folding: Heterogeneous parallelism mappings for efficient large-scale moe model training with megatron core,

D. Liu, Z. Yan, X. Yao, T. Liu, V . Korthikanti, E. Wu, S. Fan, G. Deng, H. Bai, J. Chang,et al., “Moe par- allel folding: Heterogeneous parallelism mappings for efficient large-scale moe model training with megatron core,”arXiv preprint arXiv:2504.14960, 2025. 13

work page arXiv 2025
[20]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

A. Liu, A. Mei, B. Lin, B. Xue, B. Wang, B. Xu, B. Wu, B. Zhang, C. Lin, C. Dong,et al., “Deepseek-v3.2: Push- ing the frontier of open large language models,”arXiv preprint arXiv:2512.02556, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Q. Yu, Z. Zhang, R. Zhu, Y . Yuan, X. Zuo, Y . Yue, W. Dai, T. Fan, G. Liu, L. Liu,et al., “Dapo: An open- source llm reinforcement learning system at scale,” arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Competition-level code generation with alphacode,

Y . Li, D. Choi, J. Chung, N. Kushman, J. Schrit- twieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago,et al., “Competition-level code generation with alphacode,”Science, 2022

work page 2022
[23]

Generalizing verifiable instruction following.arXiv preprint arXiv:2507.02833,

V . Pyatkin, S. Malik, V . Graf, H. Ivison, S. Huang, P. Dasigi, N. Lambert, and H. Hajishirzi, “Generaliz- ing verifiable instruction following,”arXiv preprint arXiv:2507.02833, 2025

work page arXiv 2025
[24]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wu,et al., “Deepseek- math: Pushing the limits of mathematical reason- ing in open language models,”arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Expert Parallelism Load Balancer (EPLB),

“Expert Parallelism Load Balancer (EPLB),” 2025. https://github.com/deepseek-ai/eplb

work page 2025
[26]

Human-level control through deep reinforcement learning,

V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Ve- ness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,et al., “Human-level control through deep reinforcement learning,”Nature, 2015

work page 2015
[27]

Soft actor-critic: Off-policy maximum entropy deep rein- forcement learning with a stochastic actor,

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep rein- forcement learning with a stochastic actor,” inInterna- tional Conference on Machine Learning (ICML), 2018

work page 2018
[28]

Stabilizing MoE reinforcement learning by aligning training and inference routers.arXiv preprint arXiv:2510.11370,

W. Ma, H. Zhang, L. Zhao, Y . Song, Y . Wang, Z. Sui, and F. Luo, “Stabilizing moe reinforcement learning by aligning training and inference routers,”arXiv preprint arXiv:2510.11370, 2025

work page arXiv 2025
[29]

arXiv preprint arXiv:2512.01374 , year=

C. Zheng, K. Dang, B. Yu, M. Li, H. Jiang, J. Lin, Y . Liu, H. Lin, C. Wu, F. Hu,et al., “Stabilizing reinforcement learning with llms: Formulation and practices,”arXiv preprint arXiv:2512.01374, 2025

work page arXiv 2025
[30]

Optimization and approximation in determinis- tic sequencing and scheduling: a survey,

R. L. Graham, E. L. Lawler, J. K. Lenstra, and A. R. Kan, “Optimization and approximation in determinis- tic sequencing and scheduling: a survey,” inAnnals of discrete mathematics, 1979

work page 1979
[31]

At- tention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “At- tention is all you need,”Advances in Neural Information Processing Systems, 2017

work page 2017
[32]

Zero: Memory optimizations toward training trillion parame- ter models,

S. Rajbhandari, J. Rasley, O. Ruwase, and Y . He, “Zero: Memory optimizations toward training trillion parame- ter models,” inInternational Conference for High Perfor- mance Computing, Networking, Storage and Analysis, 2020

work page 2020
[33]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatron-lm: Training multi-billion parameter language models using model parallelism,” arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[34]

Gpipe: Efficient training of giant neural networks using pipeline parallelism,

Y . Huang, Y . Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V . Le, Y . Wu,et al., “Gpipe: Efficient training of giant neural networks using pipeline parallelism,”Advances in Neural Information Processing Systems, 2019

work page 2019
[35]

Pipedream: Generalized pipeline parallelism for dnn training,

D. Narayanan, A. Harlap, A. Phanishayee, V . Seshadri, N. R. Devanur, G. R. Ganger, P. B. Gibbons, and M. Za- haria, “Pipedream: Generalized pipeline parallelism for dnn training,” inACM SOSP, 2019

work page 2019
[36]

Gpqa: A graduate-level google-proof q&a benchmark,

D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y . Pang, J. Dirani, J. Michael, and S. R. Bowman, “Gpqa: A graduate-level google-proof q&a benchmark,” inFirst conference on language modeling, 2024

work page 2024
[37]

Orca: A distributed serving system for {Transformer-Based} generative models,

G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.- G. Chun, “Orca: A distributed serving system for {Transformer-Based} generative models,” inUSENIX OSDI, 2022

work page 2022
[38]

Fast distributed inference serving for large language models,

B. Wu, Y . Zhong, Z. Zhang, S. Liu, F. Liu, Y . Sun, G. Huang, X. Liu, and X. Jin, “Fast distributed infer- ence serving for large language models,”arXiv preprint arXiv:2305.05920, 2023

work page arXiv 2023
[39]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[40]

Taming the long-tail: Efficient reasoning rl training with adaptive drafter,

Q. Hu, S. Yang, J. Guo, X. Yao, Y . Lin, Y . Gu, H. Cai, C. Gan, A. Klimovic, and S. Han, “Taming the long-tail: Efficient reasoning rl training with adaptive drafter,” in ACM ASPLOS, 2026

work page 2026
[41]

Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning

R. Qin, W. He, W. Huang, Y . Zhang, Y . Zhao, B. Pang, X. Xu, Y . Shan, Y . Wu, and M. Zhang, “Seer: Online context learning for fast synchronous llm reinforcement learning,”arXiv preprint arXiv:2511.14617, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Optimizing {RLHF} training for large language models with stage fusion,

Y . Zhong, Z. Zhang, B. Wu, S. Liu, Y . Chen, C. Wan, H. Hu, L. Xia, R. Ming, Y . Zhu,et al., “Optimizing {RLHF} training for large language models with stage fusion,” inUSENIX NSDI, 2025. 14

work page 2025
[43]

Towards efficient reward service for rlvr with request- level flexibility and batch-level constraint,

R. Zhu, M. Han, Y . Zhong, W. Xiao, X. Liu, and X. Jin, “Towards efficient reward service for rlvr with request- level flexibility and batch-level constraint,” inUSENIX NSDI, 2026

work page 2026
[44]

On the uncapacitated location problem,

G. Cornuejols, M. Fisher, and G. L. Nemhauser, “On the uncapacitated location problem,” inAnnals of Discrete Mathematics, 1977

work page 1977
[45]

Linear-Programming-Based Load Balancer (LPLB),

“Linear-Programming-Based Load Balancer (LPLB),” 2025.https://github.com/deepseek-ai/LPLB

work page 2025
[46]

Alibaba hpn: A data center network for large language model training,

K. Qian, Y . Xi, J. Cao, J. Gao, Y . Xu, Y . Guan, B. Fu, X. Shi, F. Zhu, R. Miao,et al., “Alibaba hpn: A data center network for large language model training,” in ACM SIGCOMM, 2024

work page 2024
[47]

NVIDIA GTC: Accelerating Mixture of Experts Train- ing With Rail-Optimized InfiniBand Networking in Cru- soe Cloud,

“NVIDIA GTC: Accelerating Mixture of Experts Train- ing With Rail-Optimized InfiniBand Networking in Cru- soe Cloud,” 2024.https://www.nvidia.com/en-us/ on-demand/session/gtc24-s63014/

work page 2024
[48]

GPUDirect RDMA,

“GPUDirect RDMA,” 2026. https://developer. nvidia.com/gpudirect

work page 2026
[49]

Optimized primitives for inter-GPU communication,

“Optimized primitives for inter-GPU communication,” 2026.https://github.com/NVIDIA/nccl

work page 2026
[50]

DeepEP: an efficient expert-parallel communication library,

“DeepEP: an efficient expert-parallel communication library,” 2025. https://github.com/deepseek-ai/ DeepEP

work page 2025
[51]

{DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving,

Y . Zhong, S. Liu, J. Chen, J. Hu, Y . Zhu, X. Liu, X. Jin, and H. Zhang, “{DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving,” inUSENIX OSDI, 2024

work page 2024
[52]

Megascale- infer: Efficient mixture-of-experts model serving with disaggregated expert parallelism,

R. Zhu, Z. Jiang, C. Jin, P. Wu, C. A. Stuardo, D. Wang, X. Zhang, H. Zhou, H. Wei, Y . Cheng,et al., “Megascale- infer: Efficient mixture-of-experts model serving with disaggregated expert parallelism,” inACM SIGCOMM, 2025

work page 2025
[53]

Disttrain: Addressing model and data heterogeneity with disaggregated train- ing for multimodal large language models,

Z. Zhang, Y . Zhong, Y . Jiang, H. Hu, J. Sun, Z. Ge, Y . Zhu, D. Jiang, and X. Jin, “Disttrain: Addressing model and data heterogeneity with disaggregated train- ing for multimodal large language models,” inACM SIGCOMM, 2025

work page 2025
[54]

Heddle: A distributed orches- tration system for agentic rl rollout,

Z. Zhang, Y . Zhong, C. Yang, C. Jin, B. Wu, X. Wei, Y . Liu, and X. Jin, “Heddle: A distributed orches- tration system for agentic rl rollout,”arXiv preprint arXiv:2603.28101, 2026

work page arXiv 2026
[55]

Bounds on multiprocessing timing anomalies,

R. L. Graham, “Bounds on multiprocessing timing anomalies,”SIAM journal on Applied Mathematics, 1969

work page 1969
[56]

The SCIP optimization suite 9.0

S. Bolusani, M. Besançon, K. Bestuzheva, A. Chmiela, J. Dionísio, T. Donkiewicz, J. van Doornmalen, L. Eifler, M. Ghannam, A. Gleixner,et al., “The scip optimization suite 9.0,”arXiv preprint arXiv:2402.17702, 2024

work page arXiv 2024
[57]

Slime: An LLM post-training framework for RL Scal- ing,

“Slime: An LLM post-training framework for RL Scal- ing,” 2025.https://github.com/THUDM/slime/

work page 2025
[58]

GPU optimized techniques for training transformer models at-scale,

“GPU optimized techniques for training transformer models at-scale,” 2025. https://github.com/ NVIDIA/Megatron-LM

work page 2025
[59]

Sglang: Efficient execution of structured lan- guage model programs,

L. Zheng, L. Yin, Z. Xie, C. Sun, J. Huang, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez, et al., “Sglang: Efficient execution of structured lan- guage model programs,”Advances in Neural Informa- tion Processing Systems, 2024

work page 2024
[60]

Open Multi-Processing,

“Open Multi-Processing,” 2026. https://www. openmp.org/

work page 2026
[61]

NVIDIA Hopper Architecture In-Depth,

“NVIDIA Hopper Architecture In-Depth,”

work page
[62]

https://developer.nvidia.com/blog/ nvidia-hopper-architecture-in-depth/

work page
[63]

Flashattention-3: Fast and accurate atten- tion with asynchrony and low-precision,

J. Shah, G. Bikshandi, Y . Zhang, V . Thakkar, P. Ramani, and T. Dao, “Flashattention-3: Fast and accurate atten- tion with asynchrony and low-precision,”Advances in Neural Information Processing Systems, 2024

work page 2024
[64]

https: //github.com/NVIDIA/TransformerEngine

“A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to pro- vide better performance with lower memory utiliza- tion in both training and inference.,” 2026. https: //github.com/NVIDIA/TransformerEngine

work page 2026
[65]

Ray: A distributed framework for emerging{AI} applications,

P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Elibol, Z. Yang, W. Paul, M. I. Jordan, et al., “Ray: A distributed framework for emerging{AI} applications,” inUSENIX OSDI, 2018

work page 2018
[66]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

A. Zeng, X. Lv, Q. Zheng, Z. Hou, B. Chen, C. Xie, C. Wang, D. Yin, H. Zeng, J. Zhang,et al., “Glm-4.5: Agentic, reasoning, and coding (arc) foundation models,” arXiv preprint arXiv:2508.06471, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[67]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

K. Team, A. Du, B. Gao, B. Xing, C. Jiang, C. Chen, C. Li, C. Xiao, C. Du, C. Liao,et al., “Kimi k1.5: Scal- ing reinforcement learning with llms,”arXiv preprint arXiv:2501.12599, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[68]

Streamrl: Scalable, heterogeneous, and elastic rl for llms with disaggregated stream generation.arXiv preprint arXiv:2504.15930,

Y . Zhong, Z. Zhang, X. Song, H. Hu, C. Jin, B. Wu, N. Chen, Y . Chen, Y . Zhou, C. Wan,et al., “Streamrl: Scalable, heterogeneous, and elastic rl for llms with disaggregated stream generation,”arXiv preprint arXiv:2504.15930, 2025. 15

work page arXiv 2025
[69]

Accelerat- ing distributed {MoE} training and inference with lina,

J. Li, Y . Jiang, Y . Zhu, C. Wang, and H. Xu, “Accelerat- ing distributed {MoE} training and inference with lina,” inUSENIX ATC, 2023

work page 2023
[70]

Tutel: Adap- tive mixture-of-experts at scale,

C. Hwang, W. Cui, Y . Xiong, Z. Yang, Z. Liu, H. Hu, Z. Wang, R. Salas, J. Jose, P. Ram,et al., “Tutel: Adap- tive mixture-of-experts at scale,”Proceedings of Ma- chine Learning and Systems, 2023

work page 2023
[71]

arXiv preprint arXiv:2203.14685 , year=

X. Nie, P. Zhao, X. Miao, T. Zhao, and B. Cui, “Hetumoe: An efficient trillion-scale mixture-of-expert distributed training system,”arXiv preprint arXiv:2203.14685, 2022

work page arXiv 2022
[72]

MegaScale-MoE:Large-ScaleCommunication- Efficient Training of Mixture-of-Experts Models in Production

C. Jin, Z. Jiang, Z. Bai, Z. Zhong, J. Liu, X. Li, N. Zheng, X. Wang, C. Xie, Q. Huang,et al., “Megascale- moe: Large-scale communication-efficient training of mixture-of-experts models in production,”arXiv preprint arXiv:2505.11432, 2025

work page arXiv 2025
[73]

Centauri: Enabling efficient scheduling for communication-computation overlap in large model training via communication partitioning,

C. Chen, X. Li, Q. Zhu, J. Duan, P. Sun, X. Zhang, and C. Yang, “Centauri: Enabling efficient scheduling for communication-computation overlap in large model training via communication partitioning,” inACM ASP- LOS, 2024

work page 2024
[74]

{MegaScale}: Scaling large language model training to more than 10,000{GPUs},

Z. Jiang, H. Lin, Y . Zhong, Q. Huang, Y . Chen, Z. Zhang, Y . Peng, X. Li, C. Xie, S. Nong,et al., “{MegaScale}: Scaling large language model training to more than 10,000{GPUs},” inUSENIX NSDI, 2024

work page 2024
[75]

Comet: Fine-grained computation-communication overlapping for mixture-of-experts,

S. Zhang, N. Zheng, H. Lin, Z. Jiang, W. Bao, C. Jiang, Q. Hou, W. Cui, S. Zheng, L.-W. Chang,et al., “Comet: Fine-grained computation-communication overlapping for mixture-of-experts,”Proceedings of Machine Learn- ing and Systems, 2025

work page 2025
[76]

Flux: Fast software-based communication overlap on gpus through kernel fusion.arXiv preprint arXiv:2406.06858,

L.-W. Chang, W. Bao, Q. Hou, C. Jiang, N. Zheng, Y . Zhong, X. Zhang, Z. Song, C. Yao, Z. Jiang, et al., “Flux: fast software-based communication over- lap on gpus through kernel fusion,”arXiv preprint arXiv:2406.06858, 2024

work page arXiv 2024
[77]

Janus: A unified distributed training framework for sparse mixture-of- experts models,

J. Liu, J. H. Wang, and Y . Jiang, “Janus: A unified distributed training framework for sparse mixture-of- experts models,” inACM SIGCOMM, 2023

work page 2023
[78]

Hybridflow: A flexible and efficient rlhf framework,

G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, Y . Peng, H. Lin, and C. Wu, “Hybridflow: A flexible and efficient rlhf framework,” inEuroSys, 2025

work page 2025
[79]

Rollart: Scaling agentic rl training via disaggregated infrastructure.arXiv preprint arXiv:2512.22560, 2025

W. Gao, Y . Zhao, T. Wu, S. Xiong, W. Wang, D. An, L. Cao, D. Muhtar, Z. Liu, H. Zhao,et al., “Rollart: Scal- ing agentic rl training via disaggregated infrastructure,” arXiv preprint arXiv:2512.22560, 2025

work page arXiv 2025
[80]

Fast infer- ence from transformers via speculative decoding,

Y . Leviathan, M. Kalman, and Y . Matias, “Fast infer- ence from transformers via speculative decoding,” in International Conference on Machine Learning (ICML), 2023

work page 2023

Showing first 80 references.