pith. machine review for the scientific record. sign in

arxiv: 2604.09107 · v1 · submitted 2026-04-10 · 💻 cs.DC · cs.AI

Recognition: unknown

TensorHub: Scalable and Elastic Weight Transfer for LLM RL Training

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:13 UTC · model grok-4.3

classification 💻 cs.DC cs.AI
keywords weight transferLLM reinforcement learningreference-oriented storagedistributed GPU trainingelastic scalingRDMAfault tolerancescalable rollout
0
0 comments X

The pith

By tracking replicated model weights on GPUs instead of copying them, TensorHub enables scalable elastic weight transfer for LLM reinforcement learning across heterogeneous clusters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM RL training demands frequent weight updates between training and inference workers, but moving large models creates long GPU stalls especially at scale or across datacenters. The paper proposes Reference-Oriented Storage, an abstraction that creates the appearance of stored weight versions while simply recording which workers already hold copies in GPU memory. On a read request, the system points directly to those existing copies for service. TensorHub realizes this idea with added topology-aware transfers, consistency guarantees, and recovery logic. The result is that training clusters can grow or shrink dynamically with far less idle time on the GPUs doing rollouts.

Core claim

The paper establishes that weight transfer overhead in distributed LLM RL can be largely eliminated by replacing physical data movement with a reference mechanism: Reference-Oriented Storage maintains the fiction that specific weight versions exist in a shared store, yet stores nothing itself and instead records the current GPU locations of replicated weights across inference workers, then serves any read by direct reference to one of those locations.

What carries the argument

Reference-Oriented Storage (ROS): an abstraction that tracks workers holding model weights in GPU memory and serves reads via direct reference rather than creating or moving physical copies.

If this is right

  • RDMA bandwidth becomes fully utilized during weight transfers because no extra copies are generated.
  • Standalone rollout GPU stall time drops by up to 6.7x.
  • Weight updates for elastic rollouts accelerate by 4.8x.
  • Cross-datacenter rollout stall time falls by 19x.
  • The same system handles three distinct rollout patterns with only minor configuration changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The reference approach could extend to any distributed workload where the same large data objects are already resident on many compute nodes.
  • Training clusters could incorporate a wider mix of GPU generations without rewriting transfer logic.
  • Energy and network costs of RL training may fall because the dominant data-movement traffic is removed.

Load-bearing premise

Model weights stay highly replicated across inference workers so that a reference can always locate a suitable copy, and direct GPU memory references can maintain strong consistency and fault tolerance without introducing hidden costs in heterogeneous or cross-datacenter settings.

What would settle it

Run the same rollout workloads with deliberately low weight replication (few copies per version) or with injected worker failures and measure whether stall-time reductions disappear or consistency overhead becomes measurable.

Figures

Figures reproduced from arXiv: 2604.09107 by Andrea C. Arpaci-Dusseau, Baoquan Zhong, Chenhao Ye, He Sun, Huaizheng Zhang, Kaihua Jiang, Mingcong Han, Qixiang Chen, Remzi H. Arpaci-Dusseau, Wang Zhang, Weidong Zhang, Wencong Xiao, Xiang Li, Xinyi Zhang.

Figure 1
Figure 1. Figure 1: RL Workload with Diverse Rollout Strategies. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Reference-Oriented Storage Workflow. The server only operates on lightweight references. The bulk weight transfer is directly between clients. Instead, it tracks the workers with these weights on GPUs for inference; upon request, ROS directly uses them to serve reads. By avoiding explicit data ownership and extra copies, ROS combines the flexibility of storage with the efficiency of direct transfer. This s… view at source ↗
Figure 3
Figure 3. Figure 3: TensorHub Naming Scheme. Rollout-2 fetches version 3 by replicating a copy; after doing so, it also becomes a replica capable of serving subsequent replication requests. version; if so, it notifies the client to offload a copy of weights to CPU memory and publish that as an offload replica. Note offloading is a safeguard for corner cases and is rarely triggered in practice. Rollouts typically see and repli… view at source ↗
Figure 4
Figure 4. Figure 4: TensorHub Examples. flexibility in worker memory management. For example, in a fully disaggregated architecture where trainers execute training in a tight loop, they can publish weights on CPU. • Publish. To make its registered tensors visible under ver￾sion 𝑣, the worker calls publish(v) with an immutability commitment. When the worker desires mutation, it must invoke unpublish() first. These two calls fo… view at source ↗
Figure 5
Figure 5. Figure 5: Pipeline Replication. Worker-0 is the only source, while both Worker-1 and Worker-2 are requesting data. To scale throughput, TensorHub schedules a pipeline where Worker-2 reads partially replicated data on Worker-1. 4.3.2 High-Performance Transfer between Clients. TensorHub data-plane efficiency relies on a high-performance transfer method between clients. To achieve this, each client embeds a hardware af… view at source ↗
Figure 6
Figure 6. Figure 6: Sharding Consistency Example. At𝑇0, shard:0 of replica-0 requests to replicate the latest and observes version 12. At𝑇1 and𝑇2, replica-1 publishes version 13. Consequently, when shard:1 of replica-0 makes the same request at 𝑇3, it observes version 13 as the latest. Without coordination, this leads to divergence. TensorHub’s transactional semantics ensure replica￾0’s requests both see version 12. 4.4 Consi… view at source ↗
Figure 7
Figure 7. Figure 7: Microbenchmark Results. tensors typically reside on GPUs, checksum computation is fast and can often be overlapped with RDMA transfer. In addition, these checksums also help confirm correct enforcement of the mutability contract without corruption. 5 Evaluation Our evaluation answers four questions: (1) Can TensorHub transfer weight tensors efficiently and fully utilize RDMA band￾width? (2) Can TensorHub s… view at source ↗
Figure 8
Figure 8. Figure 8: Weight Transfer Flows with Standalone. Ten￾sorHub does not require the Ray driver to orchestrate weight transfer. Each standalone rollout pulls weight on demand. NCCL trainer UCX standalone TensorHub RDMA ideal 9B 0 5 10 15 20 Total GPU stall (s) 36B 0 10 20 30 40 260B 0 50 100 150 200 1T 0 2000 4000 6000 [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Standalone Rollout Results. Ideally, only stan￾dalone rollouts need to stall due to their dependence on weights; trainers can proceed without waiting. TensorHub not only elim￾inates trainer stalls, but also keeps standalone stall time con￾sistently lower than alternatives. 5.1.3 Transparent Failure Masking. We evaluate the system’s resilience to failure with the third microbenchmark. Since failure at rest … view at source ↗
Figure 10
Figure 10. Figure 10: Weight Transfer Flows with Standalone and [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Elastic Rollout Results. need to wait. This results in up to 6.7× total GPU stall time reduction compared to NCCL. 5.3 Elastic Rollout on Spot Instances In the second case study, we evaluate TensorHub when roll￾out workers run on a hybrid of stable GPUs (for standalone) and preemptible spot GPUs (for elastic). In ByteDance pro￾duction environment, an autoscaler monitors training progress and instantiates … view at source ↗
Figure 12
Figure 12. Figure 12: Cross-Datacenter Rollout Results. Note that the left figure shows only the standalone stall time; UCX incurs additional trainer stall time that is omitted here. In the UCX baseline, elastic rollouts must wait for stan￾dalone rollouts to pull weights from trainers first, introduc￾ing a delay. When elastic workers outnumber standalones, bandwidth contention further amplifies tail latency. Fig￾ure 11b shows … view at source ↗
read the original abstract

Modern LLM reinforcement learning (RL) workloads require a highly efficient weight transfer system to scale training across heterogeneous computational resources. However, existing weight transfer approaches either fail to provide flexibility for dynamically scaling clusters or incur fundamental data movement overhead, resulting in poor performance. We introduce Reference-Oriented Storage (ROS), a new storage abstraction for RL weight transfer that exploits the highly replicated model weights in place. ROS presents the illusion that certain versions of the model weights are stored and can be fetched on demand. Underneath, ROS does not physically store any copies of the weights; instead, it tracks the workers that hold these weights on GPUs for inference. Upon request, ROS directly uses them to serve reads. We build TensorHub, a production-quality system that extends the ROS idea with topology-optimized transfer, strong consistency, and fault tolerance. Evaluation shows that TensorHub fully saturates RDMA bandwidth and adapts to three distinct rollout workloads with minimal engineering effort. Specifically, TensorHub reduces total GPU stall time by up to 6.7x for standalone rollouts, accelerates weight update for elastic rollout by 4.8x, and cuts cross-datacenter rollout stall time by 19x. TensorHub has been deployed in production to support cutting-edge RL training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Reference-Oriented Storage (ROS), a storage abstraction that tracks GPU-resident LLM weights without physical copies and serves reads via direct reference, and builds TensorHub, a production system extending ROS with topology-optimized transfers, strong consistency, and fault tolerance. It claims TensorHub saturates RDMA bandwidth, adapts to three rollout workloads, reduces GPU stall time by up to 6.7x for standalone rollouts, accelerates elastic rollout weight updates by 4.8x, and cuts cross-datacenter rollout stall time by 19x, with deployment in production RL training.

Significance. If the performance multipliers and fault-tolerance properties hold under realistic churn and heterogeneity, the work could meaningfully advance scalable RL training for LLMs by minimizing data movement and stall time across distributed, elastic clusters. The production deployment and claimed adaptability to multiple workloads strengthen the case for practical impact in the field.

major comments (2)
  1. [ROS design and TensorHub implementation sections] § on ROS design (and corresponding implementation in TensorHub): The central performance claims rest on direct GPU referencing delivering strong consistency and fault tolerance with negligible overhead. However, the description does not quantify the costs of fallback paths (replication logic or blocking waits) under worker failure, memory eviction, or cross-datacenter network partitions; without this, the 19x cross-datacenter stall reduction cannot be assessed as general rather than failure-free.
  2. [Evaluation section] Evaluation section: The reported speedups (6.7x stall reduction, 4.8x weight-update acceleration, 19x cross-datacenter) are presented without explicit baselines, error bars, workload parameter details, or measurements of overhead under injected failures or partitions. This makes it impossible to verify whether the gains are load-bearing for the ROS abstraction or artifacts of idealized intra-cluster runs.
minor comments (2)
  1. [Abstract and Evaluation] The abstract states concrete multipliers but the full methods/results should include a table or figure breaking down per-workload contributions to the aggregate speedups.
  2. [ROS abstraction] Notation for ROS tracking of replicas and versioned weights could be clarified with a small diagram or pseudocode to avoid ambiguity in how 'direct reference' is implemented without copies.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the thorough review of our manuscript. We value the referee's emphasis on ensuring the robustness of our performance claims under realistic failure conditions. Below, we provide point-by-point responses to the major comments and specify the changes we will make in the revised version.

read point-by-point responses
  1. Referee: [ROS design and TensorHub implementation sections] § on ROS design (and corresponding implementation in TensorHub): The central performance claims rest on direct GPU referencing delivering strong consistency and fault tolerance with negligible overhead. However, the description does not quantify the costs of fallback paths (replication logic or blocking waits) under worker failure, memory eviction, or cross-datacenter network partitions; without this, the 19x cross-datacenter stall reduction cannot be assessed as general rather than failure-free.

    Authors: We agree that quantifying the overhead of fallback paths is important for assessing the generality of our results. The current manuscript focuses on the common case of direct referencing, but we will expand the ROS design section to describe the fallback logic in more detail, including how replication is triggered and the associated costs. In the evaluation, we will add data on fallback frequency and overhead from our production deployment logs, which include instances of worker churn and network variability. This will help demonstrate that the 19x reduction holds under realistic conditions rather than being failure-free only. revision: partial

  2. Referee: [Evaluation section] Evaluation section: The reported speedups (6.7x stall reduction, 4.8x weight-update acceleration, 19x cross-datacenter) are presented without explicit baselines, error bars, workload parameter details, or measurements of overhead under injected failures or partitions. This makes it impossible to verify whether the gains are load-bearing for the ROS abstraction or artifacts of idealized intra-cluster runs.

    Authors: The evaluation does compare against standard approaches such as direct GPU-to-GPU transfers without ROS and traditional parameter servers, but we will make the baselines more explicit in the revised paper. We will include error bars from repeated experiments, detailed workload parameters (e.g., LLM sizes from 7B to 70B parameters, rollout batch sizes), and new results from experiments with injected failures and cross-datacenter partitions. These additions will confirm that the speedups are attributable to the ROS abstraction and hold under non-idealized conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: system implementation and empirical measurements only

full rationale

The paper introduces Reference-Oriented Storage (ROS) as a new abstraction and TensorHub as its implementation for weight transfer in LLM RL training. All central claims (6.7x stall reduction, 4.8x elastic update, 19x cross-datacenter improvement) are presented as outcomes of system design choices and reported benchmark measurements rather than any derivation, equation, fitted parameter, or theorem. No mathematical predictions, self-referential definitions, or load-bearing self-citations appear in the abstract or described structure. The work is self-contained against external benchmarks via direct performance evaluation on rollout workloads, with no reduction of results to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central idea rests on domain assumptions about workload replication patterns and hardware capabilities rather than new mathematical axioms or fitted parameters.

axioms (2)
  • domain assumption Model weights are highly replicated across inference workers holding them on GPUs.
    Explicitly invoked as the foundation for avoiding physical storage in ROS.
  • domain assumption RDMA networks and GPU memory can be directly referenced for reads while preserving consistency.
    Required for the claimed saturation of bandwidth and stall reductions.
invented entities (2)
  • Reference-Oriented Storage (ROS) no independent evidence
    purpose: Abstraction that presents the illusion of stored weight versions while tracking and directly using existing GPU replicas.
    Core new concept introduced to eliminate data-movement overhead.
  • TensorHub no independent evidence
    purpose: Production system extending ROS with topology optimization, strong consistency, and fault tolerance.
    The implemented artifact delivering the reported performance.

pith-pipeline@v0.9.0 · 5578 in / 1481 out tokens · 76264 ms · 2026-05-10T17:13:22.598390+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. JigsawRL: Assembling RL Pipelines for Efficient LLM Post-Training

    cs.LG 2026-04 unverdicted novelty 6.0

    JigsawRL achieves up to 1.85x higher throughput in LLM RL pipelines via pipeline multiplexing, sub-stage graphs, and look-ahead scheduling compared to prior systems.

Reference graph

Works this paper leans on

51 extracted references · 29 canonical work pages · cited by 1 Pith paper · 9 internal anchors

  1. [1]

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

  2. [2]

    Dhruba Borthakur. [n. d.]. HDFS Architecture Guide.https://hadoop. apache.org/docs/r1.2.1/hdfs_design.html

  3. [3]

    Rongxin Cheng, Kai Zhou, Xingda Wei, Siyuan Liu, Mingcong Han, Mingjing Ai, Yeju Zhou, Baoquan Zhong, Wencong Xiao, Rong Chen, and Haibo Chen. 2025. Fast LLM Post-training via Decoupled and Best-of-N Speculation. arXiv:2511.16193 [cs.DC]https://arxiv.org/abs/ 2511.16193

  4. [4]

    Christiano, Jan Leike, Tom B

    Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. InProceedings of the 31st International Conference on Neural Information Processing Systems(Long Beach, California, USA) (NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 4302–4310

  5. [5]

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025)

  6. [6]

    Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. 2023. Safe RLHF: Safe Reinforcement Learning from Human Feedback. arXiv:2310.12773 [cs.AI]https: //arxiv.org/abs/2310.12773

  7. [7]

    Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, Tongkai Yang, Binhang Yuan, and Yi Wu. 2025. AReaL: A Large-Scale Asyn- chronous Reinforcement Learning System for Language Reasoning. arXiv:2505.24298 [cs.LG]https://arxiv.org/abs/2505.24298

  8. [8]

    Wei Gao, Yuheng Zhao, Dakai An, Tianyuan Wu, Lunxi Cao, Shaopan Xiong, Ju Huang, Weixun Wang, Siran Yang, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng, and Wei Wang. 2025. RollPacker: Miti- gating Long-Tail Rollouts for Fast, Synchronous RL Post-Training. arXiv:2509.21009 [cs.DC]https://arxiv.org/abs/2509.21009

  9. [9]

    Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google file system. InProceedings of the Nineteenth ACM Symposium on Operating Systems Principles(Bolton Landing, NY, USA)(SOSP ’03). Association for Computing Machinery, New York, NY, USA, 29–43. doi:10.1145/945445.945450

  10. [10]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. 2025. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature645, 8081 (2025), 633–638

  11. [11]

    Jingkai He, Tianjian Li, Erhu Feng, Dong Du, Qian Liu, Tao Liu, Yu- bin Xia, and Haibo Chen. 2025. History Rhymes: Accelerating LLM Reinforcement Learning with RhymeRL. arXiv:2508.18588 [cs.LG] https://arxiv.org/abs/2508.18588

  12. [12]

    Jian Hu, Xibin Wu, Wei Shen, Jason Klein Liu, Zilin Zhu, Weixun Wang, Songlin Jiang, Haoran Wang, Hao Chen, Bin Chen, Weikai Fang, Xianyu, Yu Cao, Haotian Xu, and Yiming Liu. 2025. OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework. arXiv:2405.11143 [cs.AI]https://arxiv.org/abs/2405.11143

  13. [13]

    Qinghao Hu, Shang Yang, Junxian Guo, Xiaozhe Yao, Yujun Lin, Yuxian Gu, Han Cai, Chuang Gan, Ana Klimovic, and Song Han. 2025. Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter. arXiv:2511.16665 [cs.LG]https://arxiv.org/abs/2511.16665

  14. [14]

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al . 2024. Gpt-4o system card.arXiv preprint arXiv:2410.21276(2024)

  15. [15]

    Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanx- iong Guo. 2020. A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters. In14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, 463–479.https://www.usenix.org/conference/ osdi20/presentation/jiang

  16. [16]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Sto- ica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles(Koblenz, Germany)(SOSP ’23). As- sociation for Computing Machinery, New Yor...

  17. [17]

    Andersen, Jun Woo Park, Alexander J

    Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). USENIX Association, Broomfield, CO, 583– 598.https://www.usenix.org...

  18. [18]

    Shaobo Li, Yirui Eric Zhou, Yuqi Xue, Yuan Xu, and Jian Huang. 2025. Managing Scalable Direct Storage Accesses for GPUs with GoFS. In Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles(Lotte Hotel World, Seoul, Republic of Korea)(SOSP ’25). Association for Computing Machinery, New York, NY, USA, 979–995. doi:10.1145/3731569.3764857 13

  19. [19]

    Shaoteng Liu, Haoqi Yuan, Minda Hu, Yanwei Li, Yukang Chen, Shu Liu, Zongqing Lu, and Jiaya Jia. 2024. Rl-gpt: Integrating reinforcement learning and code-as-policy.Advances in Neural Information Processing Systems37 (2024), 28430–28459

  20. [20]

    Zhiyu Mei, Wei Fu, Kaiwei Li, Guangju Wang, Huanchen Zhang, and Yi Wu. 2025. ReaL: Efficient RLHF Training of Large Language Models with Parameter Reallocation. arXiv:2406.14088 [cs.DC]https://arxiv. org/abs/2406.14088

  21. [21]

    Jordan, and Ion Stoica

    Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. 2018. Ray: A Distributed Framework for Emerging AI Applications. In13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA, 561–57...

  22. [22]

    Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux, Arian Hos- seini, Rishabh Agarwal, and Aaron Courville. 2025. Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models. arXiv:2410.18252 [cs.LG]https://arxiv.org/abs/2410.18252

  23. [23]

    Nvidia. [n. d.]. Nvidia/NCCL: Optimized Primitives for collective multi- gpu communication.https://github.com/NVIDIA/nccl

  24. [24]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wain- wright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Chris- tiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with hum...

  25. [25]

    Dylan Patel, Daniel Nishball, and Jeremie Eliahou Ontiveros

  26. [26]

    Multi-Datacenter Training: OpenAI’s Ambitious Plan to Beat Google’s Infrastructure.https://newsletter.semianalysis.com/p/multi- datacenter-training-openais

  27. [27]

    Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2025. Moon- cake: Trading More Storage for Less Computation — A KVCache- centric Architecture for Serving LLM Chatbot. In23rd USENIX Confer- ence on File and Storage Technologies (FAST 25). USENIX Association, Santa Clara, CA, 155–170.https://w...

  28. [28]

    Shi Qiu, Weinan Liu, Yifan Hu, Jianqin Yan, Zhirong Shen, Xin Yao, Renhai Chen, Gong Zhang, and Yiming Zhang. 2025. GeminiFS: A Com- panion File System for GPUs. In23rd USENIX Conference on File and Storage Technologies (FAST 25). USENIX Association, Santa Clara, CA, 221–236.https://www.usenix.org/conference/fast25/presentation/qiu

  29. [29]

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347(2017)

  30. [30]

    Amrith Setlur, Saurabh Garg, Xinyang Geng, Naman Garg, Virginia Smith, and Aviral Kumar. 2024. Rl on incorrect synthetic data scales the efficiency of llm math reasoning by eight-fold.Advances in Neural Information Processing Systems37 (2024), 43000–43031

  31. [31]

    Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisen- stein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, and Aviral Kumar. 2024. Rewarding progress: Scaling automated process verifiers for llm reasoning.arXiv preprint arXiv:2410.08146(2024)

  32. [32]

    Pavel Shamis, Manjunath Gorentla Venkata, M Graham Lopez, Matthew B Baker, Oscar Hernandez, Yossi Itigin, Mike Dubman, Gi- lad Shainer, Richard L Graham, Liran Liss, et al. 2015. UCX: an open source framework for HPC network APIs and beyond. In2015 IEEE 23rd Annual Symposium on High-Performance Interconnects. IEEE, 40–43

  33. [33]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo

  34. [34]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2402.03300 [cs.CL]https://arxiv.org/ abs/2402.03300

  35. [35]

    Guangming Sheng, Yuxuan Tong, Borui Wan, Wang Zhang, Chaobo Jia, Xibin Wu, Yuqi Wu, Xiang Li, Chi Zhang, Yanghua Peng, Haibin Lin, Xin Liu, and Chuan Wu. 2025. Laminar: A Scalable Asynchronous RL Post-Training Framework. arXiv:2510.12633 [cs.LG]https://arxiv. org/abs/2510.12633

  36. [36]

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2025. Hybrid- Flow: A Flexible and Efficient RLHF Framework. InProceedings of the Twentieth European Conference on Computer Systems(Rotterdam, Netherlands)(EuroSys ’25). Association for Computing Machinery, New York, NY, USA, 1279–1297. doi:1...

  37. [37]

    Frans Kaashoek, and Hari Balakrishnan

    Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Balakrishnan. 2001. Chord: A scalable peer-to-peer lookup service for internet applications.SIGCOMM Comput. Commun. Rev.31, 4 (Aug. 2001), 149–160. doi:10.1145/964723.383071

  38. [38]

    Grok 4 Team. 2025. Grok 4 Report.xAI grok4 news(2025).https: //x.ai/news/grok-4

  39. [39]

    Borui Wan, Mingji Han, Yiyao Sheng, Yanghua Peng, Haibin Lin, Mo- fan Zhang, Zhichao Lai, Menghan Yu, Junda Zhang, Zuquan Song, et al. 2025. {ByteCheckpoint}: A Unified Checkpointing System for Large Foundation Model Development. In22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25). 559–578

  40. [40]

    Yuxiang Wei, Olivier Duchenne, Jade Copet, Quentin Carbonneaux, Lingming Zhang, Daniel Fried, Gabriel Synnaeve, Rishabh Singh, and Sida I Wang. 2025. Swe-rl: Advancing llm reasoning via reinforcement learning on open software evolution.arXiv preprint arXiv:2502.18449 (2025)

  41. [41]

    Bo Wu, Sid Wang, Yunhao Tang, Jia Ding, Eryk Helenowski, Liang Tan, Tengyu Xu, Tushar Gowda, Zhengxing Chen, Chen Zhu, Xi- aocheng Tang, Yundi Qian, Beibei Zhu, and Rui Hou. 2025. LlamaRL: A Distributed Asynchronous Reinforcement Learning Framework for Efficient Large-scale LLM Training. arXiv:2505.24034 [cs.LG] https://arxiv.org/abs/2505.24034

  42. [42]

    RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs

    Yongji Wu, Xueshen Liu, Haizhong Zheng, Juncheng Gu, Beidi Chen, Z. Morley Mao, Arvind Krishnamurthy, and Ion Stoica. 2025. RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs. arXiv:2510.19225 [cs.DC]https://arxiv.org/abs/ 2510.19225

  43. [43]

    Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. 2025. Learning to reason under off-policy guidance.arXiv preprint arXiv:2504.14945(2025)

  44. [44]

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

  45. [45]

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. 2024. Sglang: Efficient execution of structured language model programs.Advances in neural information processing systems37 (2024), 62557–62583

  46. [46]

    Yinmin Zhong, Zili Zhang, Xiaoniu Song, Hanpeng Hu, Chao Jin, Bingyang Wu, Nuo Chen, Yukun Chen, Yu Zhou, Changyi Wan, Hongyu Zhou, Yimin Jiang, Yibo Zhu, and Daxin Jiang. 2025. StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated 14 Stream Generation. arXiv:2504.15930 [cs.LG]https://arxiv.org/abs/ 2504.15930

  47. [47]

    Yinmin Zhong, Zili Zhang, Bingyang Wu, Shengyu Liu, Yukun Chen, Changyi Wan, Hanpeng Hu, Lei Xia, Ranchen Ming, Yibo Zhu, and Xin Jin. 2025. Optimizing RLHF Training for Large Language Models with Stage Fusion. In22nd USENIX Symposium on Networked Systems Design and Implementation, NSDI 2025, Philadelphia, PA, USA, April 28-30, 2025, Theophilus A. Benson ...

  48. [48]

    Jingyu Zhou, Meng Xu, Alexander Shraer, Bala Namasivayam, Alex Miller, Evan Tschannen, Steve Atherton, Andrew J. Beamon, Rusty Sears, John Leach, Dave Rosenthal, Xin Dong, Will Wilson, Ben Collins, David Scherer, Alec Grieser, Young Liu, Alvin Moore, Bhaskar Mup- pana, Xiaoge Su, and Vishesh Yadav. 2021. FoundationDB: A Dis- tributed Unbundled Transaction...

  49. [49]

    Yuzhen Zhou, Jiajun Li, Yusheng Su, Gowtham Ramesh, Zilin Zhu, Xiang Long, Chenyang Zhao, Jin Pan, Xiaodong Yu, Ze Wang, et al

  50. [50]

    April: Active partial rollouts in reinforcement learning to tame long-tail generation.arXiv preprint arXiv:2509.18521(2025)

  51. [51]

    Zilin Zhu, Chengxing Xie, Xin Lv, and slime Contributors. 2025. slime: An LLM post-training framework for RL Scaling.https://github.com/ THUDM/slime. GitHub repository. Corresponding author: Xin Lv. 15