ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL

Bo Zheng; Dakai An; Dilxat Muhtar; Jiamang Wang; Ju Huang; Lin Qu; Lunxi Cao; Shaopan Xiong; Siran Yang; Teng Ma

arxiv: 2605.06534 · v2 · pith:3DPOXTQ2new · submitted 2026-05-07 · 💻 cs.DC

ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL

Wei Gao , Yuheng Zhao , Dilxat Muhtar , Dakai An , Xuchun Shang , Tianyuan Wu , Lunxi Cao , Shaopan Xiong

show 8 more authors

Weixun Wang Ju Huang Teng Ma Siran Yang Jiamang Wang Lin Qu Bo Zheng Wei Wang

This is my paper

Pith reviewed 2026-05-21 08:35 UTC · model grok-4.3

classification 💻 cs.DC

keywords agentic reinforcement learningcooperative elasticityGPU co-locationLLM servingrollout schedulingresource sharingdistributed training

0 comments

The pith

Agentic RL training can borrow idle GPUs from serving clusters to increase throughput by 1.3 to 3.3 times without violating service level objectives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that rollout phases in agentic reinforcement learning, which vary sharply in compute demand, can draw on spare capacity from already-running serving clusters instead of waiting for new GPU allocations or sticking to fixed resources. Fixed systems leave GPUs idle during low-demand steps while elastic systems pay high costs for on-demand provisioning and availability limits. ROSE solves the sharing problem with safe co-location of models, quick weight synchronization across clusters, and dynamic scheduling that routes tasks to both dedicated and opportunistic GPUs. If this works, training time shrinks because the dominant rollout bottleneck receives elastic capacity from infrastructure that already exists for inference.

Core claim

ROSE realizes cooperative elasticity by co-locating heterogeneous serving and rollout models on the same GPUs through an SLO-safe executor that dynamically shares memory and compute, a weight transfer engine that uses shard-aware routing and sparsity for fast synchronization, and an elastic scheduler that routes rollouts across dedicated and opportunistic GPUs. Experiments across model sizes and cluster scales report end-to-end throughput gains of 1.3-3.3x over resource-fixed baselines and rollout time reductions of 1.2-1.5x over resource-elastic baselines, all without serving SLO violations.

What carries the argument

The SLO-safe co-serving executor that dynamically shares memory and compute between serving and rollout models on the same GPUs while preserving latency guarantees.

If this is right

Rollout phases complete faster because they access on-demand capacity without allocation delays.
Overall post-training time for agentic RL decreases as the variable compute demand is met from existing serving pools.
Serving clusters support additional training workloads without requiring extra dedicated hardware.
Resource utilization rises because idle capacity in production inference fleets becomes available for training steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same co-location pattern could apply to other bursty workloads such as online fine-tuning or evaluation jobs that run alongside serving.
Cloud operators might redesign GPU fleets to treat serving and training as co-located rather than separate resource pools.
If weight transfer overhead stays low at larger scales, the approach could extend to multi-tenant environments with more frequent model updates.

Load-bearing premise

Serving clusters consistently leave substantial GPU compute and memory idle and can co-locate heterogeneous models dynamically while preserving serving SLOs under bursty traffic.

What would settle it

Deploying the system on a cluster with consistently high serving load and measuring either no throughput gain or any increase in serving latency violations.

Figures

Figures reproduced from arXiv: 2605.06534 by Bo Zheng, Dakai An, Dilxat Muhtar, Jiamang Wang, Ju Huang, Lin Qu, Lunxi Cao, Shaopan Xiong, Siran Yang, Teng Ma, Tianyuan Wu, Wei Gao, Wei Wang, Weixun Wang, Xuchun Shang, Yuheng Zhao.

**Figure 1.** Figure 1: Characterization of agentic RL: (a) The breakdown of end-to-end training time; (b) The long-tail distribution of rollout execution time; (c) The impact of prefill on rollouts; (d) The demand for resource elasticity. Train Agentic LLM Environment Action Observation Weight Sync. Trajectory Rollout view at source ↗

**Figure 3.** Figure 3: Characterization of serving clusters and workloads: (a) Fluctuating serving traffic; (b) Serving GPU underutilization; (c) High allocation overhead; (d) Substantial communication overhead. Datacenter IB/RoCe TCP/IP NVLink Datacenter IB/RoCe Rollout Cluster Training Cluster Serving Cluster view at source ↗

**Figure 4.** Figure 4: Scheme of Datacenter Infrastructure. load and redirecting freed GPUs to rollouts. However, bidirectional autoscaling is fundamentally limited: reclaiming GPUs from rollouts back to serving requires evicting inflight rollouts and reloading models, taking tens of seconds (Figure 3c) and far exceeding typical SLO budgets. Because serving traffic is bursty at second-level granularity, frequent mode switching… view at source ↗

**Figure 4.** Figure 4: Scheme of Datacenter Infrastructure. scale rollout capacity on demand. Because these GPUs lie outside the steady-state deployment, each provisioning event requires model loading and runtime initialization, which can take tens of seconds (Figure 3c). Spot preemption and serverless lease expiration further trigger repeated teardownand-reinitialize cycles, turning allocation overhead into a persistent throu… view at source ↗

**Figure 5.** Figure 5: System Architecture of ROSE. it can take up to 145 s and grow quickly with model size, becoming a bottleneck for frequent weight synchronization. 4 System Design System Overview. To address the above challenges, we introduce ROSE, the architecture of which is illustrated in view at source ↗

**Figure 5.** Figure 5: System Architecture of ROSE. node in the RL cluster to a GPU node in the serving cluster using Mooncake Store [45] 1 , over TCP (200 Gbps Ethernet) and RDMA (400 Gbps InfiniBand), shown in Figure 3d. Even with InfiniBand (which is uncommon across datacenters), it can take up to 145 s and grow quickly with model size, becoming a bottleneck for frequent weight synchronization. 4 System Design System Overview… view at source ↗

**Figure 6.** Figure 6: Layer-wise sparsity ratio at 10th step. Shard-aware Weight Transfer. Training and serving clusters adopt heterogeneous parallelism strategies (e.g., training with TP8×PP2 and serving with TP4), requiring automatic shard mapping across configurations. Naive approaches require manual resharding or full model aggregation before transfer. ROSE automatically infers each parameter’s sharding rule by identifyi… view at source ↗

**Figure 7.** Figure 7: ROSE’s end-to-end throughput improvements. The data are normalized to the baseline’s first step. 0 25 50 75 100 Steps 0.2 0.4 0.6 Score ROLL ROSE (a) FrozenLake-8B-GRPO. 0 10 20 30 40 Steps 0.5 0.0 Score ROLL ROSE (b) ALFWorld-32B-GRPO view at source ↗

**Figure 7.** Figure 7: (a)-(c) ROSE’s end-to-end throughput improvements compared with baselines, for each baseline we run 8B and 32B model. The data are normalized to the baseline’s first step. (d) End-to-end critic scores for 8B and 32B models using GRPO. 8B 32B Model Size 0 50 100 Norm. Time 1301 1224 1210 1012 1010 805 RL RLBoost+ CoRL (a) Elastic Baselines. 8B 32B Model Size 0 15 30 Ratio (%) 16.1% 26.1% 7.3% 6.8% 0.3% 0.4%… view at source ↗

**Figure 8.** Figure 8: End-to-end critic scores for (a) 8B and (b) 32B models using the GRPO algorithm. 8B 32B Model Size 0 50 100 Norm. Time 1709 1502 1301 1224 1210 1012 1010 805 ROLL RL RLBoost CoRL (a) Micro Benchmark. 0 4 8 16 Available Serving GPUs 0 500 1k 1.5k Time (s) (b) Scalability [8B, GRPO] view at source ↗

**Figure 8.** Figure 8: End-to-end evaluation. (a) Rollout time and (b) Allocation overhead compared with elastic baselines. 1.44× and 2.69× higher throughput on average (see Figure 7c). Although AReaL eliminates GPU idle time by continuously generating trajectories without waiting for training to complete, by expanding effective GPU capacity through cooperative elasticity, ROSE provides gains orthogonal to asynchronous execut… view at source ↗

**Figure 9.** Figure 9: End-to-end evaluation. (a) Rollout time compared with baselines. (b) Scalability of ROSE on Qwen3-8B with GRPO as Serving GPUs increase. Allocation Overhead. We further analyze the allocation overhead of elastic resource management schemes. We quantify the total preempted GPU time as the product of the number of preempted GPUs and the per-preemption overhead, and normalize it by the total GPU time. As sh… view at source ↗

**Figure 10.** Figure 10: [Transfer Engine] (a) Cross-cluster weight transfer time under different optimizations; each optimization is additive over the previous one. (b) Timeline breakdown of shard-aware and sparsity-aware transfer for Qwen3-32B. D2S denotes the dense-to-sparse conversion, and S2D denotes the sparse-to-dense conversion. (c) Sensitivity of shard-aware and sparsity-aware transfer of different LLMs to cross-cluster … view at source ↗

**Figure 12.** Figure 12: [Analysis of Sparsity]. (a) The sparsity of weight differentials across steps for Qwen3-8B. (b) The sensitivity of transfer engine to sparsity. only the shards it hosts. This further reduces communication time by 1.8× (Qwen3-8B) and 1.3× (Qwen3-32B). Moreover, Figure 10b (top) illustrates the Qwen3-32B timeline. On the sender side, each training worker streams ∼60 buckets (64 MB each); each bucket take… view at source ↗

**Figure 11.** Figure 11: [Analysis of Sparsity]. (a) The sparsity of weight differentials across steps for Qwen3-8B. (b) The sensitivity of transfer engine to sparsity. diminishes. Beyond ∼20%, sparse-format metadata (e.g., indices) and (de)sparsification overhead begin to offset the reduction in transmitted weights. In our workloads, the measured non-zero fraction remains well below this threshold, enabling consistently effic… view at source ↗

**Figure 13.** Figure 13: ROSE under fully asynchronous RL training workloads. We monitor the average throughput between consecutive RL steps. 6.4 Effectiveness of Rollout Scheduler. We follow the end-to-end setups and evaluate the elastic rollout scheduler using Qwen3-8B and Qwen3-32B with GRPO algorithm for the first five RL steps view at source ↗

**Figure 12.** Figure 12: The system throughput with different per-device batch sizes. [Qwen3-8B/32K] B Spot instance trace We extract the spot-instance traces for the 8B model from Seg.B in [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗

**Figure 14.** Figure 14: The system throughput with different per-device batch sizes. [Qwen3-8B/32K] B Spot instance trace We extract the spot-instance traces for the 8B model from Seg.B in view at source ↗

**Figure 13.** Figure 13 [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗

**Figure 15.** Figure 15 view at source ↗

**Figure 14.** Figure 14: Sensitive Analysis of Serving GPU Availability. F Timeline Breakdown of Weight Transfer [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

**Figure 15.** Figure 15: provides a detailed timeline breakdown of shardaware and sparsity-aware weight transfer for Qwen3-32B. The top timeline illustrates shard-aware transfer: on the sender side, each training worker streams ∼60 buckets (64 MB each); each bucket takes 0.2–0.4 s to push, for a total of 65 seconds. On the receiver side, serving workers pull the corresponding weight buckets from the relay and load them into GPU… view at source ↗

read the original abstract

Agentic reinforcement learning (RL) is reshaping LLM post-training, but end-to-end training time is dominated by compute-intensive, multi-turn rollouts whose resource demand varies significantly across training steps. Resource-fixed systems cannot adapt to this variation, while resource-elastic approaches that provision external GPUs on demand suffer from high allocation overhead and limited availability. We observe that serving clusters leave substantial GPU compute and memory idle, and propose cooperative elasticity: sharing already-deployed serving GPUs with rollout workloads to provide on-demand elastic capacity. Realizing this is non-trivial, as it must preserve serving SLOs under bursty traffic while minimizing cross-cluster communication overhead. We present ROSE, a system that realizes cooperative elasticity for agentic RL post-training, comprising three components: (1) an SLO-safe co-serving executor that co-locates heterogeneous serving and rollout models on the same GPUs, dynamically sharing memory and compute while preserving serving SLOs; (2) a cross-cluster weight transfer engine that leverages shard-aware routing and weight sparsity for fast synchronization; and (3) an elastic rollout scheduler that dynamically routes rollouts across dedicated and opportunistic serving GPUs. Experiments across multiple model sizes and cluster scales show that ROSE improves end-to-end throughput by 1.3 - 3.3 x over resource-fixed baselines and reduces rollout time by 1.2 - 1.5 x over resource-elastic baselines, with no serving SLO violations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript presents ROSE, a system realizing cooperative elasticity for agentic RL post-training. It co-locates rollout workloads on already-deployed serving GPUs via an SLO-safe co-serving executor, a shard-aware cross-cluster weight transfer engine, and an elastic rollout scheduler. The central empirical claim is that this yields 1.3–3.3× end-to-end throughput gains over resource-fixed baselines and 1.2–1.5× rollout-time reductions over resource-elastic baselines across model sizes and cluster scales, with no serving SLO violations.

Significance. If the reported speedups and SLO preservation hold under production burst patterns, ROSE would demonstrate a practical way to harvest idle serving capacity for variable-demand RL rollouts, reducing the need for dedicated elastic provisioning. The three-component design and cross-cluster synchronization techniques are concrete contributions to systems for heterogeneous co-location.

major comments (3)

[§5] §5 (Experiments): The headline 1.3–3.3× throughput and 1.2–1.5× rollout-time numbers are presented without reported variance, number of runs, or precise definition of how serving SLOs (latency, throughput) were measured under the simulated bursty traffic; this makes it impossible to judge whether the gains are robust or sensitive to post-hoc tuning.
[§2, §3.1] §2 and §3.1: The enabling premise that serving clusters consistently leave substantial GPU compute and memory idle under bursty traffic is stated as an observation but is not backed by any production traces, utilization histograms, or worst-case analysis of co-location feasibility for heterogeneous models; if sustained utilization is higher than assumed, the opportunistic capacity and therefore the reported speedups disappear.
[§4.3] §4.3 (SLO-safe co-serving executor): The dynamic memory and compute sharing mechanism is described at a high level, yet no formal bound or micro-benchmark isolates the latency impact on the serving model when rollout jobs are co-located at varying intensities; the claim of “no SLO violations” therefore rests entirely on the specific experimental traffic rather than a general guarantee.

minor comments (3)

[Table 1, §4.1] Table 1 and §4.1: Model-size notation (e.g., “7B”, “70B”) is used inconsistently with the text; align the table headers with the exact parameter counts reported in the experimental setup.
[Figure 4] Figure 4: Axis labels and legend text are too small to read at standard print size; increase font size or split into two panels.
[§6] §6 (Related Work): The discussion of prior elastic scheduling and co-location systems omits several recent papers on GPU sharing for inference; add citations to complete the positioning.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments highlight important aspects of experimental rigor, motivation, and guarantees that we will address to improve the manuscript. We respond to each major comment below and indicate the planned revisions.

read point-by-point responses

Referee: [§5] §5 (Experiments): The headline 1.3–3.3× throughput and 1.2–1.5× rollout-time numbers are presented without reported variance, number of runs, or precise definition of how serving SLOs (latency, throughput) were measured under the simulated bursty traffic; this makes it impossible to judge whether the gains are robust or sensitive to post-hoc tuning.

Authors: We agree that reporting statistical details is essential for assessing robustness. In the revised manuscript we will add the number of runs performed for each configuration (five independent runs), include error bars or standard deviations in the relevant figures, and provide an explicit description of the SLO measurement methodology. This will include the precise latency percentile (99th), throughput threshold, and how bursty traffic was generated and monitored to ensure no violations occurred. revision: yes
Referee: [§2, §3.1] §2 and §3.1: The enabling premise that serving clusters consistently leave substantial GPU compute and memory idle under bursty traffic is stated as an observation but is not backed by any production traces, utilization histograms, or worst-case analysis of co-location feasibility for heterogeneous models; if sustained utilization is higher than assumed, the opportunistic capacity and therefore the reported speedups disappear.

Authors: We acknowledge that the current motivation section relies on general observations rather than public production traces. We will expand §2 with utilization histograms generated from our bursty-traffic simulator across a range of arrival rates and model sizes, plus a new worst-case analysis subsection that quantifies the minimum idle capacity needed for net gains and shows how speedups degrade under higher sustained utilization. While we cannot release proprietary production traces, these additions will make the feasibility argument more concrete and transparent. revision: partial
Referee: [§4.3] §4.3 (SLO-safe co-serving executor): The dynamic memory and compute sharing mechanism is described at a high level, yet no formal bound or micro-benchmark isolates the latency impact on the serving model when rollout jobs are co-located at varying intensities; the claim of “no SLO violations” therefore rests entirely on the specific experimental traffic rather than a general guarantee.

Authors: We will revise §4.3 to include dedicated micro-benchmarks that isolate serving-model latency under controlled rollout intensities, varying both compute and memory sharing ratios while holding serving traffic fixed. These experiments will report latency distributions and the maximum rollout intensity at which the 99th-percentile SLO remains satisfied. Although deriving a tight formal latency bound is difficult given nondeterministic GPU scheduling, the added micro-benchmarks will provide empirical evidence beyond the end-to-end traffic scenarios and clarify the operating regime in which SLOs are preserved. revision: yes

Circularity Check

0 steps flagged

No circularity in ROSE derivation chain

full rationale

The paper is a systems description of ROSE for cooperative elasticity, with three engineering components (SLO-safe co-serving executor, cross-cluster weight transfer engine, elastic rollout scheduler) and performance claims supported solely by experimental measurements across model sizes and cluster scales. No mathematical derivations, equations, fitted parameters presented as predictions, or first-principles results appear in the provided text. The idle-capacity observation is an empirical premise, not a derived quantity, and the speedups are direct experimental outcomes rather than reductions to inputs by construction. The derivation chain is therefore self-contained with independent empirical content.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that serving clusters have substantial idle capacity and that co-location can be engineered to protect SLOs; no free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption Serving clusters leave substantial GPU compute and memory idle under normal operation.
Stated in the abstract as the observation enabling cooperative elasticity.
domain assumption Co-location of heterogeneous serving and rollout models can preserve serving SLOs under bursty traffic.
Core premise required for the SLO-safe co-serving executor to be viable.

pith-pipeline@v0.9.0 · 5842 in / 1451 out tokens · 27384 ms · 2026-05-21T08:35:03.552733+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present ROSE, a system that realizes cooperative elasticity for agentic RL post-training, comprising three components: (1) an SLO-safe co-serving executor that co-locates heterogeneous serving and rollout models on the same GPUs, dynamically sharing memory and compute while preserving serving SLOs; (2) a cross-cluster weight transfer engine that leverages shard-aware routing and weight sparsity for fast synchronization; and (3) an elastic rollout scheduler that dynamically routes rollouts across dedicated and opportunistic serving GPUs.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experiments across multiple model sizes and cluster scales show that ROSE improves end-to-end throughput by 1.3 - 3.3 x over resource-fixed baselines

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

96 extracted references · 96 canonical work pages · 17 internal anchors

[1]

Alibaba Cloud. 2026. Creating a GPU function.https://www.alibabac loud.com/help/en/functioncompute/fc/user-guide/creating-a-gpu- function/. (2026). Accessed: 2026-04

work page 2026
[2]

Li, Ryota Tomioka, and Milan Vojnovic

Dan Alistarh, Demjan Grubic, Jerry Z. Li, Ryota Tomioka, and Milan Vojnovic. 2017. QSGD: communication-efficient SGD via gradient quantization and encoding. InProceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 1707–1718

work page 2017
[3]

Romil Bhardwaj, Zhengxu Xia, Ganesh Ananthanarayanan, Junchen Jiang, Yuanchao Shu, Nikolaos Karianakis, Kevin Hsieh, Paramvir Bahl, and Ion Stoica. 2022. Ekya: Continuous Learning of Video Analytics Models on Edge Compute Servers. In19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). USENIX Association, Renton, WA, 119–135.http...

work page 2022
[4]

Hegde, Connor Chen, Charlie Ruan, Tyler Griggs, Shu Liu, Eric Tang, Richard Liaw, Philipp Moritz, Matei Zaharia, Joseph E

Shiyi Cao, Dacheng Li, Fangzhou Zhao, Shuo Yuan, Sumanth R. Hegde, Connor Chen, Charlie Ruan, Tyler Griggs, Shu Liu, Eric Tang, Richard Liaw, Philipp Moritz, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica

work page
[5]

arXiv preprint arXiv:2511.16108(2025)

SkyRL-Agent: Efficient RL Training for Multi-turn LLM Agent. arXiv preprint arXiv:2511.16108(2025)

work page arXiv 2025
[6]

Rongxin Cheng, Kai Zhou, Xingda Wei, Siyuan Liu, Mingcong Han, Mingjing Ai, Yeju Zhou, Baoquan Zhong, Wencong Xiao, Rong Chen, and Haibo Chen. 2025. Fast LLM Post-training via Decoupled and Best-of-N Speculation.arXiv preprint arXiv:2511.16193(2025)

work page arXiv 2025
[7]

Yihua Cheng, Yuhan Liu, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaoting Feng, Yuyang Huang, Samuel Shen, Kuntai Du, and Junchen Jiang

work page
[8]

LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference.arXiv preprint arXiv:2510.09665(2025)

work page arXiv 2025
[9]

Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, and Hao Zhang. 2024. MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving. InICML

work page 2024
[10]

Assaf Eisenman, Kiran Kumar Matam, Steven Ingram, Dheevatsa Mudigere, Raghuraman Krishnamoorthi, Krishnakumar Nair, Misha Smelyanskiy, and Murali Annavaram. 2022. Check-N-Run: A check- pointing system for training deep learning recommendation models. In19th USENIX Symposium on Networked Systems Design and Imple- mentation (NSDI 22). 929–943

work page 2022
[11]

Farama Foundation. 2024. Gymnasium - FrozenLake Environment. https://gymnasium.farama.org/environments/toy_text/frozen_lake/. (2024). Accessed: 2025-09

work page 2024
[12]

Jiawei Fei, Chen-Yu Ho, Atal N Sahu, Marco Canini, and Amedeo Sapio

work page
[13]

InProceedings of the 2021 ACM SIGCOMM 2021 Conference

Efficient sparse collective communication and its application to accelerate distributed deep learning. InProceedings of the 2021 ACM SIGCOMM 2021 Conference. 676–691

work page 2021
[16]

Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, Tongkai Yang, Binhang Yuan, and Yi Wu. 2025. AReaL: A Large-Scale Asynchronous Rein- forcement Learning System for Language Reasoning.arXiv preprint arXiv:2505.10978(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. 2024. ServerlessLLM: Low- Latency Serverless Inference for Large Language Models. InOSDI’24

work page 2024
[18]

Wei Gao, Zhuoyuan Ouyang, Peng Sun, Tianwei Zhang, and Yonggang Wen. 2025. IceFrog: A Layer-Elastic Scheduling System for Deep Learning Training in GPU Clusters.IEEE Transactions on Parallel and Distributed Systems36, 6 (2025), 1071–1086.https://doi.org/10.1109/ TPDS.2025.3553137

work page arXiv 2025
[19]

Wei Gao, Yuheng Zhao, Dakai An, Tianyuan Wu, Lunxi Cao, Shaopan Xiong, Ju Huang, Weixun Wang, Siran Yang, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng, and Wei Wang. 2025. RollPacker: Mitigating Long- Tail Rollouts for Fast, Synchronous RL Post-Training.arXiv preprint arXiv:2509.21009(2025)

work page arXiv 2025
[20]

Wei Gao, Yuheng Zhao, Tianyuan Wu, Shaopan Xiong, Weixun Wang, Dakai An, Lunxi Cao, Dilxat Muhtar, Zichen Liu, Haizhou Zhao, Ju Huang, Siran Yang, Yongbin Li, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng, and Wei Wang. 2025. RollArt: Scaling Agentic RL Training via Disaggregated Infrastructure.arXiv preprint arXiv:2512.22560(2025)

work page arXiv 2025
[21]

Mingcong Han, Hanze Zhang, Rong Chen, and Haibo Chen

work page
[22]

In16th USENIX Symposium on Oper- ating Systems Design and Implementation (OSDI 22)

Microsecond-scale preemption for concurrent {GPU- accelerated} {DNN} inferences. In16th USENIX Symposium on Oper- ating Systems Design and Implementation (OSDI 22). 539–558

work page
[23]

Zhenyu Han, Ansheng You, Haibo Wang, Kui Luo, Guang Yang, Wenqi Shi, Menglong Chen, Sicheng Zhang, Zeshun Lan, Chunshi Deng, Huazhong Ji, Wenjie Liu, Yu Huang, Yixiang Zhang, Chenyi Pan, Jing Wang, Xin Huang, Chunsheng Li, and Jianping Wu. 2025. AsyncFlow: An Asynchronous Streaming RL Framework for Efficient LLM Post- Training.arXiv preprint arXiv:2507.01...

work page arXiv 2025
[24]

Bingguang Hao, Maolin Wang, Zengzhuang Xu, Yicheng Chen, Cunyin Peng, Jinjie GU, and Chenyi Zhuang. 2025. Exploring Superior Func- tion Calls via Reinforcement Learning.arXiv preprint arXiv:2508.05118 13 (2025)

work page arXiv 2025
[25]

Squillante

Mor Harchol-Balter, Cuihong Li, Takayuki Osogami, Alan Scheller- Wolf, and Mark S. Squillante. 2003. Cycle stealing under immediate dispatch task assignment. InProceedings of the Fifteenth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA ’03). As- sociation for Computing Machinery, New York, NY, USA, 274–285. https://doi.org/10.1145/777...

work page doi:10.1145/777412.777462 2003
[26]

Eric Harper, Somshubra Majumdar, Oleksii Kuchaiev, Li Jason, Yang Zhang, Evelina Bakhturina, Vahid Noroozi, Sandeep Subramanian, Koluguri Nithin, Huang Jocelyn, Fei Jia, Jagadeesh Balam, Xuesong Yang, Micha Livne, Yi Dong, Sean Naren, and Boris Ginsburg. 2025. NeMo: a toolkit for Conversational AI and Large Language Models. (2025).https://github.com/NVIDIA/NeMo

work page 2025
[27]

Jingkai He, Tianjian Li, Erhu Feng, Dong Du, Qian Liu, Tao Liu, Yubin Xia, and Haibo Chen. 2025. History Rhymes: Accelerating LLM Rein- forcement Learning with RhymeRL.arXiv preprint arXiv:2508.18588 (2025)

work page arXiv 2025
[28]

Jian Hu, Xibin Wu, Weixun Wang, Dehao Zhang, Yu Cao, et al. 2024. OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework.arXiv preprint arXiv:2405.11143(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin J...

work page 2024
[30]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?arXiv preprint arXiv:2310.06770(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Norm Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian Towles, et al. 2023. Tpu v4: An optically reconfigurable supercom- puter for machine learning with hardware support for embeddings. In Proceedings of the 50th annual international symposium on computer architecture. 1–14

work page 2023
[32]

Gonzalez, Hao Zhang, and Ion Sto- ica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Sto- ica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles

work page 2023
[33]

Jiamin Li, Hong Xu, Yibo Zhu, Zherui Liu, Chuanxiong Guo, and Cong Wang. 2023. Lyra: Elastic Scheduling for Deep Learning Clusters. In Proceedings of the Eighteenth European Conference on Computer Systems. Association for Computing Machinery, New York, NY, USA, 835–850. https://doi.org/10.1145/3552326.3587445

work page doi:10.1145/3552326.3587445 2023
[34]

Yufei Li, Zexin Li, Yinglun Zhu, and Cong Liu. 2025. Lemix: Unified Scheduling for Llm Training and Inference on Multi-Gpu Systems. In 2025 IEEE Real-Time Systems Symposium (RTSS)

work page 2025
[35]

Zhiwei Li, Yong Hu, and Wenqing Wang. 2025. Encouraging Good Pro- cesses Without the Need for Good Answers: Reinforcement Learning for LLM Agent Planning.arXiv preprint arXiv:2508.19598(2025)

work page arXiv 2025
[36]

Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E Gon- zalez, et al. 2023. {AlpaServe}: Statistical multiplexing with model parallelism for deep learning serving. In17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). 663–679

work page 2023
[37]

Hwijoon Lim, Juncheol Ye, Sangeetha Abdu Jyothi, and Dongsu Han

work page
[38]

InProceedings of the ACM SIGCOMM 2024 Con- ference

Accelerating model training in multi-cluster environments with consumer-grade gpus. InProceedings of the ACM SIGCOMM 2024 Con- ference. 707–720

work page 2024
[39]

Yuhang Liu, Pengxiang Li, Congkai Xie, Xavier Hu, Xiaotian Han, Shengyu Zhang, Hongxia Yang, and Fei Wu. 2025. Infigui-r1: Ad- vancing multimodal gui agents from reactive actors to deliberative reasoners.arXiv preprint arXiv:2504.14239(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Han Lu, Zichen Liu, Shaopan Xiong, Yancheng He, Wei Gao, Yanan Wu, Weixun Wang, Jiashun Liu, Yang Li, Haizhou Zhao, Ju Huang, Siran Yang, Xiaoyang Li, Yijia Luo, Zihe Liu, Ling Pan, Junchi Yan, Wei Wang, Wenbo Su, Jiamang Wang, Lin Qu, and Bo Zheng. 2025. Part II: ROLL Flash – Accelerating RLVR and Agentic Training with Asynchrony.arXiv preprint arXiv:251...

work page arXiv 2025
[41]

Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Han Xiao, Shuai Ren, Guanjing Xiong, and Hongsheng Li. 2025. UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning.arXiv preprint arXiv:2503.21620(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Michael Luo, Sijun Tan, Roy Huang, Ameen Patel, Alpay Ariyak, Qingyang Wu, Xiaoxiang Shi, Rachel Xin, Colin Cai, Maurice We- ber, Ce Zhang, Li Erran Li, Raluca Ada Popa, and Ion Stoica. 2025. DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level.https: //pretty-radio-b75.notion.site/DeepCoder-A-Fully-Open-Source- 14B-Coder-at-O3-mini-Level-1cf81902c14...

work page 2025
[43]

Run Luo, Lu Wang, Wanwei He, and Xiaobo Xia. 2025. GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents. arXiv preprint arXiv:2504.10458(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I Jordan, et al. 2018. Ray: A distributed framework for emerg- ing {AI} applications. In13th USENIX symposium on operating systems design and implementation (OSDI 18). 561–577

work page 2018
[45]

Aashiq Muhamed, Oscar Li, David Woodruff, Mona Diab, and Virginia Smith. 2024. Grass: Compute efficient low-memory llm training with structured sparse gradients. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 14978–15003

work page 2024
[46]

OpenPipe. 2025. Serverless RL. (2025).https://openpipe.ai/blog/serve rless-rl

work page 2025
[47]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2025. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. InProceedings of the 51st Annual International Symposium on Computer Architecture (ISCA ’24). IEEE Press, 118–132.https://doi.org/10.1109/ISCA59077.2024.000 19

work page doi:10.1109/isca59077.2024.000 2025
[48]

Gon- zalez, Ion Stoica, and Harry Xu

Yifan Qiao, Shu Anzai, Shan Yu, Haoran Ma, Shuo Yang, Yang Wang, Miryung Kim, Yongji Wu, Yang Zhou, Jiarong Xing, Joseph E. Gon- zalez, Ion Stoica, and Harry Xu. 2025. ConServe: Fine-Grained GPU Harvesting for LLM Online and Offline Co-Serving.arXiv preprint arXiv:2410.01228(2025)

work page arXiv 2025
[49]

Ruoyu Qin, Weiran He, Weixiao Huang, Yangkun Zhang, Yikai Zhao, Bo Pang, Xinran Xu, Yingdi Shan, Yongwei Wu, and Mingxing Zhang

work page
[50]

Seer: Online Context Learning for Fast Synchronous LLM Rein- forcement Learning.arXiv preprint arXiv:2511.14617(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Heyi Tang, Feng Ren, Teng Ma, Shangming Cai, Yineng Zhang, Mingxing Zhang, et al. 2024. Mooncake: A kvcache-centric disaggregated architecture for llm serv- ing.ACM Transactions on Storage(2024)

work page 2024
[52]

Haoran Qiu, Anish Biswas, Zihan Zhao, Jayashree Mohan, Alind Khare, Esha Choukse, Íñigo Goiri, Zeyu Zhang, Haiying Shen, Chetan Bansal, Ramachandran Ramjee, and Rodrigo Fonseca. 2025. ModServe: Modality- and Stage-Aware Resource Disaggregation for Scalable Mul- timodal Model Serving. InProceedings of the 2025 ACM Symposium on Cloud Computing (SoCC 2025). ...

work page 2025
[53]

Mrinal Rawat, Ambuje Gupta, Rushil Goomer, Alessandro Di Bari, Neha Gupta, and Roberto Pieraccini. 2025. Pre-Act: Multi-Step Plan- ning and Reasoning Improves Acting in LLM Agents.arXiv preprint arXiv:2505.09970(2025)

work page arXiv 2025
[54]

Amir Sarfi, Benjamin Thérien, Joel Lidin, and Eugene Belilovsky. 2025. Communication Efficient LLM Pre-training with SparseLoCo. (2025). arXiv:cs.LG/2508.15706https://arxiv.org/abs/2508.15706

work page arXiv 2025
[55]

Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow.arXiv preprint arXiv:1802.05799(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[56]

SGLang Team. 2025. SGLang: Fast Serving Framework for Large Language Models.https://github.com/sgl-project/sglang. (2025). Version 0.4

work page 2025
[57]

Zelei Shao, Vikranth Srivatsa, Sanjana Srivastava, Qingyang Wu, Al- pay Ariyak, Xiaoxia Wu, Ameen Patel, Jue Wang, Percy Liang, Tri Dao, Ce Zhang, Yiying Zhang, Ben Athiwaratkun, Chenfeng Xu, and Junx- iong Wang. 2025. Beat the long tail: Distribution-Aware Speculative Decoding for RL Training.arXiv preprint arXiv:2511.13841(2025)

work page arXiv 2025
[58]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[59]

Guangming Sheng, Yuxuan Tong, Borui Wan, Wang Zhang, Chaobo Jia, Xibin Wu, Yuqi Wu, Xiang Li, Chi Zhang, Yanghua Peng, Haibin Lin, Xin Liu, and Chuan Wu. 2025. Laminar: A Scalable Asynchronous RL Post-Training Framework.arXiv preprint arXiv:2510.12633(2025)

work page arXiv 2025
[60]

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2024. Hy- bridFlow: A Flexible and Efficient RLHF Framework.arXiv preprint arXiv:2409.19256(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[61]

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2024. verl: Volcano Engine Reinforcement Learning for LLM.https://github.com /volcengine/verl. (2024)

work page 2024
[62]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi- billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053(2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[63]

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. 2021. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning.arXiv preprint arXiv:2010.03768(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[64]

Joykirat Singh, Raghav Magazine, Yash Pandya, and Akshay Nambi

work page
[65]

Agentic Reasoning and Tool Integration for LLMs via Reinforce- ment Learning.arXiv preprint arXiv:2505.01441(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[66]

Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, and Esha Choukse. 2025. Dynamollm: Designing llm inference clusters for per- formance and energy efficiency. In2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 1348–1362

work page 2025
[67]

The Terminal-Bench Team. 2025. Terminal-Bench: A Benchmark for AI Agents in Terminal Environments. (2025).https://github.com/laude- institute/terminal-bench

work page 2025
[68]

Thinking Machines AI. 2025. Tinker.https://thinkingmachines.ai/ti nker/. (2025). Accessed: 2026-02

work page 2025
[69]

Jiahao Wang, Jinbo Han, Xingda Wei, Sijie Shen, Dingyan Zhang, Chenguang Fang, Rong Chen, Wenyuan Yu, and Haibo Chen. 2025. KVCache cache in the wild: characterizing and optimizing KVCache cache at a large cloud provider. InProceedings of the 2025 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC ’25). USENIX Association, USA, Article 28...

work page 2025
[70]

Weixun Wang, Shaopan Xiong, Gengru Chen, Wei Gao, Sheng Guo, Yancheng He, Ju Huang, Jiaheng Liu, Zhendong Li, Xiaoyang Li, Zichen Liu, Haizhou Zhao, Dakai An, Lunxi Cao, Qiyang Cao, Wanxi Deng, Feilei Du, Yiliang Gu, Jiahe Li, Xiang Li, Mingjie Liu, Yijia Luo, Zihe Liu, Yadao Wang, Pei Wang, Tianyuan Wu, Yanan Wu, Yuheng Zhao, Shuaibing Zhao, Jin Yang, Si...

work page arXiv 2025
[71]

Weixun Wang, XiaoXiao Xu, Wanhe An, Fangwen Dai, Wei Gao, Yancheng He, Ju Huang, Qiang Ji, Hanqi Jin, Xiaoyang Li, Yang Li, Zhongwen Li, Shirong Lin, Jiashun Liu, Zenan Liu, Tao Luo, Dilxat Muhtar, Yuanbin Qu, Jiaqiang Shi, Qinghui Sun, Yingshui Tan, Hao Tang, Runze Wang, Yi Wang, Zhaoguo Wang, Yanan Wu, Shaopan Xiong, Binchen Xu, Xander Xu, Yuchi Xu, Qip...

work page arXiv 2026
[72]

Yuxin Wang, Yuhan Chen, Zeyu Li, Xueze Kang, Yuchu Fang, Yeju Zhou, Yang Zheng, Zhenheng Tang, Xin He, Rui Guo, Xin Wang, Qiang Wang, Amelie Chi Zhou, and Xiaowen Chu. 2025. BurstGPT: A Real- world Workload Dataset to Optimize LLM Serving Systems.arXiv preprint arXiv:2401.17644(2025)

work page arXiv 2025
[73]

Zhuang Wang, Zhaozhuo Xu, Jingyi Xi, Yuke Wang, Anshumali Shri- vastava, and TS Eugene Ng. 2025. {ZEN}: Empowering Distributed Training with Sparsity-driven Data Synchronization. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25). 537–556

work page 2025
[74]

Junde Wu, Jiayuan Zhu, Yuyuan Liu, Min Xu, and Yueming Jin. 2025. Agentic reasoning: A streamlined framework for enhancing llm rea- soning with agentic tools. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 28489–28503

work page 2025
[75]

Tianyuan Wu, Lunxi Cao, Yining Wei, Wei Gao, Yuheng Zhao, Dakai An, Shaopan Xiong, Zhiqiang Lv, Ju Huang, Siran Yang, Yinghao Yu, Jiamang Wang, Lin Qu, and Wei Wang. 2025. RollMux: Phase- Level Multiplexing for Disaggregated RL Post-Training.arXiv preprint arXiv:2512.11306(2025)

work page arXiv 2025
[76]

RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs

Yongji Wu, Xueshen Liu, Haizhong Zheng, Juncheng Gu, Beidi Chen, Z. Morley Mao, Arvind Krishnamurthy, and Ion Stoica. 2025. RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs.arXiv preprint arXiv:2510.19225(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[77]

Bingquan Xia, Bowen Shen, Cici, Dawei Zhu, Di Zhang, Gang Wang, Hailin Zhang, Huaqiu Liu, Jiebao Xiao, Jinhao Dong, Liang Zhao, Peid- ian Li, Peng Wang, Shihua Yu, Shimao Chen, Weikun Wang, Wenhan Ma, Xiangwei Deng, Yi Huang, Yifan Song, Zihan Jiang, Bowen Ye, Can Cai, Chenhong He, Dong Zhang, Duo Zhang, Guoan Wang, Hao Tian, Haochen Zhao, Heng Qu, Hongsh...

work page arXiv 2025
[78]

Yuxing Xiang, Xue Li, Kun Qian, Yufan Yang, Diwen Zhu, Wenyuan Yu, Ennan Zhai, Xuanzhe Liu, Xin Jin, and Jingren Zhou. 2025. Aegaeon: Effective GPU pooling for concurrent LLM serving on the market. In Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles. 1030–1045

work page 2025
[79]

Wencong Xiao, Shiru Ren, Yong Li, Yang Zhang, Pengyang Hou, Zhi Li, Yihui Feng, Wei Lin, and Yangqing Jia. 2020. AntMan: Dynamic scaling on GPU clusters for deep learning. InUSENIX OSDI

work page 2020
[80]

Jiarong Xing, Yifan Qiao, Simon Mo, Xingqi Cui, Gur-Eyal Sela, Yang Zhou, Joseph Gonzalez, and Ion Stoica. 2025. Towards Efficient and Practical GPU Multitasking in the Era of LLM.arXiv preprint arXiv:2508.08448(2025)

work page arXiv 2025
[81]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page
[82]

Qwen3 Technical Report.arXiv preprint arXiv:2505.09388(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

Showing first 80 references.

[1] [1]

Alibaba Cloud. 2026. Creating a GPU function.https://www.alibabac loud.com/help/en/functioncompute/fc/user-guide/creating-a-gpu- function/. (2026). Accessed: 2026-04

work page 2026

[2] [2]

Li, Ryota Tomioka, and Milan Vojnovic

Dan Alistarh, Demjan Grubic, Jerry Z. Li, Ryota Tomioka, and Milan Vojnovic. 2017. QSGD: communication-efficient SGD via gradient quantization and encoding. InProceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 1707–1718

work page 2017

[3] [3]

Romil Bhardwaj, Zhengxu Xia, Ganesh Ananthanarayanan, Junchen Jiang, Yuanchao Shu, Nikolaos Karianakis, Kevin Hsieh, Paramvir Bahl, and Ion Stoica. 2022. Ekya: Continuous Learning of Video Analytics Models on Edge Compute Servers. In19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). USENIX Association, Renton, WA, 119–135.http...

work page 2022

[4] [4]

Hegde, Connor Chen, Charlie Ruan, Tyler Griggs, Shu Liu, Eric Tang, Richard Liaw, Philipp Moritz, Matei Zaharia, Joseph E

Shiyi Cao, Dacheng Li, Fangzhou Zhao, Shuo Yuan, Sumanth R. Hegde, Connor Chen, Charlie Ruan, Tyler Griggs, Shu Liu, Eric Tang, Richard Liaw, Philipp Moritz, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica

work page

[5] [5]

arXiv preprint arXiv:2511.16108(2025)

SkyRL-Agent: Efficient RL Training for Multi-turn LLM Agent. arXiv preprint arXiv:2511.16108(2025)

work page arXiv 2025

[6] [6]

Rongxin Cheng, Kai Zhou, Xingda Wei, Siyuan Liu, Mingcong Han, Mingjing Ai, Yeju Zhou, Baoquan Zhong, Wencong Xiao, Rong Chen, and Haibo Chen. 2025. Fast LLM Post-training via Decoupled and Best-of-N Speculation.arXiv preprint arXiv:2511.16193(2025)

work page arXiv 2025

[7] [7]

Yihua Cheng, Yuhan Liu, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaoting Feng, Yuyang Huang, Samuel Shen, Kuntai Du, and Junchen Jiang

work page

[8] [8]

LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference.arXiv preprint arXiv:2510.09665(2025)

work page arXiv 2025

[9] [9]

Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, and Hao Zhang. 2024. MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving. InICML

work page 2024

[10] [10]

Assaf Eisenman, Kiran Kumar Matam, Steven Ingram, Dheevatsa Mudigere, Raghuraman Krishnamoorthi, Krishnakumar Nair, Misha Smelyanskiy, and Murali Annavaram. 2022. Check-N-Run: A check- pointing system for training deep learning recommendation models. In19th USENIX Symposium on Networked Systems Design and Imple- mentation (NSDI 22). 929–943

work page 2022

[11] [11]

Farama Foundation. 2024. Gymnasium - FrozenLake Environment. https://gymnasium.farama.org/environments/toy_text/frozen_lake/. (2024). Accessed: 2025-09

work page 2024

[12] [12]

Jiawei Fei, Chen-Yu Ho, Atal N Sahu, Marco Canini, and Amedeo Sapio

work page

[13] [13]

InProceedings of the 2021 ACM SIGCOMM 2021 Conference

Efficient sparse collective communication and its application to accelerate distributed deep learning. InProceedings of the 2021 ACM SIGCOMM 2021 Conference. 676–691

work page 2021

[14] [16]

Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, Tongkai Yang, Binhang Yuan, and Yi Wu. 2025. AReaL: A Large-Scale Asynchronous Rein- forcement Learning System for Language Reasoning.arXiv preprint arXiv:2505.10978(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [17]

Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. 2024. ServerlessLLM: Low- Latency Serverless Inference for Large Language Models. InOSDI’24

work page 2024

[16] [18]

Wei Gao, Zhuoyuan Ouyang, Peng Sun, Tianwei Zhang, and Yonggang Wen. 2025. IceFrog: A Layer-Elastic Scheduling System for Deep Learning Training in GPU Clusters.IEEE Transactions on Parallel and Distributed Systems36, 6 (2025), 1071–1086.https://doi.org/10.1109/ TPDS.2025.3553137

work page arXiv 2025

[17] [19]

Wei Gao, Yuheng Zhao, Dakai An, Tianyuan Wu, Lunxi Cao, Shaopan Xiong, Ju Huang, Weixun Wang, Siran Yang, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng, and Wei Wang. 2025. RollPacker: Mitigating Long- Tail Rollouts for Fast, Synchronous RL Post-Training.arXiv preprint arXiv:2509.21009(2025)

work page arXiv 2025

[18] [20]

Wei Gao, Yuheng Zhao, Tianyuan Wu, Shaopan Xiong, Weixun Wang, Dakai An, Lunxi Cao, Dilxat Muhtar, Zichen Liu, Haizhou Zhao, Ju Huang, Siran Yang, Yongbin Li, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng, and Wei Wang. 2025. RollArt: Scaling Agentic RL Training via Disaggregated Infrastructure.arXiv preprint arXiv:2512.22560(2025)

work page arXiv 2025

[19] [21]

Mingcong Han, Hanze Zhang, Rong Chen, and Haibo Chen

work page

[20] [22]

In16th USENIX Symposium on Oper- ating Systems Design and Implementation (OSDI 22)

Microsecond-scale preemption for concurrent {GPU- accelerated} {DNN} inferences. In16th USENIX Symposium on Oper- ating Systems Design and Implementation (OSDI 22). 539–558

work page

[21] [23]

Zhenyu Han, Ansheng You, Haibo Wang, Kui Luo, Guang Yang, Wenqi Shi, Menglong Chen, Sicheng Zhang, Zeshun Lan, Chunshi Deng, Huazhong Ji, Wenjie Liu, Yu Huang, Yixiang Zhang, Chenyi Pan, Jing Wang, Xin Huang, Chunsheng Li, and Jianping Wu. 2025. AsyncFlow: An Asynchronous Streaming RL Framework for Efficient LLM Post- Training.arXiv preprint arXiv:2507.01...

work page arXiv 2025

[22] [24]

Bingguang Hao, Maolin Wang, Zengzhuang Xu, Yicheng Chen, Cunyin Peng, Jinjie GU, and Chenyi Zhuang. 2025. Exploring Superior Func- tion Calls via Reinforcement Learning.arXiv preprint arXiv:2508.05118 13 (2025)

work page arXiv 2025

[23] [25]

Squillante

Mor Harchol-Balter, Cuihong Li, Takayuki Osogami, Alan Scheller- Wolf, and Mark S. Squillante. 2003. Cycle stealing under immediate dispatch task assignment. InProceedings of the Fifteenth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA ’03). As- sociation for Computing Machinery, New York, NY, USA, 274–285. https://doi.org/10.1145/777...

work page doi:10.1145/777412.777462 2003

[24] [26]

Eric Harper, Somshubra Majumdar, Oleksii Kuchaiev, Li Jason, Yang Zhang, Evelina Bakhturina, Vahid Noroozi, Sandeep Subramanian, Koluguri Nithin, Huang Jocelyn, Fei Jia, Jagadeesh Balam, Xuesong Yang, Micha Livne, Yi Dong, Sean Naren, and Boris Ginsburg. 2025. NeMo: a toolkit for Conversational AI and Large Language Models. (2025).https://github.com/NVIDIA/NeMo

work page 2025

[25] [27]

Jingkai He, Tianjian Li, Erhu Feng, Dong Du, Qian Liu, Tao Liu, Yubin Xia, and Haibo Chen. 2025. History Rhymes: Accelerating LLM Rein- forcement Learning with RhymeRL.arXiv preprint arXiv:2508.18588 (2025)

work page arXiv 2025

[26] [28]

Jian Hu, Xibin Wu, Weixun Wang, Dehao Zhang, Yu Cao, et al. 2024. OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework.arXiv preprint arXiv:2405.11143(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [29]

Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin J...

work page 2024

[28] [30]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?arXiv preprint arXiv:2310.06770(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [31]

Norm Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian Towles, et al. 2023. Tpu v4: An optically reconfigurable supercom- puter for machine learning with hardware support for embeddings. In Proceedings of the 50th annual international symposium on computer architecture. 1–14

work page 2023

[30] [32]

Gonzalez, Hao Zhang, and Ion Sto- ica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Sto- ica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles

work page 2023

[31] [33]

Jiamin Li, Hong Xu, Yibo Zhu, Zherui Liu, Chuanxiong Guo, and Cong Wang. 2023. Lyra: Elastic Scheduling for Deep Learning Clusters. In Proceedings of the Eighteenth European Conference on Computer Systems. Association for Computing Machinery, New York, NY, USA, 835–850. https://doi.org/10.1145/3552326.3587445

work page doi:10.1145/3552326.3587445 2023

[32] [34]

Yufei Li, Zexin Li, Yinglun Zhu, and Cong Liu. 2025. Lemix: Unified Scheduling for Llm Training and Inference on Multi-Gpu Systems. In 2025 IEEE Real-Time Systems Symposium (RTSS)

work page 2025

[33] [35]

Zhiwei Li, Yong Hu, and Wenqing Wang. 2025. Encouraging Good Pro- cesses Without the Need for Good Answers: Reinforcement Learning for LLM Agent Planning.arXiv preprint arXiv:2508.19598(2025)

work page arXiv 2025

[34] [36]

Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E Gon- zalez, et al. 2023. {AlpaServe}: Statistical multiplexing with model parallelism for deep learning serving. In17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). 663–679

work page 2023

[35] [37]

Hwijoon Lim, Juncheol Ye, Sangeetha Abdu Jyothi, and Dongsu Han

work page

[36] [38]

InProceedings of the ACM SIGCOMM 2024 Con- ference

Accelerating model training in multi-cluster environments with consumer-grade gpus. InProceedings of the ACM SIGCOMM 2024 Con- ference. 707–720

work page 2024

[37] [39]

Yuhang Liu, Pengxiang Li, Congkai Xie, Xavier Hu, Xiaotian Han, Shengyu Zhang, Hongxia Yang, and Fei Wu. 2025. Infigui-r1: Ad- vancing multimodal gui agents from reactive actors to deliberative reasoners.arXiv preprint arXiv:2504.14239(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [40]

Han Lu, Zichen Liu, Shaopan Xiong, Yancheng He, Wei Gao, Yanan Wu, Weixun Wang, Jiashun Liu, Yang Li, Haizhou Zhao, Ju Huang, Siran Yang, Xiaoyang Li, Yijia Luo, Zihe Liu, Ling Pan, Junchi Yan, Wei Wang, Wenbo Su, Jiamang Wang, Lin Qu, and Bo Zheng. 2025. Part II: ROLL Flash – Accelerating RLVR and Agentic Training with Asynchrony.arXiv preprint arXiv:251...

work page arXiv 2025

[39] [41]

Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Han Xiao, Shuai Ren, Guanjing Xiong, and Hongsheng Li. 2025. UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning.arXiv preprint arXiv:2503.21620(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [42]

Michael Luo, Sijun Tan, Roy Huang, Ameen Patel, Alpay Ariyak, Qingyang Wu, Xiaoxiang Shi, Rachel Xin, Colin Cai, Maurice We- ber, Ce Zhang, Li Erran Li, Raluca Ada Popa, and Ion Stoica. 2025. DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level.https: //pretty-radio-b75.notion.site/DeepCoder-A-Fully-Open-Source- 14B-Coder-at-O3-mini-Level-1cf81902c14...

work page 2025

[41] [43]

Run Luo, Lu Wang, Wanwei He, and Xiaobo Xia. 2025. GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents. arXiv preprint arXiv:2504.10458(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [44]

Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I Jordan, et al. 2018. Ray: A distributed framework for emerg- ing {AI} applications. In13th USENIX symposium on operating systems design and implementation (OSDI 18). 561–577

work page 2018

[43] [45]

Aashiq Muhamed, Oscar Li, David Woodruff, Mona Diab, and Virginia Smith. 2024. Grass: Compute efficient low-memory llm training with structured sparse gradients. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 14978–15003

work page 2024

[44] [46]

OpenPipe. 2025. Serverless RL. (2025).https://openpipe.ai/blog/serve rless-rl

work page 2025

[45] [47]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2025. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. InProceedings of the 51st Annual International Symposium on Computer Architecture (ISCA ’24). IEEE Press, 118–132.https://doi.org/10.1109/ISCA59077.2024.000 19

work page doi:10.1109/isca59077.2024.000 2025

[46] [48]

Gon- zalez, Ion Stoica, and Harry Xu

Yifan Qiao, Shu Anzai, Shan Yu, Haoran Ma, Shuo Yang, Yang Wang, Miryung Kim, Yongji Wu, Yang Zhou, Jiarong Xing, Joseph E. Gon- zalez, Ion Stoica, and Harry Xu. 2025. ConServe: Fine-Grained GPU Harvesting for LLM Online and Offline Co-Serving.arXiv preprint arXiv:2410.01228(2025)

work page arXiv 2025

[47] [49]

Ruoyu Qin, Weiran He, Weixiao Huang, Yangkun Zhang, Yikai Zhao, Bo Pang, Xinran Xu, Yingdi Shan, Yongwei Wu, and Mingxing Zhang

work page

[48] [50]

Seer: Online Context Learning for Fast Synchronous LLM Rein- forcement Learning.arXiv preprint arXiv:2511.14617(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[49] [51]

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Heyi Tang, Feng Ren, Teng Ma, Shangming Cai, Yineng Zhang, Mingxing Zhang, et al. 2024. Mooncake: A kvcache-centric disaggregated architecture for llm serv- ing.ACM Transactions on Storage(2024)

work page 2024

[50] [52]

Haoran Qiu, Anish Biswas, Zihan Zhao, Jayashree Mohan, Alind Khare, Esha Choukse, Íñigo Goiri, Zeyu Zhang, Haiying Shen, Chetan Bansal, Ramachandran Ramjee, and Rodrigo Fonseca. 2025. ModServe: Modality- and Stage-Aware Resource Disaggregation for Scalable Mul- timodal Model Serving. InProceedings of the 2025 ACM Symposium on Cloud Computing (SoCC 2025). ...

work page 2025

[51] [53]

Mrinal Rawat, Ambuje Gupta, Rushil Goomer, Alessandro Di Bari, Neha Gupta, and Roberto Pieraccini. 2025. Pre-Act: Multi-Step Plan- ning and Reasoning Improves Acting in LLM Agents.arXiv preprint arXiv:2505.09970(2025)

work page arXiv 2025

[52] [54]

Amir Sarfi, Benjamin Thérien, Joel Lidin, and Eugene Belilovsky. 2025. Communication Efficient LLM Pre-training with SparseLoCo. (2025). arXiv:cs.LG/2508.15706https://arxiv.org/abs/2508.15706

work page arXiv 2025

[53] [55]

Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow.arXiv preprint arXiv:1802.05799(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[54] [56]

SGLang Team. 2025. SGLang: Fast Serving Framework for Large Language Models.https://github.com/sgl-project/sglang. (2025). Version 0.4

work page 2025

[55] [57]

Zelei Shao, Vikranth Srivatsa, Sanjana Srivastava, Qingyang Wu, Al- pay Ariyak, Xiaoxia Wu, Ameen Patel, Jue Wang, Percy Liang, Tri Dao, Ce Zhang, Yiying Zhang, Ben Athiwaratkun, Chenfeng Xu, and Junx- iong Wang. 2025. Beat the long tail: Distribution-Aware Speculative Decoding for RL Training.arXiv preprint arXiv:2511.13841(2025)

work page arXiv 2025

[56] [58]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[57] [59]

Guangming Sheng, Yuxuan Tong, Borui Wan, Wang Zhang, Chaobo Jia, Xibin Wu, Yuqi Wu, Xiang Li, Chi Zhang, Yanghua Peng, Haibin Lin, Xin Liu, and Chuan Wu. 2025. Laminar: A Scalable Asynchronous RL Post-Training Framework.arXiv preprint arXiv:2510.12633(2025)

work page arXiv 2025

[58] [60]

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2024. Hy- bridFlow: A Flexible and Efficient RLHF Framework.arXiv preprint arXiv:2409.19256(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[59] [61]

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2024. verl: Volcano Engine Reinforcement Learning for LLM.https://github.com /volcengine/verl. (2024)

work page 2024

[60] [62]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi- billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053(2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019

[61] [63]

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. 2021. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning.arXiv preprint arXiv:2010.03768(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[62] [64]

Joykirat Singh, Raghav Magazine, Yash Pandya, and Akshay Nambi

work page

[63] [65]

Agentic Reasoning and Tool Integration for LLMs via Reinforce- ment Learning.arXiv preprint arXiv:2505.01441(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[64] [66]

Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, and Esha Choukse. 2025. Dynamollm: Designing llm inference clusters for per- formance and energy efficiency. In2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 1348–1362

work page 2025

[65] [67]

The Terminal-Bench Team. 2025. Terminal-Bench: A Benchmark for AI Agents in Terminal Environments. (2025).https://github.com/laude- institute/terminal-bench

work page 2025

[66] [68]

Thinking Machines AI. 2025. Tinker.https://thinkingmachines.ai/ti nker/. (2025). Accessed: 2026-02

work page 2025

[67] [69]

Jiahao Wang, Jinbo Han, Xingda Wei, Sijie Shen, Dingyan Zhang, Chenguang Fang, Rong Chen, Wenyuan Yu, and Haibo Chen. 2025. KVCache cache in the wild: characterizing and optimizing KVCache cache at a large cloud provider. InProceedings of the 2025 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC ’25). USENIX Association, USA, Article 28...

work page 2025

[68] [70]

Weixun Wang, Shaopan Xiong, Gengru Chen, Wei Gao, Sheng Guo, Yancheng He, Ju Huang, Jiaheng Liu, Zhendong Li, Xiaoyang Li, Zichen Liu, Haizhou Zhao, Dakai An, Lunxi Cao, Qiyang Cao, Wanxi Deng, Feilei Du, Yiliang Gu, Jiahe Li, Xiang Li, Mingjie Liu, Yijia Luo, Zihe Liu, Yadao Wang, Pei Wang, Tianyuan Wu, Yanan Wu, Yuheng Zhao, Shuaibing Zhao, Jin Yang, Si...

work page arXiv 2025

[69] [71]

Weixun Wang, XiaoXiao Xu, Wanhe An, Fangwen Dai, Wei Gao, Yancheng He, Ju Huang, Qiang Ji, Hanqi Jin, Xiaoyang Li, Yang Li, Zhongwen Li, Shirong Lin, Jiashun Liu, Zenan Liu, Tao Luo, Dilxat Muhtar, Yuanbin Qu, Jiaqiang Shi, Qinghui Sun, Yingshui Tan, Hao Tang, Runze Wang, Yi Wang, Zhaoguo Wang, Yanan Wu, Shaopan Xiong, Binchen Xu, Xander Xu, Yuchi Xu, Qip...

work page arXiv 2026

[70] [72]

Yuxin Wang, Yuhan Chen, Zeyu Li, Xueze Kang, Yuchu Fang, Yeju Zhou, Yang Zheng, Zhenheng Tang, Xin He, Rui Guo, Xin Wang, Qiang Wang, Amelie Chi Zhou, and Xiaowen Chu. 2025. BurstGPT: A Real- world Workload Dataset to Optimize LLM Serving Systems.arXiv preprint arXiv:2401.17644(2025)

work page arXiv 2025

[71] [73]

Zhuang Wang, Zhaozhuo Xu, Jingyi Xi, Yuke Wang, Anshumali Shri- vastava, and TS Eugene Ng. 2025. {ZEN}: Empowering Distributed Training with Sparsity-driven Data Synchronization. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25). 537–556

work page 2025

[72] [74]

Junde Wu, Jiayuan Zhu, Yuyuan Liu, Min Xu, and Yueming Jin. 2025. Agentic reasoning: A streamlined framework for enhancing llm rea- soning with agentic tools. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 28489–28503

work page 2025

[73] [75]

Tianyuan Wu, Lunxi Cao, Yining Wei, Wei Gao, Yuheng Zhao, Dakai An, Shaopan Xiong, Zhiqiang Lv, Ju Huang, Siran Yang, Yinghao Yu, Jiamang Wang, Lin Qu, and Wei Wang. 2025. RollMux: Phase- Level Multiplexing for Disaggregated RL Post-Training.arXiv preprint arXiv:2512.11306(2025)

work page arXiv 2025

[74] [76]

RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs

Yongji Wu, Xueshen Liu, Haizhong Zheng, Juncheng Gu, Beidi Chen, Z. Morley Mao, Arvind Krishnamurthy, and Ion Stoica. 2025. RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs.arXiv preprint arXiv:2510.19225(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[75] [77]

Bingquan Xia, Bowen Shen, Cici, Dawei Zhu, Di Zhang, Gang Wang, Hailin Zhang, Huaqiu Liu, Jiebao Xiao, Jinhao Dong, Liang Zhao, Peid- ian Li, Peng Wang, Shihua Yu, Shimao Chen, Weikun Wang, Wenhan Ma, Xiangwei Deng, Yi Huang, Yifan Song, Zihan Jiang, Bowen Ye, Can Cai, Chenhong He, Dong Zhang, Duo Zhang, Guoan Wang, Hao Tian, Haochen Zhao, Heng Qu, Hongsh...

work page arXiv 2025

[76] [78]

Yuxing Xiang, Xue Li, Kun Qian, Yufan Yang, Diwen Zhu, Wenyuan Yu, Ennan Zhai, Xuanzhe Liu, Xin Jin, and Jingren Zhou. 2025. Aegaeon: Effective GPU pooling for concurrent LLM serving on the market. In Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles. 1030–1045

work page 2025

[77] [79]

Wencong Xiao, Shiru Ren, Yong Li, Yang Zhang, Pengyang Hou, Zhi Li, Yihui Feng, Wei Lin, and Yangqing Jia. 2020. AntMan: Dynamic scaling on GPU clusters for deep learning. InUSENIX OSDI

work page 2020

[78] [80]

Jiarong Xing, Yifan Qiao, Simon Mo, Xingqi Cui, Gur-Eyal Sela, Yang Zhou, Joseph Gonzalez, and Ion Stoica. 2025. Towards Efficient and Practical GPU Multitasking in the Era of LLM.arXiv preprint arXiv:2508.08448(2025)

work page arXiv 2025

[79] [81]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page

[80] [82]

Qwen3 Technical Report.arXiv preprint arXiv:2505.09388(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025