pith. sign in

arxiv: 2605.06534 · v2 · pith:3DPOXTQ2new · submitted 2026-05-07 · 💻 cs.DC

ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL

Pith reviewed 2026-05-21 08:35 UTC · model grok-4.3

classification 💻 cs.DC
keywords agentic reinforcement learningcooperative elasticityGPU co-locationLLM servingrollout schedulingresource sharingdistributed training
0
0 comments X

The pith

Agentic RL training can borrow idle GPUs from serving clusters to increase throughput by 1.3 to 3.3 times without violating service level objectives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that rollout phases in agentic reinforcement learning, which vary sharply in compute demand, can draw on spare capacity from already-running serving clusters instead of waiting for new GPU allocations or sticking to fixed resources. Fixed systems leave GPUs idle during low-demand steps while elastic systems pay high costs for on-demand provisioning and availability limits. ROSE solves the sharing problem with safe co-location of models, quick weight synchronization across clusters, and dynamic scheduling that routes tasks to both dedicated and opportunistic GPUs. If this works, training time shrinks because the dominant rollout bottleneck receives elastic capacity from infrastructure that already exists for inference.

Core claim

ROSE realizes cooperative elasticity by co-locating heterogeneous serving and rollout models on the same GPUs through an SLO-safe executor that dynamically shares memory and compute, a weight transfer engine that uses shard-aware routing and sparsity for fast synchronization, and an elastic scheduler that routes rollouts across dedicated and opportunistic GPUs. Experiments across model sizes and cluster scales report end-to-end throughput gains of 1.3-3.3x over resource-fixed baselines and rollout time reductions of 1.2-1.5x over resource-elastic baselines, all without serving SLO violations.

What carries the argument

The SLO-safe co-serving executor that dynamically shares memory and compute between serving and rollout models on the same GPUs while preserving latency guarantees.

If this is right

  • Rollout phases complete faster because they access on-demand capacity without allocation delays.
  • Overall post-training time for agentic RL decreases as the variable compute demand is met from existing serving pools.
  • Serving clusters support additional training workloads without requiring extra dedicated hardware.
  • Resource utilization rises because idle capacity in production inference fleets becomes available for training steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same co-location pattern could apply to other bursty workloads such as online fine-tuning or evaluation jobs that run alongside serving.
  • Cloud operators might redesign GPU fleets to treat serving and training as co-located rather than separate resource pools.
  • If weight transfer overhead stays low at larger scales, the approach could extend to multi-tenant environments with more frequent model updates.

Load-bearing premise

Serving clusters consistently leave substantial GPU compute and memory idle and can co-locate heterogeneous models dynamically while preserving serving SLOs under bursty traffic.

What would settle it

Deploying the system on a cluster with consistently high serving load and measuring either no throughput gain or any increase in serving latency violations.

Figures

Figures reproduced from arXiv: 2605.06534 by Bo Zheng, Dakai An, Dilxat Muhtar, Jiamang Wang, Ju Huang, Lin Qu, Lunxi Cao, Shaopan Xiong, Siran Yang, Teng Ma, Tianyuan Wu, Wei Gao, Wei Wang, Weixun Wang, Xuchun Shang, Yuheng Zhao.

Figure 1
Figure 1. Figure 1: Characterization of agentic RL: (a) The breakdown of end-to-end training time; (b) The long-tail distribution of rollout execution time; (c) The impact of prefill on rollouts; (d) The demand for resource elasticity. Train Agentic LLM Environment Action Observation Weight Sync. Trajectory Rollout view at source ↗
Figure 1
Figure 1. Figure 1: Characterization of agentic RL: (a) The breakdown of end-to-end training time; (b) The long-tail distribution of rollout execution time; (c) The impact of prefill on rollouts; (d) The demand for resource elasticity. Train Agentic LLM Environment Action Observation Weight Sync. Trajectory Rollout [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Characterization of serving clusters and workloads: (a) Fluctuating serving traffic; (b) Serving GPU underutilization; (c) High allocation overhead; (d) Substantial communication overhead. Datacenter IB/RoCe TCP/IP NVLink Datacenter IB/RoCe Rollout Cluster Training Cluster Serving Cluster view at source ↗
Figure 3
Figure 3. Figure 3: Characterization of serving clusters and workloads: (a) Fluctuating serving traffic; (b) Serving GPU underutilization; (c) High allocation overhead; (d) Substantial communication overhead. Datacenter IB/RoCe TCP/IP NVLink Datacenter IB/RoCe Rollout Cluster Training Cluster Serving Cluster [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Scheme of Datacenter Infrastructure. load and redirecting freed GPUs to rollouts. However, bidi￾rectional autoscaling is fundamentally limited: reclaiming GPUs from rollouts back to serving requires evicting in￾flight rollouts and reloading models, taking tens of seconds (Figure 3c) and far exceeding typical SLO budgets. Because serving traffic is bursty at second-level granularity, frequent mode switching… view at source ↗
Figure 4
Figure 4. Figure 4: Scheme of Datacenter Infrastructure. scale rollout capacity on demand. Because these GPUs lie out￾side the steady-state deployment, each provisioning event requires model loading and runtime initialization, which can take tens of seconds (Figure 3c). Spot preemption and serverless lease expiration further trigger repeated teardown￾and-reinitialize cycles, turning allocation overhead into a persistent throu… view at source ↗
Figure 5
Figure 5. Figure 5: System Architecture of ROSE. it can take up to 145 s and grow quickly with model size, becoming a bottleneck for frequent weight synchronization. 4 System Design System Overview. To address the above challenges, we introduce ROSE, the architecture of which is illustrated in view at source ↗
Figure 5
Figure 5. Figure 5: System Architecture of ROSE. node in the RL cluster to a GPU node in the serving cluster using Mooncake Store [45] 1 , over TCP (200 Gbps Ethernet) and RDMA (400 Gbps InfiniBand), shown in Figure 3d. Even with InfiniBand (which is uncommon across datacenters), it can take up to 145 s and grow quickly with model size, becoming a bottleneck for frequent weight synchronization. 4 System Design System Overview… view at source ↗
Figure 6
Figure 6. Figure 6: Layer-wise sparsity ratio at 10th step. Shard-aware Weight Transfer. Training and serving clus￾ters adopt heterogeneous parallelism strategies (e.g., training with TP8×PP2 and serving with TP4), requiring automatic shard mapping across configurations. Naive approaches re￾quire manual resharding or full model aggregation before transfer. ROSE automatically infers each parameter’s shard￾ing rule by identifyi… view at source ↗
Figure 7
Figure 7. Figure 7: ROSE’s end-to-end throughput improvements. The data are normalized to the baseline’s first step. 0 25 50 75 100 Steps 0.2 0.4 0.6 Score ROLL ROSE (a) FrozenLake-8B-GRPO. 0 10 20 30 40 Steps 0.5 0.0 Score ROLL ROSE (b) ALFWorld-32B-GRPO view at source ↗
Figure 7
Figure 7. Figure 7: (a)-(c) ROSE’s end-to-end throughput improvements compared with baselines, for each baseline we run 8B and 32B model. The data are normalized to the baseline’s first step. (d) End-to-end critic scores for 8B and 32B models using GRPO. 8B 32B Model Size 0 50 100 Norm. Time 1301 1224 1210 1012 1010 805 RL RLBoost+ CoRL (a) Elastic Baselines. 8B 32B Model Size 0 15 30 Ratio (%) 16.1% 26.1% 7.3% 6.8% 0.3% 0.4%… view at source ↗
Figure 8
Figure 8. Figure 8: End-to-end critic scores for (a) 8B and (b) 32B models using the GRPO algorithm. 8B 32B Model Size 0 50 100 Norm. Time 1709 1502 1301 1224 1210 1012 1010 805 ROLL RL RLBoost CoRL (a) Micro Benchmark. 0 4 8 16 Available Serving GPUs 0 500 1k 1.5k Time (s) (b) Scalability [8B, GRPO] view at source ↗
Figure 8
Figure 8. Figure 8: End-to-end evaluation. (a) Rollout time and (b) Allocation overhead compared with elastic baselines. 1.44× and 2.69× higher throughput on average (see Figure 7c). Although AReaL eliminates GPU idle time by continuously generating trajectories without waiting for training to com￾plete, by expanding effective GPU capacity through coopera￾tive elasticity, ROSE provides gains orthogonal to asynchro￾nous execut… view at source ↗
Figure 9
Figure 9. Figure 9: End-to-end evaluation. (a) Rollout time compared with baselines. (b) Scalability of ROSE on Qwen3-8B with GRPO as Serving GPUs increase. Allocation Overhead. We further analyze the allocation overhead of elastic resource management schemes. We quan￾tify the total preempted GPU time as the product of the num￾ber of preempted GPUs and the per-preemption overhead, and normalize it by the total GPU time. As sh… view at source ↗
Figure 10
Figure 10. Figure 10: [Transfer Engine] (a) Cross-cluster weight transfer time under different optimizations; each optimization is additive over the previous one. (b) Timeline breakdown of shard-aware and sparsity-aware transfer for Qwen3-32B. D2S denotes the dense-to-sparse conversion, and S2D denotes the sparse-to-dense conversion. (c) Sensitivity of shard-aware and sparsity-aware transfer of different LLMs to cross-cluster … view at source ↗
Figure 10
Figure 10. Figure 10: [Transfer Engine]. (a) Cross-cluster weight transfer time under different optimizations; each optimiza￾tion is additive over the previous one. (b) Sensitivity of shard￾aware and sparsity-aware transfer of different LLMs to cross￾cluster bandwidth. effectively limits tail-latency inflation, but without explicit SLO-aware scheduling, P99 latency still misses our target. Dual-SLO Admission Controller. This c… view at source ↗
Figure 12
Figure 12. Figure 12: [Analysis of Sparsity]. (a) The sparsity of weight differentials across steps for Qwen3-8B. (b) The sensitivity of transfer engine to sparsity. only the shards it hosts. This further reduces communica￾tion time by 1.8× (Qwen3-8B) and 1.3× (Qwen3-32B). More￾over, Figure 10b (top) illustrates the Qwen3-32B timeline. On the sender side, each training worker streams ∼60 buck￾ets (64 MB each); each bucket take… view at source ↗
Figure 11
Figure 11. Figure 11: [Analysis of Sparsity]. (a) The sparsity of weight differentials across steps for Qwen3-8B. (b) The sen￾sitivity of transfer engine to sparsity. diminishes. Beyond ∼20%, sparse-format metadata (e.g., in￾dices) and (de)sparsification overhead begin to offset the reduction in transmitted weights. In our workloads, the mea￾sured non-zero fraction remains well below this threshold, enabling consistently effic… view at source ↗
Figure 13
Figure 13. Figure 13: ROSE under fully asynchronous RL training work￾loads. We monitor the average throughput between consec￾utive RL steps. 6.4 Effectiveness of Rollout Scheduler. We follow the end-to-end setups and evaluate the elastic roll￾out scheduler using Qwen3-8B and Qwen3-32B with GRPO algorithm for the first five RL steps view at source ↗
Figure 12
Figure 12. Figure 12: The system throughput with different per-device batch sizes. [Qwen3-8B/32K] B Spot instance trace We extract the spot-instance traces for the 8B model from Seg.B in [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 14
Figure 14. Figure 14: The system throughput with different per-device batch sizes. [Qwen3-8B/32K] B Spot instance trace We extract the spot-instance traces for the 8B model from Seg.B in view at source ↗
Figure 13
Figure 13. Figure 13 [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
Figure 15
Figure 15. Figure 15 view at source ↗
Figure 14
Figure 14. Figure 14: Sensitive Analysis of Serving GPU Availability. F Timeline Breakdown of Weight Transfer [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: provides a detailed timeline breakdown of shard￾aware and sparsity-aware weight transfer for Qwen3-32B. The top timeline illustrates shard-aware transfer: on the sender side, each training worker streams ∼60 buckets (64 MB each); each bucket takes 0.2–0.4 s to push, for a total of 65 seconds. On the receiver side, serving workers pull the corre￾sponding weight buckets from the relay and load them into GPU… view at source ↗
read the original abstract

Agentic reinforcement learning (RL) is reshaping LLM post-training, but end-to-end training time is dominated by compute-intensive, multi-turn rollouts whose resource demand varies significantly across training steps. Resource-fixed systems cannot adapt to this variation, while resource-elastic approaches that provision external GPUs on demand suffer from high allocation overhead and limited availability. We observe that serving clusters leave substantial GPU compute and memory idle, and propose cooperative elasticity: sharing already-deployed serving GPUs with rollout workloads to provide on-demand elastic capacity. Realizing this is non-trivial, as it must preserve serving SLOs under bursty traffic while minimizing cross-cluster communication overhead. We present ROSE, a system that realizes cooperative elasticity for agentic RL post-training, comprising three components: (1) an SLO-safe co-serving executor that co-locates heterogeneous serving and rollout models on the same GPUs, dynamically sharing memory and compute while preserving serving SLOs; (2) a cross-cluster weight transfer engine that leverages shard-aware routing and weight sparsity for fast synchronization; and (3) an elastic rollout scheduler that dynamically routes rollouts across dedicated and opportunistic serving GPUs. Experiments across multiple model sizes and cluster scales show that ROSE improves end-to-end throughput by 1.3 - 3.3 x over resource-fixed baselines and reduces rollout time by 1.2 - 1.5 x over resource-elastic baselines, with no serving SLO violations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript presents ROSE, a system realizing cooperative elasticity for agentic RL post-training. It co-locates rollout workloads on already-deployed serving GPUs via an SLO-safe co-serving executor, a shard-aware cross-cluster weight transfer engine, and an elastic rollout scheduler. The central empirical claim is that this yields 1.3–3.3× end-to-end throughput gains over resource-fixed baselines and 1.2–1.5× rollout-time reductions over resource-elastic baselines across model sizes and cluster scales, with no serving SLO violations.

Significance. If the reported speedups and SLO preservation hold under production burst patterns, ROSE would demonstrate a practical way to harvest idle serving capacity for variable-demand RL rollouts, reducing the need for dedicated elastic provisioning. The three-component design and cross-cluster synchronization techniques are concrete contributions to systems for heterogeneous co-location.

major comments (3)
  1. [§5] §5 (Experiments): The headline 1.3–3.3× throughput and 1.2–1.5× rollout-time numbers are presented without reported variance, number of runs, or precise definition of how serving SLOs (latency, throughput) were measured under the simulated bursty traffic; this makes it impossible to judge whether the gains are robust or sensitive to post-hoc tuning.
  2. [§2, §3.1] §2 and §3.1: The enabling premise that serving clusters consistently leave substantial GPU compute and memory idle under bursty traffic is stated as an observation but is not backed by any production traces, utilization histograms, or worst-case analysis of co-location feasibility for heterogeneous models; if sustained utilization is higher than assumed, the opportunistic capacity and therefore the reported speedups disappear.
  3. [§4.3] §4.3 (SLO-safe co-serving executor): The dynamic memory and compute sharing mechanism is described at a high level, yet no formal bound or micro-benchmark isolates the latency impact on the serving model when rollout jobs are co-located at varying intensities; the claim of “no SLO violations” therefore rests entirely on the specific experimental traffic rather than a general guarantee.
minor comments (3)
  1. [Table 1, §4.1] Table 1 and §4.1: Model-size notation (e.g., “7B”, “70B”) is used inconsistently with the text; align the table headers with the exact parameter counts reported in the experimental setup.
  2. [Figure 4] Figure 4: Axis labels and legend text are too small to read at standard print size; increase font size or split into two panels.
  3. [§6] §6 (Related Work): The discussion of prior elastic scheduling and co-location systems omits several recent papers on GPU sharing for inference; add citations to complete the positioning.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments highlight important aspects of experimental rigor, motivation, and guarantees that we will address to improve the manuscript. We respond to each major comment below and indicate the planned revisions.

read point-by-point responses
  1. Referee: [§5] §5 (Experiments): The headline 1.3–3.3× throughput and 1.2–1.5× rollout-time numbers are presented without reported variance, number of runs, or precise definition of how serving SLOs (latency, throughput) were measured under the simulated bursty traffic; this makes it impossible to judge whether the gains are robust or sensitive to post-hoc tuning.

    Authors: We agree that reporting statistical details is essential for assessing robustness. In the revised manuscript we will add the number of runs performed for each configuration (five independent runs), include error bars or standard deviations in the relevant figures, and provide an explicit description of the SLO measurement methodology. This will include the precise latency percentile (99th), throughput threshold, and how bursty traffic was generated and monitored to ensure no violations occurred. revision: yes

  2. Referee: [§2, §3.1] §2 and §3.1: The enabling premise that serving clusters consistently leave substantial GPU compute and memory idle under bursty traffic is stated as an observation but is not backed by any production traces, utilization histograms, or worst-case analysis of co-location feasibility for heterogeneous models; if sustained utilization is higher than assumed, the opportunistic capacity and therefore the reported speedups disappear.

    Authors: We acknowledge that the current motivation section relies on general observations rather than public production traces. We will expand §2 with utilization histograms generated from our bursty-traffic simulator across a range of arrival rates and model sizes, plus a new worst-case analysis subsection that quantifies the minimum idle capacity needed for net gains and shows how speedups degrade under higher sustained utilization. While we cannot release proprietary production traces, these additions will make the feasibility argument more concrete and transparent. revision: partial

  3. Referee: [§4.3] §4.3 (SLO-safe co-serving executor): The dynamic memory and compute sharing mechanism is described at a high level, yet no formal bound or micro-benchmark isolates the latency impact on the serving model when rollout jobs are co-located at varying intensities; the claim of “no SLO violations” therefore rests entirely on the specific experimental traffic rather than a general guarantee.

    Authors: We will revise §4.3 to include dedicated micro-benchmarks that isolate serving-model latency under controlled rollout intensities, varying both compute and memory sharing ratios while holding serving traffic fixed. These experiments will report latency distributions and the maximum rollout intensity at which the 99th-percentile SLO remains satisfied. Although deriving a tight formal latency bound is difficult given nondeterministic GPU scheduling, the added micro-benchmarks will provide empirical evidence beyond the end-to-end traffic scenarios and clarify the operating regime in which SLOs are preserved. revision: yes

Circularity Check

0 steps flagged

No circularity in ROSE derivation chain

full rationale

The paper is a systems description of ROSE for cooperative elasticity, with three engineering components (SLO-safe co-serving executor, cross-cluster weight transfer engine, elastic rollout scheduler) and performance claims supported solely by experimental measurements across model sizes and cluster scales. No mathematical derivations, equations, fitted parameters presented as predictions, or first-principles results appear in the provided text. The idle-capacity observation is an empirical premise, not a derived quantity, and the speedups are direct experimental outcomes rather than reductions to inputs by construction. The derivation chain is therefore self-contained with independent empirical content.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that serving clusters have substantial idle capacity and that co-location can be engineered to protect SLOs; no free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption Serving clusters leave substantial GPU compute and memory idle under normal operation.
    Stated in the abstract as the observation enabling cooperative elasticity.
  • domain assumption Co-location of heterogeneous serving and rollout models can preserve serving SLOs under bursty traffic.
    Core premise required for the SLO-safe co-serving executor to be viable.

pith-pipeline@v0.9.0 · 5842 in / 1451 out tokens · 27384 ms · 2026-05-21T08:35:03.552733+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We present ROSE, a system that realizes cooperative elasticity for agentic RL post-training, comprising three components: (1) an SLO-safe co-serving executor that co-locates heterogeneous serving and rollout models on the same GPUs, dynamically sharing memory and compute while preserving serving SLOs; (2) a cross-cluster weight transfer engine that leverages shard-aware routing and weight sparsity for fast synchronization; and (3) an elastic rollout scheduler that dynamically routes rollouts across dedicated and opportunistic serving GPUs.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Experiments across multiple model sizes and cluster scales show that ROSE improves end-to-end throughput by 1.3 - 3.3 x over resource-fixed baselines

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

96 extracted references · 96 canonical work pages · 17 internal anchors

  1. [1]

    Alibaba Cloud. 2026. Creating a GPU function.https://www.alibabac loud.com/help/en/functioncompute/fc/user-guide/creating-a-gpu- function/. (2026). Accessed: 2026-04

  2. [2]

    Li, Ryota Tomioka, and Milan Vojnovic

    Dan Alistarh, Demjan Grubic, Jerry Z. Li, Ryota Tomioka, and Milan Vojnovic. 2017. QSGD: communication-efficient SGD via gradient quantization and encoding. InProceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 1707–1718

  3. [3]

    Romil Bhardwaj, Zhengxu Xia, Ganesh Ananthanarayanan, Junchen Jiang, Yuanchao Shu, Nikolaos Karianakis, Kevin Hsieh, Paramvir Bahl, and Ion Stoica. 2022. Ekya: Continuous Learning of Video Analytics Models on Edge Compute Servers. In19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). USENIX Association, Renton, WA, 119–135.http...

  4. [4]

    Hegde, Connor Chen, Charlie Ruan, Tyler Griggs, Shu Liu, Eric Tang, Richard Liaw, Philipp Moritz, Matei Zaharia, Joseph E

    Shiyi Cao, Dacheng Li, Fangzhou Zhao, Shuo Yuan, Sumanth R. Hegde, Connor Chen, Charlie Ruan, Tyler Griggs, Shu Liu, Eric Tang, Richard Liaw, Philipp Moritz, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica

  5. [5]

    arXiv preprint arXiv:2511.16108(2025)

    SkyRL-Agent: Efficient RL Training for Multi-turn LLM Agent. arXiv preprint arXiv:2511.16108(2025)

  6. [6]

    Rongxin Cheng, Kai Zhou, Xingda Wei, Siyuan Liu, Mingcong Han, Mingjing Ai, Yeju Zhou, Baoquan Zhong, Wencong Xiao, Rong Chen, and Haibo Chen. 2025. Fast LLM Post-training via Decoupled and Best-of-N Speculation.arXiv preprint arXiv:2511.16193(2025)

  7. [7]

    Yihua Cheng, Yuhan Liu, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaoting Feng, Yuyang Huang, Samuel Shen, Kuntai Du, and Junchen Jiang

  8. [8]

    LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference.arXiv preprint arXiv:2510.09665(2025)

  9. [9]

    Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, and Hao Zhang. 2024. MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving. InICML

  10. [10]

    Assaf Eisenman, Kiran Kumar Matam, Steven Ingram, Dheevatsa Mudigere, Raghuraman Krishnamoorthi, Krishnakumar Nair, Misha Smelyanskiy, and Murali Annavaram. 2022. Check-N-Run: A check- pointing system for training deep learning recommendation models. In19th USENIX Symposium on Networked Systems Design and Imple- mentation (NSDI 22). 929–943

  11. [11]

    Farama Foundation. 2024. Gymnasium - FrozenLake Environment. https://gymnasium.farama.org/environments/toy_text/frozen_lake/. (2024). Accessed: 2025-09

  12. [12]

    Jiawei Fei, Chen-Yu Ho, Atal N Sahu, Marco Canini, and Amedeo Sapio

  13. [13]

    InProceedings of the 2021 ACM SIGCOMM 2021 Conference

    Efficient sparse collective communication and its application to accelerate distributed deep learning. InProceedings of the 2021 ACM SIGCOMM 2021 Conference. 676–691

  14. [16]

    Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, Tongkai Yang, Binhang Yuan, and Yi Wu. 2025. AReaL: A Large-Scale Asynchronous Rein- forcement Learning System for Language Reasoning.arXiv preprint arXiv:2505.10978(2025)

  15. [17]

    Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. 2024. ServerlessLLM: Low- Latency Serverless Inference for Large Language Models. InOSDI’24

  16. [18]

    Wei Gao, Zhuoyuan Ouyang, Peng Sun, Tianwei Zhang, and Yonggang Wen. 2025. IceFrog: A Layer-Elastic Scheduling System for Deep Learning Training in GPU Clusters.IEEE Transactions on Parallel and Distributed Systems36, 6 (2025), 1071–1086.https://doi.org/10.1109/ TPDS.2025.3553137

  17. [19]

    Wei Gao, Yuheng Zhao, Dakai An, Tianyuan Wu, Lunxi Cao, Shaopan Xiong, Ju Huang, Weixun Wang, Siran Yang, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng, and Wei Wang. 2025. RollPacker: Mitigating Long- Tail Rollouts for Fast, Synchronous RL Post-Training.arXiv preprint arXiv:2509.21009(2025)

  18. [20]

    Wei Gao, Yuheng Zhao, Tianyuan Wu, Shaopan Xiong, Weixun Wang, Dakai An, Lunxi Cao, Dilxat Muhtar, Zichen Liu, Haizhou Zhao, Ju Huang, Siran Yang, Yongbin Li, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng, and Wei Wang. 2025. RollArt: Scaling Agentic RL Training via Disaggregated Infrastructure.arXiv preprint arXiv:2512.22560(2025)

  19. [21]

    Mingcong Han, Hanze Zhang, Rong Chen, and Haibo Chen

  20. [22]

    In16th USENIX Symposium on Oper- ating Systems Design and Implementation (OSDI 22)

    Microsecond-scale preemption for concurrent {GPU- accelerated} {DNN} inferences. In16th USENIX Symposium on Oper- ating Systems Design and Implementation (OSDI 22). 539–558

  21. [23]

    Zhenyu Han, Ansheng You, Haibo Wang, Kui Luo, Guang Yang, Wenqi Shi, Menglong Chen, Sicheng Zhang, Zeshun Lan, Chunshi Deng, Huazhong Ji, Wenjie Liu, Yu Huang, Yixiang Zhang, Chenyi Pan, Jing Wang, Xin Huang, Chunsheng Li, and Jianping Wu. 2025. AsyncFlow: An Asynchronous Streaming RL Framework for Efficient LLM Post- Training.arXiv preprint arXiv:2507.01...

  22. [24]

    Bingguang Hao, Maolin Wang, Zengzhuang Xu, Yicheng Chen, Cunyin Peng, Jinjie GU, and Chenyi Zhuang. 2025. Exploring Superior Func- tion Calls via Reinforcement Learning.arXiv preprint arXiv:2508.05118 13 (2025)

  23. [25]

    Squillante

    Mor Harchol-Balter, Cuihong Li, Takayuki Osogami, Alan Scheller- Wolf, and Mark S. Squillante. 2003. Cycle stealing under immediate dispatch task assignment. InProceedings of the Fifteenth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA ’03). As- sociation for Computing Machinery, New York, NY, USA, 274–285. https://doi.org/10.1145/777...

  24. [26]

    Eric Harper, Somshubra Majumdar, Oleksii Kuchaiev, Li Jason, Yang Zhang, Evelina Bakhturina, Vahid Noroozi, Sandeep Subramanian, Koluguri Nithin, Huang Jocelyn, Fei Jia, Jagadeesh Balam, Xuesong Yang, Micha Livne, Yi Dong, Sean Naren, and Boris Ginsburg. 2025. NeMo: a toolkit for Conversational AI and Large Language Models. (2025).https://github.com/NVIDIA/NeMo

  25. [27]

    Jingkai He, Tianjian Li, Erhu Feng, Dong Du, Qian Liu, Tao Liu, Yubin Xia, and Haibo Chen. 2025. History Rhymes: Accelerating LLM Rein- forcement Learning with RhymeRL.arXiv preprint arXiv:2508.18588 (2025)

  26. [28]

    Jian Hu, Xibin Wu, Weixun Wang, Dehao Zhang, Yu Cao, et al. 2024. OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework.arXiv preprint arXiv:2405.11143(2024)

  27. [29]

    Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin J...

  28. [30]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?arXiv preprint arXiv:2310.06770(2024)

  29. [31]

    Norm Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian Towles, et al. 2023. Tpu v4: An optically reconfigurable supercom- puter for machine learning with hardware support for embeddings. In Proceedings of the 50th annual international symposium on computer architecture. 1–14

  30. [32]

    Gonzalez, Hao Zhang, and Ion Sto- ica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Sto- ica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles

  31. [33]

    Jiamin Li, Hong Xu, Yibo Zhu, Zherui Liu, Chuanxiong Guo, and Cong Wang. 2023. Lyra: Elastic Scheduling for Deep Learning Clusters. In Proceedings of the Eighteenth European Conference on Computer Systems. Association for Computing Machinery, New York, NY, USA, 835–850. https://doi.org/10.1145/3552326.3587445

  32. [34]

    Yufei Li, Zexin Li, Yinglun Zhu, and Cong Liu. 2025. Lemix: Unified Scheduling for Llm Training and Inference on Multi-Gpu Systems. In 2025 IEEE Real-Time Systems Symposium (RTSS)

  33. [35]

    Zhiwei Li, Yong Hu, and Wenqing Wang. 2025. Encouraging Good Pro- cesses Without the Need for Good Answers: Reinforcement Learning for LLM Agent Planning.arXiv preprint arXiv:2508.19598(2025)

  34. [36]

    Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E Gon- zalez, et al. 2023. {AlpaServe}: Statistical multiplexing with model parallelism for deep learning serving. In17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). 663–679

  35. [37]

    Hwijoon Lim, Juncheol Ye, Sangeetha Abdu Jyothi, and Dongsu Han

  36. [38]

    InProceedings of the ACM SIGCOMM 2024 Con- ference

    Accelerating model training in multi-cluster environments with consumer-grade gpus. InProceedings of the ACM SIGCOMM 2024 Con- ference. 707–720

  37. [39]

    Yuhang Liu, Pengxiang Li, Congkai Xie, Xavier Hu, Xiaotian Han, Shengyu Zhang, Hongxia Yang, and Fei Wu. 2025. Infigui-r1: Ad- vancing multimodal gui agents from reactive actors to deliberative reasoners.arXiv preprint arXiv:2504.14239(2025)

  38. [40]

    Han Lu, Zichen Liu, Shaopan Xiong, Yancheng He, Wei Gao, Yanan Wu, Weixun Wang, Jiashun Liu, Yang Li, Haizhou Zhao, Ju Huang, Siran Yang, Xiaoyang Li, Yijia Luo, Zihe Liu, Ling Pan, Junchi Yan, Wei Wang, Wenbo Su, Jiamang Wang, Lin Qu, and Bo Zheng. 2025. Part II: ROLL Flash – Accelerating RLVR and Agentic Training with Asynchrony.arXiv preprint arXiv:251...

  39. [41]

    Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Han Xiao, Shuai Ren, Guanjing Xiong, and Hongsheng Li. 2025. UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning.arXiv preprint arXiv:2503.21620(2025)

  40. [42]

    Michael Luo, Sijun Tan, Roy Huang, Ameen Patel, Alpay Ariyak, Qingyang Wu, Xiaoxiang Shi, Rachel Xin, Colin Cai, Maurice We- ber, Ce Zhang, Li Erran Li, Raluca Ada Popa, and Ion Stoica. 2025. DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level.https: //pretty-radio-b75.notion.site/DeepCoder-A-Fully-Open-Source- 14B-Coder-at-O3-mini-Level-1cf81902c14...

  41. [43]

    Run Luo, Lu Wang, Wanwei He, and Xiaobo Xia. 2025. GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents. arXiv preprint arXiv:2504.10458(2025)

  42. [44]

    Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I Jordan, et al. 2018. Ray: A distributed framework for emerg- ing {AI} applications. In13th USENIX symposium on operating systems design and implementation (OSDI 18). 561–577

  43. [45]

    Aashiq Muhamed, Oscar Li, David Woodruff, Mona Diab, and Virginia Smith. 2024. Grass: Compute efficient low-memory llm training with structured sparse gradients. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 14978–15003

  44. [46]

    OpenPipe. 2025. Serverless RL. (2025).https://openpipe.ai/blog/serve rless-rl

  45. [47]

    Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2025. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. InProceedings of the 51st Annual International Symposium on Computer Architecture (ISCA ’24). IEEE Press, 118–132.https://doi.org/10.1109/ISCA59077.2024.000 19

  46. [48]

    Gon- zalez, Ion Stoica, and Harry Xu

    Yifan Qiao, Shu Anzai, Shan Yu, Haoran Ma, Shuo Yang, Yang Wang, Miryung Kim, Yongji Wu, Yang Zhou, Jiarong Xing, Joseph E. Gon- zalez, Ion Stoica, and Harry Xu. 2025. ConServe: Fine-Grained GPU Harvesting for LLM Online and Offline Co-Serving.arXiv preprint arXiv:2410.01228(2025)

  47. [49]

    Ruoyu Qin, Weiran He, Weixiao Huang, Yangkun Zhang, Yikai Zhao, Bo Pang, Xinran Xu, Yingdi Shan, Yongwei Wu, and Mingxing Zhang

  48. [50]

    Seer: Online Context Learning for Fast Synchronous LLM Rein- forcement Learning.arXiv preprint arXiv:2511.14617(2025)

  49. [51]

    Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Heyi Tang, Feng Ren, Teng Ma, Shangming Cai, Yineng Zhang, Mingxing Zhang, et al. 2024. Mooncake: A kvcache-centric disaggregated architecture for llm serv- ing.ACM Transactions on Storage(2024)

  50. [52]

    Haoran Qiu, Anish Biswas, Zihan Zhao, Jayashree Mohan, Alind Khare, Esha Choukse, Íñigo Goiri, Zeyu Zhang, Haiying Shen, Chetan Bansal, Ramachandran Ramjee, and Rodrigo Fonseca. 2025. ModServe: Modality- and Stage-Aware Resource Disaggregation for Scalable Mul- timodal Model Serving. InProceedings of the 2025 ACM Symposium on Cloud Computing (SoCC 2025). ...

  51. [53]

    Mrinal Rawat, Ambuje Gupta, Rushil Goomer, Alessandro Di Bari, Neha Gupta, and Roberto Pieraccini. 2025. Pre-Act: Multi-Step Plan- ning and Reasoning Improves Acting in LLM Agents.arXiv preprint arXiv:2505.09970(2025)

  52. [54]

    Amir Sarfi, Benjamin Thérien, Joel Lidin, and Eugene Belilovsky. 2025. Communication Efficient LLM Pre-training with SparseLoCo. (2025). arXiv:cs.LG/2508.15706https://arxiv.org/abs/2508.15706

  53. [55]

    Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow.arXiv preprint arXiv:1802.05799(2018)

  54. [56]

    SGLang Team. 2025. SGLang: Fast Serving Framework for Large Language Models.https://github.com/sgl-project/sglang. (2025). Version 0.4

  55. [57]

    Zelei Shao, Vikranth Srivatsa, Sanjana Srivastava, Qingyang Wu, Al- pay Ariyak, Xiaoxia Wu, Ameen Patel, Jue Wang, Percy Liang, Tri Dao, Ce Zhang, Yiying Zhang, Ben Athiwaratkun, Chenfeng Xu, and Junx- iong Wang. 2025. Beat the long tail: Distribution-Aware Speculative Decoding for RL Training.arXiv preprint arXiv:2511.13841(2025)

  56. [58]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300(2024)

  57. [59]

    Guangming Sheng, Yuxuan Tong, Borui Wan, Wang Zhang, Chaobo Jia, Xibin Wu, Yuqi Wu, Xiang Li, Chi Zhang, Yanghua Peng, Haibin Lin, Xin Liu, and Chuan Wu. 2025. Laminar: A Scalable Asynchronous RL Post-Training Framework.arXiv preprint arXiv:2510.12633(2025)

  58. [60]

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2024. Hy- bridFlow: A Flexible and Efficient RLHF Framework.arXiv preprint arXiv:2409.19256(2024)

  59. [61]

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2024. verl: Volcano Engine Reinforcement Learning for LLM.https://github.com /volcengine/verl. (2024)

  60. [62]

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi- billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053(2019)

  61. [63]

    Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. 2021. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning.arXiv preprint arXiv:2010.03768(2021)

  62. [64]

    Joykirat Singh, Raghav Magazine, Yash Pandya, and Akshay Nambi

  63. [65]

    Agentic Reasoning and Tool Integration for LLMs via Reinforce- ment Learning.arXiv preprint arXiv:2505.01441(2025)

  64. [66]

    Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, and Esha Choukse. 2025. Dynamollm: Designing llm inference clusters for per- formance and energy efficiency. In2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 1348–1362

  65. [67]

    The Terminal-Bench Team. 2025. Terminal-Bench: A Benchmark for AI Agents in Terminal Environments. (2025).https://github.com/laude- institute/terminal-bench

  66. [68]

    Thinking Machines AI. 2025. Tinker.https://thinkingmachines.ai/ti nker/. (2025). Accessed: 2026-02

  67. [69]

    Jiahao Wang, Jinbo Han, Xingda Wei, Sijie Shen, Dingyan Zhang, Chenguang Fang, Rong Chen, Wenyuan Yu, and Haibo Chen. 2025. KVCache cache in the wild: characterizing and optimizing KVCache cache at a large cloud provider. InProceedings of the 2025 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC ’25). USENIX Association, USA, Article 28...

  68. [70]

    Weixun Wang, Shaopan Xiong, Gengru Chen, Wei Gao, Sheng Guo, Yancheng He, Ju Huang, Jiaheng Liu, Zhendong Li, Xiaoyang Li, Zichen Liu, Haizhou Zhao, Dakai An, Lunxi Cao, Qiyang Cao, Wanxi Deng, Feilei Du, Yiliang Gu, Jiahe Li, Xiang Li, Mingjie Liu, Yijia Luo, Zihe Liu, Yadao Wang, Pei Wang, Tianyuan Wu, Yanan Wu, Yuheng Zhao, Shuaibing Zhao, Jin Yang, Si...

  69. [71]

    Weixun Wang, XiaoXiao Xu, Wanhe An, Fangwen Dai, Wei Gao, Yancheng He, Ju Huang, Qiang Ji, Hanqi Jin, Xiaoyang Li, Yang Li, Zhongwen Li, Shirong Lin, Jiashun Liu, Zenan Liu, Tao Luo, Dilxat Muhtar, Yuanbin Qu, Jiaqiang Shi, Qinghui Sun, Yingshui Tan, Hao Tang, Runze Wang, Yi Wang, Zhaoguo Wang, Yanan Wu, Shaopan Xiong, Binchen Xu, Xander Xu, Yuchi Xu, Qip...

  70. [72]

    Yuxin Wang, Yuhan Chen, Zeyu Li, Xueze Kang, Yuchu Fang, Yeju Zhou, Yang Zheng, Zhenheng Tang, Xin He, Rui Guo, Xin Wang, Qiang Wang, Amelie Chi Zhou, and Xiaowen Chu. 2025. BurstGPT: A Real- world Workload Dataset to Optimize LLM Serving Systems.arXiv preprint arXiv:2401.17644(2025)

  71. [73]

    Zhuang Wang, Zhaozhuo Xu, Jingyi Xi, Yuke Wang, Anshumali Shri- vastava, and TS Eugene Ng. 2025. {ZEN}: Empowering Distributed Training with Sparsity-driven Data Synchronization. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25). 537–556

  72. [74]

    Junde Wu, Jiayuan Zhu, Yuyuan Liu, Min Xu, and Yueming Jin. 2025. Agentic reasoning: A streamlined framework for enhancing llm rea- soning with agentic tools. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 28489–28503

  73. [75]

    Tianyuan Wu, Lunxi Cao, Yining Wei, Wei Gao, Yuheng Zhao, Dakai An, Shaopan Xiong, Zhiqiang Lv, Ju Huang, Siran Yang, Yinghao Yu, Jiamang Wang, Lin Qu, and Wei Wang. 2025. RollMux: Phase- Level Multiplexing for Disaggregated RL Post-Training.arXiv preprint arXiv:2512.11306(2025)

  74. [76]

    RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs

    Yongji Wu, Xueshen Liu, Haizhong Zheng, Juncheng Gu, Beidi Chen, Z. Morley Mao, Arvind Krishnamurthy, and Ion Stoica. 2025. RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs.arXiv preprint arXiv:2510.19225(2025)

  75. [77]

    Bingquan Xia, Bowen Shen, Cici, Dawei Zhu, Di Zhang, Gang Wang, Hailin Zhang, Huaqiu Liu, Jiebao Xiao, Jinhao Dong, Liang Zhao, Peid- ian Li, Peng Wang, Shihua Yu, Shimao Chen, Weikun Wang, Wenhan Ma, Xiangwei Deng, Yi Huang, Yifan Song, Zihan Jiang, Bowen Ye, Can Cai, Chenhong He, Dong Zhang, Duo Zhang, Guoan Wang, Hao Tian, Haochen Zhao, Heng Qu, Hongsh...

  76. [78]

    Yuxing Xiang, Xue Li, Kun Qian, Yufan Yang, Diwen Zhu, Wenyuan Yu, Ennan Zhai, Xuanzhe Liu, Xin Jin, and Jingren Zhou. 2025. Aegaeon: Effective GPU pooling for concurrent LLM serving on the market. In Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles. 1030–1045

  77. [79]

    Wencong Xiao, Shiru Ren, Yong Li, Yang Zhang, Pengyang Hou, Zhi Li, Yihui Feng, Wei Lin, and Yangqing Jia. 2020. AntMan: Dynamic scaling on GPU clusters for deep learning. InUSENIX OSDI

  78. [80]

    Jiarong Xing, Yifan Qiao, Simon Mo, Xingqi Cui, Gur-Eyal Sela, Yang Zhou, Joseph Gonzalez, and Ion Stoica. 2025. Towards Efficient and Practical GPU Multitasking in the Era of LLM.arXiv preprint arXiv:2508.08448(2025)

  79. [81]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  80. [82]

    Qwen3 Technical Report.arXiv preprint arXiv:2505.09388(2025)

Showing first 80 references.