pith. sign in

arxiv: 2511.14617 · v3 · submitted 2025-11-18 · 💻 cs.DC · cs.LG

Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning

Pith reviewed 2026-05-17 20:34 UTC · model grok-4.3

classification 💻 cs.DC cs.LG
keywords LLM reinforcement learningsynchronous rolloutcontext learningload balancingspeculative decodinglong-tail latencydistributed systems
0
0 comments X

The pith

Seer improves synchronous LLM RL rollout throughput by up to 2.04 times by learning output similarities from shared prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing synchronous systems for reinforcement learning with large language models suffer from long-tail latency and low resource utilization during the rollout phase that dominates iteration time. Seer rests on the observation that requests sharing the same prompt tend to produce outputs of similar length and pattern. It exploits this pattern through divided rollout to balance load dynamically, context-aware scheduling to shorten tail delays, and adaptive grouped speculative decoding to speed generation. These mechanisms operate together to raise efficiency without changing the synchronous RL loop. Production evaluations confirm the gains in both speed and tail latency.

Core claim

Seer is a context learning RL system that addresses rollout bottlenecks by using the fact that requests sharing the same prompt exhibit strong similarities in output lengths and response patterns. It coordinates three techniques: divided rollout for dynamic load balancing, context-aware scheduling to mitigate long-tail request delays, and adaptive grouped speculative decoding to accelerate generation. On production-grade RL workloads this yields up to 2.04 times higher end-to-end rollout throughput than prior synchronous systems while cutting long-tail latency by 72-94 percent.

What carries the argument

Online context learning that detects output-length and pattern similarities among same-prompt requests and feeds them into coordinated load balancing, scheduling, and grouped speculative decoding.

If this is right

  • Divided rollout spreads work across workers according to predicted output lengths to reduce idle time.
  • Context-aware scheduling detects and prioritizes long-tail requests to shrink overall iteration time.
  • Adaptive grouped speculative decoding generates tokens faster for batches that share prompt-derived patterns.
  • The combined effect raises rollout throughput while preserving the synchronous structure required by RL training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prompt-similarity grouping could be tested in non-RL LLM inference servers that already batch repeated prompts.
  • Performance sensitivity to prompt diversity suggests a follow-up experiment that mixes prompt families within a single rollout batch.

Load-bearing premise

Requests sharing the same prompt exhibit strong similarities in output lengths and response patterns.

What would settle it

Run the system on workloads where prompts within each batch are deliberately varied and dissimilar, then check whether the reported throughput gains and latency reductions largely vanish.

Figures

Figures reproduced from arXiv: 2511.14617 by Bo Pang, Mingxing Zhang, Ruoyu Qin, Weiran He, Weixiao Huang, Xinran Xu, Yangkun Zhang, Yikai Zhao, Yingdi Shan, Yongwei Wu.

Figure 1
Figure 1. Figure 1: Challenges and Seer’s solution for long-generation [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of output lengths during rollout across [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: KVCache utilization, number of running requests, [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Length correlation within response groups. Each [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The overview of Seer. simulated n-gram speculative decoding using compressed suf￾fix trees (CSTs). Unlike conventional lookahead decoding, which relies solely on each request’s own generation history as the n-gram dictionary, our approach incorporates other re￾sponses from the same group as reference patterns to measure cross-request similarity [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The distributed grouped draft server. and reveal themselves as potential long-tail candidates. 2) Length Estimation Update: The Context Manager main￾tains an estimated output length Lbg for each prompt group g. This value is dynamically updated as the maximum genera￾tion length among all completed requests in the group. If none of the requests in a group have completed yet, the group is classified as a pot… view at source ↗
Figure 7
Figure 7. Figure 7: Rollout throughput (output tokens per second) and completion time comparison across three tasks. The dashed lines [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Tail latency and total time of three RL tasks. [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 11
Figure 11. Figure 11: Normalized throughput on the 5th rollout iteration of the Qwen2-VL-72B task for different SD strategies. No-SD refers to disabling speculative decoding; No￾Adapt refers to disabling adaptive scheduling; No￾Context refers to disabling pattern context sharing within groups. 0 500 1000 1500 Time (s) 1.0 1.5 2.0 2.5 3.0 3.5 Mean Acceptance Length 0 50 100 150 200 Running Requests [PITH_FULL_IMAGE:figures/ful… view at source ↗
Figure 10
Figure 10. Figure 10: Impact of length context on improving throughput [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
read the original abstract

Reinforcement Learning (RL) has emerged as a critical technique for advancing modern Large Language Models (LLMs), yet existing synchronous RL systems face severe performance bottlenecks. The rollout phase, which dominates end-to-end iteration time, suffers from substantial long-tail latency and poor resource utilization due to inherent workload imbalance. We present Seer, a novel context learning RL system that addresses these challenges through a key observation: requests sharing the same prompt exhibit strong similarities in output lengths and response patterns. Leveraging this insight, Seer introduces three coordinated techniques: (1) divided rollout for dynamic load balancing, (2) context-aware scheduling to mitigate long-tail request delays, and (3) adaptive grouped speculative decoding to accelerate generation. These mechanisms work in concert to markedly reduce long-tail latency and improve resource efficiency during rollout. Evaluations on production-grade RL workloads demonstrate that Seer achieves up to 2.04$\times$ end-to-end rollout throughput improvement compared to the state-of-the-art synchronous RL systems, while notably reducing long-tail latency by 72-94%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Seer, a context-learning system for synchronous LLM reinforcement learning. It rests on the observation that requests sharing the same prompt exhibit strong similarities in output lengths and response patterns. Seer proposes three coordinated techniques—divided rollout for dynamic load balancing, context-aware scheduling to reduce long-tail delays, and adaptive grouped speculative decoding—and reports up to 2.04× end-to-end rollout throughput gains together with 72–94% reductions in long-tail latency on production-grade RL workloads relative to prior synchronous RL systems.

Significance. If the empirical results and the underlying similarity assumption are robustly validated, the work would offer a practical advance in mitigating rollout imbalance and improving resource utilization during LLM RL training. The emphasis on production workloads and the integration of scheduling with speculative decoding are positive aspects; however, the absence of quantified support for the core assumption and baseline details reduces the immediate strength of the contribution.

major comments (2)
  1. [§5 (Evaluation)] §5 (Evaluation): the reported 2.04× throughput and 72–94% latency improvements are stated without specifying the exact baselines, workload characteristics (prompt distributions, model sizes, generation parameters), number of trials, statistical significance, or controls for hardware or software confounding factors, leaving the central performance claim only partially supported.
  2. [§3 (Design)] §3 (Design / Observation): all three techniques (divided rollout, context-aware scheduling, adaptive grouped speculative decoding) are predicated on strong similarities in output lengths and response patterns for identical prompts, yet the manuscript supplies no quantitative evidence such as length variance, correlation coefficients, or ablation results when the assumption is relaxed or under varying temperature/top-p settings.
minor comments (2)
  1. [Figures] Figure captions and axis labels should explicitly state the number of runs and error bars used for throughput and latency measurements.
  2. [Related Work] Add a short related-work subsection contrasting Seer with recent speculative-decoding and RLHF scheduling papers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will strengthen the manuscript accordingly.

read point-by-point responses
  1. Referee: [§5 (Evaluation)] §5 (Evaluation): the reported 2.04× throughput and 72–94% latency improvements are stated without specifying the exact baselines, workload characteristics (prompt distributions, model sizes, generation parameters), number of trials, statistical significance, or controls for hardware or software confounding factors, leaving the central performance claim only partially supported.

    Authors: We agree that additional experimental details will improve clarity and reproducibility. In the revised version we will expand §5 to explicitly list the baselines (including the precise synchronous RL systems and versions compared against), workload characteristics such as prompt length distributions, model sizes, and sampling parameters (temperature, top-p), the number of independent trials performed, and any statistical significance tests applied. We will also document hardware configuration, software stack, and controls used to isolate confounding factors. revision: yes

  2. Referee: [§3 (Design)] §3 (Design / Observation): all three techniques (divided rollout, context-aware scheduling, adaptive grouped speculative decoding) are predicated on strong similarities in output lengths and response patterns for identical prompts, yet the manuscript supplies no quantitative evidence such as length variance, correlation coefficients, or ablation results when the assumption is relaxed or under varying temperature/top-p settings.

    Authors: The similarity observation underpins the design, and we acknowledge that the current manuscript presents it primarily qualitatively. We will add quantitative support in a revised §3, including measured output-length variance and Pearson correlation coefficients across requests sharing identical prompts, together with sensitivity results under different temperature and top-p values. We will also include an ablation that relaxes the similarity assumption to quantify its impact on the three techniques. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical system design validated by measurements

full rationale

The paper presents an empirical observation (same-prompt requests exhibit similar output lengths and response patterns) as the motivating insight, then describes three engineering techniques built on it and reports measured improvements (up to 2.04× throughput and 72-94% long-tail latency reduction) from evaluations on production workloads. No mathematical derivations, equations, fitted parameters, or predictions appear in the provided text. There are no self-citations, uniqueness theorems, or ansatzes that reduce any claim to its own inputs by construction. The logic chain is self-contained: observation → system design → external benchmark results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on one domain assumption about prompt similarity and introduces no free parameters or new entities in the abstract description.

axioms (1)
  • domain assumption Requests sharing the same prompt exhibit strong similarities in output lengths and response patterns.
    This observation is explicitly stated as the foundation for the three techniques.

pith-pipeline@v0.9.0 · 5509 in / 1192 out tokens · 42700 ms · 2026-05-17T20:34:32.938082+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning

    cs.LG 2026-05 conditional novelty 8.0

    ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equip...

  2. Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding

    cs.LG 2026-05 unverdicted novelty 7.0

    Graft combines pruning and retrieval in a sequential mechanism to build hybrid draft trees for speculative decoding, delivering up to 5.41× speedup and 21.8% better average speedup than EAGLE-3 on large models.

  3. Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.

  4. ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL

    cs.DC 2026-05 unverdicted novelty 6.0

    ROSE delivers 1.2-3.3x higher end-to-end throughput for agentic RL by safely co-using underutilized serving GPUs for rollouts while meeting serving SLOs.

  5. ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL

    cs.DC 2026-05 unverdicted novelty 6.0

    ROSE is a system for cooperative elasticity that co-locates serving and rollout models on shared GPUs, delivering 1.3-3.3x higher end-to-end throughput than fixed-resource baselines while preserving serving SLOs.

  6. JigsawRL: Assembling RL Pipelines for Efficient LLM Post-Training

    cs.LG 2026-04 unverdicted novelty 6.0

    JigsawRL achieves up to 1.85x higher throughput in LLM RL pipelines via pipeline multiplexing, sub-stage graphs, and look-ahead scheduling compared to prior systems.

  7. FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning

    cs.LG 2026-01 unverdicted novelty 6.0

    FP8-RL delivers up to 44% faster rollouts in LLM RL by using blockwise FP8 quantization, KV-cache recalibration, and importance-sampling corrections while keeping learning behavior close to BF16 baselines.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · cited by 6 Pith papers · 18 internal anchors

  1. [1]

    Checkpoint engine

    Moontshot AI. Checkpoint engine. https://github. com/MoonshotAI/checkpoint-engine, 2025

  2. [2]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sang- hai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints.arXiv preprint arXiv:2305.13245, 2023. 13

  3. [3]

    Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

    Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Sim- ple llm inference acceleration framework with multi- ple decoding heads.arXiv preprint arXiv:2401.10774, 2024

  4. [4]

    Evaluating Large Language Models Trained on Code

    Mark Chen. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  5. [5]

    Enabling efficient batch serving for lmaas via generation length prediction

    Ke Cheng, Wen Hu, Zhi Wang, Peng Du, Jianguo Li, and Sheng Zhang. Enabling efficient batch serving for lmaas via generation length prediction. In2024 IEEE In- ternational Conference on Web Services (ICWS), pages 853–864. IEEE, 2024

  6. [6]

    AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

    Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Ji- ashu Wang, et al. Areal: A large-scale asynchronous reinforcement learning system for language reasoning. arXiv preprint arXiv:2505.24298, 2025

  7. [7]

    Break the sequential dependency of llm inference using lookahead decoding

    Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of llm inference using lookahead decoding.arXiv preprint arXiv:2402.02057, 2024

  8. [8]

    Efficient llm scheduling by learning to rank.Advances in Neural Information Processing Systems, 37:59006–59029, 2024

    Yichao Fu, Siqi Zhu, Runlong Su, Aurick Qiao, Ion Sto- ica, and Hao Zhang. Efficient llm scheduling by learning to rank.Advances in Neural Information Processing Systems, 37:59006–59029, 2024

  9. [9]

    Rollpacker: Mitigating long-tail rollouts for fast, synchronous rl post-training.arXiv preprint arXiv:2509.21009, 2025

    Wei Gao, Yuheng Zhao, Dakai An, Tianyuan Wu, Lunxi Cao, Shaopan Xiong, Ju Huang, Weixun Wang, Siran Yang, Wenbo Su, et al. Rollpacker: Mitigating long-tail rollouts for fast, synchronous rl post-training.arXiv preprint arXiv:2509.21009, 2025

  10. [10]

    Deepseek-r1 incentivizes rea- soning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shi- rong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes rea- soning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

  11. [11]

    Asyncflow: An asynchronous streaming rl framework for efficient llm post-training.arXiv preprint arXiv:2507.01663, 2025

    Zhenyu Han, Ansheng You, Haibo Wang, Kui Luo, Guang Yang, Wenqi Shi, Menglong Chen, Sicheng Zhang, Zeshun Lan, Chunshi Deng, et al. Asyncflow: An asynchronous streaming rl framework for efficient llm post-training.arXiv preprint arXiv:2507.01663, 2025

  12. [12]

    De- feating nondeterminism in llm inference.Think- ing Machines Lab: Connectionism, 2025

    Horace He and Thinking Machines Lab. De- feating nondeterminism in llm inference.Think- ing Machines Lab: Connectionism, 2025. https://thinkingmachines.ai/blog/defeating- nondeterminism-in-llm-inference/

  13. [13]

    History rhymes: Accelerating llm reinforcement learning with rhymerl.arXiv preprint arXiv:2508.18588, 2025

    Jingkai He, Tianjian Li, Erhu Feng, Dong Du, Qian Liu, Tao Liu, Yubin Xia, and Haibo Chen. History rhymes: Accelerating llm reinforcement learning with rhymerl. arXiv preprint arXiv:2508.18588, 2025

  14. [14]

    Brorl: Scaling reinforcement learning via broadened exploration.arXiv preprint arXiv:2510.01180, 2025

    Jian Hu, Mingjie Liu, Ximing Lu, Fang Wu, Zaid Har- chaoui, Shizhe Diao, Yejin Choi, Pavlo Molchanov, Jun Yang, Jan Kautz, et al. Brorl: Scaling reinforcement learning via broadened exploration.arXiv preprint arXiv:2510.01180, 2025

  15. [15]

    OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

    Jian Hu, Xibin Wu, Zilin Zhu, Weixun Wang, Dehao Zhang, Yu Cao, et al. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework.arXiv preprint arXiv:2405.11143, 2024

  16. [16]

    Sam decoding: Speculative decoding via suffix automaton

    Yuxuan Hu, Ke Wang, Xiaokang Zhang, Fanjin Zhang, Cuiping Li, Hong Chen, and Jing Zhang. Sam decoding: Speculative decoding via suffix automaton. InProceed- ings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 12187–12204, 2025

  17. [17]

    Gonza- lez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonza- lez, Hao Zhang, and Ion Stoica. Efficient memory man- agement for large language model serving with page- dattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  18. [18]

    Fast inference from transformers via speculative decoding

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR, 2023

  19. [19]

    Eagle-2: Faster inference of language models with dynamic draft trees

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-2: Faster inference of language models with dynamic draft trees.arXiv preprint arXiv:2406.16858, 2024

  20. [20]

    EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty.arXiv preprint arXiv:2401.15077, 2024

  21. [21]

    EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-3: Scaling up inference acceleration of large language models via training-time test.arXiv preprint arXiv:2503.01840, 2025

  22. [22]

    Knapsack rl: Unlocking exploration of llms via optimizing budget allocation

    Ziniu Li, Congliang Chen, Tianyun Yang, Tian Ding, Ruoyu Sun, Ge Zhang, Wenhao Huang, and Zhi-Quan Luo. Knapsack rl: Unlocking exploration of llms via optimizing budget allocation.arXiv preprint arXiv:2509.25849, 2025. 14

  23. [23]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, econom- ical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434, 2024

  24. [24]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 techni- cal report.arXiv preprint arXiv:2412.19437, 2024

  25. [25]

    Spec-rl: Accelerating on-policy reinforcement learning with speculative rollouts.arXiv preprint arXiv:2509.23232, 2025a

    Bingshuai Liu, Ante Wang, Zijun Min, Liang Yao, Haibo Zhang, Yang Liu, Anxiang Zeng, and Jinsong Su. Spec- rl: Accelerating on-policy reinforcement learning via speculative rollouts.arXiv preprint arXiv:2509.23232, 2025

  26. [26]

    Muon is Scalable for LLM Training

    Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training. arXiv preprint arXiv:2502.16982, 2025

  27. [27]

    Part ii: Roll flash–accelerating rlvr and agentic training with asynchrony.arXiv preprint arXiv:2510.11345, 2025

    Han Lu, Zichen Liu, Shaopan Xiong, Yancheng He, Wei Gao, Yanan Wu, Weixun Wang, Jiashun Liu, Yang Li, Haizhou Zhao, et al. Part ii: Roll flash–accelerating rlvr and agentic training with asynchrony.arXiv preprint arXiv:2510.11345, 2025

  28. [28]

    Realhf: Optimized rlhf training for large language models through parameter reallocation.arXiv preprint arXiv:2406.14088, 2024

    Zhiyu Mei, Wei Fu, Kaiwei Li, Guangju Wang, Huanchen Zhang, and Yi Wu. Real: Efficient rlhf train- ing of large language models with parameter realloca- tion.arXiv preprint arXiv:2406.14088, 2024

  29. [29]

    Suffixdecoding: Extreme speculative decod- ing for emerging ai applications

    Gabriele Oliaro, Zhihao Jia, Daniel F Campos, and Au- rick Qiao. Suffixdecoding: Extreme speculative decod- ing for emerging ai applications. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  30. [30]

    Pipelinerl: Faster on-policy reinforcement learning for long sequence gen- eration.arXiv preprint arXiv:2509.19128, 2025

    Alexandre Piché, Ehsan Kamalloo, Rafael Pardinas, Xi- aoyin Chen, and Dzmitry Bahdanau. Pipelinerl: Faster on-policy reinforcement learning for long sequence gen- eration.arXiv preprint arXiv:2509.19128, 2025

  31. [31]

    Mooncake: Trading more storage for less computation—a {KVCache-centric} architecture for serving {LLM} chatbot

    Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. Mooncake: Trading more storage for less computation—a {KVCache-centric} architecture for serving {LLM} chatbot. In23rd USENIX Confer- ence on File and Storage Technologies (F AST 25), pages 155–170, 2025

  32. [32]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  33. [33]

    Nemo-aligner: Scalable toolkit for efficient model alignment.arXiv preprint arXiv:2405.01481, 2024

    Gerald Shen, Zhilin Wang, Olivier Delalleau, Jiaqi Zeng, Yi Dong, Daniel Egert, Shengyang Sun, Jimmy Zhang, Sahil Jain, Ali Taghibakhshi, et al. Nemo-aligner: Scal- able toolkit for efficient model alignment.arXiv preprint arXiv:2405.01481, 2024

  34. [34]

    Laminar: A scalable asyn- chronous rl post-training framework.arXiv preprint arXiv:2510.12633, 2025

    Guangming Sheng, Yuxuan Tong, Borui Wan, Wang Zhang, Chaobo Jia, Xibin Wu, Yuqi Wu, Xiang Li, Chi Zhang, Yanghua Peng, et al. Laminar: A scalable asyn- chronous rl post-training framework.arXiv preprint arXiv:2510.12633, 2025

  35. [35]

    Hybridflow: A flexible and efficient rlhf framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025

  36. [36]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter lan- guage models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

  37. [37]

    Llm-as-a-judge & reward model: What they can and cannot do.arXiv preprint arXiv:2409.11239, 2024

    Guijin Son, Hyunwoo Ko, Hoyoung Lee, Yewon Kim, and Seunghyeok Hong. Llm-as-a-judge & reward model: What they can and cannot do.arXiv preprint arXiv:2409.11239, 2024

  38. [38]

    Llumnix: Dynamic scheduling for large language model serving

    Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. Llumnix: Dynamic scheduling for large language model serving. In18th USENIX symposium on operating systems design and implementation (OSDI 24), pages 173–191, 2024

  39. [39]

    Kimi K2: Open Agentic Intelligence

    Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Ji- ahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agen- tic intelligence.arXiv preprint arXiv:2507.20534, 2025

  40. [40]

    Kimi k1.5: Scaling Reinforcement Learning with LLMs

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1.5: Scal- ing reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

  41. [41]

    Introducing LongCat-flash-thinking: A technical report

    Meituan LongCat Team. Introducing longcat- flash-thinking: A technical report.arXiv preprint arXiv:2509.18883, 2025

  42. [42]

    vllm: A high-throughput and memory- efficient inference and serving engine for llms

    vLLM Team. vllm: A high-throughput and memory- efficient inference and serving engine for llms. https: //github.com/vllm-project/vllm, 2025

  43. [43]

    15 Trl: Transformer reinforcement learning

    Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. 15 Trl: Transformer reinforcement learning. https:// github.com/huggingface/trl, 2020

  44. [44]

    Opt-tree: Speculative decoding with adaptive draft tree structure

    Jikai Wang, Yi Su, Juntao Li, Qingrong Xia, Zi Ye, Xinyu Duan, Zhefeng Wang, and Min Zhang. Opt-tree: Speculative decoding with adaptive draft tree structure. Transactions of the Association for Computational Lin- guistics, 13:188–199, 2025

  45. [45]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  46. [46]

    Reinforcement learning optimization for large-scale learning: An efficient and user-friendly scaling library.arXiv preprint arXiv:2506.06122, 2025a

    Weixun Wang, Shaopan Xiong, Gengru Chen, Wei Gao, Sheng Guo, Yancheng He, Ju Huang, Jiaheng Liu, Zhendong Li, Xiaoyang Li, et al. Reinforcement learning optimization for large-scale learning: An effi- cient and user-friendly scaling library.arXiv preprint arXiv:2506.06122, 2025

  47. [47]

    Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information pro- cessing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information pro- cessing systems, 35:24824–24837, 2022

  48. [48]

    Linear pattern matching algorithms

    Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory (swat 1973), pages 1–11. IEEE, 1973

  49. [49]

    RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs

    Yongji Wu, Xueshen Liu, Haizhong Zheng, Juncheng Gu, Beidi Chen, Z Morley Mao, Arvind Krishnamurthy, and Ion Stoica. Rlboost: Harvesting preemptible re- sources for cost-efficient reinforcement learning on llms. arXiv preprint arXiv:2510.19225, 2025

  50. [50]

    Deepspeed-chat: Easy, fast and affordable rlhf training of chatgpt-like models at all scales.arXiv preprint arXiv:2308.01320, 2023

    Zhewei Yao, Reza Yazdani Aminabadi, Olatunji Ruwase, Samyam Rajbhandari, Xiaoxia Wu, Ammar Ahmad Awan, Jeff Rasley, Minjia Zhang, Conglong Li, Connor Holmes, et al. Deepspeed-chat: Easy, fast and affordable rlhf training of chatgpt-like models at all scales.arXiv preprint arXiv:2308.01320, 2023

  51. [51]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gao- hong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  52. [52]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy opti- mization.arXiv preprint arXiv:2507.18071, 2025

  53. [53]

    Sglang: Efficient execution of structured language model programs.Advances in neural information pro- cessing systems, 37:62557–62583, 2024

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs.Advances in neural information pro- cessing systems, 37:62557–62583, 2024

  54. [54]

    Streamrl: Scalable, heterogeneous, and elastic rl for llms with disaggregated stream generation

    Yinmin Zhong, Zili Zhang, Xiaoniu Song, Hanpeng Hu, Chao Jin, Bingyang Wu, Nuo Chen, Yukun Chen, Yu Zhou, Changyi Wan, et al. Streamrl: Scalable, het- erogeneous, and elastic rl for llms with disaggregated stream generation.arXiv preprint arXiv:2504.15930, 2025

  55. [55]

    Optimizing {RLHF} training for large language models with stage fusion

    Yinmin Zhong, Zili Zhang, Bingyang Wu, Shengyu Liu, Yukun Chen, Changyi Wan, Hanpeng Hu, Lei Xia, Ranchen Ming, Yibo Zhu, et al. Optimizing {RLHF} training for large language models with stage fusion. In 22nd USENIX Symposium on Networked Systems De- sign and Implementation (NSDI 25), pages 489–503, 2025

  56. [56]

    April: Active partial rollouts in reinforcement learning to tame long-tail generation

    Yuzhen Zhou, Jiajun Li, Yusheng Su, Gowtham Ramesh, Zilin Zhu, Xiang Long, Chenyang Zhao, Jin Pan, Xi- aodong Yu, Ze Wang, et al. April: Active partial rollouts in reinforcement learning to tame long-tail generation. arXiv preprint arXiv:2509.18521, 2025

  57. [57]

    slime: An llm post-training framework for rl scaling

    Zilin Zhu, Chengxing Xie, Xin Lv, and slime Contrib- utors. slime: An llm post-training framework for rl scaling. https://github.com/THUDM/slime, 2025. GitHub repository. Corresponding author: Xin Lv. 16