Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning

Bo Pang; Mingxing Zhang; Ruoyu Qin; Weiran He; Weixiao Huang; Xinran Xu; Yangkun Zhang; Yikai Zhao; Yingdi Shan; Yongwei Wu

arxiv: 2511.14617 · v3 · submitted 2025-11-18 · 💻 cs.DC · cs.LG

Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning

Ruoyu Qin , Weiran He , Weixiao Huang , Yangkun Zhang , Yikai Zhao , Bo Pang , Xinran Xu , Yingdi Shan

show 2 more authors

Yongwei Wu Mingxing Zhang

This is my paper

Pith reviewed 2026-05-17 20:34 UTC · model grok-4.3

classification 💻 cs.DC cs.LG

keywords LLM reinforcement learningsynchronous rolloutcontext learningload balancingspeculative decodinglong-tail latencydistributed systems

0 comments

The pith

Seer improves synchronous LLM RL rollout throughput by up to 2.04 times by learning output similarities from shared prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing synchronous systems for reinforcement learning with large language models suffer from long-tail latency and low resource utilization during the rollout phase that dominates iteration time. Seer rests on the observation that requests sharing the same prompt tend to produce outputs of similar length and pattern. It exploits this pattern through divided rollout to balance load dynamically, context-aware scheduling to shorten tail delays, and adaptive grouped speculative decoding to speed generation. These mechanisms operate together to raise efficiency without changing the synchronous RL loop. Production evaluations confirm the gains in both speed and tail latency.

Core claim

Seer is a context learning RL system that addresses rollout bottlenecks by using the fact that requests sharing the same prompt exhibit strong similarities in output lengths and response patterns. It coordinates three techniques: divided rollout for dynamic load balancing, context-aware scheduling to mitigate long-tail request delays, and adaptive grouped speculative decoding to accelerate generation. On production-grade RL workloads this yields up to 2.04 times higher end-to-end rollout throughput than prior synchronous systems while cutting long-tail latency by 72-94 percent.

What carries the argument

Online context learning that detects output-length and pattern similarities among same-prompt requests and feeds them into coordinated load balancing, scheduling, and grouped speculative decoding.

If this is right

Divided rollout spreads work across workers according to predicted output lengths to reduce idle time.
Context-aware scheduling detects and prioritizes long-tail requests to shrink overall iteration time.
Adaptive grouped speculative decoding generates tokens faster for batches that share prompt-derived patterns.
The combined effect raises rollout throughput while preserving the synchronous structure required by RL training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prompt-similarity grouping could be tested in non-RL LLM inference servers that already batch repeated prompts.
Performance sensitivity to prompt diversity suggests a follow-up experiment that mixes prompt families within a single rollout batch.

Load-bearing premise

Requests sharing the same prompt exhibit strong similarities in output lengths and response patterns.

What would settle it

Run the system on workloads where prompts within each batch are deliberately varied and dissimilar, then check whether the reported throughput gains and latency reductions largely vanish.

Figures

Figures reproduced from arXiv: 2511.14617 by Bo Pang, Mingxing Zhang, Ruoyu Qin, Weiran He, Weixiao Huang, Xinran Xu, Yangkun Zhang, Yikai Zhao, Yingdi Shan, Yongwei Wu.

**Figure 2.** Figure 2: Distribution of output lengths during rollout across [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: KVCache utilization, number of running requests, [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Length correlation within response groups. Each [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: The overview of Seer. simulated n-gram speculative decoding using compressed suffix trees (CSTs). Unlike conventional lookahead decoding, which relies solely on each request’s own generation history as the n-gram dictionary, our approach incorporates other responses from the same group as reference patterns to measure cross-request similarity [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: The distributed grouped draft server. and reveal themselves as potential long-tail candidates. 2) Length Estimation Update: The Context Manager maintains an estimated output length Lbg for each prompt group g. This value is dynamically updated as the maximum generation length among all completed requests in the group. If none of the requests in a group have completed yet, the group is classified as a pot… view at source ↗

**Figure 7.** Figure 7: Rollout throughput (output tokens per second) and completion time comparison across three tasks. The dashed lines [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Tail latency and total time of three RL tasks. [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 11.** Figure 11: Normalized throughput on the 5th rollout iteration of the Qwen2-VL-72B task for different SD strategies. No-SD refers to disabling speculative decoding; NoAdapt refers to disabling adaptive scheduling; NoContext refers to disabling pattern context sharing within groups. 0 500 1000 1500 Time (s) 1.0 1.5 2.0 2.5 3.0 3.5 Mean Acceptance Length 0 50 100 150 200 Running Requests [PITH_FULL_IMAGE:figures/ful… view at source ↗

**Figure 10.** Figure 10: Impact of length context on improving throughput [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

read the original abstract

Reinforcement Learning (RL) has emerged as a critical technique for advancing modern Large Language Models (LLMs), yet existing synchronous RL systems face severe performance bottlenecks. The rollout phase, which dominates end-to-end iteration time, suffers from substantial long-tail latency and poor resource utilization due to inherent workload imbalance. We present Seer, a novel context learning RL system that addresses these challenges through a key observation: requests sharing the same prompt exhibit strong similarities in output lengths and response patterns. Leveraging this insight, Seer introduces three coordinated techniques: (1) divided rollout for dynamic load balancing, (2) context-aware scheduling to mitigate long-tail request delays, and (3) adaptive grouped speculative decoding to accelerate generation. These mechanisms work in concert to markedly reduce long-tail latency and improve resource efficiency during rollout. Evaluations on production-grade RL workloads demonstrate that Seer achieves up to 2.04$\times$ end-to-end rollout throughput improvement compared to the state-of-the-art synchronous RL systems, while notably reducing long-tail latency by 72-94%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Seer claims a 2x rollout speedup in synchronous LLM RL by exploiting output length similarities for same prompts, but the evidence for how strong or general that similarity is remains thin.

read the letter

The main thing here is that Seer gets up to 2.04x end-to-end rollout throughput and cuts long-tail latency by 72-94% in synchronous LLM RL by using the observation that identical prompts produce similar output lengths and patterns. The system combines divided rollout for dynamic balancing, context-aware scheduling, and adaptive grouped speculative decoding to improve resource use during the rollout phase that dominates training time.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Seer, a context-learning system for synchronous LLM reinforcement learning. It rests on the observation that requests sharing the same prompt exhibit strong similarities in output lengths and response patterns. Seer proposes three coordinated techniques—divided rollout for dynamic load balancing, context-aware scheduling to reduce long-tail delays, and adaptive grouped speculative decoding—and reports up to 2.04× end-to-end rollout throughput gains together with 72–94% reductions in long-tail latency on production-grade RL workloads relative to prior synchronous RL systems.

Significance. If the empirical results and the underlying similarity assumption are robustly validated, the work would offer a practical advance in mitigating rollout imbalance and improving resource utilization during LLM RL training. The emphasis on production workloads and the integration of scheduling with speculative decoding are positive aspects; however, the absence of quantified support for the core assumption and baseline details reduces the immediate strength of the contribution.

major comments (2)

[§5 (Evaluation)] §5 (Evaluation): the reported 2.04× throughput and 72–94% latency improvements are stated without specifying the exact baselines, workload characteristics (prompt distributions, model sizes, generation parameters), number of trials, statistical significance, or controls for hardware or software confounding factors, leaving the central performance claim only partially supported.
[§3 (Design)] §3 (Design / Observation): all three techniques (divided rollout, context-aware scheduling, adaptive grouped speculative decoding) are predicated on strong similarities in output lengths and response patterns for identical prompts, yet the manuscript supplies no quantitative evidence such as length variance, correlation coefficients, or ablation results when the assumption is relaxed or under varying temperature/top-p settings.

minor comments (2)

[Figures] Figure captions and axis labels should explicitly state the number of runs and error bars used for throughput and latency measurements.
[Related Work] Add a short related-work subsection contrasting Seer with recent speculative-decoding and RLHF scheduling papers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will strengthen the manuscript accordingly.

read point-by-point responses

Referee: [§5 (Evaluation)] §5 (Evaluation): the reported 2.04× throughput and 72–94% latency improvements are stated without specifying the exact baselines, workload characteristics (prompt distributions, model sizes, generation parameters), number of trials, statistical significance, or controls for hardware or software confounding factors, leaving the central performance claim only partially supported.

Authors: We agree that additional experimental details will improve clarity and reproducibility. In the revised version we will expand §5 to explicitly list the baselines (including the precise synchronous RL systems and versions compared against), workload characteristics such as prompt length distributions, model sizes, and sampling parameters (temperature, top-p), the number of independent trials performed, and any statistical significance tests applied. We will also document hardware configuration, software stack, and controls used to isolate confounding factors. revision: yes
Referee: [§3 (Design)] §3 (Design / Observation): all three techniques (divided rollout, context-aware scheduling, adaptive grouped speculative decoding) are predicated on strong similarities in output lengths and response patterns for identical prompts, yet the manuscript supplies no quantitative evidence such as length variance, correlation coefficients, or ablation results when the assumption is relaxed or under varying temperature/top-p settings.

Authors: The similarity observation underpins the design, and we acknowledge that the current manuscript presents it primarily qualitatively. We will add quantitative support in a revised §3, including measured output-length variance and Pearson correlation coefficients across requests sharing identical prompts, together with sensitivity results under different temperature and top-p values. We will also include an ablation that relaxes the similarity assumption to quantify its impact on the three techniques. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical system design validated by measurements

full rationale

The paper presents an empirical observation (same-prompt requests exhibit similar output lengths and response patterns) as the motivating insight, then describes three engineering techniques built on it and reports measured improvements (up to 2.04× throughput and 72-94% long-tail latency reduction) from evaluations on production workloads. No mathematical derivations, equations, fitted parameters, or predictions appear in the provided text. There are no self-citations, uniqueness theorems, or ansatzes that reduce any claim to its own inputs by construction. The logic chain is self-contained: observation → system design → external benchmark results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on one domain assumption about prompt similarity and introduces no free parameters or new entities in the abstract description.

axioms (1)

domain assumption Requests sharing the same prompt exhibit strong similarities in output lengths and response patterns.
This observation is explicitly stated as the foundation for the three techniques.

pith-pipeline@v0.9.0 · 5509 in / 1192 out tokens · 42700 ms · 2026-05-17T20:34:32.938082+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

requests sharing the same prompt exhibit strong similarities in output lengths and response patterns... divided rollout for dynamic load balancing, context-aware scheduling, and adaptive grouped speculative decoding
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Seer achieves up to 2.04× end-to-end rollout throughput improvement... reducing long-tail latency by 72-94%

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning
cs.LG 2026-05 conditional novelty 8.0

ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equip...
Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding
cs.LG 2026-05 unverdicted novelty 7.0

Graft combines pruning and retrieval in a sequential mechanism to build hybrid draft trees for speculative decoding, delivering up to 5.41× speedup and 21.8% better average speedup than EAGLE-3 on large models.
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL
cs.DC 2026-05 unverdicted novelty 6.0

ROSE delivers 1.2-3.3x higher end-to-end throughput for agentic RL by safely co-using underutilized serving GPUs for rollouts while meeting serving SLOs.
ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL
cs.DC 2026-05 unverdicted novelty 6.0

ROSE is a system for cooperative elasticity that co-locates serving and rollout models on shared GPUs, delivering 1.3-3.3x higher end-to-end throughput than fixed-resource baselines while preserving serving SLOs.
JigsawRL: Assembling RL Pipelines for Efficient LLM Post-Training
cs.LG 2026-04 unverdicted novelty 6.0

JigsawRL achieves up to 1.85x higher throughput in LLM RL pipelines via pipeline multiplexing, sub-stage graphs, and look-ahead scheduling compared to prior systems.
FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning
cs.LG 2026-01 unverdicted novelty 6.0

FP8-RL delivers up to 44% faster rollouts in LLM RL by using blockwise FP8 quantization, KV-cache recalibration, and importance-sampling corrections while keeping learning behavior close to BF16 baselines.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · cited by 6 Pith papers · 18 internal anchors

[1]

Checkpoint engine

Moontshot AI. Checkpoint engine. https://github. com/MoonshotAI/checkpoint-engine, 2025

work page 2025
[2]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sang- hai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints.arXiv preprint arXiv:2305.13245, 2023. 13

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Sim- ple llm inference acceleration framework with multi- ple decoding heads.arXiv preprint arXiv:2401.10774, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Evaluating Large Language Models Trained on Code

Mark Chen. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

Enabling efficient batch serving for lmaas via generation length prediction

Ke Cheng, Wen Hu, Zhi Wang, Peng Du, Jianguo Li, and Sheng Zhang. Enabling efficient batch serving for lmaas via generation length prediction. In2024 IEEE In- ternational Conference on Web Services (ICWS), pages 853–864. IEEE, 2024

work page 2024
[6]

AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Ji- ashu Wang, et al. Areal: A large-scale asynchronous reinforcement learning system for language reasoning. arXiv preprint arXiv:2505.24298, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Break the sequential dependency of llm inference using lookahead decoding

Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of llm inference using lookahead decoding.arXiv preprint arXiv:2402.02057, 2024

work page arXiv 2024
[8]

Efficient llm scheduling by learning to rank.Advances in Neural Information Processing Systems, 37:59006–59029, 2024

Yichao Fu, Siqi Zhu, Runlong Su, Aurick Qiao, Ion Sto- ica, and Hao Zhang. Efficient llm scheduling by learning to rank.Advances in Neural Information Processing Systems, 37:59006–59029, 2024

work page 2024
[9]

Rollpacker: Mitigating long-tail rollouts for fast, synchronous rl post-training.arXiv preprint arXiv:2509.21009, 2025

Wei Gao, Yuheng Zhao, Dakai An, Tianyuan Wu, Lunxi Cao, Shaopan Xiong, Ju Huang, Weixun Wang, Siran Yang, Wenbo Su, et al. Rollpacker: Mitigating long-tail rollouts for fast, synchronous rl post-training.arXiv preprint arXiv:2509.21009, 2025

work page arXiv 2025
[10]

Deepseek-r1 incentivizes rea- soning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shi- rong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes rea- soning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

work page 2025
[11]

Asyncflow: An asynchronous streaming rl framework for efficient llm post-training.arXiv preprint arXiv:2507.01663, 2025

Zhenyu Han, Ansheng You, Haibo Wang, Kui Luo, Guang Yang, Wenqi Shi, Menglong Chen, Sicheng Zhang, Zeshun Lan, Chunshi Deng, et al. Asyncflow: An asynchronous streaming rl framework for efficient llm post-training.arXiv preprint arXiv:2507.01663, 2025

work page arXiv 2025
[12]

De- feating nondeterminism in llm inference.Think- ing Machines Lab: Connectionism, 2025

Horace He and Thinking Machines Lab. De- feating nondeterminism in llm inference.Think- ing Machines Lab: Connectionism, 2025. https://thinkingmachines.ai/blog/defeating- nondeterminism-in-llm-inference/

work page 2025
[13]

History rhymes: Accelerating llm reinforcement learning with rhymerl.arXiv preprint arXiv:2508.18588, 2025

Jingkai He, Tianjian Li, Erhu Feng, Dong Du, Qian Liu, Tao Liu, Yubin Xia, and Haibo Chen. History rhymes: Accelerating llm reinforcement learning with rhymerl. arXiv preprint arXiv:2508.18588, 2025

work page arXiv 2025
[14]

Brorl: Scaling reinforcement learning via broadened exploration.arXiv preprint arXiv:2510.01180, 2025

Jian Hu, Mingjie Liu, Ximing Lu, Fang Wu, Zaid Har- chaoui, Shizhe Diao, Yejin Choi, Pavlo Molchanov, Jun Yang, Jan Kautz, et al. Brorl: Scaling reinforcement learning via broadened exploration.arXiv preprint arXiv:2510.01180, 2025

work page arXiv 2025
[15]

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

Jian Hu, Xibin Wu, Zilin Zhu, Weixun Wang, Dehao Zhang, Yu Cao, et al. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework.arXiv preprint arXiv:2405.11143, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Sam decoding: Speculative decoding via suffix automaton

Yuxuan Hu, Ke Wang, Xiaokang Zhang, Fanjin Zhang, Cuiping Li, Hong Chen, and Jing Zhang. Sam decoding: Speculative decoding via suffix automaton. InProceed- ings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 12187–12204, 2025

work page 2025
[17]

Gonza- lez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonza- lez, Hao Zhang, and Ion Stoica. Efficient memory man- agement for large language model serving with page- dattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023
[18]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR, 2023

work page 2023
[19]

Eagle-2: Faster inference of language models with dynamic draft trees

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-2: Faster inference of language models with dynamic draft trees.arXiv preprint arXiv:2406.16858, 2024

work page arXiv 2024
[20]

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty.arXiv preprint arXiv:2401.15077, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-3: Scaling up inference acceleration of large language models via training-time test.arXiv preprint arXiv:2503.01840, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Knapsack rl: Unlocking exploration of llms via optimizing budget allocation

Ziniu Li, Congliang Chen, Tianyun Yang, Tian Ding, Ruoyu Sun, Ge Zhang, Wenhao Huang, and Zhi-Quan Luo. Knapsack rl: Unlocking exploration of llms via optimizing budget allocation.arXiv preprint arXiv:2509.25849, 2025. 14

work page arXiv 2025
[23]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, econom- ical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 techni- cal report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Spec-rl: Accelerating on-policy reinforcement learning with speculative rollouts.arXiv preprint arXiv:2509.23232, 2025a

Bingshuai Liu, Ante Wang, Zijun Min, Liang Yao, Haibo Zhang, Yang Liu, Anxiang Zeng, and Jinsong Su. Spec- rl: Accelerating on-policy reinforcement learning via speculative rollouts.arXiv preprint arXiv:2509.23232, 2025

work page arXiv 2025
[26]

Muon is Scalable for LLM Training

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training. arXiv preprint arXiv:2502.16982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Part ii: Roll flash–accelerating rlvr and agentic training with asynchrony.arXiv preprint arXiv:2510.11345, 2025

Han Lu, Zichen Liu, Shaopan Xiong, Yancheng He, Wei Gao, Yanan Wu, Weixun Wang, Jiashun Liu, Yang Li, Haizhou Zhao, et al. Part ii: Roll flash–accelerating rlvr and agentic training with asynchrony.arXiv preprint arXiv:2510.11345, 2025

work page arXiv 2025
[28]

Realhf: Optimized rlhf training for large language models through parameter reallocation.arXiv preprint arXiv:2406.14088, 2024

Zhiyu Mei, Wei Fu, Kaiwei Li, Guangju Wang, Huanchen Zhang, and Yi Wu. Real: Efficient rlhf train- ing of large language models with parameter realloca- tion.arXiv preprint arXiv:2406.14088, 2024

work page arXiv 2024
[29]

Suffixdecoding: Extreme speculative decod- ing for emerging ai applications

Gabriele Oliaro, Zhihao Jia, Daniel F Campos, and Au- rick Qiao. Suffixdecoding: Extreme speculative decod- ing for emerging ai applications. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[30]

Pipelinerl: Faster on-policy reinforcement learning for long sequence gen- eration.arXiv preprint arXiv:2509.19128, 2025

Alexandre Piché, Ehsan Kamalloo, Rafael Pardinas, Xi- aoyin Chen, and Dzmitry Bahdanau. Pipelinerl: Faster on-policy reinforcement learning for long sequence gen- eration.arXiv preprint arXiv:2509.19128, 2025

work page arXiv 2025
[31]

Mooncake: Trading more storage for less computation—a {KVCache-centric} architecture for serving {LLM} chatbot

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. Mooncake: Trading more storage for less computation—a {KVCache-centric} architecture for serving {LLM} chatbot. In23rd USENIX Confer- ence on File and Storage Technologies (F AST 25), pages 155–170, 2025

work page 2025
[32]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Nemo-aligner: Scalable toolkit for efficient model alignment.arXiv preprint arXiv:2405.01481, 2024

Gerald Shen, Zhilin Wang, Olivier Delalleau, Jiaqi Zeng, Yi Dong, Daniel Egert, Shengyang Sun, Jimmy Zhang, Sahil Jain, Ali Taghibakhshi, et al. Nemo-aligner: Scal- able toolkit for efficient model alignment.arXiv preprint arXiv:2405.01481, 2024

work page arXiv 2024
[34]

Laminar: A scalable asyn- chronous rl post-training framework.arXiv preprint arXiv:2510.12633, 2025

Guangming Sheng, Yuxuan Tong, Borui Wan, Wang Zhang, Chaobo Jia, Xibin Wu, Yuqi Wu, Xiang Li, Chi Zhang, Yanghua Peng, et al. Laminar: A scalable asyn- chronous rl post-training framework.arXiv preprint arXiv:2510.12633, 2025

work page arXiv 2025
[35]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025

work page 2025
[36]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter lan- guage models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[37]

Llm-as-a-judge & reward model: What they can and cannot do.arXiv preprint arXiv:2409.11239, 2024

Guijin Son, Hyunwoo Ko, Hoyoung Lee, Yewon Kim, and Seunghyeok Hong. Llm-as-a-judge & reward model: What they can and cannot do.arXiv preprint arXiv:2409.11239, 2024

work page arXiv 2024
[38]

Llumnix: Dynamic scheduling for large language model serving

Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. Llumnix: Dynamic scheduling for large language model serving. In18th USENIX symposium on operating systems design and implementation (OSDI 24), pages 173–191, 2024

work page 2024
[39]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Ji- ahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agen- tic intelligence.arXiv preprint arXiv:2507.20534, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1.5: Scal- ing reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Introducing LongCat-flash-thinking: A technical report

Meituan LongCat Team. Introducing longcat- flash-thinking: A technical report.arXiv preprint arXiv:2509.18883, 2025

work page arXiv 2025
[42]

vllm: A high-throughput and memory- efficient inference and serving engine for llms

vLLM Team. vllm: A high-throughput and memory- efficient inference and serving engine for llms. https: //github.com/vllm-project/vllm, 2025

work page 2025
[43]

15 Trl: Transformer reinforcement learning

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. 15 Trl: Transformer reinforcement learning. https:// github.com/huggingface/trl, 2020

work page 2020
[44]

Opt-tree: Speculative decoding with adaptive draft tree structure

Jikai Wang, Yi Su, Juntao Li, Qingrong Xia, Zi Ye, Xinyu Duan, Zhefeng Wang, and Min Zhang. Opt-tree: Speculative decoding with adaptive draft tree structure. Transactions of the Association for Computational Lin- guistics, 13:188–199, 2025

work page 2025
[45]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

Reinforcement learning optimization for large-scale learning: An efficient and user-friendly scaling library.arXiv preprint arXiv:2506.06122, 2025a

Weixun Wang, Shaopan Xiong, Gengru Chen, Wei Gao, Sheng Guo, Yancheng He, Ju Huang, Jiaheng Liu, Zhendong Li, Xiaoyang Li, et al. Reinforcement learning optimization for large-scale learning: An effi- cient and user-friendly scaling library.arXiv preprint arXiv:2506.06122, 2025

work page arXiv 2025
[47]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information pro- cessing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information pro- cessing systems, 35:24824–24837, 2022

work page 2022
[48]

Linear pattern matching algorithms

Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory (swat 1973), pages 1–11. IEEE, 1973

work page 1973
[49]

RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs

Yongji Wu, Xueshen Liu, Haizhong Zheng, Juncheng Gu, Beidi Chen, Z Morley Mao, Arvind Krishnamurthy, and Ion Stoica. Rlboost: Harvesting preemptible re- sources for cost-efficient reinforcement learning on llms. arXiv preprint arXiv:2510.19225, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Deepspeed-chat: Easy, fast and affordable rlhf training of chatgpt-like models at all scales.arXiv preprint arXiv:2308.01320, 2023

Zhewei Yao, Reza Yazdani Aminabadi, Olatunji Ruwase, Samyam Rajbhandari, Xiaoxia Wu, Ammar Ahmad Awan, Jeff Rasley, Minjia Zhang, Conglong Li, Connor Holmes, et al. Deepspeed-chat: Easy, fast and affordable rlhf training of chatgpt-like models at all scales.arXiv preprint arXiv:2308.01320, 2023

work page arXiv 2023
[51]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gao- hong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy opti- mization.arXiv preprint arXiv:2507.18071, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

Sglang: Efficient execution of structured language model programs.Advances in neural information pro- cessing systems, 37:62557–62583, 2024

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs.Advances in neural information pro- cessing systems, 37:62557–62583, 2024

work page 2024
[54]

Streamrl: Scalable, heterogeneous, and elastic rl for llms with disaggregated stream generation

Yinmin Zhong, Zili Zhang, Xiaoniu Song, Hanpeng Hu, Chao Jin, Bingyang Wu, Nuo Chen, Yukun Chen, Yu Zhou, Changyi Wan, et al. Streamrl: Scalable, het- erogeneous, and elastic rl for llms with disaggregated stream generation.arXiv preprint arXiv:2504.15930, 2025

work page arXiv 2025
[55]

Optimizing {RLHF} training for large language models with stage fusion

Yinmin Zhong, Zili Zhang, Bingyang Wu, Shengyu Liu, Yukun Chen, Changyi Wan, Hanpeng Hu, Lei Xia, Ranchen Ming, Yibo Zhu, et al. Optimizing {RLHF} training for large language models with stage fusion. In 22nd USENIX Symposium on Networked Systems De- sign and Implementation (NSDI 25), pages 489–503, 2025

work page 2025
[56]

April: Active partial rollouts in reinforcement learning to tame long-tail generation

Yuzhen Zhou, Jiajun Li, Yusheng Su, Gowtham Ramesh, Zilin Zhu, Xiang Long, Chenyang Zhao, Jin Pan, Xi- aodong Yu, Ze Wang, et al. April: Active partial rollouts in reinforcement learning to tame long-tail generation. arXiv preprint arXiv:2509.18521, 2025

work page arXiv 2025
[57]

slime: An llm post-training framework for rl scaling

Zilin Zhu, Chengxing Xie, Xin Lv, and slime Contrib- utors. slime: An llm post-training framework for rl scaling. https://github.com/THUDM/slime, 2025. GitHub repository. Corresponding author: Xin Lv. 16

work page 2025

[1] [1]

Checkpoint engine

Moontshot AI. Checkpoint engine. https://github. com/MoonshotAI/checkpoint-engine, 2025

work page 2025

[2] [2]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sang- hai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints.arXiv preprint arXiv:2305.13245, 2023. 13

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Sim- ple llm inference acceleration framework with multi- ple decoding heads.arXiv preprint arXiv:2401.10774, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Evaluating Large Language Models Trained on Code

Mark Chen. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[5] [5]

Enabling efficient batch serving for lmaas via generation length prediction

Ke Cheng, Wen Hu, Zhi Wang, Peng Du, Jianguo Li, and Sheng Zhang. Enabling efficient batch serving for lmaas via generation length prediction. In2024 IEEE In- ternational Conference on Web Services (ICWS), pages 853–864. IEEE, 2024

work page 2024

[6] [6]

AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Ji- ashu Wang, et al. Areal: A large-scale asynchronous reinforcement learning system for language reasoning. arXiv preprint arXiv:2505.24298, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Break the sequential dependency of llm inference using lookahead decoding

Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of llm inference using lookahead decoding.arXiv preprint arXiv:2402.02057, 2024

work page arXiv 2024

[8] [8]

Efficient llm scheduling by learning to rank.Advances in Neural Information Processing Systems, 37:59006–59029, 2024

Yichao Fu, Siqi Zhu, Runlong Su, Aurick Qiao, Ion Sto- ica, and Hao Zhang. Efficient llm scheduling by learning to rank.Advances in Neural Information Processing Systems, 37:59006–59029, 2024

work page 2024

[9] [9]

Rollpacker: Mitigating long-tail rollouts for fast, synchronous rl post-training.arXiv preprint arXiv:2509.21009, 2025

Wei Gao, Yuheng Zhao, Dakai An, Tianyuan Wu, Lunxi Cao, Shaopan Xiong, Ju Huang, Weixun Wang, Siran Yang, Wenbo Su, et al. Rollpacker: Mitigating long-tail rollouts for fast, synchronous rl post-training.arXiv preprint arXiv:2509.21009, 2025

work page arXiv 2025

[10] [10]

Deepseek-r1 incentivizes rea- soning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shi- rong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes rea- soning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

work page 2025

[11] [11]

Asyncflow: An asynchronous streaming rl framework for efficient llm post-training.arXiv preprint arXiv:2507.01663, 2025

Zhenyu Han, Ansheng You, Haibo Wang, Kui Luo, Guang Yang, Wenqi Shi, Menglong Chen, Sicheng Zhang, Zeshun Lan, Chunshi Deng, et al. Asyncflow: An asynchronous streaming rl framework for efficient llm post-training.arXiv preprint arXiv:2507.01663, 2025

work page arXiv 2025

[12] [12]

De- feating nondeterminism in llm inference.Think- ing Machines Lab: Connectionism, 2025

Horace He and Thinking Machines Lab. De- feating nondeterminism in llm inference.Think- ing Machines Lab: Connectionism, 2025. https://thinkingmachines.ai/blog/defeating- nondeterminism-in-llm-inference/

work page 2025

[13] [13]

History rhymes: Accelerating llm reinforcement learning with rhymerl.arXiv preprint arXiv:2508.18588, 2025

Jingkai He, Tianjian Li, Erhu Feng, Dong Du, Qian Liu, Tao Liu, Yubin Xia, and Haibo Chen. History rhymes: Accelerating llm reinforcement learning with rhymerl. arXiv preprint arXiv:2508.18588, 2025

work page arXiv 2025

[14] [14]

Brorl: Scaling reinforcement learning via broadened exploration.arXiv preprint arXiv:2510.01180, 2025

Jian Hu, Mingjie Liu, Ximing Lu, Fang Wu, Zaid Har- chaoui, Shizhe Diao, Yejin Choi, Pavlo Molchanov, Jun Yang, Jan Kautz, et al. Brorl: Scaling reinforcement learning via broadened exploration.arXiv preprint arXiv:2510.01180, 2025

work page arXiv 2025

[15] [15]

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

Jian Hu, Xibin Wu, Zilin Zhu, Weixun Wang, Dehao Zhang, Yu Cao, et al. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework.arXiv preprint arXiv:2405.11143, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Sam decoding: Speculative decoding via suffix automaton

Yuxuan Hu, Ke Wang, Xiaokang Zhang, Fanjin Zhang, Cuiping Li, Hong Chen, and Jing Zhang. Sam decoding: Speculative decoding via suffix automaton. InProceed- ings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 12187–12204, 2025

work page 2025

[17] [17]

Gonza- lez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonza- lez, Hao Zhang, and Ion Stoica. Efficient memory man- agement for large language model serving with page- dattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023

[18] [18]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR, 2023

work page 2023

[19] [19]

Eagle-2: Faster inference of language models with dynamic draft trees

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-2: Faster inference of language models with dynamic draft trees.arXiv preprint arXiv:2406.16858, 2024

work page arXiv 2024

[20] [20]

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty.arXiv preprint arXiv:2401.15077, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-3: Scaling up inference acceleration of large language models via training-time test.arXiv preprint arXiv:2503.01840, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Knapsack rl: Unlocking exploration of llms via optimizing budget allocation

Ziniu Li, Congliang Chen, Tianyun Yang, Tian Ding, Ruoyu Sun, Ge Zhang, Wenhao Huang, and Zhi-Quan Luo. Knapsack rl: Unlocking exploration of llms via optimizing budget allocation.arXiv preprint arXiv:2509.25849, 2025. 14

work page arXiv 2025

[23] [23]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, econom- ical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 techni- cal report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Spec-rl: Accelerating on-policy reinforcement learning with speculative rollouts.arXiv preprint arXiv:2509.23232, 2025a

Bingshuai Liu, Ante Wang, Zijun Min, Liang Yao, Haibo Zhang, Yang Liu, Anxiang Zeng, and Jinsong Su. Spec- rl: Accelerating on-policy reinforcement learning via speculative rollouts.arXiv preprint arXiv:2509.23232, 2025

work page arXiv 2025

[26] [26]

Muon is Scalable for LLM Training

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training. arXiv preprint arXiv:2502.16982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Part ii: Roll flash–accelerating rlvr and agentic training with asynchrony.arXiv preprint arXiv:2510.11345, 2025

Han Lu, Zichen Liu, Shaopan Xiong, Yancheng He, Wei Gao, Yanan Wu, Weixun Wang, Jiashun Liu, Yang Li, Haizhou Zhao, et al. Part ii: Roll flash–accelerating rlvr and agentic training with asynchrony.arXiv preprint arXiv:2510.11345, 2025

work page arXiv 2025

[28] [28]

Realhf: Optimized rlhf training for large language models through parameter reallocation.arXiv preprint arXiv:2406.14088, 2024

Zhiyu Mei, Wei Fu, Kaiwei Li, Guangju Wang, Huanchen Zhang, and Yi Wu. Real: Efficient rlhf train- ing of large language models with parameter realloca- tion.arXiv preprint arXiv:2406.14088, 2024

work page arXiv 2024

[29] [29]

Suffixdecoding: Extreme speculative decod- ing for emerging ai applications

Gabriele Oliaro, Zhihao Jia, Daniel F Campos, and Au- rick Qiao. Suffixdecoding: Extreme speculative decod- ing for emerging ai applications. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025

[30] [30]

Pipelinerl: Faster on-policy reinforcement learning for long sequence gen- eration.arXiv preprint arXiv:2509.19128, 2025

Alexandre Piché, Ehsan Kamalloo, Rafael Pardinas, Xi- aoyin Chen, and Dzmitry Bahdanau. Pipelinerl: Faster on-policy reinforcement learning for long sequence gen- eration.arXiv preprint arXiv:2509.19128, 2025

work page arXiv 2025

[31] [31]

Mooncake: Trading more storage for less computation—a {KVCache-centric} architecture for serving {LLM} chatbot

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. Mooncake: Trading more storage for less computation—a {KVCache-centric} architecture for serving {LLM} chatbot. In23rd USENIX Confer- ence on File and Storage Technologies (F AST 25), pages 155–170, 2025

work page 2025

[32] [32]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

Nemo-aligner: Scalable toolkit for efficient model alignment.arXiv preprint arXiv:2405.01481, 2024

Gerald Shen, Zhilin Wang, Olivier Delalleau, Jiaqi Zeng, Yi Dong, Daniel Egert, Shengyang Sun, Jimmy Zhang, Sahil Jain, Ali Taghibakhshi, et al. Nemo-aligner: Scal- able toolkit for efficient model alignment.arXiv preprint arXiv:2405.01481, 2024

work page arXiv 2024

[34] [34]

Laminar: A scalable asyn- chronous rl post-training framework.arXiv preprint arXiv:2510.12633, 2025

Guangming Sheng, Yuxuan Tong, Borui Wan, Wang Zhang, Chaobo Jia, Xibin Wu, Yuqi Wu, Xiang Li, Chi Zhang, Yanghua Peng, et al. Laminar: A scalable asyn- chronous rl post-training framework.arXiv preprint arXiv:2510.12633, 2025

work page arXiv 2025

[35] [35]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025

work page 2025

[36] [36]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter lan- guage models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909

[37] [37]

Llm-as-a-judge & reward model: What they can and cannot do.arXiv preprint arXiv:2409.11239, 2024

Guijin Son, Hyunwoo Ko, Hoyoung Lee, Yewon Kim, and Seunghyeok Hong. Llm-as-a-judge & reward model: What they can and cannot do.arXiv preprint arXiv:2409.11239, 2024

work page arXiv 2024

[38] [38]

Llumnix: Dynamic scheduling for large language model serving

Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. Llumnix: Dynamic scheduling for large language model serving. In18th USENIX symposium on operating systems design and implementation (OSDI 24), pages 173–191, 2024

work page 2024

[39] [39]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Ji- ahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agen- tic intelligence.arXiv preprint arXiv:2507.20534, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1.5: Scal- ing reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Introducing LongCat-flash-thinking: A technical report

Meituan LongCat Team. Introducing longcat- flash-thinking: A technical report.arXiv preprint arXiv:2509.18883, 2025

work page arXiv 2025

[42] [42]

vllm: A high-throughput and memory- efficient inference and serving engine for llms

vLLM Team. vllm: A high-throughput and memory- efficient inference and serving engine for llms. https: //github.com/vllm-project/vllm, 2025

work page 2025

[43] [43]

15 Trl: Transformer reinforcement learning

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. 15 Trl: Transformer reinforcement learning. https:// github.com/huggingface/trl, 2020

work page 2020

[44] [44]

Opt-tree: Speculative decoding with adaptive draft tree structure

Jikai Wang, Yi Su, Juntao Li, Qingrong Xia, Zi Ye, Xinyu Duan, Zhefeng Wang, and Min Zhang. Opt-tree: Speculative decoding with adaptive draft tree structure. Transactions of the Association for Computational Lin- guistics, 13:188–199, 2025

work page 2025

[45] [45]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[46] [46]

Reinforcement learning optimization for large-scale learning: An efficient and user-friendly scaling library.arXiv preprint arXiv:2506.06122, 2025a

Weixun Wang, Shaopan Xiong, Gengru Chen, Wei Gao, Sheng Guo, Yancheng He, Ju Huang, Jiaheng Liu, Zhendong Li, Xiaoyang Li, et al. Reinforcement learning optimization for large-scale learning: An effi- cient and user-friendly scaling library.arXiv preprint arXiv:2506.06122, 2025

work page arXiv 2025

[47] [47]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information pro- cessing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information pro- cessing systems, 35:24824–24837, 2022

work page 2022

[48] [48]

Linear pattern matching algorithms

Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory (swat 1973), pages 1–11. IEEE, 1973

work page 1973

[49] [49]

RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs

Yongji Wu, Xueshen Liu, Haizhong Zheng, Juncheng Gu, Beidi Chen, Z Morley Mao, Arvind Krishnamurthy, and Ion Stoica. Rlboost: Harvesting preemptible re- sources for cost-efficient reinforcement learning on llms. arXiv preprint arXiv:2510.19225, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [50]

Deepspeed-chat: Easy, fast and affordable rlhf training of chatgpt-like models at all scales.arXiv preprint arXiv:2308.01320, 2023

Zhewei Yao, Reza Yazdani Aminabadi, Olatunji Ruwase, Samyam Rajbhandari, Xiaoxia Wu, Ammar Ahmad Awan, Jeff Rasley, Minjia Zhang, Conglong Li, Connor Holmes, et al. Deepspeed-chat: Easy, fast and affordable rlhf training of chatgpt-like models at all scales.arXiv preprint arXiv:2308.01320, 2023

work page arXiv 2023

[51] [51]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gao- hong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[52] [52]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy opti- mization.arXiv preprint arXiv:2507.18071, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[53] [53]

Sglang: Efficient execution of structured language model programs.Advances in neural information pro- cessing systems, 37:62557–62583, 2024

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs.Advances in neural information pro- cessing systems, 37:62557–62583, 2024

work page 2024

[54] [54]

Streamrl: Scalable, heterogeneous, and elastic rl for llms with disaggregated stream generation

Yinmin Zhong, Zili Zhang, Xiaoniu Song, Hanpeng Hu, Chao Jin, Bingyang Wu, Nuo Chen, Yukun Chen, Yu Zhou, Changyi Wan, et al. Streamrl: Scalable, het- erogeneous, and elastic rl for llms with disaggregated stream generation.arXiv preprint arXiv:2504.15930, 2025

work page arXiv 2025

[55] [55]

Optimizing {RLHF} training for large language models with stage fusion

Yinmin Zhong, Zili Zhang, Bingyang Wu, Shengyu Liu, Yukun Chen, Changyi Wan, Hanpeng Hu, Lei Xia, Ranchen Ming, Yibo Zhu, et al. Optimizing {RLHF} training for large language models with stage fusion. In 22nd USENIX Symposium on Networked Systems De- sign and Implementation (NSDI 25), pages 489–503, 2025

work page 2025

[56] [56]

April: Active partial rollouts in reinforcement learning to tame long-tail generation

Yuzhen Zhou, Jiajun Li, Yusheng Su, Gowtham Ramesh, Zilin Zhu, Xiang Long, Chenyang Zhao, Jin Pan, Xi- aodong Yu, Ze Wang, et al. April: Active partial rollouts in reinforcement learning to tame long-tail generation. arXiv preprint arXiv:2509.18521, 2025

work page arXiv 2025

[57] [57]

slime: An llm post-training framework for rl scaling

Zilin Zhu, Chengxing Xie, Xin Lv, and slime Contrib- utors. slime: An llm post-training framework for rl scaling. https://github.com/THUDM/slime, 2025. GitHub repository. Corresponding author: Xin Lv. 16

work page 2025