Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning
Pith reviewed 2026-05-17 20:34 UTC · model grok-4.3
The pith
Seer improves synchronous LLM RL rollout throughput by up to 2.04 times by learning output similarities from shared prompts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Seer is a context learning RL system that addresses rollout bottlenecks by using the fact that requests sharing the same prompt exhibit strong similarities in output lengths and response patterns. It coordinates three techniques: divided rollout for dynamic load balancing, context-aware scheduling to mitigate long-tail request delays, and adaptive grouped speculative decoding to accelerate generation. On production-grade RL workloads this yields up to 2.04 times higher end-to-end rollout throughput than prior synchronous systems while cutting long-tail latency by 72-94 percent.
What carries the argument
Online context learning that detects output-length and pattern similarities among same-prompt requests and feeds them into coordinated load balancing, scheduling, and grouped speculative decoding.
If this is right
- Divided rollout spreads work across workers according to predicted output lengths to reduce idle time.
- Context-aware scheduling detects and prioritizes long-tail requests to shrink overall iteration time.
- Adaptive grouped speculative decoding generates tokens faster for batches that share prompt-derived patterns.
- The combined effect raises rollout throughput while preserving the synchronous structure required by RL training.
Where Pith is reading between the lines
- The same prompt-similarity grouping could be tested in non-RL LLM inference servers that already batch repeated prompts.
- Performance sensitivity to prompt diversity suggests a follow-up experiment that mixes prompt families within a single rollout batch.
Load-bearing premise
Requests sharing the same prompt exhibit strong similarities in output lengths and response patterns.
What would settle it
Run the system on workloads where prompts within each batch are deliberately varied and dissimilar, then check whether the reported throughput gains and latency reductions largely vanish.
Figures
read the original abstract
Reinforcement Learning (RL) has emerged as a critical technique for advancing modern Large Language Models (LLMs), yet existing synchronous RL systems face severe performance bottlenecks. The rollout phase, which dominates end-to-end iteration time, suffers from substantial long-tail latency and poor resource utilization due to inherent workload imbalance. We present Seer, a novel context learning RL system that addresses these challenges through a key observation: requests sharing the same prompt exhibit strong similarities in output lengths and response patterns. Leveraging this insight, Seer introduces three coordinated techniques: (1) divided rollout for dynamic load balancing, (2) context-aware scheduling to mitigate long-tail request delays, and (3) adaptive grouped speculative decoding to accelerate generation. These mechanisms work in concert to markedly reduce long-tail latency and improve resource efficiency during rollout. Evaluations on production-grade RL workloads demonstrate that Seer achieves up to 2.04$\times$ end-to-end rollout throughput improvement compared to the state-of-the-art synchronous RL systems, while notably reducing long-tail latency by 72-94%.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Seer, a context-learning system for synchronous LLM reinforcement learning. It rests on the observation that requests sharing the same prompt exhibit strong similarities in output lengths and response patterns. Seer proposes three coordinated techniques—divided rollout for dynamic load balancing, context-aware scheduling to reduce long-tail delays, and adaptive grouped speculative decoding—and reports up to 2.04× end-to-end rollout throughput gains together with 72–94% reductions in long-tail latency on production-grade RL workloads relative to prior synchronous RL systems.
Significance. If the empirical results and the underlying similarity assumption are robustly validated, the work would offer a practical advance in mitigating rollout imbalance and improving resource utilization during LLM RL training. The emphasis on production workloads and the integration of scheduling with speculative decoding are positive aspects; however, the absence of quantified support for the core assumption and baseline details reduces the immediate strength of the contribution.
major comments (2)
- [§5 (Evaluation)] §5 (Evaluation): the reported 2.04× throughput and 72–94% latency improvements are stated without specifying the exact baselines, workload characteristics (prompt distributions, model sizes, generation parameters), number of trials, statistical significance, or controls for hardware or software confounding factors, leaving the central performance claim only partially supported.
- [§3 (Design)] §3 (Design / Observation): all three techniques (divided rollout, context-aware scheduling, adaptive grouped speculative decoding) are predicated on strong similarities in output lengths and response patterns for identical prompts, yet the manuscript supplies no quantitative evidence such as length variance, correlation coefficients, or ablation results when the assumption is relaxed or under varying temperature/top-p settings.
minor comments (2)
- [Figures] Figure captions and axis labels should explicitly state the number of runs and error bars used for throughput and latency measurements.
- [Related Work] Add a short related-work subsection contrasting Seer with recent speculative-decoding and RLHF scheduling papers.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and will strengthen the manuscript accordingly.
read point-by-point responses
-
Referee: [§5 (Evaluation)] §5 (Evaluation): the reported 2.04× throughput and 72–94% latency improvements are stated without specifying the exact baselines, workload characteristics (prompt distributions, model sizes, generation parameters), number of trials, statistical significance, or controls for hardware or software confounding factors, leaving the central performance claim only partially supported.
Authors: We agree that additional experimental details will improve clarity and reproducibility. In the revised version we will expand §5 to explicitly list the baselines (including the precise synchronous RL systems and versions compared against), workload characteristics such as prompt length distributions, model sizes, and sampling parameters (temperature, top-p), the number of independent trials performed, and any statistical significance tests applied. We will also document hardware configuration, software stack, and controls used to isolate confounding factors. revision: yes
-
Referee: [§3 (Design)] §3 (Design / Observation): all three techniques (divided rollout, context-aware scheduling, adaptive grouped speculative decoding) are predicated on strong similarities in output lengths and response patterns for identical prompts, yet the manuscript supplies no quantitative evidence such as length variance, correlation coefficients, or ablation results when the assumption is relaxed or under varying temperature/top-p settings.
Authors: The similarity observation underpins the design, and we acknowledge that the current manuscript presents it primarily qualitatively. We will add quantitative support in a revised §3, including measured output-length variance and Pearson correlation coefficients across requests sharing identical prompts, together with sensitivity results under different temperature and top-p values. We will also include an ablation that relaxes the similarity assumption to quantify its impact on the three techniques. revision: yes
Circularity Check
No circularity; empirical system design validated by measurements
full rationale
The paper presents an empirical observation (same-prompt requests exhibit similar output lengths and response patterns) as the motivating insight, then describes three engineering techniques built on it and reports measured improvements (up to 2.04× throughput and 72-94% long-tail latency reduction) from evaluations on production workloads. No mathematical derivations, equations, fitted parameters, or predictions appear in the provided text. There are no self-citations, uniqueness theorems, or ansatzes that reduce any claim to its own inputs by construction. The logic chain is self-contained: observation → system design → external benchmark results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Requests sharing the same prompt exhibit strong similarities in output lengths and response patterns.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
requests sharing the same prompt exhibit strong similarities in output lengths and response patterns... divided rollout for dynamic load balancing, context-aware scheduling, and adaptive grouped speculative decoding
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Seer achieves up to 2.04× end-to-end rollout throughput improvement... reducing long-tail latency by 72-94%
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 7 Pith papers
-
ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning
ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equip...
-
Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding
Graft combines pruning and retrieval in a sequential mechanism to build hybrid draft trees for speculative decoding, delivering up to 5.41× speedup and 21.8% better average speedup than EAGLE-3 on large models.
-
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
-
ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL
ROSE delivers 1.2-3.3x higher end-to-end throughput for agentic RL by safely co-using underutilized serving GPUs for rollouts while meeting serving SLOs.
-
ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL
ROSE is a system for cooperative elasticity that co-locates serving and rollout models on shared GPUs, delivering 1.3-3.3x higher end-to-end throughput than fixed-resource baselines while preserving serving SLOs.
-
JigsawRL: Assembling RL Pipelines for Efficient LLM Post-Training
JigsawRL achieves up to 1.85x higher throughput in LLM RL pipelines via pipeline multiplexing, sub-stage graphs, and look-ahead scheduling compared to prior systems.
-
FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning
FP8-RL delivers up to 44% faster rollouts in LLM RL by using blockwise FP8 quantization, KV-cache recalibration, and importance-sampling corrections while keeping learning behavior close to BF16 baselines.
Reference graph
Works this paper leans on
-
[1]
Moontshot AI. Checkpoint engine. https://github. com/MoonshotAI/checkpoint-engine, 2025
work page 2025
-
[2]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sang- hai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints.arXiv preprint arXiv:2305.13245, 2023. 13
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Sim- ple llm inference acceleration framework with multi- ple decoding heads.arXiv preprint arXiv:2401.10774, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Evaluating Large Language Models Trained on Code
Mark Chen. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[5]
Enabling efficient batch serving for lmaas via generation length prediction
Ke Cheng, Wen Hu, Zhi Wang, Peng Du, Jianguo Li, and Sheng Zhang. Enabling efficient batch serving for lmaas via generation length prediction. In2024 IEEE In- ternational Conference on Web Services (ICWS), pages 853–864. IEEE, 2024
work page 2024
-
[6]
AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning
Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Ji- ashu Wang, et al. Areal: A large-scale asynchronous reinforcement learning system for language reasoning. arXiv preprint arXiv:2505.24298, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Break the sequential dependency of llm inference using lookahead decoding
Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of llm inference using lookahead decoding.arXiv preprint arXiv:2402.02057, 2024
-
[8]
Yichao Fu, Siqi Zhu, Runlong Su, Aurick Qiao, Ion Sto- ica, and Hao Zhang. Efficient llm scheduling by learning to rank.Advances in Neural Information Processing Systems, 37:59006–59029, 2024
work page 2024
-
[9]
Wei Gao, Yuheng Zhao, Dakai An, Tianyuan Wu, Lunxi Cao, Shaopan Xiong, Ju Huang, Weixun Wang, Siran Yang, Wenbo Su, et al. Rollpacker: Mitigating long-tail rollouts for fast, synchronous rl post-training.arXiv preprint arXiv:2509.21009, 2025
-
[10]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shi- rong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes rea- soning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025
work page 2025
-
[11]
Zhenyu Han, Ansheng You, Haibo Wang, Kui Luo, Guang Yang, Wenqi Shi, Menglong Chen, Sicheng Zhang, Zeshun Lan, Chunshi Deng, et al. Asyncflow: An asynchronous streaming rl framework for efficient llm post-training.arXiv preprint arXiv:2507.01663, 2025
-
[12]
De- feating nondeterminism in llm inference.Think- ing Machines Lab: Connectionism, 2025
Horace He and Thinking Machines Lab. De- feating nondeterminism in llm inference.Think- ing Machines Lab: Connectionism, 2025. https://thinkingmachines.ai/blog/defeating- nondeterminism-in-llm-inference/
work page 2025
-
[13]
Jingkai He, Tianjian Li, Erhu Feng, Dong Du, Qian Liu, Tao Liu, Yubin Xia, and Haibo Chen. History rhymes: Accelerating llm reinforcement learning with rhymerl. arXiv preprint arXiv:2508.18588, 2025
-
[14]
Jian Hu, Mingjie Liu, Ximing Lu, Fang Wu, Zaid Har- chaoui, Shizhe Diao, Yejin Choi, Pavlo Molchanov, Jun Yang, Jan Kautz, et al. Brorl: Scaling reinforcement learning via broadened exploration.arXiv preprint arXiv:2510.01180, 2025
-
[15]
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
Jian Hu, Xibin Wu, Zilin Zhu, Weixun Wang, Dehao Zhang, Yu Cao, et al. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework.arXiv preprint arXiv:2405.11143, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Sam decoding: Speculative decoding via suffix automaton
Yuxuan Hu, Ke Wang, Xiaokang Zhang, Fanjin Zhang, Cuiping Li, Hong Chen, and Jing Zhang. Sam decoding: Speculative decoding via suffix automaton. InProceed- ings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 12187–12204, 2025
work page 2025
-
[17]
Gonza- lez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonza- lez, Hao Zhang, and Ion Stoica. Efficient memory man- agement for large language model serving with page- dattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023
work page 2023
-
[18]
Fast inference from transformers via speculative decoding
Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR, 2023
work page 2023
-
[19]
Eagle-2: Faster inference of language models with dynamic draft trees
Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-2: Faster inference of language models with dynamic draft trees.arXiv preprint arXiv:2406.16858, 2024
-
[20]
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty.arXiv preprint arXiv:2401.15077, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test
Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-3: Scaling up inference acceleration of large language models via training-time test.arXiv preprint arXiv:2503.01840, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Knapsack rl: Unlocking exploration of llms via optimizing budget allocation
Ziniu Li, Congliang Chen, Tianyun Yang, Tian Ding, Ruoyu Sun, Ge Zhang, Wenhao Huang, and Zhi-Quan Luo. Knapsack rl: Unlocking exploration of llms via optimizing budget allocation.arXiv preprint arXiv:2509.25849, 2025. 14
-
[23]
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, econom- ical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 techni- cal report.arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Bingshuai Liu, Ante Wang, Zijun Min, Liang Yao, Haibo Zhang, Yang Liu, Anxiang Zeng, and Jinsong Su. Spec- rl: Accelerating on-policy reinforcement learning via speculative rollouts.arXiv preprint arXiv:2509.23232, 2025
-
[26]
Muon is Scalable for LLM Training
Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training. arXiv preprint arXiv:2502.16982, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Han Lu, Zichen Liu, Shaopan Xiong, Yancheng He, Wei Gao, Yanan Wu, Weixun Wang, Jiashun Liu, Yang Li, Haizhou Zhao, et al. Part ii: Roll flash–accelerating rlvr and agentic training with asynchrony.arXiv preprint arXiv:2510.11345, 2025
-
[28]
Zhiyu Mei, Wei Fu, Kaiwei Li, Guangju Wang, Huanchen Zhang, and Yi Wu. Real: Efficient rlhf train- ing of large language models with parameter realloca- tion.arXiv preprint arXiv:2406.14088, 2024
-
[29]
Suffixdecoding: Extreme speculative decod- ing for emerging ai applications
Gabriele Oliaro, Zhihao Jia, Daniel F Campos, and Au- rick Qiao. Suffixdecoding: Extreme speculative decod- ing for emerging ai applications. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[30]
Alexandre Piché, Ehsan Kamalloo, Rafael Pardinas, Xi- aoyin Chen, and Dzmitry Bahdanau. Pipelinerl: Faster on-policy reinforcement learning for long sequence gen- eration.arXiv preprint arXiv:2509.19128, 2025
-
[31]
Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. Mooncake: Trading more storage for less computation—a {KVCache-centric} architecture for serving {LLM} chatbot. In23rd USENIX Confer- ence on File and Storage Technologies (F AST 25), pages 155–170, 2025
work page 2025
-
[32]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Nemo-aligner: Scalable toolkit for efficient model alignment.arXiv preprint arXiv:2405.01481, 2024
Gerald Shen, Zhilin Wang, Olivier Delalleau, Jiaqi Zeng, Yi Dong, Daniel Egert, Shengyang Sun, Jimmy Zhang, Sahil Jain, Ali Taghibakhshi, et al. Nemo-aligner: Scal- able toolkit for efficient model alignment.arXiv preprint arXiv:2405.01481, 2024
-
[34]
Laminar: A scalable asyn- chronous rl post-training framework.arXiv preprint arXiv:2510.12633, 2025
Guangming Sheng, Yuxuan Tong, Borui Wan, Wang Zhang, Chaobo Jia, Xibin Wu, Yuqi Wu, Xiang Li, Chi Zhang, Yanghua Peng, et al. Laminar: A scalable asyn- chronous rl post-training framework.arXiv preprint arXiv:2510.12633, 2025
-
[35]
Hybridflow: A flexible and efficient rlhf framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025
work page 2025
-
[36]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter lan- guage models using model parallelism.arXiv preprint arXiv:1909.08053, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[37]
Llm-as-a-judge & reward model: What they can and cannot do.arXiv preprint arXiv:2409.11239, 2024
Guijin Son, Hyunwoo Ko, Hoyoung Lee, Yewon Kim, and Seunghyeok Hong. Llm-as-a-judge & reward model: What they can and cannot do.arXiv preprint arXiv:2409.11239, 2024
-
[38]
Llumnix: Dynamic scheduling for large language model serving
Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. Llumnix: Dynamic scheduling for large language model serving. In18th USENIX symposium on operating systems design and implementation (OSDI 24), pages 173–191, 2024
work page 2024
-
[39]
Kimi K2: Open Agentic Intelligence
Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Ji- ahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agen- tic intelligence.arXiv preprint arXiv:2507.20534, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1.5: Scal- ing reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Introducing LongCat-flash-thinking: A technical report
Meituan LongCat Team. Introducing longcat- flash-thinking: A technical report.arXiv preprint arXiv:2509.18883, 2025
-
[42]
vllm: A high-throughput and memory- efficient inference and serving engine for llms
vLLM Team. vllm: A high-throughput and memory- efficient inference and serving engine for llms. https: //github.com/vllm-project/vllm, 2025
work page 2025
-
[43]
15 Trl: Transformer reinforcement learning
Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. 15 Trl: Transformer reinforcement learning. https:// github.com/huggingface/trl, 2020
work page 2020
-
[44]
Opt-tree: Speculative decoding with adaptive draft tree structure
Jikai Wang, Yi Su, Juntao Li, Qingrong Xia, Zi Ye, Xinyu Duan, Zhefeng Wang, and Min Zhang. Opt-tree: Speculative decoding with adaptive draft tree structure. Transactions of the Association for Computational Lin- guistics, 13:188–199, 2025
work page 2025
-
[45]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[46]
Weixun Wang, Shaopan Xiong, Gengru Chen, Wei Gao, Sheng Guo, Yancheng He, Ju Huang, Jiaheng Liu, Zhendong Li, Xiaoyang Li, et al. Reinforcement learning optimization for large-scale learning: An effi- cient and user-friendly scaling library.arXiv preprint arXiv:2506.06122, 2025
-
[47]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information pro- cessing systems, 35:24824–24837, 2022
work page 2022
-
[48]
Linear pattern matching algorithms
Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory (swat 1973), pages 1–11. IEEE, 1973
work page 1973
-
[49]
RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs
Yongji Wu, Xueshen Liu, Haizhong Zheng, Juncheng Gu, Beidi Chen, Z Morley Mao, Arvind Krishnamurthy, and Ion Stoica. Rlboost: Harvesting preemptible re- sources for cost-efficient reinforcement learning on llms. arXiv preprint arXiv:2510.19225, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Zhewei Yao, Reza Yazdani Aminabadi, Olatunji Ruwase, Samyam Rajbhandari, Xiaoxia Wu, Ammar Ahmad Awan, Jeff Rasley, Minjia Zhang, Conglong Li, Connor Holmes, et al. Deepspeed-chat: Easy, fast and affordable rlhf training of chatgpt-like models at all scales.arXiv preprint arXiv:2308.01320, 2023
-
[51]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gao- hong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
Group Sequence Policy Optimization
Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy opti- mization.arXiv preprint arXiv:2507.18071, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[53]
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs.Advances in neural information pro- cessing systems, 37:62557–62583, 2024
work page 2024
-
[54]
Streamrl: Scalable, heterogeneous, and elastic rl for llms with disaggregated stream generation
Yinmin Zhong, Zili Zhang, Xiaoniu Song, Hanpeng Hu, Chao Jin, Bingyang Wu, Nuo Chen, Yukun Chen, Yu Zhou, Changyi Wan, et al. Streamrl: Scalable, het- erogeneous, and elastic rl for llms with disaggregated stream generation.arXiv preprint arXiv:2504.15930, 2025
-
[55]
Optimizing {RLHF} training for large language models with stage fusion
Yinmin Zhong, Zili Zhang, Bingyang Wu, Shengyu Liu, Yukun Chen, Changyi Wan, Hanpeng Hu, Lei Xia, Ranchen Ming, Yibo Zhu, et al. Optimizing {RLHF} training for large language models with stage fusion. In 22nd USENIX Symposium on Networked Systems De- sign and Implementation (NSDI 25), pages 489–503, 2025
work page 2025
-
[56]
April: Active partial rollouts in reinforcement learning to tame long-tail generation
Yuzhen Zhou, Jiajun Li, Yusheng Su, Gowtham Ramesh, Zilin Zhu, Xiang Long, Chenyang Zhao, Jin Pan, Xi- aodong Yu, Ze Wang, et al. April: Active partial rollouts in reinforcement learning to tame long-tail generation. arXiv preprint arXiv:2509.18521, 2025
-
[57]
slime: An llm post-training framework for rl scaling
Zilin Zhu, Chengxing Xie, Xin Lv, and slime Contrib- utors. slime: An llm post-training framework for rl scaling. https://github.com/THUDM/slime, 2025. GitHub repository. Corresponding author: Xin Lv. 16
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.