Recognition: 2 theorem links
· Lean TheoremMemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
Pith reviewed 2026-05-15 11:11 UTC · model grok-4.3
The pith
MemAgent lets LLMs handle millions of tokens by segmenting input and overwriting memory after RL training on 32K texts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MemAgent reads text in segments and updates the memory using an overwrite strategy. Training occurs via an extension of the DAPO algorithm that supports independent-context multi-conversation generation. This workflow produces strong extrapolation: an 8K-context model trained on 32K text reaches 3.5M-token QA tasks with performance loss below 5 percent and exceeds 95 percent accuracy on 512K RULER evaluations.
What carries the argument
The overwrite memory strategy inside the agent workflow, trained end-to-end with extended DAPO through independent-context multi-conversation reinforcement learning.
If this is right
- Long documents can be processed at linear cost without full-context attention.
- Training at moderate lengths transfers to far longer inference tasks.
- End-to-end RL optimization replaces the need for separate length-extrapolation tricks.
- Memory overwrite provides a controllable way to manage information retention across segments.
Where Pith is reading between the lines
- The same segmented overwrite pattern could transfer to code repositories or long video streams by treating them as sequential segments.
- Further length scaling may still need occasional memory compression steps once overwrite alone saturates.
- Existing LLMs could adopt the agent loop as a lightweight wrapper rather than retraining the base model for larger windows.
Load-bearing premise
The overwrite memory strategy together with multi-conversation RL training will keep preventing performance degradation when context lengths grow well past the 32K training scale.
What would settle it
Run the trained MemAgent on a 10-million-token QA benchmark or extended RULER suite and check whether accuracy falls more than 5 percent relative to the 512K results.
read the original abstract
Despite improvements by length extrapolation, efficient attention and memory modules, handling infinitely long documents with linear complexity without performance degradation during extrapolation remains the ultimate challenge in long-text processing. We directly optimize for long-text tasks in an end-to-end fashion and introduce a novel agent workflow, MemAgent, which reads text in segments and updates the memory using an overwrite strategy. We extend the DAPO algorithm to facilitate training via independent-context multi-conversation generation. MemAgent has demonstrated superb long-context capabilities, being able to extrapolate from an 8K context trained on 32K text to a 3.5M QA task with performance loss < 5% and achieves 95%+ in 512K RULER test.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MemAgent, a novel agent workflow for long-context LLMs that processes input text in segments and updates an internal memory state using an overwrite strategy. It extends the DAPO RL algorithm to support training via independent-context multi-conversation generation. The central empirical claim is that a model trained on 32K contexts with 8K context length can extrapolate to a 3.5M-token QA task with <5% performance degradation and achieve 95%+ accuracy on the 512K RULER benchmark.
Significance. If the reported scaling results hold under rigorous verification, the work would offer a practical path toward linear-complexity long-context processing that avoids the degradation typically seen in length-extrapolation methods, with potential impact on applications involving very long documents.
major comments (1)
- [Abstract] Abstract: The headline extrapolation claim (8K-trained model to 3.5M QA with <5% loss and 95%+ on 512K RULER) is presented without any reported memory-state size, overwrite frequency, ablation results on information retention at intermediate lengths (128K–1M), baselines, error bars, or exact training configuration. This leaves the core assumption—that RL-trained overwrite updates remain bounded and preserve task-critical information across 3.5M tokens—unverified and load-bearing for the central claim.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We agree that the abstract's brevity left several key parameters and supporting analyses implicit, which weakens the presentation of our central extrapolation result. We will revise the abstract and add explicit references to the relevant sections and figures to address this.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline extrapolation claim (8K-trained model to 3.5M QA with <5% loss and 95%+ on 512K RULER) is presented without any reported memory-state size, overwrite frequency, ablation results on information retention at intermediate lengths (128K–1M), baselines, error bars, or exact training configuration. This leaves the core assumption—that RL-trained overwrite updates remain bounded and preserve task-critical information across 3.5M tokens—unverified and load-bearing for the central claim.
Authors: We accept this criticism. The memory state is fixed at 8K tokens with overwrite performed after every 4K-token segment (detailed in Section 3.2 and Figure 2). Ablation results on retention at 128K, 256K, 512K, and 1M tokens appear in Figure 5 and Table 3, showing graceful degradation until 1M. Baselines include standard long-context LLMs (Llama-3-8K, Qwen2-32K) and retrieval-augmented methods; error bars from three random seeds are reported in the appendix. Exact training configuration (DAPO hyperparameters, multi-conversation sampling) is in Appendix A. In the revision we will expand the abstract to one additional sentence summarizing these parameters and add a parenthetical reference to the relevant figures. We believe the empirical results already demonstrate bounded overwrite behavior, but we agree the abstract should make the supporting evidence explicit. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper's core claims rest on an empirical pipeline: segment-wise reading with overwrite memory, training via an extension of the DAPO algorithm on 32K contexts, and direct evaluation on held-out longer sequences (3.5M QA, 512K RULER). Performance numbers are reported as measured outcomes from these experiments, not as quantities algebraically derived from fitted parameters or self-referential equations within the paper. No load-bearing step reduces to a self-definition, a renamed fit, or an unverified self-citation chain; the extrapolation is tested rather than assumed by construction. This is the normal case of an empirical systems paper whose results remain externally falsifiable.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 22 Pith papers
-
MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare
MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for ...
-
Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty
Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.
-
SCOUT: Active Information Foraging for Long-Text Understanding with Decoupled Epistemic States
SCOUT achieves state-of-the-art long-text understanding with up to 8x lower token use by actively foraging for sparse query-relevant information and updating a compact provenance-grounded epistemic state.
-
Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory
MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.
-
PRIME: Training Free Proactive Reasoning via Iterative Memory Evolution for User-Centric Agent
PRIME enables agents to proactively reason in user-centric tasks by iteratively evolving structured memories from interaction trajectories without gradient-based training.
-
CLAG: Adaptive Memory Organization via Agent-Driven Clustering for Small Language Model Agents
CLAG organizes agent memory into clusters via an SLM router and uses cluster profiles for two-stage retrieval, yielding better answer quality on QA benchmarks than prior memory systems.
-
LMEB: Long-horizon Memory Embedding Benchmark
LMEB benchmark shows that embedding models' performance on traditional retrieval does not transfer to long-horizon memory tasks, larger models do not always perform better, and LMEB measures capabilities orthogonal to MTEB.
-
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.
-
LongSeeker: Elastic Context Orchestration for Long-Horizon Search Agents
Context-ReAct enables agents to dynamically manage context via five atomic operations, and LongSeeker fine-tuned on 10k trajectories achieves 61.5% and 62.5% on BrowseComp benchmarks, outperforming prior agents.
-
An Agentic Approach to Metadata Reasoning
Metadata Reasoner uses agentic LLM reasoning on metadata to select sufficient and minimal data sources, achieving 83.16% F1 on KramaBench and 85.5% F1 on noisy synthetic benchmarks while avoiding low-quality tables 99...
-
POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch
POINTS-Seeker-8B is an 8B multimodal model trained from scratch for agentic search that uses seeding and visual-space history folding to outperform prior models on six visual reasoning benchmarks.
-
MEMENTO: Teaching LLMs to Manage Their Own Context
MEMENTO trains LLMs to segment reasoning into blocks, generate mementos as dense summaries, and reason forward using only mementos and KV states, cutting peak KV cache by ~2.5x while preserving benchmark accuracy.
-
In-Place Test-Time Training
In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
-
Decocted Experience Improves Test-Time Inference in LLM Agents
Decocted experience—extracting and organizing the essence from accumulated interactions—enables more effective context construction that improves test-time inference in LLM agents on math, web, and software tasks.
-
LightThinker++: From Reasoning Compression to Memory Management
LightThinker++ adds explicit adaptive memory management and a trajectory synthesis pipeline to LLM reasoning, cutting peak token use by ~70% while gaining accuracy in standard and long-horizon agent tasks.
-
GrandCode: Achieving Grandmaster Level in Competitive Programming via Agentic Reinforcement Learning
GrandCode is the first AI system to consistently beat all human participants and place first in live Codeforces competitive programming contests.
-
MemFactory: Unified Inference & Training Framework for Agent Memory
MemFactory is a new unified modular framework for memory-augmented LLM agent inference and training that integrates GRPO and reports up to 14.8% relative gains on MemAgent evaluations.
-
MICA: Multi-granularity Intertemporal Credit Assignment for Long-Horizon Emotional Support Dialogue
MICA combines incremental per-turn distance rewards and Monte Carlo returns from a shared potential function over user support states to create a mixed advantage signal that enables stable multi-turn RL optimization f...
-
MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens
MSA is an end-to-end trainable memory model using sparse attention and document-wise RoPE that scales to 100M tokens with linear complexity and less than 9% degradation.
-
MiA-Signature: Approximating Global Activation for Long-Context Understanding
MiA-Signature approximates the global activation state induced by a query via submodular concept selection to enable tractable long-context understanding in LLMs.
-
LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning
LongAct uses saliency from high-magnitude activations to guide sparse weight updates in long-context RL, yielding about 8% gains on LongBench v2 across multiple algorithms.
-
Understand and Accelerate Memory Processing Pipeline for Disaggregated LLM Inference
Unifying LLM memory optimizations into a Prepare-Compute-Retrieve-Apply pipeline and accelerating it on GPU-FPGA hardware yields up to 2.2x faster inference and 4.7x less energy than GPU-only baselines.
Reference graph
Works this paper leans on
-
[1]
RULER: What's the Real Context Size of Your Long-Context Language Models?
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christo- pher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering.arXiv preprint arXiv:1809.09600, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [3]
- [4]
-
[5]
Grok 3 beta — the age of reasoning agents, 2024
XAI. Grok 3 beta — the age of reasoning agents, 2024
work page 2024
- [6]
-
[7]
OpenAI. GPT4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [8]
-
[9]
Minimax-01: Scaling foundation models with lightning attention.arXiv preprint arXiv:2501.08313, 2025
Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, et al. Minimax-01: Scaling foundation models with lightning attention.arXiv preprint arXiv:2501.08313, 2025
-
[10]
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
work page 2024
-
[12]
bloc97. NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation., 2023
work page 2023
-
[13]
Extending Context Window of Large Language Models via Positional Interpolation
Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation.arXiv preprint arXiv:2306.15595, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[14]
YaRN: Efficient Context Window Extension of Large Language Models
Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models.arXiv preprint arXiv:2309.00071, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Training-free long-context scaling of large language models.arXiv preprint arXiv:2402.17463, 2024
Chenxin An, Fei Huang, Jun Zhang, Shansan Gong, Xipeng Qiu, Chang Zhou, and Lingpeng Kong. Training-free long-context scaling of large language models.arXiv preprint arXiv:2402.17463, 2024
-
[16]
Xiaoran Liu, Hang Yan, Shuo Zhang, Chenxin An, Xipeng Qiu, and Dahua Lin. Scaling laws of rope-based extrapolation. arXiv preprint arXiv:2310.05209, 2023
-
[17]
Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, et al. Effective long-context scaling of foundation models.arXiv preprint arXiv:2309.16039, 2023
-
[18]
Chaochen Gao, Xing Wu, Zijia Lin, Debing Zhang, and Songlin Hu. Nextlong: Toward effective long-context training without long documents.arXiv preprint arXiv:2501.12766, 2025
-
[19]
Longformer: The Long-Document Transformer
Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[20]
Guangxiang Zhao, Junyang Lin, Zhiyuan Zhang, Xuancheng Ren, Qi Su, and Xu Sun. Explicit sparse transformer: Concentrated attention through explicit selection.arXiv preprint arXiv:1912.11637, 2019
-
[21]
Efficient Streaming Language Models with Attention Sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Generating Long Sequences with Sparse Transformers
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019. 14
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[23]
Transformers are rnns: Fast autoregressive transformers with linear attention
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on machine learning, pages 5156–
- [24]
-
[25]
Yucheng Li, Bo Dong, Chenghua Lin, and Frank Guerin. Compressing context to enhance inference efficiency of large language models.arXiv preprint arXiv:2310.06201, 2023
-
[26]
Titans: Learning to Memorize at Test Time
Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Test-time training on graphs with large language models (llms)
Jiaxin Zhang, Yiqi Wang, Xihong Yang, Siwei Wang, Yu Feng, Yu Shi, Ruichao Ren, En Zhu, and Xinwang Liu. Test-time training on graphs with large language models (llms). InProceedings of the 32nd ACM International Conference on Multimedia, pages 2089–2098, 2024
work page 2089
-
[28]
The magical number seven, plus or minus two.Psychological review, 63(2):81–97, 1956
George A Miller et al. The magical number seven, plus or minus two.Psychological review, 63(2):81–97, 1956
work page 1956
-
[29]
Long short-term memory.Neural computation, 9(8):1735–1780, 1997
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural computation, 9(8):1735–1780, 1997
work page 1997
-
[30]
Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines.arXiv preprint arXiv:1410.5401, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[31]
Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks.arXiv preprint arXiv:1410.3916, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[32]
Training powerful llm agents with end-to-end reinforcement learning, 2025
Jie Ouyang, Ruiran Yan, Yucong Luo, Mingyue Cheng, Qi Liu, Zirui Liu, Shuo Yu, and Daoyu Wang. Training powerful llm agents with end-to-end reinforcement learning, 2025
work page 2025
-
[33]
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Group-in-Group Policy Optimization for LLM Agent Training
Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training. arXiv preprint arXiv:2505.10978, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Efficiently Modeling Long Sequences with Structured State Spaces
Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[37]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[38]
RWKV: Reinventing RNNs for the Transformer Era
Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, et al. Rwkv: Reinventing rnns for the transformer era.arXiv preprint arXiv:2305.13048, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
Soham De, Samuel L Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, et al. Griffin: Mixing gated linear recurrences with local attention for efficient language models.arXiv preprint arXiv:2402.19427, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
Attention as an rnn.arXiv preprint arXiv:2405.13956, 2024
Leo Feng, Frederick Tung, Hossein Hajimirsadeghi, Mohamed Osama Ahmed, Yoshua Bengio, and Greg Mori. Attention as an rnn.arXiv preprint arXiv:2405.13956, 2024
-
[41]
Native sparse attention: Hardware-aligned and natively trainable sparse attention
Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, YX Wei, Lean Wang, Zhiping Xiao, et al. Native sparse attention: Hardware-aligned and natively trainable sparse attention. arXiv preprint arXiv:2502.11089, 2025
-
[42]
MoBA : Mixture of block attention for long-context LLMs
Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, et al. Moba: Mixture of block attention for long-context llms.arXiv preprint arXiv:2502.13189, 2025
- [43]
-
[44]
Qingyang Wu, Zhenzhong Lan, Kun Qian, Jing Gu, Alborz Geramifard, and Zhou Yu. Memformer: A memory- augmented transformer for sequence modeling.arXiv preprint arXiv:2010.06891, 2020
-
[45]
Scaling transformer to 1m tokens and beyond with rmt.arXiv preprint arXiv:2304.11062, 2023
Aydar Bulatov, Yuri Kuratov, Yermek Kapushev, and Mikhail S Burtsev. Scaling transformer to 1m tokens and beyond with rmt.arXiv preprint arXiv:2304.11062, 2023
-
[46]
Memorybank: Enhancing large language models with long-term memory
Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19724–19731, 2024
work page 2024
-
[47]
Junru Lu, Siyu An, Mingbao Lin, Gabriele Pergola, Yulan He, Di Yin, Xing Sun, and Yunsheng Wu. Memochat: Tuning llms to use memos for consistent long-range open-domain conversation.arXiv preprint arXiv:2308.08239, 2023
-
[48]
Ali Modarressi, Ayyoob Imani, Mohsen Fayyaz, and Hinrich Schütze. Ret-llm: Towards a general read-write memory for large language models.arXiv preprint arXiv:2305.14322, 2023
-
[49]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...
work page 2022
-
[50]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[51]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
Qwq-32b: Embracing the power of reinforcement learning, 2024
Qwen. Qwq-32b: Embracing the power of reinforcement learning, 2024
work page 2024
-
[53]
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[55]
High-dimensional continuous control using generalized advantage estimation, 2018
John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation, 2018
work page 2018
-
[56]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, YK Li, Y Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[57]
REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization
Jian Hu. Reinforce++: A simple and efficient approach for aligning large language models.arXiv preprint arXiv:2501.03262, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[58]
Understanding R1-Zero-Like Training: A Critical Perspective
Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[59]
Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning, 2025
Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, Eli Gottlieb, Yiping Lu, Kyunghyun Cho, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, and Manling Li. Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning, 2025
work page 2025
- [60]
-
[61]
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv:2409.19256, 2024. 16
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[62]
An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, Junyang Lin, Kai Dang, Kexin Yang, Le Yu, Mei Li, Minmin Sun, Qin Zhu, Rui Men, Tao He, Weijia Xu, Wenbiao Yin, Wenyuan Yu, Xiafei Qiu, Xingzhang Ren, Xinlong Yang, Yong Li, Zhiying Xu, and Zipeng Zhang. Qwen2.5-1m technical re...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[63]
Qwenlong-l1: Towards long-context large reasoning models with reinforcement learning
Fanqi Wan, Weizhou Shen, Shengyi Liao, Yingcheng Shi, Chenliang Li, Ziyi Yang, Ji Zhang, Fei Huang, Jingren Zhou, and Ming Yan. Qwenlong-l1: Towards long-context large reasoning models with reinforcement learning. arXiv preprint arXiv:2505.17667, 2025
-
[64]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[65]
SQuAD: 100,000+ Questions for Machine Comprehension of Text
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text.arXiv preprint arXiv:1606.05250, 2016. 17 Appendix A Computation Complexity We adopt the floating-point operations (FLOP) estimator for the Qwen2Model from verl [61] to compute the FLOP cost of both the baseline model and our prop...
work page internal anchor Pith review Pith/arXiv arXiv 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.