pith. machine review for the scientific record. sign in

arxiv: 2507.02259 · v1 · submitted 2025-07-03 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

Authors on Pith no claims yet

Pith reviewed 2026-05-15 11:11 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords long-context LLMmemory agentoverwrite strategyreinforcement learninglength extrapolationmulti-conversation trainingDAPO algorithm
0
0 comments X

The pith

MemAgent lets LLMs handle millions of tokens by segmenting input and overwriting memory after RL training on 32K texts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the difficulty of processing infinitely long documents without performance drops or quadratic costs. MemAgent works by breaking text into segments, maintaining memory through an overwrite update rule, and training the whole system end-to-end with an extended DAPO reinforcement learning method that uses multiple independent conversations. A model trained at 8K context on 32K-length data then extrapolates to 3.5M-token question answering with under 5 percent loss and scores above 95 percent on 512K RULER benchmarks. If the approach holds, long-context tasks become feasible with linear scaling and no need for ever-larger attention windows.

Core claim

MemAgent reads text in segments and updates the memory using an overwrite strategy. Training occurs via an extension of the DAPO algorithm that supports independent-context multi-conversation generation. This workflow produces strong extrapolation: an 8K-context model trained on 32K text reaches 3.5M-token QA tasks with performance loss below 5 percent and exceeds 95 percent accuracy on 512K RULER evaluations.

What carries the argument

The overwrite memory strategy inside the agent workflow, trained end-to-end with extended DAPO through independent-context multi-conversation reinforcement learning.

If this is right

  • Long documents can be processed at linear cost without full-context attention.
  • Training at moderate lengths transfers to far longer inference tasks.
  • End-to-end RL optimization replaces the need for separate length-extrapolation tricks.
  • Memory overwrite provides a controllable way to manage information retention across segments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same segmented overwrite pattern could transfer to code repositories or long video streams by treating them as sequential segments.
  • Further length scaling may still need occasional memory compression steps once overwrite alone saturates.
  • Existing LLMs could adopt the agent loop as a lightweight wrapper rather than retraining the base model for larger windows.

Load-bearing premise

The overwrite memory strategy together with multi-conversation RL training will keep preventing performance degradation when context lengths grow well past the 32K training scale.

What would settle it

Run the trained MemAgent on a 10-million-token QA benchmark or extended RULER suite and check whether accuracy falls more than 5 percent relative to the 512K results.

read the original abstract

Despite improvements by length extrapolation, efficient attention and memory modules, handling infinitely long documents with linear complexity without performance degradation during extrapolation remains the ultimate challenge in long-text processing. We directly optimize for long-text tasks in an end-to-end fashion and introduce a novel agent workflow, MemAgent, which reads text in segments and updates the memory using an overwrite strategy. We extend the DAPO algorithm to facilitate training via independent-context multi-conversation generation. MemAgent has demonstrated superb long-context capabilities, being able to extrapolate from an 8K context trained on 32K text to a 3.5M QA task with performance loss < 5% and achieves 95%+ in 512K RULER test.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces MemAgent, a novel agent workflow for long-context LLMs that processes input text in segments and updates an internal memory state using an overwrite strategy. It extends the DAPO RL algorithm to support training via independent-context multi-conversation generation. The central empirical claim is that a model trained on 32K contexts with 8K context length can extrapolate to a 3.5M-token QA task with <5% performance degradation and achieve 95%+ accuracy on the 512K RULER benchmark.

Significance. If the reported scaling results hold under rigorous verification, the work would offer a practical path toward linear-complexity long-context processing that avoids the degradation typically seen in length-extrapolation methods, with potential impact on applications involving very long documents.

major comments (1)
  1. [Abstract] Abstract: The headline extrapolation claim (8K-trained model to 3.5M QA with <5% loss and 95%+ on 512K RULER) is presented without any reported memory-state size, overwrite frequency, ablation results on information retention at intermediate lengths (128K–1M), baselines, error bars, or exact training configuration. This leaves the core assumption—that RL-trained overwrite updates remain bounded and preserve task-critical information across 3.5M tokens—unverified and load-bearing for the central claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We agree that the abstract's brevity left several key parameters and supporting analyses implicit, which weakens the presentation of our central extrapolation result. We will revise the abstract and add explicit references to the relevant sections and figures to address this.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline extrapolation claim (8K-trained model to 3.5M QA with <5% loss and 95%+ on 512K RULER) is presented without any reported memory-state size, overwrite frequency, ablation results on information retention at intermediate lengths (128K–1M), baselines, error bars, or exact training configuration. This leaves the core assumption—that RL-trained overwrite updates remain bounded and preserve task-critical information across 3.5M tokens—unverified and load-bearing for the central claim.

    Authors: We accept this criticism. The memory state is fixed at 8K tokens with overwrite performed after every 4K-token segment (detailed in Section 3.2 and Figure 2). Ablation results on retention at 128K, 256K, 512K, and 1M tokens appear in Figure 5 and Table 3, showing graceful degradation until 1M. Baselines include standard long-context LLMs (Llama-3-8K, Qwen2-32K) and retrieval-augmented methods; error bars from three random seeds are reported in the appendix. Exact training configuration (DAPO hyperparameters, multi-conversation sampling) is in Appendix A. In the revision we will expand the abstract to one additional sentence summarizing these parameters and add a parenthetical reference to the relevant figures. We believe the empirical results already demonstrate bounded overwrite behavior, but we agree the abstract should make the supporting evidence explicit. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's core claims rest on an empirical pipeline: segment-wise reading with overwrite memory, training via an extension of the DAPO algorithm on 32K contexts, and direct evaluation on held-out longer sequences (3.5M QA, 512K RULER). Performance numbers are reported as measured outcomes from these experiments, not as quantities algebraically derived from fitted parameters or self-referential equations within the paper. No load-bearing step reduces to a self-definition, a renamed fit, or an unverified self-citation chain; the extrapolation is tested rather than assumed by construction. This is the normal case of an empirical systems paper whose results remain externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities can be extracted from the provided text.

pith-pipeline@v0.9.0 · 5458 in / 1038 out tokens · 46124 ms · 2026-05-15T11:11:50.392302+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare

    cs.AI 2026-05 conditional novelty 8.0

    MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for ...

  2. Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty

    cs.CL 2026-05 unverdicted novelty 8.0

    Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.

  3. SCOUT: Active Information Foraging for Long-Text Understanding with Decoupled Epistemic States

    cs.CL 2026-05 unverdicted novelty 7.0

    SCOUT achieves state-of-the-art long-text understanding with up to 8x lower token use by actively foraging for sparse query-relevant information and updating a compact provenance-grounded epistemic state.

  4. Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory

    cs.CL 2026-05 unverdicted novelty 7.0

    MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.

  5. PRIME: Training Free Proactive Reasoning via Iterative Memory Evolution for User-Centric Agent

    cs.AI 2026-04 unverdicted novelty 7.0

    PRIME enables agents to proactively reason in user-centric tasks by iteratively evolving structured memories from interaction trajectories without gradient-based training.

  6. CLAG: Adaptive Memory Organization via Agent-Driven Clustering for Small Language Model Agents

    cs.CL 2026-03 unverdicted novelty 7.0

    CLAG organizes agent memory into clusters via an SLM router and uses cluster profiles for two-stage retrieval, yielding better answer quality on QA benchmarks than prior memory systems.

  7. LMEB: Long-horizon Memory Embedding Benchmark

    cs.CL 2026-03 unverdicted novelty 7.0

    LMEB benchmark shows that embedding models' performance on traditional retrieval does not transfer to long-horizon memory tasks, larger models do not always perform better, and LMEB measures capabilities orthogonal to MTEB.

  8. Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

    cs.CL 2025-11 unverdicted novelty 7.0

    Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.

  9. LongSeeker: Elastic Context Orchestration for Long-Horizon Search Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    Context-ReAct enables agents to dynamically manage context via five atomic operations, and LongSeeker fine-tuned on 10k trajectories achieves 61.5% and 62.5% on BrowseComp benchmarks, outperforming prior agents.

  10. An Agentic Approach to Metadata Reasoning

    cs.DB 2026-04 unverdicted novelty 6.0

    Metadata Reasoner uses agentic LLM reasoning on metadata to select sufficient and minimal data sources, achieving 83.16% F1 on KramaBench and 85.5% F1 on noisy synthetic benchmarks while avoiding low-quality tables 99...

  11. POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch

    cs.CV 2026-04 unverdicted novelty 6.0

    POINTS-Seeker-8B is an 8B multimodal model trained from scratch for agentic search that uses seeding and visual-space history folding to outperform prior models on six visual reasoning benchmarks.

  12. MEMENTO: Teaching LLMs to Manage Their Own Context

    cs.AI 2026-04 unverdicted novelty 6.0

    MEMENTO trains LLMs to segment reasoning into blocks, generate mementos as dense summaries, and reason forward using only mementos and KV states, cutting peak KV cache by ~2.5x while preserving benchmark accuracy.

  13. In-Place Test-Time Training

    cs.LG 2026-04 conditional novelty 6.0

    In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.

  14. Decocted Experience Improves Test-Time Inference in LLM Agents

    cs.AI 2026-04 unverdicted novelty 6.0

    Decocted experience—extracting and organizing the essence from accumulated interactions—enables more effective context construction that improves test-time inference in LLM agents on math, web, and software tasks.

  15. LightThinker++: From Reasoning Compression to Memory Management

    cs.CL 2026-04 unverdicted novelty 6.0

    LightThinker++ adds explicit adaptive memory management and a trajectory synthesis pipeline to LLM reasoning, cutting peak token use by ~70% while gaining accuracy in standard and long-horizon agent tasks.

  16. GrandCode: Achieving Grandmaster Level in Competitive Programming via Agentic Reinforcement Learning

    cs.AI 2026-04 unverdicted novelty 6.0

    GrandCode is the first AI system to consistently beat all human participants and place first in live Codeforces competitive programming contests.

  17. MemFactory: Unified Inference & Training Framework for Agent Memory

    cs.CL 2026-03 unverdicted novelty 6.0

    MemFactory is a new unified modular framework for memory-augmented LLM agent inference and training that integrates GRPO and reports up to 14.8% relative gains on MemAgent evaluations.

  18. MICA: Multi-granularity Intertemporal Credit Assignment for Long-Horizon Emotional Support Dialogue

    cs.CL 2026-03 unverdicted novelty 6.0

    MICA combines incremental per-turn distance rewards and Monte Carlo returns from a shared potential function over user support states to create a mixed advantage signal that enables stable multi-turn RL optimization f...

  19. MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens

    cs.CL 2026-03 unverdicted novelty 6.0

    MSA is an end-to-end trainable memory model using sparse attention and document-wise RoPE that scales to 100M tokens with linear complexity and less than 9% degradation.

  20. MiA-Signature: Approximating Global Activation for Long-Context Understanding

    cs.CL 2026-05 unverdicted novelty 5.0

    MiA-Signature approximates the global activation state induced by a query via submodular concept selection to enable tractable long-context understanding in LLMs.

  21. LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 5.0

    LongAct uses saliency from high-magnitude activations to guide sparse weight updates in long-context RL, yielding about 8% gains on LongBench v2 across multiple algorithms.

  22. Understand and Accelerate Memory Processing Pipeline for Disaggregated LLM Inference

    cs.DC 2026-03 unverdicted novelty 5.0

    Unifying LLM memory optimizations into a Prepare-Compute-Retrieve-Apply pipeline and accelerating it on GPU-FPGA hardware yields up to 2.2x faster inference and 4.7x less energy than GPU-only baselines.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · cited by 22 Pith papers · 30 internal anchors

  1. [1]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

  2. [2]

    HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christo- pher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering.arXiv preprint arXiv:1809.09600, 2018

  3. [3]

    Learning to reason with llms, 2024

    OpenAI. Learning to reason with llms, 2024

  4. [4]

    Gemini 2.0 flash thinking, 2024

    Google DeepMind. Gemini 2.0 flash thinking, 2024

  5. [5]

    Grok 3 beta — the age of reasoning agents, 2024

    XAI. Grok 3 beta — the age of reasoning agents, 2024

  6. [6]

    Claude 3.5 sonnet, 2024

    Anthropic. Claude 3.5 sonnet, 2024

  7. [7]

    GPT-4 Technical Report

    OpenAI. GPT4 technical report.arXiv preprint arXiv:2303.08774, 2023

  8. [8]

    Introducing claude 4, 2025

    Anthropic. Introducing claude 4, 2025

  9. [9]

    Minimax-01: Scaling foundation models with lightning attention.arXiv preprint arXiv:2501.08313, 2025

    Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, et al. Minimax-01: Scaling foundation models with lightning attention.arXiv preprint arXiv:2501.08313, 2025

  10. [10]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024

  11. [11]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  12. [12]

    NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation., 2023

    bloc97. NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation., 2023

  13. [13]

    Extending Context Window of Large Language Models via Positional Interpolation

    Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation.arXiv preprint arXiv:2306.15595, 2023

  14. [14]

    YaRN: Efficient Context Window Extension of Large Language Models

    Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models.arXiv preprint arXiv:2309.00071, 2023

  15. [15]

    Training-free long-context scaling of large language models.arXiv preprint arXiv:2402.17463, 2024

    Chenxin An, Fei Huang, Jun Zhang, Shansan Gong, Xipeng Qiu, Chang Zhou, and Lingpeng Kong. Training-free long-context scaling of large language models.arXiv preprint arXiv:2402.17463, 2024

  16. [16]

    CoRR , volume =

    Xiaoran Liu, Hang Yan, Shuo Zhang, Chenxin An, Xipeng Qiu, and Dahua Lin. Scaling laws of rope-based extrapolation. arXiv preprint arXiv:2310.05209, 2023

  17. [17]

    CoRR , volume =

    Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, et al. Effective long-context scaling of foundation models.arXiv preprint arXiv:2309.16039, 2023

  18. [18]

    Nextlong: Toward effective long-context training without long documents.arXiv preprint arXiv:2501.12766, 2025

    Chaochen Gao, Xing Wu, Zijia Lin, Debing Zhang, and Songlin Hu. Nextlong: Toward effective long-context training without long documents.arXiv preprint arXiv:2501.12766, 2025

  19. [19]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150, 2020

  20. [20]

    Explicit sparse transformer: Concentrated attention through explicit selection.arXiv preprint arXiv:1912.11637, 2019

    Guangxiang Zhao, Junyang Lin, Zhiyuan Zhang, Xuancheng Ren, Qi Su, and Xu Sun. Explicit sparse transformer: Concentrated attention through explicit selection.arXiv preprint arXiv:1912.11637, 2019

  21. [21]

    Efficient Streaming Language Models with Attention Sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

  22. [22]

    Generating Long Sequences with Sparse Transformers

    Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019. 14

  23. [23]

    Transformers are rnns: Fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on machine learning, pages 5156–

  24. [24]

    & Qiu, L

    Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Llmlingua: Compressing prompts for accelerated inference of large language models.arXiv preprint arXiv:2310.05736, 2023

  25. [25]

    Compressing context to enhance inference efficiency of large language models.arXiv preprint arXiv:2310.06201, 2023

    Yucheng Li, Bo Dong, Chenghua Lin, and Frank Guerin. Compressing context to enhance inference efficiency of large language models.arXiv preprint arXiv:2310.06201, 2023

  26. [26]

    Titans: Learning to Memorize at Test Time

    Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663, 2024

  27. [27]

    Test-time training on graphs with large language models (llms)

    Jiaxin Zhang, Yiqi Wang, Xihong Yang, Siwei Wang, Yu Feng, Yu Shi, Ruichao Ren, En Zhu, and Xinwang Liu. Test-time training on graphs with large language models (llms). InProceedings of the 32nd ACM International Conference on Multimedia, pages 2089–2098, 2024

  28. [28]

    The magical number seven, plus or minus two.Psychological review, 63(2):81–97, 1956

    George A Miller et al. The magical number seven, plus or minus two.Psychological review, 63(2):81–97, 1956

  29. [29]

    Long short-term memory.Neural computation, 9(8):1735–1780, 1997

    Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural computation, 9(8):1735–1780, 1997

  30. [30]

    Neural Turing Machines

    Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines.arXiv preprint arXiv:1410.5401, 2014

  31. [31]

    Memory Networks

    Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks.arXiv preprint arXiv:1410.3916, 2014

  32. [32]

    Training powerful llm agents with end-to-end reinforcement learning, 2025

    Jie Ouyang, Ruiran Yan, Yucong Luo, Mingyue Cheng, Qi Liu, Zirui Liu, Shuo Yu, and Daoyu Wang. Training powerful llm agents with end-to-end reinforcement learning, 2025

  33. [33]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025

  34. [34]

    Group-in-Group Policy Optimization for LLM Agent Training

    Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training. arXiv preprint arXiv:2505.10978, 2025

  35. [35]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  36. [36]

    Efficiently Modeling Long Sequences with Structured State Spaces

    Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021

  37. [37]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023

  38. [38]

    RWKV: Reinventing RNNs for the Transformer Era

    Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, et al. Rwkv: Reinventing rnns for the transformer era.arXiv preprint arXiv:2305.13048, 2023

  39. [39]

    Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

    Soham De, Samuel L Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, et al. Griffin: Mixing gated linear recurrences with local attention for efficient language models.arXiv preprint arXiv:2402.19427, 2024

  40. [40]

    Attention as an rnn.arXiv preprint arXiv:2405.13956, 2024

    Leo Feng, Frederick Tung, Hossein Hajimirsadeghi, Mohamed Osama Ahmed, Yoshua Bengio, and Greg Mori. Attention as an rnn.arXiv preprint arXiv:2405.13956, 2024

  41. [41]

    Native sparse attention: Hardware-aligned and natively trainable sparse attention

    Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, YX Wei, Lean Wang, Zhiping Xiao, et al. Native sparse attention: Hardware-aligned and natively trainable sparse attention. arXiv preprint arXiv:2502.11089, 2025

  42. [42]

    MoBA : Mixture of block attention for long-context LLMs

    Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, et al. Moba: Mixture of block attention for long-context llms.arXiv preprint arXiv:2502.13189, 2025

  43. [43]

    Pedro Henrique Martins, Zita Marinho, and André FT Martins.∞-former: Infinite memory transformer.arXiv preprint arXiv:2109.00301, 2021. 15

  44. [44]

    Memformer: A memory- augmented transformer for sequence modeling.arXiv preprint arXiv:2010.06891, 2020

    Qingyang Wu, Zhenzhong Lan, Kun Qian, Jing Gu, Alborz Geramifard, and Zhou Yu. Memformer: A memory- augmented transformer for sequence modeling.arXiv preprint arXiv:2010.06891, 2020

  45. [45]

    Scaling transformer to 1m tokens and beyond with rmt.arXiv preprint arXiv:2304.11062, 2023

    Aydar Bulatov, Yuri Kuratov, Yermek Kapushev, and Mikhail S Burtsev. Scaling transformer to 1m tokens and beyond with rmt.arXiv preprint arXiv:2304.11062, 2023

  46. [46]

    Memorybank: Enhancing large language models with long-term memory

    Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19724–19731, 2024

  47. [47]

    Memochat: Tuning llms to use memos for consistent long-range open-domain conversation.arXiv preprint arXiv:2308.08239, 2023

    Junru Lu, Siyu An, Mingbao Lin, Gabriele Pergola, Yulan He, Di Yin, Xing Sun, and Yunsheng Wu. Memochat: Tuning llms to use memos for consistent long-range open-domain conversation.arXiv preprint arXiv:2308.08239, 2023

  48. [48]

    Ret-llm: Towards a general read-write memory for large language models.arXiv preprint arXiv:2305.14322, 2023

    Ali Modarressi, Ayyoob Imani, Mohsen Fayyaz, and Hinrich Schütze. Ret-llm: Towards a general read-write memory for large language models.arXiv preprint arXiv:2305.14322, 2023

  49. [49]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...

  50. [50]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022

  51. [51]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  52. [52]

    Qwq-32b: Embracing the power of reinforcement learning, 2024

    Qwen. Qwq-32b: Embracing the power of reinforcement learning, 2024

  53. [53]

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

  54. [54]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  55. [55]

    High-dimensional continuous control using generalized advantage estimation, 2018

    John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation, 2018

  56. [56]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, YK Li, Y Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  57. [57]

    REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

    Jian Hu. Reinforce++: A simple and efficient approach for aligning large language models.arXiv preprint arXiv:2501.03262, 2025

  58. [58]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

  59. [59]

    Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning, 2025

    Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, Eli Gottlieb, Yiping Lu, Kyunghyun Cho, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, and Manling Li. Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning, 2025

  60. [60]

    ByteDance Seed, Jiaze Chen, Tiantian Fan, Xin Liu, Lingjun Liu, Zhiqi Lin, Mingxuan Wang, Chengyi Wang, Xiangpeng Wei, Wenyuan Xu, et al. Seed1. 5-thinking: Advancing superb reasoning models with reinforcement learning. arXiv preprint arXiv:2504.13914, 2025

  61. [61]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv:2409.19256, 2024. 16

  62. [62]

    Qwen2.5-1M Technical Report

    An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, Junyang Lin, Kai Dang, Kexin Yang, Le Yu, Mei Li, Minmin Sun, Qin Zhu, Rui Men, Tao He, Weijia Xu, Wenbiao Yin, Wenyuan Yu, Xiafei Qiu, Xingzhang Ren, Xinlong Yang, Yong Li, Zhiying Xu, and Zipeng Zhang. Qwen2.5-1m technical re...

  63. [63]

    Qwenlong-l1: Towards long-context large reasoning models with reinforcement learning

    Fanqi Wan, Weizhou Shen, Shengyi Liao, Yingcheng Shi, Chenliang Li, Ziyi Yang, Ji Zhang, Fei Huang, Jingren Zhou, and Ming Yan. Qwenlong-l1: Towards long-context large reasoning models with reinforcement learning. arXiv preprint arXiv:2505.17667, 2025

  64. [64]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024

  65. [65]

    SQuAD: 100,000+ Questions for Machine Comprehension of Text

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text.arXiv preprint arXiv:1606.05250, 2016. 17 Appendix A Computation Complexity We adopt the floating-point operations (FLOP) estimator for the Qwen2Model from verl [61] to compute the FLOP cost of both the baseline model and our prop...