pith. machine review for the scientific record. sign in

arxiv: 2506.15841 · v2 · submitted 2025-06-18 · 💻 cs.CL · cs.AI· cs.IR

Recognition: 3 theorem links

· Lean Theorem

MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:22 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR
keywords long-horizon agentsmemory consolidationreinforcement learningconstant memorymulti-turn QAshared internal stateagent efficiency
0
0 comments X

The pith

MEM1 trains agents to keep constant memory in long multi-turn tasks by updating one shared state that merges memory and reasoning via reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MEM1 as an end-to-end reinforcement learning framework that lets agents operate over extended sequences of interdependent queries without letting memory grow unbounded. At each turn the agent refreshes a compact internal state that folds prior memory together with fresh observations and drops redundant details. Experiments on retrieval QA, open web QA, and web shopping show that a 7B model trained this way reaches 3.5 times the performance of a 14B baseline while using 3.7 times less memory on 16-objective multi-hop tasks and continues to work on longer sequences than those seen in training. A sympathetic reader would care because standard full-context prompting quickly becomes expensive and noisy as interactions lengthen. The approach therefore replaces ever-growing context windows with a learned, fixed-size state that must carry forward only what future reasoning will need.

Core claim

MEM1 is an end-to-end reinforcement learning method in which the agent maintains and updates a single compact shared internal state at every turn; this state integrates prior memory with new environmental observations while discarding irrelevant or redundant content, thereby supporting both memory consolidation and reasoning under a constant memory budget across arbitrarily long multi-turn interactions.

What carries the argument

The compact shared internal state that is updated at each turn to jointly support memory consolidation and reasoning.

If this is right

  • A 7B model achieves 3.5 times higher performance than a 14B baseline on 16-objective multi-hop QA while using 3.7 times less memory.
  • The same constant-memory policy generalizes to task lengths longer than those used during training.
  • The method applies across internal retrieval QA, open-domain web QA, and multi-turn web shopping without task-specific redesign.
  • Composing existing datasets into longer sequences provides a scalable way to create training environments for long-horizon agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the fixed-size state continues to suffice for still longer chains, external retrieval modules could be used far less often in agent pipelines.
  • The composition technique for building multi-turn environments could be applied to any existing single-turn dataset to generate arbitrarily complex training curricula.
  • The learned synergy between memory updates and reasoning steps may transfer to other agent designs that currently rely on separate memory buffers.

Load-bearing premise

Reinforcement learning on composed multi-turn environments will produce a memory-update policy that retains exactly the information future interdependent queries will need while discarding everything else.

What would settle it

Run the trained agent on a 32-objective multi-hop QA sequence and check whether accuracy remains high while the internal state size stays fixed; a sharp drop in performance or growth in effective memory would falsify the claim.

read the original abstract

Modern language agents must operate over long-horizon, multi-turn interactions, where they retrieve external information, adapt to observations, and answer interdependent queries. Yet, most LLM systems rely on full-context prompting, appending all past turns regardless of their relevance. This leads to unbounded memory growth, increased computational costs, and degraded reasoning performance on out-of-distribution input lengths. We introduce MEM1, an end-to-end reinforcement learning framework that enables agents to operate with constant memory across long multi-turn tasks. At each turn, MEM1 updates a compact shared internal state that jointly supports memory consolidation and reasoning. This state integrates prior memory with new observations from the environment while strategically discarding irrelevant or redundant information. To support training in more realistic and compositional settings, we propose a simple yet effective and scalable approach to constructing multi-turn environments by composing existing datasets into arbitrarily complex task sequences. Experiments across three domains, including internal retrieval QA, open-domain web QA, and multi-turn web shopping, show that MEM1-7B improves performance by 3.5x while reducing memory usage by 3.7x compared to Qwen2.5-14B-Instruct on a 16-objective multi-hop QA task, and generalizes beyond the training horizon. Our results demonstrate the promise of reasoning-driven memory consolidation as a scalable alternative to existing solutions for training long-horizon interactive agents, where both efficiency and performance are optimized.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces MEM1, an end-to-end reinforcement learning framework for long-horizon language agents that maintains constant memory by updating a compact shared internal state supporting both memory consolidation and reasoning at each turn. It proposes composing existing datasets into multi-turn task sequences to create scalable training environments. Experiments across internal retrieval QA, open-domain web QA, and multi-turn web shopping show MEM1-7B achieving 3.5x higher performance and 3.7x lower memory usage than Qwen2.5-14B-Instruct on a 16-objective multi-hop QA task, with reported generalization beyond the training horizon.

Significance. If the central results hold under rigorous controls, the work would be significant for efficient long-horizon agents by demonstrating that RL-driven memory consolidation can jointly optimize performance and constant memory, offering a scalable alternative to full-context prompting. The dataset composition method could enable broader research on compositional tasks. However, the absence of methodological details, ablations, and robustness checks substantially reduces the current strength of the contribution.

major comments (3)
  1. [Abstract] Abstract: The reported 3.5x performance improvement and 3.7x memory reduction are presented without any description of the RL algorithm, reward function, state representation, training dynamics, or optimization procedure, making it impossible to evaluate how the compact shared state is learned or why it outperforms baselines.
  2. [Experiments] Experiments: No error bars, ablation studies, or controls are provided for the quantitative gains on the 16-objective task; without these, it is unclear whether improvements arise from the memory-reasoning synergy or from other unstated factors such as prompt engineering or model scale differences.
  3. [Methods] Methods/Experiments: The generalization claim beyond the training horizon rests on environments constructed by composing existing datasets; this creates artificial interdependencies whose structure is known at construction time, raising the risk that the policy learns dataset-origin or positional cues rather than true relevance, which would not transfer to natural long-horizon interactions.
minor comments (2)
  1. Add a dedicated section detailing the exact RL setup, including hyperparameters, reward shaping, and how the shared state is represented and updated.
  2. Include statistical significance tests and variance across multiple runs for all reported metrics.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the detailed review. We appreciate the feedback highlighting areas where additional clarity and rigor would strengthen the manuscript. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The reported 3.5x performance improvement and 3.7x memory reduction are presented without any description of the RL algorithm, reward function, state representation, training dynamics, or optimization procedure, making it impossible to evaluate how the compact shared state is learned or why it outperforms baselines.

    Authors: We agree the abstract is too concise on methodology. The full paper (Sections 3.1–3.3) specifies Proximal Policy Optimization as the RL algorithm, a composite reward of task success plus memory-efficiency penalty, a fixed-dimensional state vector updated by a learned consolidation module, and the end-to-end training loop. We will revise the abstract to include a brief description of the RL framework and state-update mechanism so readers can immediately understand the learning process. revision: yes

  2. Referee: [Experiments] Experiments: No error bars, ablation studies, or controls are provided for the quantitative gains on the 16-objective task; without these, it is unclear whether improvements arise from the memory-reasoning synergy or from other unstated factors such as prompt engineering or model scale differences.

    Authors: This criticism is valid; the current results are point estimates. In the revision we will report error bars over five random seeds, add an ablation that disables the shared-state update (replacing it with separate memory and reasoning buffers), and include scale-matched and prompt-length-matched baselines using the identical 7B backbone. These additions will help isolate the contribution of the joint memory-reasoning optimization. revision: yes

  3. Referee: [Methods] Methods/Experiments: The generalization claim beyond the training horizon rests on environments constructed by composing existing datasets; this creates artificial interdependencies whose structure is known at construction time, raising the risk that the policy learns dataset-origin or positional cues rather than true relevance, which would not transfer to natural long-horizon interactions.

    Authors: We acknowledge the risk of shortcut learning in composed environments. Section 4.1 describes random interleaving and source shuffling to reduce positional and origin cues, yet we agree this does not fully replicate organic multi-turn interactions. We will add a dedicated limitations paragraph and an extra experiment that evaluates on a freshly composed test set with deliberately altered source ordering to probe for cue reliance. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical results rest on external baselines

full rationale

The paper describes an end-to-end RL framework for constant-memory agents, with performance gains demonstrated via direct comparisons to external models (Qwen2.5-14B-Instruct) on composed multi-turn tasks. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the provided text. Claims are validated against independent benchmarks rather than reducing to internal inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only access prevents identification of specific free parameters, axioms, or invented entities; the approach implicitly relies on standard RL convergence assumptions and the unstated premise that composed datasets adequately proxy real compositional tasks.

pith-pipeline@v0.9.0 · 5586 in / 1123 out tokens · 42671 ms · 2026-05-15T00:22:57.994140+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith.Foundation.LedgerCanonicality ZeroParameterComparisonLedger echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    MEM1 updates a compact shared internal state that jointly supports memory consolidation and reasoning. This state integrates prior memory with new observations from the environment while strategically discarding irrelevant or redundant information.

  • IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    MEM1-7B improves performance by 3.5x while reducing memory usage by 3.7x compared to Qwen2.5-14B-Instruct on a 16-objective multi-hop QA task, and generalizes beyond the training horizon.

  • IndisputableMonolith.Foundation.DiscretenessForcing discreteness_forcing_principle echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    At each turn, MEM1 updates a compact shared internal state... pruning the agent's context to retain only the most recent <IS>

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare

    cs.AI 2026-05 conditional novelty 8.0

    MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for ...

  2. Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty

    cs.CL 2026-05 unverdicted novelty 8.0

    Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.

  3. LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

    cs.CL 2026-05 unverdicted novelty 7.0

    LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.

  4. Belief Memory: Agent Memory Under Partial Observability

    cs.AI 2026-05 unverdicted novelty 7.0

    BeliefMem is a probabilistic memory architecture for LLM agents that retains multiple candidate conclusions with probabilities updated by Noisy-OR, achieving superior average performance over deterministic baselines o...

  5. Belief Memory: Agent Memory Under Partial Observability

    cs.AI 2026-05 unverdicted novelty 7.0

    BeliefMem stores multiple candidate conclusions with probabilities in agent memory and updates them via Noisy-OR rules to preserve uncertainty under partial observability.

  6. MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents

    cs.MA 2026-05 unverdicted novelty 7.0

    MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.

  7. Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory

    cs.CL 2026-05 unverdicted novelty 7.0

    MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.

  8. Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents

    cs.AI 2026-04 unverdicted novelty 7.0

    Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summa...

  9. LMEB: Long-horizon Memory Embedding Benchmark

    cs.CL 2026-03 unverdicted novelty 7.0

    LMEB benchmark shows that embedding models' performance on traditional retrieval does not transfer to long-horizon memory tasks, larger models do not always perform better, and LMEB measures capabilities orthogonal to MTEB.

  10. Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

    cs.CL 2025-11 unverdicted novelty 7.0

    Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.

  11. MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    MAP improves LLM agent reasoning by constructing a structured cognitive map of the environment before task execution, yielding performance gains on benchmarks like ARC-AGI-3 and superior training data via the new MAP-...

  12. CTM-AI: A Blueprint for General AI Inspired by a Model of Consciousness

    q-bio.NC 2026-04 unverdicted novelty 6.0

    CTM-AI combines a formal consciousness model with foundation models to report state-of-the-art results on sarcasm detection, humor, and agentic tool-use benchmarks.

  13. Stateless Decision Memory for Enterprise AI Agents

    cs.AI 2026-04 unverdicted novelty 6.0

    Deterministic Projection Memory (DPM) delivers stateless, deterministic decision memory for enterprise AI agents that matches or exceeds summarization-based approaches at tight memory budgets while improving speed, de...

  14. MemSearch-o1: Empowering Large Language Models with Reasoning-Aligned Memory Growth in Agentic Search

    cs.IR 2026-04 unverdicted novelty 6.0

    MemSearch-o1 uses reasoning-aligned memory growth from seed tokens, retracing via contribution functions, and path reorganization to mitigate memory dilution in LLM agentic search.

  15. MemSearch-o1: Empowering Large Language Models with Reasoning-Aligned Memory Growth in Agentic Search

    cs.IR 2026-04 unverdicted novelty 6.0

    MemSearch-o1 mitigates memory dilution in agentic LLM search through reasoning-aligned token-level memory growth, retracing with a contribution function, and path reorganization, improving reasoning activation on benchmarks.

  16. Escaping the Context Bottleneck: Active Context Curation for LLM Agents via Reinforcement Learning

    cs.AI 2026-04 unverdicted novelty 6.0

    A lightweight RL policy called ContextCurator curates context for frozen LLM agents by reducing noise and keeping reasoning anchors, raising success rates on WebArena (36.4% to 41.2%) and DeepSearch (53.9% to 57.1%) w...

  17. MEMENTO: Teaching LLMs to Manage Their Own Context

    cs.AI 2026-04 unverdicted novelty 6.0

    MEMENTO trains LLMs to segment reasoning into blocks, generate mementos as dense summaries, and reason forward using only mementos and KV states, cutting peak KV cache by ~2.5x while preserving benchmark accuracy.

  18. AnomalyAgent: Agentic Industrial Anomaly Synthesis via Tool-Augmented Reinforcement Learning

    cs.CV 2026-04 unverdicted novelty 6.0

    AnomalyAgent uses tool-augmented reinforcement learning with self-reflection to generate realistic industrial anomalies, achieving better metrics than zero-shot methods on MVTec-AD.

  19. Hierarchical Reinforcement Learning with Augmented Step-Level Transitions for LLM Agents

    cs.AI 2026-04 unverdicted novelty 6.0

    STEP-HRL enables step-level learning in LLM agents via hierarchical task structure and local progress modules, outperforming baselines on ScienceWorld and ALFWorld while cutting token usage.

  20. LightThinker++: From Reasoning Compression to Memory Management

    cs.CL 2026-04 unverdicted novelty 6.0

    LightThinker++ adds explicit adaptive memory management and a trajectory synthesis pipeline to LLM reasoning, cutting peak token use by ~70% while gaining accuracy in standard and long-horizon agent tasks.

  21. Opal: Private Memory for Personal AI

    cs.CR 2026-04 unverdicted novelty 6.0

    Opal enables private long-term memory for personal AI by decoupling reasoning to a trusted enclave with a lightweight knowledge graph and piggybacking reindexing on ORAM accesses.

  22. Do Agents Need to Plan Step-by-Step? Rethinking Planning Horizon in Data-Centric Tool Calling

    cs.CL 2026-05 unverdicted novelty 5.0

    Full-horizon planning with on-demand replanning achieves accuracy parity with single-step planning in tool-calling agents for knowledge base and multi-hop question answering while consuming 2-3 times fewer tokens.

  23. A Brief Overview: Agentic Reinforcement Learning In Large Language Models

    cs.AI 2026-04 unverdicted novelty 2.0

    This review synthesizes conceptual foundations, methods, challenges, and future directions for agentic reinforcement learning in large language models.

  24. A Brief Overview: Agentic Reinforcement Learning In Large Language Models

    cs.AI 2026-04 unverdicted novelty 2.0

    The paper surveys the conceptual foundations, methodological innovations, challenges, and future directions of agentic reinforcement learning frameworks that embed cognitive capabilities like meta-reasoning and self-r...

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · cited by 21 Pith papers · 18 internal anchors

  1. [1]

    Surprising exercises that will sharpen your short- term memory, January 2024

    A Cognitive Connection. Surprising exercises that will sharpen your short- term memory, January 2024. URL https://acognitiveconnection.com/ surprising-exercises-that-will-sharpen-your-short-term-memory . Accessed: 2025-05-10

  2. [2]

    Why does the effective context length of llms fall short? In Proceedings of the International Conference on Learning Representations (ICLR), 2025

    Chenxin An, Jun Zhang, Ming Zhong, Lei Li, Shansan Gong, Yao Luo, Jingjing Xu, and Lingpeng Kong. Why does the effective context length of llms fall short? In Proceedings of the International Conference on Learning Representations (ICLR), 2025

  3. [3]

    The claude 3 model family: Opus, sonnet, haiku

    Anthropic. The claude 3 model family: Opus, sonnet, haiku. https://www.anthropic.com/ news/claude-3-family, 2024

  4. [4]

    Baddeley and Graham J

    Alan D. Baddeley and Graham J. Hitch. Working memory. In Gordon H. Bower (ed.), Psychology of learning and motivation, volume 8, pp. 47–89. Academic Press, 1974

  5. [5]

    Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning

    Hao Bai, Yifei Zhou, Jiayi Pan, Mert Cemri, Alane Suhr, Sergey Levine, and Aviral Kumar. Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning. Advances in Neural Information Processing Systems (NeurIPS), 37:12461–12495, 2024

  6. [6]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020

  7. [7]

    Language Models are Few-Shot Learners

    Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020

  8. [8]

    Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

    Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Simple LLM inference acceleration framework with multiple decoding heads. In Proceedings of the 41st International Conference on Machine Learning , volume 235 of Proceedings of Machine Learning Research, pp. 5209–5235, Vienna, Austria, 21–27 Jul 2024. PMLR....

  9. [9]

    Web agents with world models: Learning and leveraging environment dynamics in web navigation

    Hyungjoo Chae, Namyoung Kim, Kai Tzu-iunn Ong, Minju Gwak, Gwanwoo Song, Jihoon Kim, Sunghwan Kim, Dongha Lee, and Jinyoung Yeo. Web agents with world models: Learning and leveraging environment dynamics in web navigation. arXiv preprint arXiv:2410.13232, 2024

  10. [10]

    Accelerating Large Language Model Decoding with Speculative Sampling

    Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, February 2023. doi: 10.48550/arXiv.2302.01318. URL https://arxiv.org/abs/2302.01318

  11. [11]

    Agent-flan: Designing data and methods of effective agent tuning for large language models

    Zehui Chen, Kuikun Liu, Qiuchen Wang, Wenwei Zhang, Jiangning Liu, Dahua Lin, Kai Chen, and Feng Zhao. Agent-flan: Designing data and methods of effective agent tuning for large language models. In Findings of the Association for Computational Linguistics (ACL) , pp. 9354–9366, 2024

  12. [12]

    Seeclick: Harnessing gui grounding for advanced visual gui agents

    Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 9313–9332, 2024

  13. [13]

    Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

    Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory. arXiv preprint arXiv:2504.19413, 2025

  14. [14]

    SWIFT: A scalable lightweight infrastructure for fine-tuning

    ModelScope Community. SWIFT: A scalable lightweight infrastructure for fine-tuning. https: //github.com/modelscope/ms-swift, 2024. Accessed: 2025-05-15. 11

  15. [15]

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, and et al. Deepseek- r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501...

  16. [16]

    Mind2web: Towards a generalist agent for the web

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems (NeurIPS), 36:28091–28114, 2023

  17. [17]

    The Faiss library

    Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre- Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. The faiss library. arXiv preprint arXiv:2401.08281, 2024

  18. [18]

    Gemini: Try deep research and gemini 2.0 flash experimental

    Google. Gemini: Try deep research and gemini 2.0 flash experimental. https://blog. google/products/gemini/google-gemini-deep-research/ , 2024. Accessed: 2025- 05-15

  19. [19]

    When attention sink emerges in language models: An empirical view

    Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Ye Wang, and Min Lin. When attention sink emerges in language models: An empirical view. In Proceedings of the International Conference on Learning Representations (ICLR), 2025

  20. [20]

    A real-world webagent with planning, long context understanding, and program synthesis

    Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. A real-world webagent with planning, long context understanding, and program synthesis. In Proceedings of the International Conference on Learning Representations (ICLR), 2024

  21. [21]

    Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps

    Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. In Proceedings of the 28th International Conference on Computational Linguistics (COLING), pp. 6609–6625, 2020

  22. [22]

    REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

    Jian Hu, Jason Klein Liu, and Wei Shen. Reinforce++: An efficient rlhf algorithm with robustness to both prompt and reward models. arXiv preprint arXiv:2501.03262, 2025. Version 3, revised 6 Apr 2025

  23. [23]

    Into the unknown unknowns: Engaged human learning through participation in language model agent conversa- tions

    Yucheng Jiang, Yijia Shao, Dekun Ma, Sina Semnani, and Monica Lam. Into the unknown unknowns: Engaged human learning through participation in language model agent conversa- tions. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 9917–9955, 2024

  24. [24]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Za- mani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516, 2025

  25. [25]

    Dense passage retrieval for open-domain question answering

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6769–6781, 2020

  26. [26]

    Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research. Transac...

  27. [27]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 12

  28. [28]

    Fast inference from transformers via speculative decoding

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InProceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research , pp. 19274–19286, Honolulu, Hawaii, USA, 23–29 Jul 2023. PMLR. URL https://proceedings.mlr.press/v202/ leviathan23a.html

  29. [29]

    Compressing context to enhance inference efficiency of large language models

    Yucheng Li, Bo Dong, Frank Guerin, and Chenghua Lin. Compressing context to enhance inference efficiency of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6921–6935, 2023

  30. [30]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173, 2024

  31. [31]

    Inference-time scaling for generalist reward modeling, 2025

    Zijun Liu, Peiyi Wang, Runxin Xu, Shirong Ma, Chong Ruan, Peng Li, Yang Liu, and Yu Wu. Inference-time scaling for generalist reward modeling. arXiv preprint arXiv:2504.02495, 2025

  32. [32]

    Self-refine: Iterative refinement with self-feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems (NeurIPS), 36:46534– 46594, 2023

  33. [33]

    s1: Simple test-time scaling

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393, 2025

  34. [34]

    Browser use: Enable ai to control your browser

    Magnus Müller and Gregor Žuni ˇc. Browser use: Enable ai to control your browser. https: //github.com/browser-use/browser-use, 2024

  35. [35]

    WebGPT: Browser-assisted question-answering with human feedback

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schul- man. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint ...

  36. [36]

    Gpt-4o system card

    OpenAI. Gpt-4o system card. https://openai.com/index/gpt-4o-system-card/ , 2024. Accessed: 2025-05-15

  37. [37]

    Introducing deep research, February 2025

    OpenAI. Introducing deep research, February 2025. URL https://openai.com/index/ introducing-deep-research/

  38. [38]

    Smith, and Mike Lewis

    Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. In Findings of the Association for Computational Linguistics (EMNLP), pp. 5687–5711, 2023

  39. [39]

    Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning.arXiv:2411.02337, 2024

    Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Wenyi Zhao, Yu Yang, Xinyue Yang, Jiadai Sun, Shuntian Yao, et al. Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning. arXiv preprint arXiv:2411.02337, 2024

  40. [40]

    4d masks support in transformers

    Ruslan S. 4d masks support in transformers. https://huggingface.co/blog/poedator/ 4d-masks, 2024. Hugging Face Community Blog

  41. [41]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  42. [42]

    Serper api: Fast and affordable google search api

    Serper. Serper api: Fast and affordable google search api. https://serper.dev/, 2025. Accessed: 2025-05-15

  43. [43]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024

  44. [44]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv:2409.19256, 2024. 13

  45. [45]

    Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems (NeurIPS), 36:8634–8652, 2023

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems (NeurIPS), 36:8634–8652, 2023

  46. [46]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute opti- mally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, 2024

  47. [47]

    Trial and error: Exploration-based trajectory optimization of llm agents

    Yifan Song, Da Yin, Xiang Yue, Jie Huang, Sujian Li, and Bill Yuchen Lin. Trial and error: Exploration-based trajectory optimization of llm agents. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 7584–7600, 2024

  48. [48]

    Sutton and Andrew G

    Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction . MIT Press, Cambridge, MA, second edition, 2018. URL http://incompleteideas.net/book/ the-book-2nd.html

  49. [49]

    Policy gradient methods for reinforcement learning with function approximation

    Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems (NeurIPS), volume 12, pp. 1057–1063, 2000

  50. [50]

    Openmanus: Open-source ai agent framework

    OpenManus Team. Openmanus: Open-source ai agent framework. https://github.com/ mannaandpoem/OpenManus, 2025

  51. [51]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

  52. [52]

    Gomez, Łukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Informa- tion Processing Systems (NeurIPS), volume 30, pp. 5998–6008, 2017

  53. [53]

    A survey on large language model based autonomous agents

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Ji-Rong Wen. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(3):1–25, 2024

  54. [54]

    Text Embeddings by Weakly-Supervised Contrastive Pre-training

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022

  55. [55]

    Chi, Quoc V

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems (NeurIPS) , volume 35, pp. 24824–24837, 2022

  56. [56]

    Long- memeval: Benchmarking chat assistants on long-term interactive memory

    Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Long- memeval: Benchmarking chat assistants on long-term interactive memory. In International Conference on Learning Representations (ICLR), 2025

  57. [57]

    A-MEM: Agentic Memory for LLM Agents

    Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents. arXiv preprint arXiv:2502.12110, 2025

  58. [58]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

  59. [59]

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2369–2380, 2018. 14

  60. [60]

    Webshop: Towards scalable real-world web interaction with grounded language agents

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents. In Advances in Neural Information Processing Systems (NeurIPS), 2022

  61. [61]

    ReAct: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. In Proceedings of the International Conference on Learning Representations (ICLR), 2023

  62. [62]

    Lumos: Learning agents with unified data, modular design, and open-source llms

    Da Yin, Faeze Brahman, Abhilasha Ravichander, Khyathi Chandu, Kai-Wei Chang, Yejin Choi, and Bill Yuchen Lin. Lumos: Learning agents with unified data, modular design, and open-source llms. In ICLR 2024 Workshop on Large Language Model (LLM) Agents, 2023

  63. [63]

    Compact: Compressing retrieved documents actively for question answering

    Chanwoong Yoon, Taewhoo Lee, Hyeon Hwang, Minbyul Jeong, and Jaewoo Kang. Compact: Compressing retrieved documents actively for question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) , pp. 21424– 21439, 2024

  64. [64]

    Agent-r: Train- ing language model agents to reflect via iterative self-training.arXiv preprint arXiv:2501.11425, 2025

    Siyu Yuan, Zehui Chen, Zhiheng Xi, Junjie Ye, Zhengyin Du, and Jiecao Chen. Agent-r: Train- ing language model agents to reflect via iterative self-training.arXiv preprint arXiv:2501.11425, 2025

  65. [65]

    Inference scaling for long-context retrieval augmented generation

    Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, and Michael Bendersky. Inference scaling for long-context retrieval augmented generation. In Proceedings of the International Conference on Learning Representations (ICLR), 2025

  66. [66]

    Agenttuning: Enabling generalized agent abilities for llms,

    Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, and Jie Tang. Agenttuning: Enabling generalized agent abilities for llms. arXiv preprint arXiv:2310.12823, 2023

  67. [67]

    Lightthinker: Thinking step-by-step compression

    Jintian Zhang, Yuqi Zhu, Mengshu Sun, Yujie Luo, Shuofei Qiao, Lun Du, Da Zheng, Huajun Chen, and Ningyu Zhang. Lightthinker: Thinking step-by-step compression. arXiv preprint arXiv:2502.15589, 2025

  68. [68]

    Gonzalez, Clark Barrett, and Ying Sheng

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. Sglang: Efficient execution of structured language model programs. In Advances in Neural Information Processing Systems (NeurIPS), 2024

  69. [69]

    Deepresearcher: Scaling deep research via reinforcement learning in real-world environments

    Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. Deepresearcher: Scaling deep research via reinforcement learning in real-world environments. arXiv preprint arXiv:2504.03160, 2025

  70. [70]

    Webarena: A realistic web environment for building autonomous agents

    Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents. In Proceedings of the International Conference on Learning Representations (ICLR), 2024

  71. [71]

    Principled reinforcement learning with human feedback from pairwise or k-wise comparisons

    Banghua Zhu, Michael Jordan, and Jiantao Jiao. Principled reinforcement learning with human feedback from pairwise or k-wise comparisons. In Proceedings of the International Conference on Machine Learning (ICML), pp. 43037–43067, 2023. 15 A Details of MEM1 A.1 Computing Resources and Training Details All trainings of MEM1 are conducted on 4 H100 or H200 G...

  72. [72]

    </think>

    Perform reasoning and update a cumulative, concise summary within <think> ... </think>. This acts as persistent memory and must include all essential information from previous <think> and <information> tags

  73. [73]

    </search>

    Then choose one of the following actions: - If any question remains unanswered, issue a single query for one question inside <search> ... </search>. The query should consist of keywords or a short phrase. Only search one question at a time. - If all questions are answered, provide the final answers—separated by semicolons—within <answer> answer1; answer2;...

  74. [74]

    This is your persistent memory and should include all important information from previous <think> </think> and <information> </information> (i.e

    Conduct reasoning, and then update a concise, cumulative summary with essential information inside <think> </think>. This is your persistent memory and should include all important information from previous <think> </think> and <information> </information> (i.e. information and answers already found for questions)

  75. [75]

    Find a gingko light and 20x20 pillow cover that is hand painted

    Then choose one: - Issue a query (i.e., key words / phrases for search) inside <search> </search> (you may search repeatedly until the answer is clear). This query will be used to conduct search and return the results in <information> results </information> - Provide the final concise answer (no explanations) if no additional information is needed inside ...