arxiv: 2506.15841 · v2 · submitted 2025-06-18 · 💻 cs.CL · cs.AI· cs.IR

Recognition: 3 theorem links

· Lean Theorem

MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents

Zijian Zhou , Ao Qu , Zhaoxuan Wu , Sunghwan Kim , Alok Prakash , Daniela Rus , Jinhua Zhao , Bryan Kian Hsiang Low

show 1 more author

Paul Pu Liang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:22 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR

keywords long-horizon agentsmemory consolidationreinforcement learningconstant memorymulti-turn QAshared internal stateagent efficiency

0 comments

The pith

MEM1 trains agents to keep constant memory in long multi-turn tasks by updating one shared state that merges memory and reasoning via reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MEM1 as an end-to-end reinforcement learning framework that lets agents operate over extended sequences of interdependent queries without letting memory grow unbounded. At each turn the agent refreshes a compact internal state that folds prior memory together with fresh observations and drops redundant details. Experiments on retrieval QA, open web QA, and web shopping show that a 7B model trained this way reaches 3.5 times the performance of a 14B baseline while using 3.7 times less memory on 16-objective multi-hop tasks and continues to work on longer sequences than those seen in training. A sympathetic reader would care because standard full-context prompting quickly becomes expensive and noisy as interactions lengthen. The approach therefore replaces ever-growing context windows with a learned, fixed-size state that must carry forward only what future reasoning will need.

Core claim

MEM1 is an end-to-end reinforcement learning method in which the agent maintains and updates a single compact shared internal state at every turn; this state integrates prior memory with new environmental observations while discarding irrelevant or redundant content, thereby supporting both memory consolidation and reasoning under a constant memory budget across arbitrarily long multi-turn interactions.

What carries the argument

The compact shared internal state that is updated at each turn to jointly support memory consolidation and reasoning.

If this is right

A 7B model achieves 3.5 times higher performance than a 14B baseline on 16-objective multi-hop QA while using 3.7 times less memory.
The same constant-memory policy generalizes to task lengths longer than those used during training.
The method applies across internal retrieval QA, open-domain web QA, and multi-turn web shopping without task-specific redesign.
Composing existing datasets into longer sequences provides a scalable way to create training environments for long-horizon agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the fixed-size state continues to suffice for still longer chains, external retrieval modules could be used far less often in agent pipelines.
The composition technique for building multi-turn environments could be applied to any existing single-turn dataset to generate arbitrarily complex training curricula.
The learned synergy between memory updates and reasoning steps may transfer to other agent designs that currently rely on separate memory buffers.

Load-bearing premise

Reinforcement learning on composed multi-turn environments will produce a memory-update policy that retains exactly the information future interdependent queries will need while discarding everything else.

What would settle it

Run the trained agent on a 32-objective multi-hop QA sequence and check whether accuracy remains high while the internal state size stays fixed; a sharp drop in performance or growth in effective memory would falsify the claim.

read the original abstract

Modern language agents must operate over long-horizon, multi-turn interactions, where they retrieve external information, adapt to observations, and answer interdependent queries. Yet, most LLM systems rely on full-context prompting, appending all past turns regardless of their relevance. This leads to unbounded memory growth, increased computational costs, and degraded reasoning performance on out-of-distribution input lengths. We introduce MEM1, an end-to-end reinforcement learning framework that enables agents to operate with constant memory across long multi-turn tasks. At each turn, MEM1 updates a compact shared internal state that jointly supports memory consolidation and reasoning. This state integrates prior memory with new observations from the environment while strategically discarding irrelevant or redundant information. To support training in more realistic and compositional settings, we propose a simple yet effective and scalable approach to constructing multi-turn environments by composing existing datasets into arbitrarily complex task sequences. Experiments across three domains, including internal retrieval QA, open-domain web QA, and multi-turn web shopping, show that MEM1-7B improves performance by 3.5x while reducing memory usage by 3.7x compared to Qwen2.5-14B-Instruct on a 16-objective multi-hop QA task, and generalizes beyond the training horizon. Our results demonstrate the promise of reasoning-driven memory consolidation as a scalable alternative to existing solutions for training long-horizon interactive agents, where both efficiency and performance are optimized.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MEM1 trains a fixed-size state via RL to handle both memory and reasoning jointly, plus a practical way to compose datasets for long-horizon training, but the reported gains need checks against composition artifacts.

read the letter

The core contribution is an end-to-end RL setup where the agent maintains one compact internal state that gets updated each turn to support both consolidation and the current reasoning step. They also describe a straightforward method for stitching existing datasets into multi-turn sequences so training can scale to longer horizons without new data collection. That joint update and the composition recipe are the parts that differ from earlier memory-augmented agents that usually treat memory management as a separate module or heuristic. The abstract reports clear wins on a 16-objective multi-hop QA task, with MEM1-7B beating a larger baseline while using far less memory, and it claims the policy generalizes past the training length. Those numbers are the first thing a reader should verify in the full experiments. The main soft spot is the risk that the RL policy learns to retain or drop content based on cues from how the datasets were composed rather than actual relevance. Because the interdependencies are constructed at training time, the model could exploit sequential position or dataset-origin signals to maximize reward without developing robust consolidation. The abstract gives no details on the RL algorithm, reward function, state representation, or any ablations, so it is hard to judge whether the gains are stable or whether the memory state truly discards only irrelevant information. No error bars appear in the reported results either. This work is aimed at people building web agents or multi-turn QA systems who care about keeping memory bounded. A reader who wants to see concrete RL training for memory-reasoning synergy would find the formulation and dataset recipe useful to discuss. It deserves a serious referee because the problem is practical and the claims are specific enough to test directly. I would send it out, with reviewers asked to focus on generalization experiments and any analysis of what information the shared state actually keeps across turns.

Referee Report

3 major / 2 minor

Summary. The paper introduces MEM1, an end-to-end reinforcement learning framework for long-horizon language agents that maintains constant memory by updating a compact shared internal state supporting both memory consolidation and reasoning at each turn. It proposes composing existing datasets into multi-turn task sequences to create scalable training environments. Experiments across internal retrieval QA, open-domain web QA, and multi-turn web shopping show MEM1-7B achieving 3.5x higher performance and 3.7x lower memory usage than Qwen2.5-14B-Instruct on a 16-objective multi-hop QA task, with reported generalization beyond the training horizon.

Significance. If the central results hold under rigorous controls, the work would be significant for efficient long-horizon agents by demonstrating that RL-driven memory consolidation can jointly optimize performance and constant memory, offering a scalable alternative to full-context prompting. The dataset composition method could enable broader research on compositional tasks. However, the absence of methodological details, ablations, and robustness checks substantially reduces the current strength of the contribution.

major comments (3)

[Abstract] Abstract: The reported 3.5x performance improvement and 3.7x memory reduction are presented without any description of the RL algorithm, reward function, state representation, training dynamics, or optimization procedure, making it impossible to evaluate how the compact shared state is learned or why it outperforms baselines.
[Experiments] Experiments: No error bars, ablation studies, or controls are provided for the quantitative gains on the 16-objective task; without these, it is unclear whether improvements arise from the memory-reasoning synergy or from other unstated factors such as prompt engineering or model scale differences.
[Methods] Methods/Experiments: The generalization claim beyond the training horizon rests on environments constructed by composing existing datasets; this creates artificial interdependencies whose structure is known at construction time, raising the risk that the policy learns dataset-origin or positional cues rather than true relevance, which would not transfer to natural long-horizon interactions.

minor comments (2)

Add a dedicated section detailing the exact RL setup, including hyperparameters, reward shaping, and how the shared state is represented and updated.
Include statistical significance tests and variance across multiple runs for all reported metrics.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the detailed review. We appreciate the feedback highlighting areas where additional clarity and rigor would strengthen the manuscript. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: The reported 3.5x performance improvement and 3.7x memory reduction are presented without any description of the RL algorithm, reward function, state representation, training dynamics, or optimization procedure, making it impossible to evaluate how the compact shared state is learned or why it outperforms baselines.

Authors: We agree the abstract is too concise on methodology. The full paper (Sections 3.1–3.3) specifies Proximal Policy Optimization as the RL algorithm, a composite reward of task success plus memory-efficiency penalty, a fixed-dimensional state vector updated by a learned consolidation module, and the end-to-end training loop. We will revise the abstract to include a brief description of the RL framework and state-update mechanism so readers can immediately understand the learning process. revision: yes
Referee: [Experiments] Experiments: No error bars, ablation studies, or controls are provided for the quantitative gains on the 16-objective task; without these, it is unclear whether improvements arise from the memory-reasoning synergy or from other unstated factors such as prompt engineering or model scale differences.

Authors: This criticism is valid; the current results are point estimates. In the revision we will report error bars over five random seeds, add an ablation that disables the shared-state update (replacing it with separate memory and reasoning buffers), and include scale-matched and prompt-length-matched baselines using the identical 7B backbone. These additions will help isolate the contribution of the joint memory-reasoning optimization. revision: yes
Referee: [Methods] Methods/Experiments: The generalization claim beyond the training horizon rests on environments constructed by composing existing datasets; this creates artificial interdependencies whose structure is known at construction time, raising the risk that the policy learns dataset-origin or positional cues rather than true relevance, which would not transfer to natural long-horizon interactions.

Authors: We acknowledge the risk of shortcut learning in composed environments. Section 4.1 describes random interleaving and source shuffling to reduce positional and origin cues, yet we agree this does not fully replicate organic multi-turn interactions. We will add a dedicated limitations paragraph and an extra experiment that evaluates on a freshly composed test set with deliberately altered source ordering to probe for cue reliance. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical results rest on external baselines

full rationale

The paper describes an end-to-end RL framework for constant-memory agents, with performance gains demonstrated via direct comparisons to external models (Qwen2.5-14B-Instruct) on composed multi-turn tasks. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the provided text. Claims are validated against independent benchmarks rather than reducing to internal inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only access prevents identification of specific free parameters, axioms, or invented entities; the approach implicitly relies on standard RL convergence assumptions and the unstated premise that composed datasets adequately proxy real compositional tasks.

pith-pipeline@v0.9.0 · 5586 in / 1123 out tokens · 42671 ms · 2026-05-15T00:22:57.994140+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.LedgerCanonicality ZeroParameterComparisonLedger echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

MEM1 updates a compact shared internal state that jointly supports memory consolidation and reasoning. This state integrates prior memory with new observations from the environment while strategically discarding irrelevant or redundant information.
IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

MEM1-7B improves performance by 3.5x while reducing memory usage by 3.7x compared to Qwen2.5-14B-Instruct on a 16-objective multi-hop QA task, and generalizes beyond the training horizon.
IndisputableMonolith.Foundation.DiscretenessForcing discreteness_forcing_principle echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

At each turn, MEM1 updates a compact shared internal state... pruning the agent's context to retain only the most recent <IS>

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare
cs.AI 2026-05 conditional novelty 8.0

MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for ...
Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty
cs.CL 2026-05 unverdicted novelty 8.0

Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.
LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues
cs.CL 2026-05 unverdicted novelty 7.0

LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
Belief Memory: Agent Memory Under Partial Observability
cs.AI 2026-05 unverdicted novelty 7.0

BeliefMem is a probabilistic memory architecture for LLM agents that retains multiple candidate conclusions with probabilities updated by Noisy-OR, achieving superior average performance over deterministic baselines o...
Belief Memory: Agent Memory Under Partial Observability
cs.AI 2026-05 unverdicted novelty 7.0

BeliefMem stores multiple candidate conclusions with probabilities in agent memory and updates them via Noisy-OR rules to preserve uncertainty under partial observability.
MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents
cs.MA 2026-05 unverdicted novelty 7.0

MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.
Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory
cs.CL 2026-05 unverdicted novelty 7.0

MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.
Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents
cs.AI 2026-04 unverdicted novelty 7.0

Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summa...
LMEB: Long-horizon Memory Embedding Benchmark
cs.CL 2026-03 unverdicted novelty 7.0

LMEB benchmark shows that embedding models' performance on traditional retrieval does not transfer to long-horizon memory tasks, larger models do not always perform better, and LMEB measures capabilities orthogonal to MTEB.
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
cs.CL 2025-11 unverdicted novelty 7.0

Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.
MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

MAP improves LLM agent reasoning by constructing a structured cognitive map of the environment before task execution, yielding performance gains on benchmarks like ARC-AGI-3 and superior training data via the new MAP-...
CTM-AI: A Blueprint for General AI Inspired by a Model of Consciousness
q-bio.NC 2026-04 unverdicted novelty 6.0

CTM-AI combines a formal consciousness model with foundation models to report state-of-the-art results on sarcasm detection, humor, and agentic tool-use benchmarks.
Stateless Decision Memory for Enterprise AI Agents
cs.AI 2026-04 unverdicted novelty 6.0

Deterministic Projection Memory (DPM) delivers stateless, deterministic decision memory for enterprise AI agents that matches or exceeds summarization-based approaches at tight memory budgets while improving speed, de...
MemSearch-o1: Empowering Large Language Models with Reasoning-Aligned Memory Growth in Agentic Search
cs.IR 2026-04 unverdicted novelty 6.0

MemSearch-o1 uses reasoning-aligned memory growth from seed tokens, retracing via contribution functions, and path reorganization to mitigate memory dilution in LLM agentic search.
MemSearch-o1: Empowering Large Language Models with Reasoning-Aligned Memory Growth in Agentic Search
cs.IR 2026-04 unverdicted novelty 6.0

MemSearch-o1 mitigates memory dilution in agentic LLM search through reasoning-aligned token-level memory growth, retracing with a contribution function, and path reorganization, improving reasoning activation on benchmarks.
Escaping the Context Bottleneck: Active Context Curation for LLM Agents via Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 6.0

A lightweight RL policy called ContextCurator curates context for frozen LLM agents by reducing noise and keeping reasoning anchors, raising success rates on WebArena (36.4% to 41.2%) and DeepSearch (53.9% to 57.1%) w...
MEMENTO: Teaching LLMs to Manage Their Own Context
cs.AI 2026-04 unverdicted novelty 6.0

MEMENTO trains LLMs to segment reasoning into blocks, generate mementos as dense summaries, and reason forward using only mementos and KV states, cutting peak KV cache by ~2.5x while preserving benchmark accuracy.
AnomalyAgent: Agentic Industrial Anomaly Synthesis via Tool-Augmented Reinforcement Learning
cs.CV 2026-04 unverdicted novelty 6.0

AnomalyAgent uses tool-augmented reinforcement learning with self-reflection to generate realistic industrial anomalies, achieving better metrics than zero-shot methods on MVTec-AD.
Hierarchical Reinforcement Learning with Augmented Step-Level Transitions for LLM Agents
cs.AI 2026-04 unverdicted novelty 6.0

STEP-HRL enables step-level learning in LLM agents via hierarchical task structure and local progress modules, outperforming baselines on ScienceWorld and ALFWorld while cutting token usage.
LightThinker++: From Reasoning Compression to Memory Management
cs.CL 2026-04 unverdicted novelty 6.0

LightThinker++ adds explicit adaptive memory management and a trajectory synthesis pipeline to LLM reasoning, cutting peak token use by ~70% while gaining accuracy in standard and long-horizon agent tasks.
Opal: Private Memory for Personal AI
cs.CR 2026-04 unverdicted novelty 6.0

Opal enables private long-term memory for personal AI by decoupling reasoning to a trusted enclave with a lightweight knowledge graph and piggybacking reindexing on ORAM accesses.
Do Agents Need to Plan Step-by-Step? Rethinking Planning Horizon in Data-Centric Tool Calling
cs.CL 2026-05 unverdicted novelty 5.0

Full-horizon planning with on-demand replanning achieves accuracy parity with single-step planning in tool-calling agents for knowledge base and multi-hop question answering while consuming 2-3 times fewer tokens.
A Brief Overview: Agentic Reinforcement Learning In Large Language Models
cs.AI 2026-04 unverdicted novelty 2.0

This review synthesizes conceptual foundations, methods, challenges, and future directions for agentic reinforcement learning in large language models.
A Brief Overview: Agentic Reinforcement Learning In Large Language Models
cs.AI 2026-04 unverdicted novelty 2.0

The paper surveys the conceptual foundations, methodological innovations, challenges, and future directions of agentic reinforcement learning frameworks that embed cognitive capabilities like meta-reasoning and self-r...

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · cited by 21 Pith papers · 18 internal anchors

[1]

Surprising exercises that will sharpen your short- term memory, January 2024

A Cognitive Connection. Surprising exercises that will sharpen your short- term memory, January 2024. URL https://acognitiveconnection.com/ surprising-exercises-that-will-sharpen-your-short-term-memory . Accessed: 2025-05-10

work page 2024
[2]

Why does the effective context length of llms fall short? In Proceedings of the International Conference on Learning Representations (ICLR), 2025

Chenxin An, Jun Zhang, Ming Zhong, Lei Li, Shansan Gong, Yao Luo, Jingjing Xu, and Lingpeng Kong. Why does the effective context length of llms fall short? In Proceedings of the International Conference on Learning Representations (ICLR), 2025

work page 2025
[3]

The claude 3 model family: Opus, sonnet, haiku

Anthropic. The claude 3 model family: Opus, sonnet, haiku. https://www.anthropic.com/ news/claude-3-family, 2024

work page 2024
[4]

Baddeley and Graham J

Alan D. Baddeley and Graham J. Hitch. Working memory. In Gordon H. Bower (ed.), Psychology of learning and motivation, volume 8, pp. 47–89. Academic Press, 1974

work page 1974
[5]

Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning

Hao Bai, Yifei Zhou, Jiayi Pan, Mert Cemri, Alane Suhr, Sergey Levine, and Aviral Kumar. Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning. Advances in Neural Information Processing Systems (NeurIPS), 37:12461–12495, 2024

work page 2024
[6]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004
[7]

Language Models are Few-Shot Learners

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005
[8]

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Simple LLM inference acceleration framework with multiple decoding heads. In Proceedings of the 41st International Conference on Machine Learning , volume 235 of Proceedings of Machine Learning Research, pp. 5209–5235, Vienna, Austria, 21–27 Jul 2024. PMLR....

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.10774 2024
[9]

Web agents with world models: Learning and leveraging environment dynamics in web navigation

Hyungjoo Chae, Namyoung Kim, Kai Tzu-iunn Ong, Minju Gwak, Gwanwoo Song, Jihoon Kim, Sunghwan Kim, Dongha Lee, and Jinyoung Yeo. Web agents with world models: Learning and leveraging environment dynamics in web navigation. arXiv preprint arXiv:2410.13232, 2024

work page arXiv 2024
[10]

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, February 2023. doi: 10.48550/arXiv.2302.01318. URL https://arxiv.org/abs/2302.01318

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.01318 2023
[11]

Agent-flan: Designing data and methods of effective agent tuning for large language models

Zehui Chen, Kuikun Liu, Qiuchen Wang, Wenwei Zhang, Jiangning Liu, Dahua Lin, Kai Chen, and Feng Zhao. Agent-flan: Designing data and methods of effective agent tuning for large language models. In Findings of the Association for Computational Linguistics (ACL) , pp. 9354–9366, 2024

work page 2024
[12]

Seeclick: Harnessing gui grounding for advanced visual gui agents

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 9313–9332, 2024

work page 2024
[13]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory. arXiv preprint arXiv:2504.19413, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

SWIFT: A scalable lightweight infrastructure for fine-tuning

ModelScope Community. SWIFT: A scalable lightweight infrastructure for fine-tuning. https: //github.com/modelscope/ms-swift, 2024. Accessed: 2025-05-15. 11

work page 2024
[15]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, and et al. Deepseek- r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Mind2web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems (NeurIPS), 36:28091–28114, 2023

work page 2023
[17]

The Faiss library

Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre- Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. The faiss library. arXiv preprint arXiv:2401.08281, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Gemini: Try deep research and gemini 2.0 flash experimental

Google. Gemini: Try deep research and gemini 2.0 flash experimental. https://blog. google/products/gemini/google-gemini-deep-research/ , 2024. Accessed: 2025- 05-15

work page 2024
[19]

When attention sink emerges in language models: An empirical view

Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Ye Wang, and Min Lin. When attention sink emerges in language models: An empirical view. In Proceedings of the International Conference on Learning Representations (ICLR), 2025

work page 2025
[20]

A real-world webagent with planning, long context understanding, and program synthesis

Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. A real-world webagent with planning, long context understanding, and program synthesis. In Proceedings of the International Conference on Learning Representations (ICLR), 2024

work page 2024
[21]

Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. In Proceedings of the 28th International Conference on Computational Linguistics (COLING), pp. 6609–6625, 2020

work page 2020
[22]

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Jian Hu, Jason Klein Liu, and Wei Shen. Reinforce++: An efficient rlhf algorithm with robustness to both prompt and reward models. arXiv preprint arXiv:2501.03262, 2025. Version 3, revised 6 Apr 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Into the unknown unknowns: Engaged human learning through participation in language model agent conversa- tions

Yucheng Jiang, Yijia Shao, Dekun Ma, Sina Semnani, and Monica Lam. Into the unknown unknowns: Engaged human learning through participation in language model agent conversa- tions. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 9917–9955, 2024

work page 2024
[24]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Za- mani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6769–6781, 2020

work page 2020
[26]

Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research. Transac...

work page 2019
[27]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 12

work page 2023
[28]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InProceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research , pp. 19274–19286, Honolulu, Hawaii, USA, 23–29 Jul 2023. PMLR. URL https://proceedings.mlr.press/v202/ leviathan23a.html

work page 2023
[29]

Compressing context to enhance inference efficiency of large language models

Yucheng Li, Bo Dong, Frank Guerin, and Chenghua Lin. Compressing context to enhance inference efficiency of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6921–6935, 2023

work page 2023
[30]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173, 2024

work page 2024
[31]

Inference-time scaling for generalist reward modeling, 2025

Zijun Liu, Peiyi Wang, Runxin Xu, Shirong Ma, Chong Ruan, Peng Li, Yang Liu, and Yu Wu. Inference-time scaling for generalist reward modeling. arXiv preprint arXiv:2504.02495, 2025

work page arXiv 2025
[32]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems (NeurIPS), 36:46534– 46594, 2023

work page 2023
[33]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Browser use: Enable ai to control your browser

Magnus Müller and Gregor Žuni ˇc. Browser use: Enable ai to control your browser. https: //github.com/browser-use/browser-use, 2024

work page 2024
[35]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schul- man. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint ...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[36]

Gpt-4o system card

OpenAI. Gpt-4o system card. https://openai.com/index/gpt-4o-system-card/ , 2024. Accessed: 2025-05-15

work page 2024
[37]

Introducing deep research, February 2025

OpenAI. Introducing deep research, February 2025. URL https://openai.com/index/ introducing-deep-research/

work page 2025
[38]

Smith, and Mike Lewis

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. In Findings of the Association for Computational Linguistics (EMNLP), pp. 5687–5711, 2023

work page 2023
[39]

Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning.arXiv:2411.02337, 2024

Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Wenyi Zhao, Yu Yang, Xinyue Yang, Jiadai Sun, Shuntian Yao, et al. Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning. arXiv preprint arXiv:2411.02337, 2024

work page arXiv 2024
[40]

4d masks support in transformers

Ruslan S. 4d masks support in transformers. https://huggingface.co/blog/poedator/ 4d-masks, 2024. Hugging Face Community Blog

work page 2024
[41]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[42]

Serper api: Fast and affordable google search api

Serper. Serper api: Fast and affordable google search api. https://serper.dev/, 2025. Accessed: 2025-05-15

work page 2025
[43]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024

work page 2024
[44]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv:2409.19256, 2024. 13

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems (NeurIPS), 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems (NeurIPS), 36:8634–8652, 2023

work page 2023
[46]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute opti- mally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Trial and error: Exploration-based trajectory optimization of llm agents

Yifan Song, Da Yin, Xiang Yue, Jie Huang, Sujian Li, and Bill Yuchen Lin. Trial and error: Exploration-based trajectory optimization of llm agents. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 7584–7600, 2024

work page 2024
[48]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction . MIT Press, Cambridge, MA, second edition, 2018. URL http://incompleteideas.net/book/ the-book-2nd.html

work page 2018
[49]

Policy gradient methods for reinforcement learning with function approximation

Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems (NeurIPS), volume 12, pp. 1057–1063, 2000

work page 2000
[50]

Openmanus: Open-source ai agent framework

OpenManus Team. Openmanus: Open-source ai agent framework. https://github.com/ mannaandpoem/OpenManus, 2025

work page 2025
[51]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Informa- tion Processing Systems (NeurIPS), volume 30, pp. 5998–6008, 2017

work page 2017
[53]

A survey on large language model based autonomous agents

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Ji-Rong Wen. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(3):1–25, 2024

work page 2024
[54]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[55]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems (NeurIPS) , volume 35, pp. 24824–24837, 2022

work page 2022
[56]

Long- memeval: Benchmarking chat assistants on long-term interactive memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Long- memeval: Benchmarking chat assistants on long-term interactive memory. In International Conference on Learning Representations (ICLR), 2025

work page 2025
[57]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents. arXiv preprint arXiv:2502.12110, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[59]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2369–2380, 2018. 14

work page 2018
[60]

Webshop: Towards scalable real-world web interaction with grounded language agents

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents. In Advances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[61]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. In Proceedings of the International Conference on Learning Representations (ICLR), 2023

work page 2023
[62]

Lumos: Learning agents with unified data, modular design, and open-source llms

Da Yin, Faeze Brahman, Abhilasha Ravichander, Khyathi Chandu, Kai-Wei Chang, Yejin Choi, and Bill Yuchen Lin. Lumos: Learning agents with unified data, modular design, and open-source llms. In ICLR 2024 Workshop on Large Language Model (LLM) Agents, 2023

work page 2024
[63]

Compact: Compressing retrieved documents actively for question answering

Chanwoong Yoon, Taewhoo Lee, Hyeon Hwang, Minbyul Jeong, and Jaewoo Kang. Compact: Compressing retrieved documents actively for question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) , pp. 21424– 21439, 2024

work page 2024
[64]

Agent-r: Train- ing language model agents to reflect via iterative self-training.arXiv preprint arXiv:2501.11425, 2025

Siyu Yuan, Zehui Chen, Zhiheng Xi, Junjie Ye, Zhengyin Du, and Jiecao Chen. Agent-r: Train- ing language model agents to reflect via iterative self-training.arXiv preprint arXiv:2501.11425, 2025

work page arXiv 2025
[65]

Inference scaling for long-context retrieval augmented generation

Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, and Michael Bendersky. Inference scaling for long-context retrieval augmented generation. In Proceedings of the International Conference on Learning Representations (ICLR), 2025

work page 2025
[66]

Agenttuning: Enabling generalized agent abilities for llms,

Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, and Jie Tang. Agenttuning: Enabling generalized agent abilities for llms. arXiv preprint arXiv:2310.12823, 2023

work page arXiv 2023
[67]

Lightthinker: Thinking step-by-step compression

Jintian Zhang, Yuqi Zhu, Mengshu Sun, Yujie Luo, Shuofei Qiao, Lun Du, Da Zheng, Huajun Chen, and Ningyu Zhang. Lightthinker: Thinking step-by-step compression. arXiv preprint arXiv:2502.15589, 2025

work page arXiv 2025
[68]

Gonzalez, Clark Barrett, and Ying Sheng

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. Sglang: Efficient execution of structured language model programs. In Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[69]

Deepresearcher: Scaling deep research via reinforcement learning in real-world environments

Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. Deepresearcher: Scaling deep research via reinforcement learning in real-world environments. arXiv preprint arXiv:2504.03160, 2025

work page arXiv 2025
[70]

Webarena: A realistic web environment for building autonomous agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents. In Proceedings of the International Conference on Learning Representations (ICLR), 2024

work page 2024
[71]

Principled reinforcement learning with human feedback from pairwise or k-wise comparisons

Banghua Zhu, Michael Jordan, and Jiantao Jiao. Principled reinforcement learning with human feedback from pairwise or k-wise comparisons. In Proceedings of the International Conference on Machine Learning (ICML), pp. 43037–43067, 2023. 15 A Details of MEM1 A.1 Computing Resources and Training Details All trainings of MEM1 are conducted on 4 H100 or H200 G...

work page 2023
[72]

</think>

Perform reasoning and update a cumulative, concise summary within <think> ... </think>. This acts as persistent memory and must include all essential information from previous <think> and <information> tags

work page
[73]

</search>

Then choose one of the following actions: - If any question remains unanswered, issue a single query for one question inside <search> ... </search>. The query should consist of keywords or a short phrase. Only search one question at a time. - If all questions are answered, provide the final answers—separated by semicolons—within <answer> answer1; answer2;...

work page
[74]

This is your persistent memory and should include all important information from previous <think> </think> and <information> </information> (i.e

Conduct reasoning, and then update a concise, cumulative summary with essential information inside <think> </think>. This is your persistent memory and should include all important information from previous <think> </think> and <information> </information> (i.e. information and answers already found for questions)

work page
[75]

Find a gingko light and 20x20 pillow cover that is hand painted

Then choose one: - Issue a query (i.e., key words / phrases for search) inside <search> </search> (you may search repeatedly until the answer is clear). This query will be used to conduct search and return the results in <information> results </information> - Provide the final concise answer (no explanations) if no additional information is needed inside ...

work page