Recognition: 3 theorem links
· Lean TheoremMEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents
Pith reviewed 2026-05-15 00:22 UTC · model grok-4.3
The pith
MEM1 trains agents to keep constant memory in long multi-turn tasks by updating one shared state that merges memory and reasoning via reinforcement learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MEM1 is an end-to-end reinforcement learning method in which the agent maintains and updates a single compact shared internal state at every turn; this state integrates prior memory with new environmental observations while discarding irrelevant or redundant content, thereby supporting both memory consolidation and reasoning under a constant memory budget across arbitrarily long multi-turn interactions.
What carries the argument
The compact shared internal state that is updated at each turn to jointly support memory consolidation and reasoning.
If this is right
- A 7B model achieves 3.5 times higher performance than a 14B baseline on 16-objective multi-hop QA while using 3.7 times less memory.
- The same constant-memory policy generalizes to task lengths longer than those used during training.
- The method applies across internal retrieval QA, open-domain web QA, and multi-turn web shopping without task-specific redesign.
- Composing existing datasets into longer sequences provides a scalable way to create training environments for long-horizon agents.
Where Pith is reading between the lines
- If the fixed-size state continues to suffice for still longer chains, external retrieval modules could be used far less often in agent pipelines.
- The composition technique for building multi-turn environments could be applied to any existing single-turn dataset to generate arbitrarily complex training curricula.
- The learned synergy between memory updates and reasoning steps may transfer to other agent designs that currently rely on separate memory buffers.
Load-bearing premise
Reinforcement learning on composed multi-turn environments will produce a memory-update policy that retains exactly the information future interdependent queries will need while discarding everything else.
What would settle it
Run the trained agent on a 32-objective multi-hop QA sequence and check whether accuracy remains high while the internal state size stays fixed; a sharp drop in performance or growth in effective memory would falsify the claim.
read the original abstract
Modern language agents must operate over long-horizon, multi-turn interactions, where they retrieve external information, adapt to observations, and answer interdependent queries. Yet, most LLM systems rely on full-context prompting, appending all past turns regardless of their relevance. This leads to unbounded memory growth, increased computational costs, and degraded reasoning performance on out-of-distribution input lengths. We introduce MEM1, an end-to-end reinforcement learning framework that enables agents to operate with constant memory across long multi-turn tasks. At each turn, MEM1 updates a compact shared internal state that jointly supports memory consolidation and reasoning. This state integrates prior memory with new observations from the environment while strategically discarding irrelevant or redundant information. To support training in more realistic and compositional settings, we propose a simple yet effective and scalable approach to constructing multi-turn environments by composing existing datasets into arbitrarily complex task sequences. Experiments across three domains, including internal retrieval QA, open-domain web QA, and multi-turn web shopping, show that MEM1-7B improves performance by 3.5x while reducing memory usage by 3.7x compared to Qwen2.5-14B-Instruct on a 16-objective multi-hop QA task, and generalizes beyond the training horizon. Our results demonstrate the promise of reasoning-driven memory consolidation as a scalable alternative to existing solutions for training long-horizon interactive agents, where both efficiency and performance are optimized.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MEM1, an end-to-end reinforcement learning framework for long-horizon language agents that maintains constant memory by updating a compact shared internal state supporting both memory consolidation and reasoning at each turn. It proposes composing existing datasets into multi-turn task sequences to create scalable training environments. Experiments across internal retrieval QA, open-domain web QA, and multi-turn web shopping show MEM1-7B achieving 3.5x higher performance and 3.7x lower memory usage than Qwen2.5-14B-Instruct on a 16-objective multi-hop QA task, with reported generalization beyond the training horizon.
Significance. If the central results hold under rigorous controls, the work would be significant for efficient long-horizon agents by demonstrating that RL-driven memory consolidation can jointly optimize performance and constant memory, offering a scalable alternative to full-context prompting. The dataset composition method could enable broader research on compositional tasks. However, the absence of methodological details, ablations, and robustness checks substantially reduces the current strength of the contribution.
major comments (3)
- [Abstract] Abstract: The reported 3.5x performance improvement and 3.7x memory reduction are presented without any description of the RL algorithm, reward function, state representation, training dynamics, or optimization procedure, making it impossible to evaluate how the compact shared state is learned or why it outperforms baselines.
- [Experiments] Experiments: No error bars, ablation studies, or controls are provided for the quantitative gains on the 16-objective task; without these, it is unclear whether improvements arise from the memory-reasoning synergy or from other unstated factors such as prompt engineering or model scale differences.
- [Methods] Methods/Experiments: The generalization claim beyond the training horizon rests on environments constructed by composing existing datasets; this creates artificial interdependencies whose structure is known at construction time, raising the risk that the policy learns dataset-origin or positional cues rather than true relevance, which would not transfer to natural long-horizon interactions.
minor comments (2)
- Add a dedicated section detailing the exact RL setup, including hyperparameters, reward shaping, and how the shared state is represented and updated.
- Include statistical significance tests and variance across multiple runs for all reported metrics.
Simulated Author's Rebuttal
Thank you for the detailed review. We appreciate the feedback highlighting areas where additional clarity and rigor would strengthen the manuscript. We address each major comment below, indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract] Abstract: The reported 3.5x performance improvement and 3.7x memory reduction are presented without any description of the RL algorithm, reward function, state representation, training dynamics, or optimization procedure, making it impossible to evaluate how the compact shared state is learned or why it outperforms baselines.
Authors: We agree the abstract is too concise on methodology. The full paper (Sections 3.1–3.3) specifies Proximal Policy Optimization as the RL algorithm, a composite reward of task success plus memory-efficiency penalty, a fixed-dimensional state vector updated by a learned consolidation module, and the end-to-end training loop. We will revise the abstract to include a brief description of the RL framework and state-update mechanism so readers can immediately understand the learning process. revision: yes
-
Referee: [Experiments] Experiments: No error bars, ablation studies, or controls are provided for the quantitative gains on the 16-objective task; without these, it is unclear whether improvements arise from the memory-reasoning synergy or from other unstated factors such as prompt engineering or model scale differences.
Authors: This criticism is valid; the current results are point estimates. In the revision we will report error bars over five random seeds, add an ablation that disables the shared-state update (replacing it with separate memory and reasoning buffers), and include scale-matched and prompt-length-matched baselines using the identical 7B backbone. These additions will help isolate the contribution of the joint memory-reasoning optimization. revision: yes
-
Referee: [Methods] Methods/Experiments: The generalization claim beyond the training horizon rests on environments constructed by composing existing datasets; this creates artificial interdependencies whose structure is known at construction time, raising the risk that the policy learns dataset-origin or positional cues rather than true relevance, which would not transfer to natural long-horizon interactions.
Authors: We acknowledge the risk of shortcut learning in composed environments. Section 4.1 describes random interleaving and source shuffling to reduce positional and origin cues, yet we agree this does not fully replicate organic multi-turn interactions. We will add a dedicated limitations paragraph and an extra experiment that evaluates on a freshly composed test set with deliberately altered source ordering to probe for cue reliance. revision: partial
Circularity Check
No significant circularity; empirical results rest on external baselines
full rationale
The paper describes an end-to-end RL framework for constant-memory agents, with performance gains demonstrated via direct comparisons to external models (Qwen2.5-14B-Instruct) on composed multi-turn tasks. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the provided text. Claims are validated against independent benchmarks rather than reducing to internal inputs by construction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.LedgerCanonicalityZeroParameterComparisonLedger echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
MEM1 updates a compact shared internal state that jointly supports memory consolidation and reasoning. This state integrates prior memory with new observations from the environment while strategically discarding irrelevant or redundant information.
-
IndisputableMonolith.Foundation.HierarchyEmergencehierarchy_emergence_forces_phi echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
MEM1-7B improves performance by 3.5x while reducing memory usage by 3.7x compared to Qwen2.5-14B-Instruct on a 16-objective multi-hop QA task, and generalizes beyond the training horizon.
-
IndisputableMonolith.Foundation.DiscretenessForcingdiscreteness_forcing_principle echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
At each turn, MEM1 updates a compact shared internal state... pruning the agent's context to retain only the most recent <IS>
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 24 Pith papers
-
MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare
MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for ...
-
Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty
Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.
-
LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues
LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
-
Belief Memory: Agent Memory Under Partial Observability
BeliefMem is a probabilistic memory architecture for LLM agents that retains multiple candidate conclusions with probabilities updated by Noisy-OR, achieving superior average performance over deterministic baselines o...
-
Belief Memory: Agent Memory Under Partial Observability
BeliefMem stores multiple candidate conclusions with probabilities in agent memory and updates them via Noisy-OR rules to preserve uncertainty under partial observability.
-
MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents
MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.
-
Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory
MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.
-
Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents
Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summa...
-
LMEB: Long-horizon Memory Embedding Benchmark
LMEB benchmark shows that embedding models' performance on traditional retrieval does not transfer to long-horizon memory tasks, larger models do not always perform better, and LMEB measures capabilities orthogonal to MTEB.
-
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.
-
MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning
MAP improves LLM agent reasoning by constructing a structured cognitive map of the environment before task execution, yielding performance gains on benchmarks like ARC-AGI-3 and superior training data via the new MAP-...
-
CTM-AI: A Blueprint for General AI Inspired by a Model of Consciousness
CTM-AI combines a formal consciousness model with foundation models to report state-of-the-art results on sarcasm detection, humor, and agentic tool-use benchmarks.
-
Stateless Decision Memory for Enterprise AI Agents
Deterministic Projection Memory (DPM) delivers stateless, deterministic decision memory for enterprise AI agents that matches or exceeds summarization-based approaches at tight memory budgets while improving speed, de...
-
MemSearch-o1: Empowering Large Language Models with Reasoning-Aligned Memory Growth in Agentic Search
MemSearch-o1 uses reasoning-aligned memory growth from seed tokens, retracing via contribution functions, and path reorganization to mitigate memory dilution in LLM agentic search.
-
MemSearch-o1: Empowering Large Language Models with Reasoning-Aligned Memory Growth in Agentic Search
MemSearch-o1 mitigates memory dilution in agentic LLM search through reasoning-aligned token-level memory growth, retracing with a contribution function, and path reorganization, improving reasoning activation on benchmarks.
-
Escaping the Context Bottleneck: Active Context Curation for LLM Agents via Reinforcement Learning
A lightweight RL policy called ContextCurator curates context for frozen LLM agents by reducing noise and keeping reasoning anchors, raising success rates on WebArena (36.4% to 41.2%) and DeepSearch (53.9% to 57.1%) w...
-
MEMENTO: Teaching LLMs to Manage Their Own Context
MEMENTO trains LLMs to segment reasoning into blocks, generate mementos as dense summaries, and reason forward using only mementos and KV states, cutting peak KV cache by ~2.5x while preserving benchmark accuracy.
-
AnomalyAgent: Agentic Industrial Anomaly Synthesis via Tool-Augmented Reinforcement Learning
AnomalyAgent uses tool-augmented reinforcement learning with self-reflection to generate realistic industrial anomalies, achieving better metrics than zero-shot methods on MVTec-AD.
-
Hierarchical Reinforcement Learning with Augmented Step-Level Transitions for LLM Agents
STEP-HRL enables step-level learning in LLM agents via hierarchical task structure and local progress modules, outperforming baselines on ScienceWorld and ALFWorld while cutting token usage.
-
LightThinker++: From Reasoning Compression to Memory Management
LightThinker++ adds explicit adaptive memory management and a trajectory synthesis pipeline to LLM reasoning, cutting peak token use by ~70% while gaining accuracy in standard and long-horizon agent tasks.
-
Opal: Private Memory for Personal AI
Opal enables private long-term memory for personal AI by decoupling reasoning to a trusted enclave with a lightweight knowledge graph and piggybacking reindexing on ORAM accesses.
-
Do Agents Need to Plan Step-by-Step? Rethinking Planning Horizon in Data-Centric Tool Calling
Full-horizon planning with on-demand replanning achieves accuracy parity with single-step planning in tool-calling agents for knowledge base and multi-hop question answering while consuming 2-3 times fewer tokens.
-
A Brief Overview: Agentic Reinforcement Learning In Large Language Models
This review synthesizes conceptual foundations, methods, challenges, and future directions for agentic reinforcement learning in large language models.
-
A Brief Overview: Agentic Reinforcement Learning In Large Language Models
The paper surveys the conceptual foundations, methodological innovations, challenges, and future directions of agentic reinforcement learning frameworks that embed cognitive capabilities like meta-reasoning and self-r...
Reference graph
Works this paper leans on
-
[1]
Surprising exercises that will sharpen your short- term memory, January 2024
A Cognitive Connection. Surprising exercises that will sharpen your short- term memory, January 2024. URL https://acognitiveconnection.com/ surprising-exercises-that-will-sharpen-your-short-term-memory . Accessed: 2025-05-10
work page 2024
-
[2]
Chenxin An, Jun Zhang, Ming Zhong, Lei Li, Shansan Gong, Yao Luo, Jingjing Xu, and Lingpeng Kong. Why does the effective context length of llms fall short? In Proceedings of the International Conference on Learning Representations (ICLR), 2025
work page 2025
-
[3]
The claude 3 model family: Opus, sonnet, haiku
Anthropic. The claude 3 model family: Opus, sonnet, haiku. https://www.anthropic.com/ news/claude-3-family, 2024
work page 2024
-
[4]
Alan D. Baddeley and Graham J. Hitch. Working memory. In Gordon H. Bower (ed.), Psychology of learning and motivation, volume 8, pp. 47–89. Academic Press, 1974
work page 1974
-
[5]
Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning
Hao Bai, Yifei Zhou, Jiayi Pan, Mert Cemri, Alane Suhr, Sergey Levine, and Aviral Kumar. Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning. Advances in Neural Information Processing Systems (NeurIPS), 37:12461–12495, 2024
work page 2024
-
[6]
Longformer: The Long-Document Transformer
Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[7]
Language Models are Few-Shot Learners
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[8]
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Simple LLM inference acceleration framework with multiple decoding heads. In Proceedings of the 41st International Conference on Machine Learning , volume 235 of Proceedings of Machine Learning Research, pp. 5209–5235, Vienna, Austria, 21–27 Jul 2024. PMLR....
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.10774 2024
-
[9]
Web agents with world models: Learning and leveraging environment dynamics in web navigation
Hyungjoo Chae, Namyoung Kim, Kai Tzu-iunn Ong, Minju Gwak, Gwanwoo Song, Jihoon Kim, Sunghwan Kim, Dongha Lee, and Jinyoung Yeo. Web agents with world models: Learning and leveraging environment dynamics in web navigation. arXiv preprint arXiv:2410.13232, 2024
-
[10]
Accelerating Large Language Model Decoding with Speculative Sampling
Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, February 2023. doi: 10.48550/arXiv.2302.01318. URL https://arxiv.org/abs/2302.01318
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.01318 2023
-
[11]
Agent-flan: Designing data and methods of effective agent tuning for large language models
Zehui Chen, Kuikun Liu, Qiuchen Wang, Wenwei Zhang, Jiangning Liu, Dahua Lin, Kai Chen, and Feng Zhao. Agent-flan: Designing data and methods of effective agent tuning for large language models. In Findings of the Association for Computational Linguistics (ACL) , pp. 9354–9366, 2024
work page 2024
-
[12]
Seeclick: Harnessing gui grounding for advanced visual gui agents
Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 9313–9332, 2024
work page 2024
-
[13]
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory. arXiv preprint arXiv:2504.19413, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
SWIFT: A scalable lightweight infrastructure for fine-tuning
ModelScope Community. SWIFT: A scalable lightweight infrastructure for fine-tuning. https: //github.com/modelscope/ms-swift, 2024. Accessed: 2025-05-15. 11
work page 2024
-
[15]
DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, and et al. Deepseek- r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Mind2web: Towards a generalist agent for the web
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems (NeurIPS), 36:28091–28114, 2023
work page 2023
-
[17]
Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre- Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. The faiss library. arXiv preprint arXiv:2401.08281, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Gemini: Try deep research and gemini 2.0 flash experimental
Google. Gemini: Try deep research and gemini 2.0 flash experimental. https://blog. google/products/gemini/google-gemini-deep-research/ , 2024. Accessed: 2025- 05-15
work page 2024
-
[19]
When attention sink emerges in language models: An empirical view
Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Ye Wang, and Min Lin. When attention sink emerges in language models: An empirical view. In Proceedings of the International Conference on Learning Representations (ICLR), 2025
work page 2025
-
[20]
A real-world webagent with planning, long context understanding, and program synthesis
Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. A real-world webagent with planning, long context understanding, and program synthesis. In Proceedings of the International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[21]
Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps
Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. In Proceedings of the 28th International Conference on Computational Linguistics (COLING), pp. 6609–6625, 2020
work page 2020
-
[22]
REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization
Jian Hu, Jason Klein Liu, and Wei Shen. Reinforce++: An efficient rlhf algorithm with robustness to both prompt and reward models. arXiv preprint arXiv:2501.03262, 2025. Version 3, revised 6 Apr 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Yucheng Jiang, Yijia Shao, Dekun Ma, Sina Semnani, and Monica Lam. Into the unknown unknowns: Engaged human learning through participation in language model agent conversa- tions. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 9917–9955, 2024
work page 2024
-
[24]
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Za- mani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Dense passage retrieval for open-domain question answering
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6769–6781, 2020
work page 2020
-
[26]
Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research. Transac...
work page 2019
-
[27]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 12
work page 2023
-
[28]
Fast inference from transformers via speculative decoding
Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InProceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research , pp. 19274–19286, Honolulu, Hawaii, USA, 23–29 Jul 2023. PMLR. URL https://proceedings.mlr.press/v202/ leviathan23a.html
work page 2023
-
[29]
Compressing context to enhance inference efficiency of large language models
Yucheng Li, Bo Dong, Frank Guerin, and Chenghua Lin. Compressing context to enhance inference efficiency of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6921–6935, 2023
work page 2023
-
[30]
Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173, 2024
work page 2024
-
[31]
Inference-time scaling for generalist reward modeling, 2025
Zijun Liu, Peiyi Wang, Runxin Xu, Shirong Ma, Chong Ruan, Peng Li, Yang Liu, and Yu Wu. Inference-time scaling for generalist reward modeling. arXiv preprint arXiv:2504.02495, 2025
-
[32]
Self-refine: Iterative refinement with self-feedback
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems (NeurIPS), 36:46534– 46594, 2023
work page 2023
-
[33]
Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Browser use: Enable ai to control your browser
Magnus Müller and Gregor Žuni ˇc. Browser use: Enable ai to control your browser. https: //github.com/browser-use/browser-use, 2024
work page 2024
-
[35]
WebGPT: Browser-assisted question-answering with human feedback
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schul- man. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint ...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[36]
OpenAI. Gpt-4o system card. https://openai.com/index/gpt-4o-system-card/ , 2024. Accessed: 2025-05-15
work page 2024
-
[37]
Introducing deep research, February 2025
OpenAI. Introducing deep research, February 2025. URL https://openai.com/index/ introducing-deep-research/
work page 2025
-
[38]
Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. In Findings of the Association for Computational Linguistics (EMNLP), pp. 5687–5711, 2023
work page 2023
-
[39]
Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Wenyi Zhao, Yu Yang, Xinyue Yang, Jiadai Sun, Shuntian Yao, et al. Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning. arXiv preprint arXiv:2411.02337, 2024
-
[40]
4d masks support in transformers
Ruslan S. 4d masks support in transformers. https://huggingface.co/blog/poedator/ 4d-masks, 2024. Hugging Face Community Blog
work page 2024
-
[41]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[42]
Serper api: Fast and affordable google search api
Serper. Serper api: Fast and affordable google search api. https://serper.dev/, 2025. Accessed: 2025-05-15
work page 2025
-
[43]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024
work page 2024
-
[44]
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv:2409.19256, 2024. 13
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems (NeurIPS), 36:8634–8652, 2023
work page 2023
-
[46]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute opti- mally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[47]
Trial and error: Exploration-based trajectory optimization of llm agents
Yifan Song, Da Yin, Xiang Yue, Jie Huang, Sujian Li, and Bill Yuchen Lin. Trial and error: Exploration-based trajectory optimization of llm agents. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 7584–7600, 2024
work page 2024
-
[48]
Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction . MIT Press, Cambridge, MA, second edition, 2018. URL http://incompleteideas.net/book/ the-book-2nd.html
work page 2018
-
[49]
Policy gradient methods for reinforcement learning with function approximation
Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems (NeurIPS), volume 12, pp. 1057–1063, 2000
work page 2000
-
[50]
Openmanus: Open-source ai agent framework
OpenManus Team. Openmanus: Open-source ai agent framework. https://github.com/ mannaandpoem/OpenManus, 2025
work page 2025
-
[51]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[52]
Gomez, Łukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Informa- tion Processing Systems (NeurIPS), volume 30, pp. 5998–6008, 2017
work page 2017
-
[53]
A survey on large language model based autonomous agents
Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Ji-Rong Wen. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(3):1–25, 2024
work page 2024
-
[54]
Text Embeddings by Weakly-Supervised Contrastive Pre-training
Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[55]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems (NeurIPS) , volume 35, pp. 24824–24837, 2022
work page 2022
-
[56]
Long- memeval: Benchmarking chat assistants on long-term interactive memory
Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Long- memeval: Benchmarking chat assistants on long-term interactive memory. In International Conference on Learning Representations (ICLR), 2025
work page 2025
-
[57]
A-MEM: Agentic Memory for LLM Agents
Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents. arXiv preprint arXiv:2502.12110, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[58]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[59]
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2369–2380, 2018. 14
work page 2018
-
[60]
Webshop: Towards scalable real-world web interaction with grounded language agents
Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents. In Advances in Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[61]
ReAct: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. In Proceedings of the International Conference on Learning Representations (ICLR), 2023
work page 2023
-
[62]
Lumos: Learning agents with unified data, modular design, and open-source llms
Da Yin, Faeze Brahman, Abhilasha Ravichander, Khyathi Chandu, Kai-Wei Chang, Yejin Choi, and Bill Yuchen Lin. Lumos: Learning agents with unified data, modular design, and open-source llms. In ICLR 2024 Workshop on Large Language Model (LLM) Agents, 2023
work page 2024
-
[63]
Compact: Compressing retrieved documents actively for question answering
Chanwoong Yoon, Taewhoo Lee, Hyeon Hwang, Minbyul Jeong, and Jaewoo Kang. Compact: Compressing retrieved documents actively for question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) , pp. 21424– 21439, 2024
work page 2024
-
[64]
Siyu Yuan, Zehui Chen, Zhiheng Xi, Junjie Ye, Zhengyin Du, and Jiecao Chen. Agent-r: Train- ing language model agents to reflect via iterative self-training.arXiv preprint arXiv:2501.11425, 2025
-
[65]
Inference scaling for long-context retrieval augmented generation
Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, and Michael Bendersky. Inference scaling for long-context retrieval augmented generation. In Proceedings of the International Conference on Learning Representations (ICLR), 2025
work page 2025
-
[66]
Agenttuning: Enabling generalized agent abilities for llms,
Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, and Jie Tang. Agenttuning: Enabling generalized agent abilities for llms. arXiv preprint arXiv:2310.12823, 2023
-
[67]
Lightthinker: Thinking step-by-step compression
Jintian Zhang, Yuqi Zhu, Mengshu Sun, Yujie Luo, Shuofei Qiao, Lun Du, Da Zheng, Huajun Chen, and Ningyu Zhang. Lightthinker: Thinking step-by-step compression. arXiv preprint arXiv:2502.15589, 2025
-
[68]
Gonzalez, Clark Barrett, and Ying Sheng
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. Sglang: Efficient execution of structured language model programs. In Advances in Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[69]
Deepresearcher: Scaling deep research via reinforcement learning in real-world environments
Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. Deepresearcher: Scaling deep research via reinforcement learning in real-world environments. arXiv preprint arXiv:2504.03160, 2025
-
[70]
Webarena: A realistic web environment for building autonomous agents
Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents. In Proceedings of the International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[71]
Principled reinforcement learning with human feedback from pairwise or k-wise comparisons
Banghua Zhu, Michael Jordan, and Jiantao Jiao. Principled reinforcement learning with human feedback from pairwise or k-wise comparisons. In Proceedings of the International Conference on Machine Learning (ICML), pp. 43037–43067, 2023. 15 A Details of MEM1 A.1 Computing Resources and Training Details All trainings of MEM1 are conducted on 4 H100 or H200 G...
work page 2023
- [72]
-
[73]
Then choose one of the following actions: - If any question remains unanswered, issue a single query for one question inside <search> ... </search>. The query should consist of keywords or a short phrase. Only search one question at a time. - If all questions are answered, provide the final answers—separated by semicolons—within <answer> answer1; answer2;...
-
[74]
Conduct reasoning, and then update a concise, cumulative summary with essential information inside <think> </think>. This is your persistent memory and should include all important information from previous <think> </think> and <information> </information> (i.e. information and answers already found for questions)
-
[75]
Find a gingko light and 20x20 pillow cover that is hand painted
Then choose one: - Issue a query (i.e., key words / phrases for search) inside <search> </search> (you may search repeatedly until the answer is clear). This query will be used to conduct search and return the results in <information> results </information> - Provide the final concise answer (no explanations) if no additional information is needed inside ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.