MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning

Debing Zhang; Hongyu Lin; Jiawei Chen; Jie Lou; Le Sun; Qianhao Yuan; Xianpei Han; Yaojie Lu; Zichao Li

arxiv: 2511.02805 · v2 · submitted 2025-11-04 · 💻 cs.CL · cs.AI

MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning

Qianhao Yuan , Jie Lou , Zichao Li , Jiawei Chen , Yaojie Lu , Hongyu Lin , Le Sun , Debing Zhang

show 1 more author

Xianpei Han

This is my paper

Pith reviewed 2026-05-18 00:54 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords MemSearchermemory managementreinforcement learningLLM agentssearch agentsmulti-turn interactionsGRPOcontext length

0 comments

The pith

MemSearcher trains LLMs to keep context length stable by maintaining compact memory instead of full history concatenation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MemSearcher, a framework in which an LLM agent maintains a compact memory containing only question-relevant information during multi-turn search interactions. This replaces the standard approach of concatenating the entire history, which produces long and noisy inputs that raise compute and memory costs. Training occurs end-to-end through multi-context GRPO, an extension of reinforcement learning that assigns trajectory-level advantages to every turn even though each turn occurs under a different LLM context. Experiments on public datasets show that the resulting agents outperform ReAct-style baselines while holding token counts nearly constant across turns.

Core claim

MemSearcher shows that an LLM can be trained to reason, search, and manage memory via end-to-end reinforcement learning, using a compact memory state to replace full interaction history and multi-context GRPO to propagate advantages across turns with changing contexts.

What carries the argument

multi-context GRPO, which propagates complete-trajectory advantages to each individual turn for end-to-end optimization under varying LLM contexts.

Load-bearing premise

Trajectory-level advantages can be propagated effectively to every turn without substantial signal loss or optimization bias even when the LLM context changes between turns.

What would settle it

Measure token count and task accuracy on a multi-turn search benchmark with increasing numbers of turns; if token count grows or accuracy fails to exceed history-concatenation baselines, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2511.02805 by Debing Zhang, Hongyu Lin, Jiawei Chen, Jie Lou, Le Sun, Qianhao Yuan, Xianpei Han, Yaojie Lu, Zichao Li.

**Figure 2.** Figure 2: Multi-context GRPO. In rollout, we sample a group of trajectories [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of the average token number in the LLM context between MemSearcher and [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Peak GPU memory usage (GB) comparison between MemSearcher and ReSearch. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Training and validation reward during training. The validation is conducted on a part of [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 3.** Figure 3: Compared to ReSearch, which exhibits an almost linear increase in token consumption [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

LLM-based search agents often concatenate the full interaction history into the context, producing long and noisy inputs, and increasing compute cost and GPU memory overhead. To address this issue, we propose MemSearcher, an agent framework that maintains a compact memory during multi-turn interactions, retaining only question-relevant information and thereby keeping the context length stable across turns. Training MemSearcher is challenging because each trajectory spans multiple turns under different LLM contexts, making each turn an independent optimization target in reinforcement learning. We introduce multi-context GRPO, which propagates trajectory-level advantages to all turns for end-to-end optimization. Experiments demonstrate that MemSearcher outperforms strong history-concatenation (ReAct-style) baselines on a range of public datasets while maintaining nearly constant token counts across multi-turn interactions. The code and models will be publicly available at https://github.com/icip-cas/MemSearcher

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MemSearcher shows a workable RL setup for keeping agent contexts short via learned memory management, but the experiments are too light on details to judge if the gains are real or the optimization trick holds up.

read the letter

The core idea here is training an LLM agent to keep a compact, question-relevant memory instead of dumping the whole history into every prompt. They use end-to-end RL with a multi-context GRPO variant that spreads trajectory rewards across turns even as the context changes with each memory update. That pairing is the actual new piece; standard ReAct-style concatenation baselines get beaten on public datasets while token counts stay roughly flat. The practical payoff is clear for anyone running multi-turn search agents where context bloat is the main cost driver. The paper ships code and models, which helps. What is missing is any real look at the numbers. The abstract claims outperformance but gives no exact metrics, no significance tests, no ablation on the memory policy itself, and no breakdown of how much the GRPO change matters versus just having a memory module. The stress-test worry about credit assignment under shifting contexts is worth checking in the full experiments; if the advantage signal gets diluted across non-stationary observations, the end-to-end claim weakens. Still, the method is straightforward enough that a reader can test it quickly. This is for people building agent systems who already know the context-length headache and want a concrete RL recipe rather than another prompting trick. It is not a foundational result, but it is solid enough engineering work to send out for review. I would take it as a referee.

Referee Report

2 major / 0 minor

Summary. The paper introduces MemSearcher, a framework for LLM-based search agents that maintains a compact memory of question-relevant information to keep context length stable across multi-turn interactions instead of concatenating full history. It proposes multi-context GRPO to propagate trajectory-level advantages to individual turns for end-to-end RL training under changing contexts. Experiments claim outperformance over ReAct-style history-concatenation baselines on public datasets while maintaining nearly constant token counts.

Significance. If the empirical results hold with proper validation, this addresses a practical scalability issue in LLM agents by reducing context bloat, compute cost, and memory overhead. The multi-context GRPO extension for non-stationary RL settings represents a targeted training innovation, and the planned public release of code and models supports reproducibility.

major comments (2)

Abstract and experimental section: The central claim of outperformance on public datasets is stated without reporting exact metrics, statistical significance tests, precise baseline implementations, or ablation studies on the memory component and GRPO. This leaves the key empirical result unverifiable and load-bearing for the paper's contribution.
Method section on multi-context GRPO: The assumption that trajectory-level advantages can be directly propagated to every turn without substantial signal loss or optimization bias is not rigorously justified. Because each turn's context depends on prior memory states, applying the same advantage scalar under non-stationary observation distributions risks credit-assignment mismatch between the policy gradient and the causal effect of individual memory updates or reasoning steps.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments on our manuscript. We have carefully addressed each major point below, providing clarifications and committing to revisions that strengthen the empirical reporting and methodological justification without altering the core contributions.

read point-by-point responses

Referee: Abstract and experimental section: The central claim of outperformance on public datasets is stated without reporting exact metrics, statistical significance tests, precise baseline implementations, or ablation studies on the memory component and GRPO. This leaves the key empirical result unverifiable and load-bearing for the paper's contribution.

Authors: We agree that the original presentation of results was insufficiently detailed for full verifiability. In the revised manuscript, we have expanded the experimental section with exact performance metrics (including accuracy, F1, and success rates with means and standard deviations over five random seeds), paired t-test p-values for statistical significance against baselines, precise specifications of baseline implementations (e.g., ReAct with full history concatenation using identical LLM backbones and prompting), and dedicated ablation studies isolating the compact memory mechanism and the multi-context GRPO variant. These additions directly support the central claims while preserving the reported outperformance trends. revision: yes
Referee: Method section on multi-context GRPO: The assumption that trajectory-level advantages can be directly propagated to every turn without substantial signal loss or optimization bias is not rigorously justified. Because each turn's context depends on prior memory states, applying the same advantage scalar under non-stationary observation distributions risks credit-assignment mismatch between the policy gradient and the causal effect of individual memory updates or reasoning steps.

Authors: We acknowledge the validity of the concern regarding potential credit-assignment issues in non-stationary contexts. The revised method section now includes a formal derivation showing that trajectory-level advantage propagation in multi-context GRPO yields unbiased policy gradients because memory updates are treated as explicit actions within the joint policy, and the sole external reward is at the trajectory level. We further add an empirical gradient-variance analysis across turns demonstrating limited signal degradation, along with an ablation comparing multi-context GRPO to per-turn advantage baselines. This formulation mitigates non-stationarity through the stabilizing effect of compact memory, though we note that a fully general proof for arbitrary non-stationary MDPs remains an open direction. revision: partial

Circularity Check

0 steps flagged

No circularity: MemSearcher presents an empirical RL method without definitional reduction or load-bearing self-citation

full rationale

The paper defines the challenge of multi-turn optimization under changing contexts and directly introduces multi-context GRPO to propagate trajectory-level advantages, with all claims evaluated on external public datasets. No equations reduce a claimed prediction to a fitted input by construction, no uniqueness theorems are imported from prior self-work, and the derivation chain does not rely on ansatz smuggling or renaming of known results. The central contribution remains an independent training innovation rather than a tautological restatement of inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Framework depends on the domain assumption that question-relevant information can be identified and retained without performance loss; RL training involves typical hyperparameters whose specific values are not detailed in the abstract.

free parameters (1)

RL training hyperparameters
Learning rates, advantage scaling factors, and memory update thresholds are standard free parameters in such RL setups but unspecified here.

axioms (1)

domain assumption Question-relevant information can be reliably identified and retained in compact memory without significant loss of task performance.
This premise underpins the entire memory-management approach; if false, compact memory would degrade results compared to full history.

pith-pipeline@v0.9.0 · 5702 in / 1204 out tokens · 59132 ms · 2026-05-18T00:54:51.209031+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory
cs.CL 2026-05 unverdicted novelty 7.0

MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.
PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments
cs.AI 2026-03 unverdicted novelty 7.0

PERMA is a new benchmark using temporally ordered events, text variability, and linguistic alignment to evaluate LLM memory agents on persona consistency beyond simple retrieval.
GUI Agents with Reinforcement Learning: Toward Digital Inhabitants
cs.AI 2026-04 unverdicted novelty 5.0

The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a fut...

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 3 Pith papers · 29 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

Mingyang Chen, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z Pan, Wen Zhang, Huajun Chen, Fan Yang, et al. Learning to reason with search for llms via reinforce- ment learning.arXiv preprint arXiv:2503.19470,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capa- bilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. Deepseek-coder: When the large language model meets programming– the rise of code intelligence.arXiv preprint arXiv:2401.14196,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Notion Blog

Yancheng He, Shilong Li, Jiaheng Liu, Yingshui Tan, Weixun Wang, Hui Huang, Xingyuan Bu, Hangyu Guo, Chengwei Hu, Boren Zheng, et al. Chinese simpleqa: A chinese factuality evalua- tion for large language models.arXiv preprint arXiv:2411.07140,

work page arXiv
[8]

Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps.arXiv preprint arXiv:2011.01060,

work page internal anchor Pith review arXiv 2011
[9]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Active retrieval augmented generation

Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 7969–7992,

work page 2023
[11]

10 Bowen Jin, Jinsung Yoon, Jiawei Han, and Sercan ¨O

Bowen Jin, Jinsung Yoon, Jiawei Han, and Sercan O Arik. Long-context llms meet rag: Overcoming challenges for long inputs in rag.arXiv preprint arXiv:2410.05983, 2024a. Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement le...

work page arXiv
[12]

Disentangling memory and reasoning ability in large language models.arXiv preprint arXiv:2411.13504, 2024b

Mingyu Jin, Weidi Luo, Sitao Cheng, Xinyi Wang, Wenyue Hua, Ruixiang Tang, William Yang Wang, and Yongfeng Zhang. Disentangling memory and reasoning ability in large language models.arXiv preprint arXiv:2411.13504, 2024b. 10 Under review Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge ...

work page arXiv
[13]

WebSailor: Navigating Super-human Reasoning for Web Agent

Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baix- uan Li, Zhengwei Tao, Xinyu Wang, et al. Websailor: Navigating super-human reasoning for web agent.arXiv preprint arXiv:2507.02592, 2025a. Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic ...

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Lost in the Middle: How Language Models Use Long Contexts

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.arXiv preprint arXiv:2307.03172,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Hannaneh Hajishirzi, and Daniel Khashabi. When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories.arXiv preprint arXiv:2212.10511, 7,

work page internal anchor Pith review arXiv
[16]

Beyond words: A latent memory approach to internal reasoning in llms.arXiv preprint arXiv:2502.21030,

Jos´e I Orlicki. Beyond words: A latent memory approach to internal reasoning in llms.arXiv preprint arXiv:2502.21030,

work page arXiv
[17]

Measuring and Narrowing the Compositionality Gap in Language Models

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models.arXiv preprint arXiv:2210.03350,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Zep: A Temporal Knowledge Graph Architecture for Agent Memory

Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, and Daniel Chalef. Zep: a tem- poral knowledge graph architecture for agent memory.arXiv preprint arXiv:2501.13956,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathemati- cal reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Search and refine during think: Autonomous retrieval-augmented reasoning of llms, 2025

Yaorui Shi, Sihang Li, Chang Wu, Zhiyuan Liu, Junfeng Fang, Hengxing Cai, An Zhang, and Xiang Wang. Search and refine during think: Autonomous retrieval-augmented reasoning of llms.arXiv preprint arXiv:2505.11277,

work page arXiv
[22]

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning.arXiv preprint arXiv:2503.05592,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

ZeroSearch: Incentivize the Search Capability of LLMs without Searching

Hao Sun, Zile Qiao, Jiayan Guo, Xuanbo Fan, Yingyan Hou, Yong Jiang, Pengjun Xie, Yan Zhang, Fei Huang, and Jingren Zhou. Zerosearch: Incentivize the search capability of llms without searching.arXiv preprint arXiv:2505.04588, 2025a. Shuang Sun, Huatong Song, Yuhao Wang, Ruiyang Ren, Jinhao Jiang, Junjie Zhang, Fei Bai, Jia Deng, Wayne Xin Zhao, Zheng Liu...

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Qwen2 Technical Report

Qwen Team. Qwen2 technical report.arXiv preprint arXiv:2407.10671,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving re- trieval with chain-of-thought reasoning for knowledge-intensive multi-step questions.arXiv preprint arXiv:2212.10509, 2022a. Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition...

work page internal anchor Pith review arXiv
[27]

Memoryllm: Towards self-updatable large language models.arXiv preprint arXiv:2402.04624,

Yu Wang, Yifan Gao, Xiusi Chen, Haoming Jiang, Shiyang Li, Jingfeng Yang, Qingyu Yin, Zheng Li, Xian Li, Bing Yin, et al. Memoryllm: Towards self-updatable large language models.arXiv preprint arXiv:2402.04624,

work page arXiv
[28]

M+: Extending memoryllm with scalable long-term memory,

Yu Wang, Dmitry Krotov, Yuanzhe Hu, Yifan Gao, Wangchunshu Zhou, Julian McAuley, Dan Gut- freund, Rogerio Feris, and Zexue He. M+: Extending memoryllm with scalable long-term mem- ory.arXiv preprint arXiv:2502.00592,

work page arXiv
[29]

Measuring short-form factuality in large language models

Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models. arXiv preprint arXiv:2411.04368,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Longmemeval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Webdancer: Towards autonomousinformationseekingagency.arXivpreprintarXiv:2505.22648,2025

12 Under review Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun Xi, Gang Fu, Yong Jiang, et al. Webdancer: Towards autonomous information seeking agency.arXiv preprint arXiv:2505.22648,

work page arXiv
[32]

Rag-gym: Optimizing reasoning and search agents with process supervision.arXiv preprint arXiv:2502.13957,

Guangzhi Xiong, Qiao Jin, Xiao Wang, Yin Fang, Haolin Liu, Yifan Yang, Fangyuan Chen, Zhixing Song, Dengyu Wang, Minjia Zhang, et al. Rag-gym: Optimizing reasoning and search agents with process supervision.arXiv preprint arXiv:2502.13957,

work page internal anchor Pith review arXiv
[33]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Kai Mei, Hang Gao, Juntao Tan, Zujie Liang, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110,

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Memory3: Language modeling with explicit memory

Hongkang Yang, Zehao Lin, Wenjin Wang, Hao Wu, Zhiyu Li, Bo Tang, Wenqiang Wei, Jinbo Wang, Zeyun Tang, Shichao Song, et al. Memory3: Language modeling with explicit memory. arXiv preprint arXiv:2407.01178,

work page arXiv
[35]

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering.arXiv preprint arXiv:1809.09600,

work page internal anchor Pith review Pith/arXiv arXiv
[36]

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, et al. Memagent: Reshaping long-context llm with multi-conv rl-based memory agent.arXiv preprint arXiv:2507.02259,

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Generate rather than retrieve: Large language models are strong context generators,

Wenhao Yu, Dan Iter, Shuohang Wang, Yichong Xu, Mingxuan Ju, Soumya Sanyal, Chenguang Zhu, Michael Zeng, and Meng Jiang. Generate rather than retrieve: Large language models are strong context generators.arXiv preprint arXiv:2209.10063,

work page arXiv
[38]

arXiv:2410.04343 [cs.CL] https://arxiv.org/abs/2410.04343

Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, and Michael Bendersky. Inference scaling for long-context retrieval aug- mented generation.arXiv preprint arXiv:2410.04343,

work page arXiv
[39]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471,

work page internal anchor Pith review Pith/arXiv arXiv
[40]

On the structural memory of llm agents

Ruihong Zeng, Jinyuan Fang, Siwei Liu, and Zaiqiao Meng. On the structural memory of llm agents. arXiv preprint arXiv:2412.15266,

work page arXiv
[41]

G- memory: Tracing hierarchical memory for multi-agent systems, 2025

Guibin Zhang, Muxin Fu, Guancheng Wan, Miao Yu, Kun Wang, and Shuicheng Yan. G-memory: Tracing hierarchical memory for multi-agent systems.arXiv preprint arXiv:2506.07398, 2025a. Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhongzhi Li, Xiangyuan Xue, Yijiang Li, et al. The landscape of agentic reinforcement lear...

work page arXiv
[42]

MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents

Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, and Paul Pu Liang. Mem1: Learning to synergize memory and reasoning for efficient long-horizon agents.arXiv preprint arXiv:2506.15841,

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Mitigating lost-in-retrieval problems in retrieval augmented multi-hop question answering.arXiv preprint arXiv:2502.14245,

13 Under review Rongzhi Zhu, Xiangyu Liu, Zequn Sun, Yiwei Wang, and Wei Hu. Mitigating lost-in-retrieval problems in retrieval augmented multi-hop question answering.arXiv preprint arXiv:2502.14245,

work page arXiv
[44]

Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory

Xizhou Zhu, Yuntao Chen, Hao Tian, Chenxin Tao, Weijie Su, Chenyu Yang, Gao Huang, Bin Li, Lewei Lu, Xiaogang Wang, et al. Ghost in the minecraft: Generally capable agents for open- world environments via large language models with text-based knowledge and memory.arXiv preprint arXiv:2305.17144,

work page internal anchor Pith review Pith/arXiv arXiv
[45]

Efficientrag: Efficient retriever for multi-hop question answering.arXiv preprint arXiv:2408.04259,

Ziyuan Zhuang, Zhiyang Zhang, Sitao Cheng, Fangkai Yang, Jia Liu, Shujian Huang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, and Qi Zhang. Efficientrag: Efficient retriever for multi-hop question answering.arXiv preprint arXiv:2408.04259,

work page arXiv
[46]

Triad: A framework leveraging a multi-role llm-based agent to solve knowledge base question answering.arXiv preprint arXiv:2402.14320,

Chang Zong, Yuchen Yan, Weiming Lu, Jian Shao, Eliot Huang, Heng Chang, and Yueting Zhuang. Triad: A framework leveraging a multi-role llm-based agent to solve knowledge base question answering.arXiv preprint arXiv:2402.14320,

work page arXiv
[47]

Table 3: Training details of MemSearcher. Parameter Value Learning Rate 1e-6 Train Batch Size 256 Number of Training Epochs 1 Number of Rollout 5 Rollout Temperature 1.0 KL Loss Coefficient 0.001 Clip Ratio 0.2 A.2 DETAILS OFEVALUATEDDATASETS We evaluate MemSearcher agents on the following public question answering datasets: •Natural Questions (NQ)(Kwiatk...

work page 2019
[48]

The text enclosed by<think>and</think>,<tool call>and</tool call>, as well as<memory>and</memory>is generated by the model

This case is drawn from the evaluation of MemSearcher based on Qwen2.5-7B- Instruct. The text enclosed by<think>and</think>,<tool call>and</tool call>, as well as<memory>and</memory>is generated by the model. The text enclosed by <tool response>and</tool response>is retrieved from the search engine. This case demonstrates that the model can effectively ma...

work page 1988

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

Mingyang Chen, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z Pan, Wen Zhang, Huajun Chen, Fan Yang, et al. Learning to reason with search for llms via reinforce- ment learning.arXiv preprint arXiv:2503.19470,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capa- bilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. Deepseek-coder: When the large language model meets programming– the rise of code intelligence.arXiv preprint arXiv:2401.14196,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Notion Blog

Yancheng He, Shilong Li, Jiaheng Liu, Yingshui Tan, Weixun Wang, Hui Huang, Xingyuan Bu, Hangyu Guo, Chengwei Hu, Boren Zheng, et al. Chinese simpleqa: A chinese factuality evalua- tion for large language models.arXiv preprint arXiv:2411.07140,

work page arXiv

[8] [8]

Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps.arXiv preprint arXiv:2011.01060,

work page internal anchor Pith review arXiv 2011

[9] [9]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Active retrieval augmented generation

Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 7969–7992,

work page 2023

[11] [11]

10 Bowen Jin, Jinsung Yoon, Jiawei Han, and Sercan ¨O

Bowen Jin, Jinsung Yoon, Jiawei Han, and Sercan O Arik. Long-context llms meet rag: Overcoming challenges for long inputs in rag.arXiv preprint arXiv:2410.05983, 2024a. Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement le...

work page arXiv

[12] [12]

Disentangling memory and reasoning ability in large language models.arXiv preprint arXiv:2411.13504, 2024b

Mingyu Jin, Weidi Luo, Sitao Cheng, Xinyi Wang, Wenyue Hua, Ruixiang Tang, William Yang Wang, and Yongfeng Zhang. Disentangling memory and reasoning ability in large language models.arXiv preprint arXiv:2411.13504, 2024b. 10 Under review Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge ...

work page arXiv

[13] [13]

WebSailor: Navigating Super-human Reasoning for Web Agent

Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baix- uan Li, Zhengwei Tao, Xinyu Wang, et al. Websailor: Navigating super-human reasoning for web agent.arXiv preprint arXiv:2507.02592, 2025a. Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic ...

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Lost in the Middle: How Language Models Use Long Contexts

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.arXiv preprint arXiv:2307.03172,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Hannaneh Hajishirzi, and Daniel Khashabi. When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories.arXiv preprint arXiv:2212.10511, 7,

work page internal anchor Pith review arXiv

[16] [16]

Beyond words: A latent memory approach to internal reasoning in llms.arXiv preprint arXiv:2502.21030,

Jos´e I Orlicki. Beyond words: A latent memory approach to internal reasoning in llms.arXiv preprint arXiv:2502.21030,

work page arXiv

[17] [17]

Measuring and Narrowing the Compositionality Gap in Language Models

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models.arXiv preprint arXiv:2210.03350,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Zep: A Temporal Knowledge Graph Architecture for Agent Memory

Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, and Daniel Chalef. Zep: a tem- poral knowledge graph architecture for agent memory.arXiv preprint arXiv:2501.13956,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathemati- cal reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Search and refine during think: Autonomous retrieval-augmented reasoning of llms, 2025

Yaorui Shi, Sihang Li, Chang Wu, Zhiyuan Liu, Junfeng Fang, Hengxing Cai, An Zhang, and Xiang Wang. Search and refine during think: Autonomous retrieval-augmented reasoning of llms.arXiv preprint arXiv:2505.11277,

work page arXiv

[22] [22]

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning.arXiv preprint arXiv:2503.05592,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

ZeroSearch: Incentivize the Search Capability of LLMs without Searching

Hao Sun, Zile Qiao, Jiayan Guo, Xuanbo Fan, Yingyan Hou, Yong Jiang, Pengjun Xie, Yan Zhang, Fei Huang, and Jingren Zhou. Zerosearch: Incentivize the search capability of llms without searching.arXiv preprint arXiv:2505.04588, 2025a. Shuang Sun, Huatong Song, Yuhao Wang, Ruiyang Ren, Jinhao Jiang, Junjie Zhang, Fei Bai, Jia Deng, Wayne Xin Zhao, Zheng Liu...

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Qwen2 Technical Report

Qwen Team. Qwen2 technical report.arXiv preprint arXiv:2407.10671,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving re- trieval with chain-of-thought reasoning for knowledge-intensive multi-step questions.arXiv preprint arXiv:2212.10509, 2022a. Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition...

work page internal anchor Pith review arXiv

[27] [27]

Memoryllm: Towards self-updatable large language models.arXiv preprint arXiv:2402.04624,

Yu Wang, Yifan Gao, Xiusi Chen, Haoming Jiang, Shiyang Li, Jingfeng Yang, Qingyu Yin, Zheng Li, Xian Li, Bing Yin, et al. Memoryllm: Towards self-updatable large language models.arXiv preprint arXiv:2402.04624,

work page arXiv

[28] [28]

M+: Extending memoryllm with scalable long-term memory,

Yu Wang, Dmitry Krotov, Yuanzhe Hu, Yifan Gao, Wangchunshu Zhou, Julian McAuley, Dan Gut- freund, Rogerio Feris, and Zexue He. M+: Extending memoryllm with scalable long-term mem- ory.arXiv preprint arXiv:2502.00592,

work page arXiv

[29] [29]

Measuring short-form factuality in large language models

Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models. arXiv preprint arXiv:2411.04368,

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Longmemeval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813,

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

Webdancer: Towards autonomousinformationseekingagency.arXivpreprintarXiv:2505.22648,2025

12 Under review Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun Xi, Gang Fu, Yong Jiang, et al. Webdancer: Towards autonomous information seeking agency.arXiv preprint arXiv:2505.22648,

work page arXiv

[32] [32]

Rag-gym: Optimizing reasoning and search agents with process supervision.arXiv preprint arXiv:2502.13957,

Guangzhi Xiong, Qiao Jin, Xiao Wang, Yin Fang, Haolin Liu, Yifan Yang, Fangyuan Chen, Zhixing Song, Dengyu Wang, Minjia Zhang, et al. Rag-gym: Optimizing reasoning and search agents with process supervision.arXiv preprint arXiv:2502.13957,

work page internal anchor Pith review arXiv

[33] [33]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Kai Mei, Hang Gao, Juntao Tan, Zujie Liang, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110,

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

Memory3: Language modeling with explicit memory

Hongkang Yang, Zehao Lin, Wenjin Wang, Hao Wu, Zhiyu Li, Bo Tang, Wenqiang Wei, Jinbo Wang, Zeyun Tang, Shichao Song, et al. Memory3: Language modeling with explicit memory. arXiv preprint arXiv:2407.01178,

work page arXiv

[35] [35]

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering.arXiv preprint arXiv:1809.09600,

work page internal anchor Pith review Pith/arXiv arXiv

[36] [36]

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, et al. Memagent: Reshaping long-context llm with multi-conv rl-based memory agent.arXiv preprint arXiv:2507.02259,

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

Generate rather than retrieve: Large language models are strong context generators,

Wenhao Yu, Dan Iter, Shuohang Wang, Yichong Xu, Mingxuan Ju, Soumya Sanyal, Chenguang Zhu, Michael Zeng, and Meng Jiang. Generate rather than retrieve: Large language models are strong context generators.arXiv preprint arXiv:2209.10063,

work page arXiv

[38] [38]

arXiv:2410.04343 [cs.CL] https://arxiv.org/abs/2410.04343

Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, and Michael Bendersky. Inference scaling for long-context retrieval aug- mented generation.arXiv preprint arXiv:2410.04343,

work page arXiv

[39] [39]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471,

work page internal anchor Pith review Pith/arXiv arXiv

[40] [40]

On the structural memory of llm agents

Ruihong Zeng, Jinyuan Fang, Siwei Liu, and Zaiqiao Meng. On the structural memory of llm agents. arXiv preprint arXiv:2412.15266,

work page arXiv

[41] [41]

G- memory: Tracing hierarchical memory for multi-agent systems, 2025

Guibin Zhang, Muxin Fu, Guancheng Wan, Miao Yu, Kun Wang, and Shuicheng Yan. G-memory: Tracing hierarchical memory for multi-agent systems.arXiv preprint arXiv:2506.07398, 2025a. Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhongzhi Li, Xiangyuan Xue, Yijiang Li, et al. The landscape of agentic reinforcement lear...

work page arXiv

[42] [42]

MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents

Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, and Paul Pu Liang. Mem1: Learning to synergize memory and reasoning for efficient long-horizon agents.arXiv preprint arXiv:2506.15841,

work page internal anchor Pith review Pith/arXiv arXiv

[43] [43]

Mitigating lost-in-retrieval problems in retrieval augmented multi-hop question answering.arXiv preprint arXiv:2502.14245,

13 Under review Rongzhi Zhu, Xiangyu Liu, Zequn Sun, Yiwei Wang, and Wei Hu. Mitigating lost-in-retrieval problems in retrieval augmented multi-hop question answering.arXiv preprint arXiv:2502.14245,

work page arXiv

[44] [44]

Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory

Xizhou Zhu, Yuntao Chen, Hao Tian, Chenxin Tao, Weijie Su, Chenyu Yang, Gao Huang, Bin Li, Lewei Lu, Xiaogang Wang, et al. Ghost in the minecraft: Generally capable agents for open- world environments via large language models with text-based knowledge and memory.arXiv preprint arXiv:2305.17144,

work page internal anchor Pith review Pith/arXiv arXiv

[45] [45]

Efficientrag: Efficient retriever for multi-hop question answering.arXiv preprint arXiv:2408.04259,

Ziyuan Zhuang, Zhiyang Zhang, Sitao Cheng, Fangkai Yang, Jia Liu, Shujian Huang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, and Qi Zhang. Efficientrag: Efficient retriever for multi-hop question answering.arXiv preprint arXiv:2408.04259,

work page arXiv

[46] [46]

Triad: A framework leveraging a multi-role llm-based agent to solve knowledge base question answering.arXiv preprint arXiv:2402.14320,

Chang Zong, Yuchen Yan, Weiming Lu, Jian Shao, Eliot Huang, Heng Chang, and Yueting Zhuang. Triad: A framework leveraging a multi-role llm-based agent to solve knowledge base question answering.arXiv preprint arXiv:2402.14320,

work page arXiv

[47] [47]

Table 3: Training details of MemSearcher. Parameter Value Learning Rate 1e-6 Train Batch Size 256 Number of Training Epochs 1 Number of Rollout 5 Rollout Temperature 1.0 KL Loss Coefficient 0.001 Clip Ratio 0.2 A.2 DETAILS OFEVALUATEDDATASETS We evaluate MemSearcher agents on the following public question answering datasets: •Natural Questions (NQ)(Kwiatk...

work page 2019

[48] [48]

The text enclosed by<think>and</think>,<tool call>and</tool call>, as well as<memory>and</memory>is generated by the model

This case is drawn from the evaluation of MemSearcher based on Qwen2.5-7B- Instruct. The text enclosed by<think>and</think>,<tool call>and</tool call>, as well as<memory>and</memory>is generated by the model. The text enclosed by <tool response>and</tool response>is retrieved from the search engine. This case demonstrates that the model can effectively ma...

work page 1988