MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning
Pith reviewed 2026-05-18 00:54 UTC · model grok-4.3
The pith
MemSearcher trains LLMs to keep context length stable by maintaining compact memory instead of full history concatenation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MemSearcher shows that an LLM can be trained to reason, search, and manage memory via end-to-end reinforcement learning, using a compact memory state to replace full interaction history and multi-context GRPO to propagate advantages across turns with changing contexts.
What carries the argument
multi-context GRPO, which propagates complete-trajectory advantages to each individual turn for end-to-end optimization under varying LLM contexts.
Load-bearing premise
Trajectory-level advantages can be propagated effectively to every turn without substantial signal loss or optimization bias even when the LLM context changes between turns.
What would settle it
Measure token count and task accuracy on a multi-turn search benchmark with increasing numbers of turns; if token count grows or accuracy fails to exceed history-concatenation baselines, the central claim does not hold.
Figures
read the original abstract
LLM-based search agents often concatenate the full interaction history into the context, producing long and noisy inputs, and increasing compute cost and GPU memory overhead. To address this issue, we propose MemSearcher, an agent framework that maintains a compact memory during multi-turn interactions, retaining only question-relevant information and thereby keeping the context length stable across turns. Training MemSearcher is challenging because each trajectory spans multiple turns under different LLM contexts, making each turn an independent optimization target in reinforcement learning. We introduce multi-context GRPO, which propagates trajectory-level advantages to all turns for end-to-end optimization. Experiments demonstrate that MemSearcher outperforms strong history-concatenation (ReAct-style) baselines on a range of public datasets while maintaining nearly constant token counts across multi-turn interactions. The code and models will be publicly available at https://github.com/icip-cas/MemSearcher
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MemSearcher, a framework for LLM-based search agents that maintains a compact memory of question-relevant information to keep context length stable across multi-turn interactions instead of concatenating full history. It proposes multi-context GRPO to propagate trajectory-level advantages to individual turns for end-to-end RL training under changing contexts. Experiments claim outperformance over ReAct-style history-concatenation baselines on public datasets while maintaining nearly constant token counts.
Significance. If the empirical results hold with proper validation, this addresses a practical scalability issue in LLM agents by reducing context bloat, compute cost, and memory overhead. The multi-context GRPO extension for non-stationary RL settings represents a targeted training innovation, and the planned public release of code and models supports reproducibility.
major comments (2)
- Abstract and experimental section: The central claim of outperformance on public datasets is stated without reporting exact metrics, statistical significance tests, precise baseline implementations, or ablation studies on the memory component and GRPO. This leaves the key empirical result unverifiable and load-bearing for the paper's contribution.
- Method section on multi-context GRPO: The assumption that trajectory-level advantages can be directly propagated to every turn without substantial signal loss or optimization bias is not rigorously justified. Because each turn's context depends on prior memory states, applying the same advantage scalar under non-stationary observation distributions risks credit-assignment mismatch between the policy gradient and the causal effect of individual memory updates or reasoning steps.
Simulated Author's Rebuttal
We thank the referee for their constructive and insightful comments on our manuscript. We have carefully addressed each major point below, providing clarifications and committing to revisions that strengthen the empirical reporting and methodological justification without altering the core contributions.
read point-by-point responses
-
Referee: Abstract and experimental section: The central claim of outperformance on public datasets is stated without reporting exact metrics, statistical significance tests, precise baseline implementations, or ablation studies on the memory component and GRPO. This leaves the key empirical result unverifiable and load-bearing for the paper's contribution.
Authors: We agree that the original presentation of results was insufficiently detailed for full verifiability. In the revised manuscript, we have expanded the experimental section with exact performance metrics (including accuracy, F1, and success rates with means and standard deviations over five random seeds), paired t-test p-values for statistical significance against baselines, precise specifications of baseline implementations (e.g., ReAct with full history concatenation using identical LLM backbones and prompting), and dedicated ablation studies isolating the compact memory mechanism and the multi-context GRPO variant. These additions directly support the central claims while preserving the reported outperformance trends. revision: yes
-
Referee: Method section on multi-context GRPO: The assumption that trajectory-level advantages can be directly propagated to every turn without substantial signal loss or optimization bias is not rigorously justified. Because each turn's context depends on prior memory states, applying the same advantage scalar under non-stationary observation distributions risks credit-assignment mismatch between the policy gradient and the causal effect of individual memory updates or reasoning steps.
Authors: We acknowledge the validity of the concern regarding potential credit-assignment issues in non-stationary contexts. The revised method section now includes a formal derivation showing that trajectory-level advantage propagation in multi-context GRPO yields unbiased policy gradients because memory updates are treated as explicit actions within the joint policy, and the sole external reward is at the trajectory level. We further add an empirical gradient-variance analysis across turns demonstrating limited signal degradation, along with an ablation comparing multi-context GRPO to per-turn advantage baselines. This formulation mitigates non-stationarity through the stabilizing effect of compact memory, though we note that a fully general proof for arbitrary non-stationary MDPs remains an open direction. revision: partial
Circularity Check
No circularity: MemSearcher presents an empirical RL method without definitional reduction or load-bearing self-citation
full rationale
The paper defines the challenge of multi-turn optimization under changing contexts and directly introduces multi-context GRPO to propagate trajectory-level advantages, with all claims evaluated on external public datasets. No equations reduce a claimed prediction to a fitted input by construction, no uniqueness theorems are imported from prior self-work, and the derivation chain does not rely on ansatz smuggling or renaming of known results. The central contribution remains an independent training innovation rather than a tautological restatement of inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- RL training hyperparameters
axioms (1)
- domain assumption Question-relevant information can be reliably identified and retained in compact memory without significant loss of task performance.
Forward citations
Cited by 3 Pith papers
-
Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory
MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.
-
PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments
PERMA is a new benchmark using temporally ordered events, text variability, and linguistic alignment to evaluate LLM memory agents on persona consistency beyond simple retrieval.
-
GUI Agents with Reinforcement Learning: Toward Digital Inhabitants
The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a fut...
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning
Mingyang Chen, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z Pan, Wen Zhang, Huajun Chen, Fan Yang, et al. Learning to reason with search for llms via reinforce- ment learning.arXiv preprint arXiv:2503.19470,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capa- bilities.arXiv preprint arXiv:2507.06261,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence
Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. Deepseek-coder: When the large language model meets programming– the rise of code intelligence.arXiv preprint arXiv:2401.14196,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Yancheng He, Shilong Li, Jiaheng Liu, Yingshui Tan, Weixun Wang, Hui Huang, Xingyuan Bu, Hangyu Guo, Chengwei Hu, Boren Zheng, et al. Chinese simpleqa: A chinese factuality evalua- tion for large language models.arXiv preprint arXiv:2411.07140,
-
[8]
Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps
Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps.arXiv preprint arXiv:2011.01060,
work page internal anchor Pith review arXiv 2011
-
[9]
RULER: What's the Real Context Size of Your Long-Context Language Models?
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Active retrieval augmented generation
Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 7969–7992,
work page 2023
-
[11]
10 Bowen Jin, Jinsung Yoon, Jiawei Han, and Sercan ¨O
Bowen Jin, Jinsung Yoon, Jiawei Han, and Sercan O Arik. Long-context llms meet rag: Overcoming challenges for long inputs in rag.arXiv preprint arXiv:2410.05983, 2024a. Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement le...
-
[12]
Mingyu Jin, Weidi Luo, Sitao Cheng, Xinyi Wang, Wenyue Hua, Ruixiang Tang, William Yang Wang, and Yongfeng Zhang. Disentangling memory and reasoning ability in large language models.arXiv preprint arXiv:2411.13504, 2024b. 10 Under review Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge ...
-
[13]
WebSailor: Navigating Super-human Reasoning for Web Agent
Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baix- uan Li, Zhengwei Tao, Xinyu Wang, et al. Websailor: Navigating super-human reasoning for web agent.arXiv preprint arXiv:2507.02592, 2025a. Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Lost in the Middle: How Language Models Use Long Contexts
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.arXiv preprint arXiv:2307.03172,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Hannaneh Hajishirzi, and Daniel Khashabi. When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories.arXiv preprint arXiv:2212.10511, 7,
work page internal anchor Pith review arXiv
-
[16]
Jos´e I Orlicki. Beyond words: A latent memory approach to internal reasoning in llms.arXiv preprint arXiv:2502.21030,
-
[17]
Measuring and Narrowing the Compositionality Gap in Language Models
Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models.arXiv preprint arXiv:2210.03350,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Zep: A Temporal Knowledge Graph Architecture for Agent Memory
Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, and Daniel Chalef. Zep: a tem- poral knowledge graph architecture for agent memory.arXiv preprint arXiv:2501.13956,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathemati- cal reasoning in open language models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Search and refine during think: Autonomous retrieval-augmented reasoning of llms, 2025
Yaorui Shi, Sihang Li, Chang Wu, Zhiyuan Liu, Junfeng Fang, Hengxing Cai, An Zhang, and Xiang Wang. Search and refine during think: Autonomous retrieval-augmented reasoning of llms.arXiv preprint arXiv:2505.11277,
-
[22]
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning.arXiv preprint arXiv:2503.05592,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
ZeroSearch: Incentivize the Search Capability of LLMs without Searching
Hao Sun, Zile Qiao, Jiayan Guo, Xuanbo Fan, Yingyan Hou, Yong Jiang, Pengjun Xie, Yan Zhang, Fei Huang, and Jingren Zhou. Zerosearch: Incentivize the search capability of llms without searching.arXiv preprint arXiv:2505.04588, 2025a. Shuang Sun, Huatong Song, Yuhao Wang, Ruiyang Ren, Jinhao Jiang, Junjie Zhang, Fei Bai, Jia Deng, Wayne Xin Zhao, Zheng Liu...
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Kimi K2: Open Agentic Intelligence
Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Qwen Team. Qwen2 technical report.arXiv preprint arXiv:2407.10671,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving re- trieval with chain-of-thought reasoning for knowledge-intensive multi-step questions.arXiv preprint arXiv:2212.10509, 2022a. Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition...
work page internal anchor Pith review arXiv
-
[27]
Memoryllm: Towards self-updatable large language models.arXiv preprint arXiv:2402.04624,
Yu Wang, Yifan Gao, Xiusi Chen, Haoming Jiang, Shiyang Li, Jingfeng Yang, Qingyu Yin, Zheng Li, Xian Li, Bing Yin, et al. Memoryllm: Towards self-updatable large language models.arXiv preprint arXiv:2402.04624,
-
[28]
M+: Extending memoryllm with scalable long-term memory,
Yu Wang, Dmitry Krotov, Yuanzhe Hu, Yifan Gao, Wangchunshu Zhou, Julian McAuley, Dan Gut- freund, Rogerio Feris, and Zexue He. M+: Extending memoryllm with scalable long-term mem- ory.arXiv preprint arXiv:2502.00592,
-
[29]
Measuring short-form factuality in large language models
Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models. arXiv preprint arXiv:2411.04368,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory
Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Longmemeval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Webdancer: Towards autonomousinformationseekingagency.arXivpreprintarXiv:2505.22648,2025
12 Under review Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun Xi, Gang Fu, Yong Jiang, et al. Webdancer: Towards autonomous information seeking agency.arXiv preprint arXiv:2505.22648,
-
[32]
Guangzhi Xiong, Qiao Jin, Xiao Wang, Yin Fang, Haolin Liu, Yifan Yang, Fangyuan Chen, Zhixing Song, Dengyu Wang, Minjia Zhang, et al. Rag-gym: Optimizing reasoning and search agents with process supervision.arXiv preprint arXiv:2502.13957,
work page internal anchor Pith review arXiv
-
[33]
A-MEM: Agentic Memory for LLM Agents
Wujiang Xu, Kai Mei, Hang Gao, Juntao Tan, Zujie Liang, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110,
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
Memory3: Language modeling with explicit memory
Hongkang Yang, Zehao Lin, Wenjin Wang, Hao Wu, Zhiyu Li, Bo Tang, Wenqiang Wei, Jinbo Wang, Zeyun Tang, Shichao Song, et al. Memory3: Language modeling with explicit memory. arXiv preprint arXiv:2407.01178,
-
[35]
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering.arXiv preprint arXiv:1809.09600,
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, et al. Memagent: Reshaping long-context llm with multi-conv rl-based memory agent.arXiv preprint arXiv:2507.02259,
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
Generate rather than retrieve: Large language models are strong context generators,
Wenhao Yu, Dan Iter, Shuohang Wang, Yichong Xu, Mingxuan Ju, Soumya Sanyal, Chenguang Zhu, Michael Zeng, and Meng Jiang. Generate rather than retrieve: Large language models are strong context generators.arXiv preprint arXiv:2209.10063,
-
[38]
arXiv:2410.04343 [cs.CL] https://arxiv.org/abs/2410.04343
Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, and Michael Bendersky. Inference scaling for long-context retrieval aug- mented generation.arXiv preprint arXiv:2410.04343,
-
[39]
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471,
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
On the structural memory of llm agents
Ruihong Zeng, Jinyuan Fang, Siwei Liu, and Zaiqiao Meng. On the structural memory of llm agents. arXiv preprint arXiv:2412.15266,
-
[41]
G- memory: Tracing hierarchical memory for multi-agent systems, 2025
Guibin Zhang, Muxin Fu, Guancheng Wan, Miao Yu, Kun Wang, and Shuicheng Yan. G-memory: Tracing hierarchical memory for multi-agent systems.arXiv preprint arXiv:2506.07398, 2025a. Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhongzhi Li, Xiangyuan Xue, Yijiang Li, et al. The landscape of agentic reinforcement lear...
-
[42]
MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents
Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, and Paul Pu Liang. Mem1: Learning to synergize memory and reasoning for efficient long-horizon agents.arXiv preprint arXiv:2506.15841,
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
13 Under review Rongzhi Zhu, Xiangyu Liu, Zequn Sun, Yiwei Wang, and Wei Hu. Mitigating lost-in-retrieval problems in retrieval augmented multi-hop question answering.arXiv preprint arXiv:2502.14245,
-
[44]
Xizhou Zhu, Yuntao Chen, Hao Tian, Chenxin Tao, Weijie Su, Chenyu Yang, Gao Huang, Bin Li, Lewei Lu, Xiaogang Wang, et al. Ghost in the minecraft: Generally capable agents for open- world environments via large language models with text-based knowledge and memory.arXiv preprint arXiv:2305.17144,
work page internal anchor Pith review Pith/arXiv arXiv
-
[45]
Efficientrag: Efficient retriever for multi-hop question answering.arXiv preprint arXiv:2408.04259,
Ziyuan Zhuang, Zhiyang Zhang, Sitao Cheng, Fangkai Yang, Jia Liu, Shujian Huang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, and Qi Zhang. Efficientrag: Efficient retriever for multi-hop question answering.arXiv preprint arXiv:2408.04259,
-
[46]
Chang Zong, Yuchen Yan, Weiming Lu, Jian Shao, Eliot Huang, Heng Chang, and Yueting Zhuang. Triad: A framework leveraging a multi-role llm-based agent to solve knowledge base question answering.arXiv preprint arXiv:2402.14320,
-
[47]
Table 3: Training details of MemSearcher. Parameter Value Learning Rate 1e-6 Train Batch Size 256 Number of Training Epochs 1 Number of Rollout 5 Rollout Temperature 1.0 KL Loss Coefficient 0.001 Clip Ratio 0.2 A.2 DETAILS OFEVALUATEDDATASETS We evaluate MemSearcher agents on the following public question answering datasets: •Natural Questions (NQ)(Kwiatk...
work page 2019
-
[48]
This case is drawn from the evaluation of MemSearcher based on Qwen2.5-7B- Instruct. The text enclosed by<think>and</think>,<tool call>and</tool call>, as well as<memory>and</memory>is generated by the model. The text enclosed by <tool response>and</tool response>is retrieved from the search engine. This case demonstrates that the model can effectively ma...
work page 1988
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.