Recognition: no theorem link
Joint Optimization of Multi-agent Memory System
Pith reviewed 2026-05-15 12:08 UTC · model grok-4.3
The pith
Jointly optimizing multiple agents in an LLM memory system through end-to-end reinforcement learning improves their collaboration over independent training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CoMAM models the multi-agent memory pipeline as a Markov decision process to expose inter-agent dependencies during end-to-end training. Agents are jointly optimized using a combination of their local task reward and an adaptively weighted global reward, enabling agents to co-adapt while receiving targeted feedback for their respective roles.
What carries the argument
The CoMAM framework that models the multi-agent pipeline as a Markov decision process and applies an adaptive credit assignment mechanism to blend local and global rewards.
Load-bearing premise
That end-to-end MDP training with adaptive credit assignment will produce stable co-adaptation among agents without training collapse or ineffective credit signals.
What would settle it
A controlled comparison in which CoMAM is trained on the same tasks but produces no gain or a loss in final QA accuracy and collaboration metrics relative to the independent-optimization baselines.
Figures
read the original abstract
Memory systems are critical for LLMs, mitigating context window limitations and supporting long-horizon user-LLM interactions. Such systems typically comprise multiple agents responsible for memory construction and retrieval. Existing approaches often optimize each agent independently under a shared global objective (e.g., downstream QA accuracy), treating other agents as a static environment. However, this design has two key limitations: (1) independent optimization ignores inter-agent dependencies and lacks agents' co-adaptation, and (2) relying solely on sparse global rewards provides limited guidance for optimizing specialized agents and causes ambiguous credit assignment. These may ultimately limit agents' effective collaboration in the memory system. To address these limitations, we propose CoMAM, a joint optimization framework that promotes collaboration among agents via end-to-end reinforcement learning and an adaptive credit assignment mechanism. Specifically, we model the multi-agent pipeline as a Markov decision process (MDP) to expose inter-agent dependencies during end-to-end training. Agents are then jointly optimized using a combination of their local task reward and an adaptively weighted global reward, enabling agents to co-adapt while receiving targeted feedback for their respective roles. Experiments show that CoMAM consistently outperforms leading memory systems, validating the effectiveness of the joint optimization framework.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes CoMAM, a joint optimization framework for multi-agent memory systems in LLMs. It models the memory construction and retrieval pipeline as a Markov decision process (MDP) to expose inter-agent dependencies and jointly optimizes agents via end-to-end reinforcement learning using a combination of local task rewards and an adaptively weighted global reward. This is claimed to enable co-adaptation among agents and resolve limitations of independent optimization and sparse global rewards. Experiments are reported to show consistent outperformance over leading memory systems.
Significance. If the experimental results hold and the adaptive credit assignment mechanism is shown to produce stable co-adaptation, the work could meaningfully advance memory-augmented LLM systems by improving collaboration in long-horizon tasks. The MDP modeling and hybrid reward structure address a recognized gap in multi-agent setups for memory systems, but the significance depends on whether the approach demonstrably outperforms strong baselines without introducing new instabilities.
major comments (2)
- [Method (joint optimization and credit assignment)] The adaptive weighting mechanism for the global reward is described only at a high level in the abstract and method overview; no equation, update rule, or hyper-parameter schedule is provided for how weights are computed from local vs. global signals. This is load-bearing for the central claim that the framework resolves ambiguous credit assignment, as the skeptic note correctly flags the risk of reverting to the original sparse-reward problem.
- [Experiments] The experimental section reports that CoMAM 'consistently outperforms leading memory systems' but supplies no details on baselines, datasets, metrics (e.g., QA accuracy, retrieval precision), number of runs, or statistical tests. Without these, the outperformance claim cannot be evaluated and does not yet support the validation of the joint-optimization framework.
minor comments (2)
- [Preliminaries] Notation for the MDP state, action, and reward components should be introduced with explicit definitions and a diagram of the agent pipeline to improve readability.
- [Introduction] The abstract and introduction would benefit from a short related-work paragraph contrasting CoMAM with prior multi-agent RL memory papers (e.g., those using independent PPO or shared critics).
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript to provide the requested details on the adaptive weighting mechanism and experimental reporting.
read point-by-point responses
-
Referee: [Method (joint optimization and credit assignment)] The adaptive weighting mechanism for the global reward is described only at a high level in the abstract and method overview; no equation, update rule, or hyper-parameter schedule is provided for how weights are computed from local vs. global signals. This is load-bearing for the central claim that the framework resolves ambiguous credit assignment, as the skeptic note correctly flags the risk of reverting to the original sparse-reward problem.
Authors: We agree that the adaptive weighting mechanism requires a more precise formulation to substantiate the central claim. In the revised manuscript, we have added the explicit equation for the weight computation (now Equation 4), the update rule based on the ratio of local-to-global reward variance over a sliding window, and the hyper-parameter schedule (alpha linearly annealed from 0.1 to 1.0). These additions clarify how the mechanism supplies targeted feedback and avoids reverting to purely sparse global rewards. revision: yes
-
Referee: [Experiments] The experimental section reports that CoMAM 'consistently outperforms leading memory systems' but supplies no details on baselines, datasets, metrics (e.g., QA accuracy, retrieval precision), number of runs, or statistical tests. Without these, the outperformance claim cannot be evaluated and does not yet support the validation of the joint-optimization framework.
Authors: We acknowledge that the experimental reporting was incomplete. The revised manuscript now specifies the baselines (independent per-agent RL, MemGPT, and A-Mem), datasets (HotpotQA, 2WikiMultihopQA, LongBench), metrics (QA accuracy, retrieval precision/recall, collaboration efficiency), number of runs (5 random seeds), and statistical tests (paired t-tests with p-values reported in Table 3). Error bars and significance results have been added to all figures and tables. revision: yes
Circularity Check
No circularity: experimental validation of new MDP-based joint optimization stands independent of inputs
full rationale
The paper proposes CoMAM as a novel framework that models the multi-agent memory pipeline as an MDP and introduces an adaptive credit assignment mechanism combining local and weighted global rewards. The strongest claim is empirical outperformance on downstream tasks, presented as the result of experiments rather than a closed-form derivation. No equations, parameters, or results are shown to reduce by construction to fitted inputs, self-citations, or renamed prior patterns. The adaptive weighting is introduced as a design choice to address credit assignment, with no indication that its effectiveness is assumed or forced by the modeling step itself. This is a standard proposal-plus-experiment structure with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
free parameters (1)
- adaptive reward weights
axioms (1)
- domain assumption The multi-agent memory pipeline can be modeled as a Markov decision process to expose inter-agent dependencies
Forward citations
Cited by 1 Pith paper
-
Tree-based Credit Assignment for Multi-Agent Memory System
TreeMem assigns credit to agents in multi-agent memory systems by expanding outputs into a tree and using Monte Carlo averaging of final rewards to optimize each agent's policy.
Reference graph
Works this paper leans on
-
[1]
Aditya Akella. On the fundamental limitations of decentralized learnable reward shaping in cooperative multi-agent reinforcement learning.CoRR, abs/2511.00034, 2025
-
[2]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Reflective multi-agent collaboration based on large language models
Xiaohe Bo, Zeyu Zhang, Quanyu Dai, Xueyang Feng, Lei Wang, Rui Li, Xu Chen, and Ji-Rong Wen. Reflective multi-agent collaboration based on large language models. InNeurIPS, 2024
work page 2024
-
[5]
Self-play fine-tuning converts weak language models to strong language models
Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models. InICML. OpenReview.net, 2024
work page 2024
-
[6]
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready AI agents with scalable long-term memory.CoRR, abs/2504.19413, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InNIPS, pages 4299–4307, 2017
work page 2017
-
[8]
Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson
Jakob N. Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. InAAAI, pages 2974–2982. AAAI Press, 2018
work page 2018
-
[9]
Haoyang Hong, Jiajun Yin, Yuan Wang, Jingnan Liu, Zhe Chen, Ailing Yu, Ji Li, Zhiling Ye, Hansong Xiao, Yefei Chen, et al. Multi-agent deep research: Training multi-agent systems with m-grpo.arXiv preprint arXiv:2511.13288, 2025
-
[10]
Memory in the Age of AI Agents
Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, et al. Memory in the age of ai agents.arXiv preprint arXiv:2512.13564, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Cumulated gain-based evaluation of IR techniques
Kalervo Järvelin and Jaana Kekäläinen. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst., 20(4):422–446, 2002
work page 2002
-
[12]
Bowen Jiang, Zhuoqun Hao, Young Min Cho, Bryan Li, Yuan Yuan, Sihao Chen, Lyle Ungar, Camillo Jose Taylor, and Dan Roth. Know me, respond to me: Benchmarking LLMs for dynamic user profiling and personalized responses at scale. InSecond Conference on Language Modeling, 2025
work page 2025
-
[13]
Search-r1: Training LLMs to reason and leverage search engines with reinforcement learning
Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan O Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training LLMs to reason and leverage search engines with reinforcement learning. InSecond Conference on Language Modeling, 2025
work page 2025
-
[14]
Retrieval-augmented generation for knowledge-intensive NLP tasks
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InNeurIPS, 2020
work page 2020
-
[15]
CAM: A constructivist view of agentic memory for LLM-based reading comprehension
Rui Li, Zeyu Zhang, Xiaohe Bo, Zihang Tian, Xu Chen, Quanyu Dai, Zhenhua Dong, and Ruiming Tang. CAM: A constructivist view of agentic memory for LLM-based reading comprehension. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 11
work page 2025
-
[16]
Asynchronous credit assignment for multi-agent reinforcement learning
Yongheng Liang, Hejun Wu, Haitao Wang, and Hao Cai. Asynchronous credit assignment for multi-agent reinforcement learning. InIJCAI, pages 170–178. ijcai.org, 2025
work page 2025
-
[17]
MARFT: multi-agent reinforcement fine-tuning.CoRR, abs/2504.16129, 2025
Junwei Liao, Muning Wen, Jun Wang, and Weinan Zhang. MARFT: multi-agent reinforcement fine-tuning.CoRR, abs/2504.16129, 2025
-
[18]
Hao Ma, Tianyi Hu, Zhiqiang Pu, Boyin Liu, Xiaolin Ai, Yanyan Liang, and Min Chen. Coevolving with the other you: Fine-tuning LLM with sequential cooperative multi-agent reinforcement learning. InNeurIPS, 2024
work page 2024
-
[19]
Evaluating very long-term conversational memory of LLM agents
Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of LLM agents. InACL (1), pages 13851–13870. Association for Computational Linguistics, 2024
work page 2024
-
[20]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human fee...
work page 2022
-
[21]
Vicky Zhao, Lili Qiu, and Jianfeng Gao
Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Xufang Luo, Hao Cheng, Dongsheng Li, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, and Jianfeng Gao. Secom: On memory construction and retrieval for personalized conversational agents. InICLR. OpenReview.net, 2025
work page 2025
-
[22]
Ozdaglar, Kaiqing Zhang, and Joo- Kyung Kim
Chanwoo Park, Seungju Han, Xingzhi Guo, Asuman E. Ozdaglar, Kaiqing Zhang, and Joo- Kyung Kim. Maporl: Multi-agent post-co-training for collaborative large language models with reinforcement learning. InACL (1), pages 30215–30248. Association for Computational Linguistics, 2025
work page 2025
-
[23]
Patil, Tianjun Zhang, Xin Wang, and Joseph E
Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive apis. InNeurIPS, 2024
work page 2024
-
[24]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.CoRR, abs/2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Tenenbaum, Antonio Torralba, Shuang Li, and Igor Mordatch
Vighnesh Subramaniam, Yilun Du, Joshua B. Tenenbaum, Antonio Torralba, Shuang Li, and Igor Mordatch. Multiagent finetuning: Self improvement with diverse reasoning chains. In ICLR. OpenReview.net, 2025
work page 2025
-
[26]
Chuanneng Sun, Songjun Huang, and Dario Pompili. Llm-based multi-agent reinforcement learning: Current and future directions.CoRR, abs/2405.11106, 2024
-
[27]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models.CoRR, abs/2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
Offline multi-agent reinforcement learning with knowledge distillation
Wei-Cheng Tseng, Tsun-Hsuan Johnson Wang, Yen-Chen Lin, and Phillip Isola. Offline multi-agent reinforcement learning with knowledge distillation. InNeurIPS, 2022
work page 2022
-
[29]
MIRIX: Multi-Agent Memory System for LLM-Based Agents
Yu Wang and Xi Chen. MIRIX: multi-agent memory system for llm-based agents.CoRR, abs/2507.07957, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Yu Wang, Ryuichi Takanobu, Zhiqi Liang, Yuzhen Mao, Yuanzhe Hu, Julian McAuley, and Xiaojian Wu. Mem-{\alpha}: Learning memory construction via reinforcement learning.arXiv preprint arXiv:2509.25911, 2025
-
[31]
Multi-agent reinforcement learning is a sequence modeling problem
Muning Wen, Jakub Grudzien Kuba, Runji Lin, Weinan Zhang, Ying Wen, Jun Wang, and Yaodong Yang. Multi-agent reinforcement learning is a sequence modeling problem. InNeurIPS, 2022. 12
work page 2022
-
[32]
Long- memeval: Benchmarking chat assistants on long-term interactive memory
Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Long- memeval: Benchmarking chat assistants on long-term interactive memory. InICLR. OpenRe- view.net, 2025
work page 2025
-
[33]
Huang, Karthik Subbian, Linjun Zhang, Diyi Yang, James Zou, and Jure Leskovec
Shirley Wu, Parth Sarthi, Shiyu Zhao, Aaron Lee, Herumb Shandilya, Adrian Mladenic Gro- belnik, Nurendra Choudhary, Edward W. Huang, Karthik Subbian, Linjun Zhang, Diyi Yang, James Zou, and Jure Leskovec. Optimas: Optimizing compound AI systems with globally aligned local rewards. InThe F ourteenth International Conference on Learning Representations, 2026
work page 2026
-
[34]
Ioannidis, Karthik Subbian, Jure Leskovec, and James Y
Shirley Wu, Shiyu Zhao, Qian Huang, Kexin Huang, Michihiro Yasunaga, Kaidi Cao, Vassilis N. Ioannidis, Karthik Subbian, Jure Leskovec, and James Y . Zou. Avatar: Optimizing LLM agents for tool usage via contrastive reasoning. InNeurIPS, 2024
work page 2024
-
[35]
A-mem: Agentic memory for LLM agents
Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for LLM agents. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[36]
CoMAS: Co-evolving multi-agent systems via interaction rewards
Xiangyuan Xue, Yifan Zhou, Guibin Zhang, Zaibin Zhang, Yijiang Li, Chen Zhang, Zhenfei Yin, Philip Torr, Wanli Ouyang, and Lei Bai. CoMAS: Co-evolving multi-agent systems via interaction rewards. InThe F ourteenth International Conference on Learning Representations, 2026
work page 2026
-
[37]
General agentic memory via deep research.arXiv preprint arXiv:2511.18423, 2025
BY Yan, Chaofan Li, Hongjin Qian, Shuqi Lu, and Zheng Liu. General agentic memory via deep research.arXiv preprint arXiv:2511.18423, 2025
-
[38]
Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Hinrich Schütze, V olker Tresp, and Yunpu Ma. Memory-r1: Enhancing large language model agents to manage and utilize memories via reinforcement learning.CoRR, abs/2508.19828, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Memagent: Reshaping long-context LLM with multi-conv RL-based memory agent
Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, and Hao Zhou. Memagent: Reshaping long-context LLM with multi-conv RL-based memory agent. InThe F ourteenth International Conference on Learning Representations, 2026
work page 2026
-
[40]
STRONGER TOGETHER: ON-POLICY REINFORCEMENT LEARNING FOR COLLAB- ORATIVE LLMS
Yujie Zhao, Lanxiang Hu, Yang Wang, Minmin Hou, Hao Zhang, Ke Ding, and Jishen Zhao. STRONGER TOGETHER: ON-POLICY REINFORCEMENT LEARNING FOR COLLAB- ORATIVE LLMS. InThe F ourteenth International Conference on Learning Representations, 2026
work page 2026
-
[41]
Secrets of RLHF in large language models part I: PPO.CoRR, abs/2307.04964, 2023
Rui Zheng, Shihan Dou, Songyang Gao, Yuan Hua, Wei Shen, Binghai Wang, Yan Liu, Senjie Jin, Qin Liu, Yuhao Zhou, Limao Xiong, Lu Chen, Zhiheng Xi, Nuo Xu, Wenbin Lai, Minghao Zhu, Cheng Chang, Zhangyue Yin, Rongxiang Weng, Wensen Cheng, Haoran Huang, Tianxiang Sun, Hang Yan, Tao Gui, Qi Zhang, Xipeng Qiu, and Xuanjing Huang. Secrets of RLHF in large langu...
-
[42]
Memorybank: Enhancing large language models with long-term memory
Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InAAAI, pages 19724–19731. AAAI Press, 2024
work page 2024
-
[43]
Learning implicit credit assignment for cooperative multi-agent reinforcement learning
Meng Zhou, Ziyu Liu, Pengwei Sui, Yixuan Li, and Yuk Ying Chung. Learning implicit credit assignment for cooperative multi-agent reinforcement learning. InNeurIPS, 2020
work page 2020
-
[44]
Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, and Paul Pu Liang. MEM1: Learning to synergize memory and reasoning for efficient long-horizon agents. InThe F ourteenth International Conference on Learning Representations, 2026. 13 A Algorithm Here we present the algorithm of CoMAM’s training p...
work page 2026
-
[45]
Are the scenarios sufficiently rich and comprehensive?
-
[46]
Is the analysis of user preferences sufficiently in-depth?
-
[47]
w/ L”), optimizing with only global rewards (“w/ G
Is the user persona sufficiently detailed and representative? Output Format: Please output a single scalar score within <score></score> as a float between 0.0 and 1.0, without any additional text. • 1.0: The synthesized insight perfectly captures the scenario, user preferences, and persona. •0.0: The synthesized insight fails to capture the relevant infor...
-
[48]
adopts a zero-sum adversarial reward scheme that pits agents against each other, which is misaligned with the cooperative nature of memory systems; In contrast, CoMAM assigns agent- specific credit based on a ranking-based proxy that measures the alignment between each agent’s local task rewards and the global system reward. This formulation captures whet...
-
[49]
Simplified Agent Design.We adopt a compact set of agents (Extraction, Profile, and Retrieval) to focus on core collaborative behaviors. More advanced functionalities, such as memory editing, consistency maintenance, and redundancy removal, are not explicitly modeled and remain for future exploration
-
[50]
Limited Cross-Agent Credit Modeling.CoMAM estimates agent-specific credit by modeling the ranking consistency between each agent’s local rewards and the global system reward. However, relationships among agents’ local rewards are not explicitly considered in the credit estimation. Incorporating such cross-agent relationships may provide more informative c...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.