arxiv: 2603.12631 · v2 · submitted 2026-03-13 · 💻 cs.MA

Recognition: no theorem link

Joint Optimization of Multi-agent Memory System

Wenyu Mao , Haoyang Liu , Haosong Tan , Yaorui Shi , Jiancan Wu , An Zhang , Xiang Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 12:08 UTC · model grok-4.3

classification 💻 cs.MA

keywords multi-agent systemsLLM memoryjoint optimizationreinforcement learningcredit assignmentMarkov decision processcollaborative agents

0 comments

The pith

Jointly optimizing multiple agents in an LLM memory system through end-to-end reinforcement learning improves their collaboration over independent training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard memory systems for large language models train each agent separately under one shared goal like answer accuracy. This approach overlooks how one agent's output directly shapes what the next agent can do and gives only sparse feedback that makes it hard to tell which agent did what. CoMAM treats the entire memory pipeline as a single sequence of decisions where each step depends on the last, then trains every agent together with a mix of its own local reward and a share of the global reward that adjusts automatically to reflect its contribution. A sympathetic reader would care because better co-adaptation could let memory systems support longer, more reliable conversations without constant manual tuning of each component.

Core claim

CoMAM models the multi-agent memory pipeline as a Markov decision process to expose inter-agent dependencies during end-to-end training. Agents are jointly optimized using a combination of their local task reward and an adaptively weighted global reward, enabling agents to co-adapt while receiving targeted feedback for their respective roles.

What carries the argument

The CoMAM framework that models the multi-agent pipeline as a Markov decision process and applies an adaptive credit assignment mechanism to blend local and global rewards.

Load-bearing premise

That end-to-end MDP training with adaptive credit assignment will produce stable co-adaptation among agents without training collapse or ineffective credit signals.

What would settle it

A controlled comparison in which CoMAM is trained on the same tasks but produces no gain or a loss in final QA accuracy and collaboration metrics relative to the independent-optimization baselines.

Figures

Figures reproduced from arXiv: 2603.12631 by An Zhang, Haosong Tan, Haoyang Liu, Jiancan Wu, Wenyu Mao, Xiang Wang, Yaorui Shi.

**Figure 1.** Figure 1: (Left) Training curves of coarse/fine memory construction and retrieval agents, which are optimized independently. (Middle) Global system performance under independent and joint optimization across different context lengths. (Right) Conceptual comparison between existing methods that optimize agents independently with shared global rewards and our joint optimization with adaptive credit assignment. converg… view at source ↗

**Figure 2.** Figure 2: The overview of our joint optimization framework for the multi-agent memory system, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Sensitivity of CoMAM to the hyperparameters [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: The left demonstrates CoMAM’s sensitivity to the credit assignment weight, and the right [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: The curves for local task-specific reward [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Detailed performance of different methods across seven question types on the PersonaMem [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

read the original abstract

Memory systems are critical for LLMs, mitigating context window limitations and supporting long-horizon user-LLM interactions. Such systems typically comprise multiple agents responsible for memory construction and retrieval. Existing approaches often optimize each agent independently under a shared global objective (e.g., downstream QA accuracy), treating other agents as a static environment. However, this design has two key limitations: (1) independent optimization ignores inter-agent dependencies and lacks agents' co-adaptation, and (2) relying solely on sparse global rewards provides limited guidance for optimizing specialized agents and causes ambiguous credit assignment. These may ultimately limit agents' effective collaboration in the memory system. To address these limitations, we propose CoMAM, a joint optimization framework that promotes collaboration among agents via end-to-end reinforcement learning and an adaptive credit assignment mechanism. Specifically, we model the multi-agent pipeline as a Markov decision process (MDP) to expose inter-agent dependencies during end-to-end training. Agents are then jointly optimized using a combination of their local task reward and an adaptively weighted global reward, enabling agents to co-adapt while receiving targeted feedback for their respective roles. Experiments show that CoMAM consistently outperforms leading memory systems, validating the effectiveness of the joint optimization framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoMAM frames joint optimization for multi-agent LLM memory as an MDP with adaptive credit assignment, but the weighting mechanism stays underspecified and the experimental support is thin.

read the letter

The paper's core move is to stop optimizing memory agents independently and instead treat the whole pipeline as a single MDP so agents can co-adapt through end-to-end RL. They add local task rewards plus an adaptively weighted global reward to give each agent more targeted signal than the usual sparse downstream accuracy. That framing directly targets the two problems they name: ignored inter-agent dependencies and ambiguous credit assignment under global rewards alone. The idea is reasonable and the motivation section does a clean job laying out why independent optimization falls short for long-horizon memory tasks. If the full paper shows stable training curves and clear ablation on the weighting scheme, this could be a useful incremental step for people building multi-agent LLM systems. The main soft spot is the adaptive weighting itself. The abstract never says how the weights are computed or updated—no learned gate, no variance baseline, no schedule. In delayed-reward multi-agent RL that detail usually determines whether the method actually improves credit assignment or just reintroduces the instability the authors criticize. Without it the central claim rests on an unshown mechanism. The experimental section is also thin in the abstract: it asserts consistent outperformance over leading memory systems but gives no numbers, no exact baselines, and no implementation notes. If the full paper supplies reproducible code, hyper-parameter tables, and statistical tests, the results become evaluable; otherwise the outperformance claim stays hard to assess. This work is aimed at researchers already working on LLM memory augmentation or multi-agent collaboration. A reader in that niche would find the problem statement and MDP modeling useful even if they end up tweaking the credit assignment. I would send it to peer review because the framing addresses a genuine gap and the authors appear to have run experiments that can be checked, but I would flag the need for explicit equations on the adaptive weights and fuller experimental reporting.

Referee Report

2 major / 2 minor

Summary. The paper proposes CoMAM, a joint optimization framework for multi-agent memory systems in LLMs. It models the memory construction and retrieval pipeline as a Markov decision process (MDP) to expose inter-agent dependencies and jointly optimizes agents via end-to-end reinforcement learning using a combination of local task rewards and an adaptively weighted global reward. This is claimed to enable co-adaptation among agents and resolve limitations of independent optimization and sparse global rewards. Experiments are reported to show consistent outperformance over leading memory systems.

Significance. If the experimental results hold and the adaptive credit assignment mechanism is shown to produce stable co-adaptation, the work could meaningfully advance memory-augmented LLM systems by improving collaboration in long-horizon tasks. The MDP modeling and hybrid reward structure address a recognized gap in multi-agent setups for memory systems, but the significance depends on whether the approach demonstrably outperforms strong baselines without introducing new instabilities.

major comments (2)

[Method (joint optimization and credit assignment)] The adaptive weighting mechanism for the global reward is described only at a high level in the abstract and method overview; no equation, update rule, or hyper-parameter schedule is provided for how weights are computed from local vs. global signals. This is load-bearing for the central claim that the framework resolves ambiguous credit assignment, as the skeptic note correctly flags the risk of reverting to the original sparse-reward problem.
[Experiments] The experimental section reports that CoMAM 'consistently outperforms leading memory systems' but supplies no details on baselines, datasets, metrics (e.g., QA accuracy, retrieval precision), number of runs, or statistical tests. Without these, the outperformance claim cannot be evaluated and does not yet support the validation of the joint-optimization framework.

minor comments (2)

[Preliminaries] Notation for the MDP state, action, and reward components should be introduced with explicit definitions and a diagram of the agent pipeline to improve readability.
[Introduction] The abstract and introduction would benefit from a short related-work paragraph contrasting CoMAM with prior multi-agent RL memory papers (e.g., those using independent PPO or shared critics).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript to provide the requested details on the adaptive weighting mechanism and experimental reporting.

read point-by-point responses

Referee: [Method (joint optimization and credit assignment)] The adaptive weighting mechanism for the global reward is described only at a high level in the abstract and method overview; no equation, update rule, or hyper-parameter schedule is provided for how weights are computed from local vs. global signals. This is load-bearing for the central claim that the framework resolves ambiguous credit assignment, as the skeptic note correctly flags the risk of reverting to the original sparse-reward problem.

Authors: We agree that the adaptive weighting mechanism requires a more precise formulation to substantiate the central claim. In the revised manuscript, we have added the explicit equation for the weight computation (now Equation 4), the update rule based on the ratio of local-to-global reward variance over a sliding window, and the hyper-parameter schedule (alpha linearly annealed from 0.1 to 1.0). These additions clarify how the mechanism supplies targeted feedback and avoids reverting to purely sparse global rewards. revision: yes
Referee: [Experiments] The experimental section reports that CoMAM 'consistently outperforms leading memory systems' but supplies no details on baselines, datasets, metrics (e.g., QA accuracy, retrieval precision), number of runs, or statistical tests. Without these, the outperformance claim cannot be evaluated and does not yet support the validation of the joint-optimization framework.

Authors: We acknowledge that the experimental reporting was incomplete. The revised manuscript now specifies the baselines (independent per-agent RL, MemGPT, and A-Mem), datasets (HotpotQA, 2WikiMultihopQA, LongBench), metrics (QA accuracy, retrieval precision/recall, collaboration efficiency), number of runs (5 random seeds), and statistical tests (paired t-tests with p-values reported in Table 3). Error bars and significance results have been added to all figures and tables. revision: yes

Circularity Check

0 steps flagged

No circularity: experimental validation of new MDP-based joint optimization stands independent of inputs

full rationale

The paper proposes CoMAM as a novel framework that models the multi-agent memory pipeline as an MDP and introduces an adaptive credit assignment mechanism combining local and weighted global rewards. The strongest claim is empirical outperformance on downstream tasks, presented as the result of experiments rather than a closed-form derivation. No equations, parameters, or results are shown to reduce by construction to fitted inputs, self-citations, or renamed prior patterns. The adaptive weighting is introduced as a design choice to address credit assignment, with no indication that its effectiveness is assumed or forced by the modeling step itself. This is a standard proposal-plus-experiment structure with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract-only review; limited details available on parameters and assumptions.

free parameters (1)

adaptive reward weights
The adaptive weighting between local task rewards and global reward is likely tuned or fitted during training to balance agent roles.

axioms (1)

domain assumption The multi-agent memory pipeline can be modeled as a Markov decision process to expose inter-agent dependencies
Invoked to enable end-to-end training and co-adaptation.

pith-pipeline@v0.9.0 · 5527 in / 994 out tokens · 51946 ms · 2026-05-15T12:08:47.119952+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Tree-based Credit Assignment for Multi-Agent Memory System
cs.MA 2026-05 unverdicted novelty 6.0

TreeMem assigns credit to agents in multi-agent memory systems by expanding outputs into a tree and using Monte Carlo averaging of final rewards to optimize each agent's policy.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · cited by 1 Pith paper · 7 internal anchors

[1]

On the fundamental limitations of decentralized learnable reward shaping in cooperative multi-agent reinforcement learning.CoRR, abs/2511.00034, 2025

Aditya Akella. On the fundamental limitations of decentralized learnable reward shaping in cooperative multi-agent reinforcement learning.CoRR, abs/2511.00034, 2025

work page arXiv 2025
[2]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Reflective multi-agent collaboration based on large language models

Xiaohe Bo, Zeyu Zhang, Quanyu Dai, Xueyang Feng, Lei Wang, Rui Li, Xu Chen, and Ji-Rong Wen. Reflective multi-agent collaboration based on large language models. InNeurIPS, 2024

work page 2024
[5]

Self-play fine-tuning converts weak language models to strong language models

Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models. InICML. OpenReview.net, 2024

work page 2024
[6]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready AI agents with scalable long-term memory.CoRR, abs/2504.19413, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Christiano, Jan Leike, Tom B

Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InNIPS, pages 4299–4307, 2017

work page 2017
[8]

Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson

Jakob N. Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. InAAAI, pages 2974–2982. AAAI Press, 2018

work page 2018
[9]

Multi-agent deep research: Training multi-agent systems with m-grpo.arXiv preprint arXiv:2511.13288, 2025

Haoyang Hong, Jiajun Yin, Yuan Wang, Jingnan Liu, Zhe Chen, Ailing Yu, Ji Li, Zhiling Ye, Hansong Xiao, Yefei Chen, et al. Multi-agent deep research: Training multi-agent systems with m-grpo.arXiv preprint arXiv:2511.13288, 2025

work page arXiv 2025
[10]

Memory in the Age of AI Agents

Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, et al. Memory in the age of ai agents.arXiv preprint arXiv:2512.13564, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Cumulated gain-based evaluation of IR techniques

Kalervo Järvelin and Jaana Kekäläinen. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst., 20(4):422–446, 2002

work page 2002
[12]

Know me, respond to me: Benchmarking LLMs for dynamic user profiling and personalized responses at scale

Bowen Jiang, Zhuoqun Hao, Young Min Cho, Bryan Li, Yuan Yuan, Sihao Chen, Lyle Ungar, Camillo Jose Taylor, and Dan Roth. Know me, respond to me: Benchmarking LLMs for dynamic user profiling and personalized responses at scale. InSecond Conference on Language Modeling, 2025

work page 2025
[13]

Search-r1: Training LLMs to reason and leverage search engines with reinforcement learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan O Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training LLMs to reason and leverage search engines with reinforcement learning. InSecond Conference on Language Modeling, 2025

work page 2025
[14]

Retrieval-augmented generation for knowledge-intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InNeurIPS, 2020

work page 2020
[15]

CAM: A constructivist view of agentic memory for LLM-based reading comprehension

Rui Li, Zeyu Zhang, Xiaohe Bo, Zihang Tian, Xu Chen, Quanyu Dai, Zhenhua Dong, and Ruiming Tang. CAM: A constructivist view of agentic memory for LLM-based reading comprehension. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 11

work page 2025
[16]

Asynchronous credit assignment for multi-agent reinforcement learning

Yongheng Liang, Hejun Wu, Haitao Wang, and Hao Cai. Asynchronous credit assignment for multi-agent reinforcement learning. InIJCAI, pages 170–178. ijcai.org, 2025

work page 2025
[17]

MARFT: multi-agent reinforcement fine-tuning.CoRR, abs/2504.16129, 2025

Junwei Liao, Muning Wen, Jun Wang, and Weinan Zhang. MARFT: multi-agent reinforcement fine-tuning.CoRR, abs/2504.16129, 2025

work page arXiv 2025
[18]

Coevolving with the other you: Fine-tuning LLM with sequential cooperative multi-agent reinforcement learning

Hao Ma, Tianyi Hu, Zhiqiang Pu, Boyin Liu, Xiaolin Ai, Yanyan Liang, and Min Chen. Coevolving with the other you: Fine-tuning LLM with sequential cooperative multi-agent reinforcement learning. InNeurIPS, 2024

work page 2024
[19]

Evaluating very long-term conversational memory of LLM agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of LLM agents. InACL (1), pages 13851–13870. Association for Computational Linguistics, 2024

work page 2024
[20]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human fee...

work page 2022
[21]

Vicky Zhao, Lili Qiu, and Jianfeng Gao

Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Xufang Luo, Hao Cheng, Dongsheng Li, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, and Jianfeng Gao. Secom: On memory construction and retrieval for personalized conversational agents. InICLR. OpenReview.net, 2025

work page 2025
[22]

Ozdaglar, Kaiqing Zhang, and Joo- Kyung Kim

Chanwoo Park, Seungju Han, Xingzhi Guo, Asuman E. Ozdaglar, Kaiqing Zhang, and Joo- Kyung Kim. Maporl: Multi-agent post-co-training for collaborative large language models with reinforcement learning. InACL (1), pages 30215–30248. Association for Computational Linguistics, 2025

work page 2025
[23]

Patil, Tianjun Zhang, Xin Wang, and Joseph E

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive apis. InNeurIPS, 2024

work page 2024
[24]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.CoRR, abs/2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Tenenbaum, Antonio Torralba, Shuang Li, and Igor Mordatch

Vighnesh Subramaniam, Yilun Du, Joshua B. Tenenbaum, Antonio Torralba, Shuang Li, and Igor Mordatch. Multiagent finetuning: Self improvement with diverse reasoning chains. In ICLR. OpenReview.net, 2025

work page 2025
[26]

Llm-based multi-agent reinforcement learning: Current and future directions.CoRR, abs/2405.11106, 2024

Chuanneng Sun, Songjun Huang, and Dario Pompili. Llm-based multi-agent reinforcement learning: Current and future directions.CoRR, abs/2405.11106, 2024

work page arXiv 2024
[27]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models.CoRR, abs/2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Offline multi-agent reinforcement learning with knowledge distillation

Wei-Cheng Tseng, Tsun-Hsuan Johnson Wang, Yen-Chen Lin, and Phillip Isola. Offline multi-agent reinforcement learning with knowledge distillation. InNeurIPS, 2022

work page 2022
[29]

MIRIX: Multi-Agent Memory System for LLM-Based Agents

Yu Wang and Xi Chen. MIRIX: multi-agent memory system for llm-based agents.CoRR, abs/2507.07957, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Mem-{\alpha}: Learning memory construction via reinforcement learning.arXiv preprint arXiv:2509.25911, 2025

Yu Wang, Ryuichi Takanobu, Zhiqi Liang, Yuzhen Mao, Yuanzhe Hu, Julian McAuley, and Xiaojian Wu. Mem-{\alpha}: Learning memory construction via reinforcement learning.arXiv preprint arXiv:2509.25911, 2025

work page arXiv 2025
[31]

Multi-agent reinforcement learning is a sequence modeling problem

Muning Wen, Jakub Grudzien Kuba, Runji Lin, Weinan Zhang, Ying Wen, Jun Wang, and Yaodong Yang. Multi-agent reinforcement learning is a sequence modeling problem. InNeurIPS, 2022. 12

work page 2022
[32]

Long- memeval: Benchmarking chat assistants on long-term interactive memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Long- memeval: Benchmarking chat assistants on long-term interactive memory. InICLR. OpenRe- view.net, 2025

work page 2025
[33]

Huang, Karthik Subbian, Linjun Zhang, Diyi Yang, James Zou, and Jure Leskovec

Shirley Wu, Parth Sarthi, Shiyu Zhao, Aaron Lee, Herumb Shandilya, Adrian Mladenic Gro- belnik, Nurendra Choudhary, Edward W. Huang, Karthik Subbian, Linjun Zhang, Diyi Yang, James Zou, and Jure Leskovec. Optimas: Optimizing compound AI systems with globally aligned local rewards. InThe F ourteenth International Conference on Learning Representations, 2026

work page 2026
[34]

Ioannidis, Karthik Subbian, Jure Leskovec, and James Y

Shirley Wu, Shiyu Zhao, Qian Huang, Kexin Huang, Michihiro Yasunaga, Kaidi Cao, Vassilis N. Ioannidis, Karthik Subbian, Jure Leskovec, and James Y . Zou. Avatar: Optimizing LLM agents for tool usage via contrastive reasoning. InNeurIPS, 2024

work page 2024
[35]

A-mem: Agentic memory for LLM agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for LLM agents. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[36]

CoMAS: Co-evolving multi-agent systems via interaction rewards

Xiangyuan Xue, Yifan Zhou, Guibin Zhang, Zaibin Zhang, Yijiang Li, Chen Zhang, Zhenfei Yin, Philip Torr, Wanli Ouyang, and Lei Bai. CoMAS: Co-evolving multi-agent systems via interaction rewards. InThe F ourteenth International Conference on Learning Representations, 2026

work page 2026
[37]

General agentic memory via deep research.arXiv preprint arXiv:2511.18423, 2025

BY Yan, Chaofan Li, Hongjin Qian, Shuqi Lu, and Zheng Liu. General agentic memory via deep research.arXiv preprint arXiv:2511.18423, 2025

work page arXiv 2025
[38]

Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Hinrich Schütze, V olker Tresp, and Yunpu Ma. Memory-r1: Enhancing large language model agents to manage and utilize memories via reinforcement learning.CoRR, abs/2508.19828, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Memagent: Reshaping long-context LLM with multi-conv RL-based memory agent

Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, and Hao Zhou. Memagent: Reshaping long-context LLM with multi-conv RL-based memory agent. InThe F ourteenth International Conference on Learning Representations, 2026

work page 2026
[40]

STRONGER TOGETHER: ON-POLICY REINFORCEMENT LEARNING FOR COLLAB- ORATIVE LLMS

Yujie Zhao, Lanxiang Hu, Yang Wang, Minmin Hou, Hao Zhang, Ke Ding, and Jishen Zhao. STRONGER TOGETHER: ON-POLICY REINFORCEMENT LEARNING FOR COLLAB- ORATIVE LLMS. InThe F ourteenth International Conference on Learning Representations, 2026

work page 2026
[41]

Secrets of RLHF in large language models part I: PPO.CoRR, abs/2307.04964, 2023

Rui Zheng, Shihan Dou, Songyang Gao, Yuan Hua, Wei Shen, Binghai Wang, Yan Liu, Senjie Jin, Qin Liu, Yuhao Zhou, Limao Xiong, Lu Chen, Zhiheng Xi, Nuo Xu, Wenbin Lai, Minghao Zhu, Cheng Chang, Zhangyue Yin, Rongxiang Weng, Wensen Cheng, Haoran Huang, Tianxiang Sun, Hang Yan, Tao Gui, Qi Zhang, Xipeng Qiu, and Xuanjing Huang. Secrets of RLHF in large langu...

work page arXiv 2023
[42]

Memorybank: Enhancing large language models with long-term memory

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InAAAI, pages 19724–19731. AAAI Press, 2024

work page 2024
[43]

Learning implicit credit assignment for cooperative multi-agent reinforcement learning

Meng Zhou, Ziyu Liu, Pengwei Sui, Yixuan Li, and Yuk Ying Chung. Learning implicit credit assignment for cooperative multi-agent reinforcement learning. InNeurIPS, 2020

work page 2020
[44]

+independent RL

Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, and Paul Pu Liang. MEM1: Learning to synergize memory and reasoning for efficient long-horizon agents. InThe F ourteenth International Conference on Learning Representations, 2026. 13 A Algorithm Here we present the algorithm of CoMAM’s training p...

work page 2026
[45]

Are the scenarios sufficiently rich and comprehensive?

work page
[46]

Is the analysis of user preferences sufficiently in-depth?

work page
[47]

w/ L”), optimizing with only global rewards (“w/ G

Is the user persona sufficiently detailed and representative? Output Format: Please output a single scalar score within <score></score> as a float between 0.0 and 1.0, without any additional text. • 1.0: The synthesized insight perfectly captures the scenario, user preferences, and persona. •0.0: The synthesized insight fails to capture the relevant infor...

work page
[48]

This formulation captures whether trajectories with higher global performance are consistently associated with higher local rewards for a given agent

adopts a zero-sum adversarial reward scheme that pits agents against each other, which is misaligned with the cooperative nature of memory systems; In contrast, CoMAM assigns agent- specific credit based on a ranking-based proxy that measures the alignment between each agent’s local task rewards and the global system reward. This formulation captures whet...

work page
[49]

More advanced functionalities, such as memory editing, consistency maintenance, and redundancy removal, are not explicitly modeled and remain for future exploration

Simplified Agent Design.We adopt a compact set of agents (Extraction, Profile, and Retrieval) to focus on core collaborative behaviors. More advanced functionalities, such as memory editing, consistency maintenance, and redundancy removal, are not explicitly modeled and remain for future exploration

work page
[50]

However, relationships among agents’ local rewards are not explicitly considered in the credit estimation

Limited Cross-Agent Credit Modeling.CoMAM estimates agent-specific credit by modeling the ranking consistency between each agent’s local rewards and the global system reward. However, relationships among agents’ local rewards are not explicitly considered in the credit estimation. Incorporating such cross-agent relationships may provide more informative c...

work page