Learning Decentralized LLM Collaboration with Multi-Agent Actor Critic
Pith reviewed 2026-05-16 09:45 UTC · model grok-4.3
The pith
Centralized critic in multi-agent actor-critic training outperforms decentralized critics and Monte Carlo methods for LLM collaboration on long-horizon or sparse-reward tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We develop Multi-Agent Actor-Critic methods to optimize decentralized LLM collaboration. We propose CoLLM-CC with a centralized critic and CoLLM-DC with decentralized critics. Experiments across writing, coding, and game-playing domains show that Monte Carlo methods and CoLLM-DC can achieve performance comparable to CoLLM-CC in short-horizon and dense-reward settings. However, they both underperform CoLLM-CC on long-horizon or sparse-reward tasks, where Monte Carlo methods require substantially more samples and CoLLM-DC struggles to converge.
What carries the argument
Multi-Agent Actor-Critic (MAAC) framework for LLM collaboration, with CoLLM-CC using a single centralized critic to estimate joint values and reduce variance during decentralized execution.
Load-bearing premise
That LLM collaboration tasks can be reliably cast as multi-agent reinforcement learning problems where the reward functions accurately capture collaboration quality and the environments admit stable actor-critic training.
What would settle it
An experiment on a long-horizon sparse-reward task where CoLLM-DC converges to performance matching or exceeding CoLLM-CC using fewer samples than Monte Carlo methods.
Figures
read the original abstract
Recent work has explored optimizing LLM collaboration through Multi-Agent Reinforcement Learning (MARL). However, most MARL fine-tuning approaches rely on predefined execution protocols, which often require centralized execution. Decentralized LLM collaboration is more appealing in practice, as agents can run inference in parallel with flexible deployments. Also, current approaches use Monte Carlo methods for fine-tuning, which suffer from high variance and thus require more samples to train effectively. Actor-critic methods are prevalent in MARL for dealing with these issues, so we developed Multi-Agent Actor-Critic (MAAC) methods to optimize decentralized LLM collaboration. In this paper, we analyze when and why these MAAC methods are beneficial. We propose 2 MAAC approaches, \textbf{CoLLM-CC} with a \textbf{C}entralized \textbf{C}ritic and \textbf{CoLLM-DC} with \textbf{D}ecentralized \textbf{C}ritics. Our experiments across writing, coding, and game-playing domains show that Monte Carlo methods and CoLLM-DC can achieve performance comparable to CoLLM-CC in short-horizon and dense-reward settings. However, they both underperform CoLLM-CC on long-horizon or sparse-reward tasks, where Monte Carlo methods require substantially more samples and CoLLM-DC struggles to converge. Our code is available at https://github.com/OpenMLRL/CoMLRL/releases/tag/v1.3.6.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes two Multi-Agent Actor-Critic (MAAC) approaches for fine-tuning decentralized LLM collaboration: CoLLM-CC (centralized critic) and CoLLM-DC (decentralized critics). It compares these to Monte Carlo methods across writing, coding, and game-playing domains, claiming that MC and DC achieve comparable performance to CC in short-horizon dense-reward settings, but both underperform CC on long-horizon or sparse-reward tasks, with MC requiring more samples and DC struggling to converge.
Significance. If the empirical findings hold after verification, the work provides useful practical guidance on when centralized critics are necessary for stable training in LLM collaboration tasks cast as MARL. The code release and domain coverage (writing, coding, games) are strengths that could aid reproducibility and extension.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experiments): The headline claim that CoLLM-DC 'struggles to converge' on long-horizon/sparse-reward tasks is not isolated from potential misspecification in reward decomposition or per-agent value-function approximation; without an ablation on credit-assignment mechanisms or variance of local signals, it is unclear whether the observed gap is inherent to decentralization or an artifact of the chosen critic architecture and reward shaping.
- [§4] §4: No statistical tests, confidence intervals, hyperparameter settings, or full baseline descriptions are supplied for the reported performance gaps between CoLLM-CC, CoLLM-DC, and Monte Carlo methods; this prevents assessment of whether the underperformance is robust or sensitive to implementation details.
minor comments (1)
- [§3] Notation for the centralized vs. decentralized critic formulations could be clarified with explicit equations distinguishing the critic inputs and loss terms.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript. We address each major comment point by point below, providing the strongest honest defense of our claims while incorporating revisions to improve rigor and clarity.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): The headline claim that CoLLM-DC 'struggles to converge' on long-horizon/sparse-reward tasks is not isolated from potential misspecification in reward decomposition or per-agent value-function approximation; without an ablation on credit-assignment mechanisms or variance of local signals, it is unclear whether the observed gap is inherent to decentralization or an artifact of the chosen critic architecture and reward shaping.
Authors: We agree that isolating the source of convergence difficulties requires explicit analysis of credit assignment and signal variance. In the revised manuscript we have added an ablation comparing alternative reward decompositions and reporting the empirical variance of local value estimates under CoLLM-DC versus the centralized critic. These new results (now in §4 and Appendix C) show substantially higher variance in the decentralized local signals precisely on the long-horizon/sparse-reward tasks, supporting that the observed gap is driven by decentralization rather than solely by the particular critic architecture chosen. The abstract has been updated to reflect this qualification. revision: yes
-
Referee: [§4] §4: No statistical tests, confidence intervals, hyperparameter settings, or full baseline descriptions are supplied for the reported performance gaps between CoLLM-CC, CoLLM-DC, and Monte Carlo methods; this prevents assessment of whether the underperformance is robust or sensitive to implementation details.
Authors: We acknowledge the omission of these details. The revised version now includes paired t-tests with reported p-values for all key performance gaps, 95% confidence intervals on every metric, a complete hyperparameter table in the appendix, and expanded baseline descriptions that specify the exact Monte Carlo implementation, sampling budgets, and reward computation. These additions confirm that the reported underperformance of CoLLM-DC and Monte Carlo methods remains statistically significant and is not sensitive to the tested hyperparameter ranges. revision: yes
Circularity Check
No circularity in empirical MAAC application to LLM collaboration
full rationale
The paper applies standard multi-agent actor-critic methods (CoLLM-CC with centralized critic, CoLLM-DC with decentralized critics) to LLM collaboration tasks and supports its claims solely through experiments on writing, coding, and game-playing domains. No equations, fitted parameters renamed as predictions, or self-citation chains are present that reduce any result to the paper's own inputs by construction. Performance differences (e.g., CoLLM-DC struggling on long-horizon/sparse-reward tasks) are reported as empirical observations rather than derived analytically from prior self-referential assumptions.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 5 Pith papers
-
Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems
A survey that unifies prior work on multi-agent LLM systems via the LIFE framework, mapping dependencies across collaboration, failure attribution, and autonomous self-evolution while identifying cross-stage challenges.
-
Improving the Efficiency of Language Agent Teams with Adaptive Task Graphs
LATTE coordinates LLM agent teams with an evolving shared task graph, cutting token use, time, and failures while matching or beating accuracy of MetaGPT, leader-worker, and static methods.
-
Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems
The survey proposes the LIFE framework to unify fragmented research on collaboration, failure attribution, and self-evolution in LLM multi-agent systems into a progression toward self-organizing intelligence.
-
Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces
This survey organizes RL for LLM multi-agent systems into reward families, credit units, and five orchestration sub-decisions, notes the absence of explicit stopping-decision training in its paper pool, and releases a...
-
Cloud-native and Distributed Systems for Efficient and Scalable Large Language Models -- A Research Agenda
This research agenda argues that cloud-native architectures, microservices, autoscaling, and emerging trends like serverless inference and federated learning are required to make large language models efficient and scalable.
Reference graph
Works this paper leans on
-
[1]
Achiam, J., Adler, S., Agarwal, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs
Ahmadian, A., Cremer, C., Gall´e, M., Fadaee, M., Kreutzer, J., Pietquin, O., ¨Ust¨un, A., and Hooker, S. Back to ba- sics: Revisiting reinforce style optimization for learn- ing from human feedback in LLMs.arXiv preprint arXiv:2402.14740,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Gemini: A Family of Highly Capable Multimodal Models
URL https://www. marl-book.com. Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Program Synthesis with Large Language Models
Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., Fan, Y ., Ge, W., Han, Y ., Huang, F., et al. Qwen technical report.arXiv preprint arXiv:2309.16609,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Why Do Multi-Agent LLM Systems Fail?
Cemri, M., Pan, M. Z., Yang, S., Agrawal, L. A., Chopra, B., Tiwari, R., Keutzer, K., Parameswaran, A., Klein, D., Ramchandran, K., et al. Why do multi-agent llm systems fail?arXiv preprint arXiv:2503.13657,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Evaluating Large Language Models Trained on Code
Chen, M., Tworek, J., Jun, H., et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Improving retrieval-augmented generation through multi-agent reinforcement learning
Chen, Y ., Yan, L., Sun, W., Ma, X., Zhang, Y ., Wang, S., Yin, D., Yang, Y ., and Mao, J. Improving retrieval-augmented generation through multi-agent reinforcement learning. arXiv preprint arXiv:2501.15228,
-
[9]
S., Gupta, T., Makoviichuk, D., Makoviychuk, V ., Torr, P
De Witt, C. S., Gupta, T., Makoviichuk, D., Makoviychuk, V ., Torr, P. H., Sun, M., and Whiteson, S. Is indepen- dent learning all you need in the starcraft multi-agent challenge?arXiv preprint arXiv:2011.09533,
-
[10]
Ding, Z. and Ye, W. Treegrpo: Tree-advantage grpo for online rl post-training of diffusion models.arXiv preprint arXiv:2512.08153,
-
[11]
A., Chhaparia, R., Donchev, Y., Kuncoro, A., Ranzato, M., Szlam, A., and Shen, J
Douillard, A., Feng, Q., Rusu, A. A., Chhaparia, R., Donchev, Y ., Kuncoro, A., Ranzato, M., Szlam, A., and Shen, J. Diloco: Distributed low-communication training of language models.arXiv preprint arXiv:2311.08105,
-
[12]
Improving Factuality and Reasoning in Language Models through Multiagent Debate
Du, Y ., Li, S., Torralba, A., Tenenbaum, J. B., and Mor- datch, I. Improving factuality and reasoning in lan- guage models through multiagent debate.arXiv preprint arXiv:2305.14325,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Feng, S., Wang, Z., Goyal, P., Wang, Y ., Shi, W., Xia, H., Palangi, H., Zettlemoyer, L., Tsvetkov, Y ., Lee, C.-Y ., et al. Heterogeneous swarms: Jointly optimizing model roles and weights for multi-LLM systems.arXiv preprint arXiv:2502.04510,
-
[14]
Gradientcoin: A peer-to-peer decentralized large language models
Gao, Y ., Song, Z., and Yin, J. Gradientcoin: A peer-to- peer decentralized large language models.arXiv preprint arXiv:2308.10502,
-
[15]
doi: 10.1038/ s41586-025-09422-z. Hong, H., Yin, J., Wang, Y ., Liu, J., Chen, Z., Yu, A., Li, J., Ye, Z., Xiao, H., Chen, Y ., et al. Multi-agent deep research: Training multi-agent systems with m-grpo. arXiv preprint arXiv:2511.13288,
-
[16]
Tree search for llm agent reinforcement learning.arXiv preprint arXiv:2509.21240, 2025
Ji, Y ., Ma, Z., Wang, Y ., Chen, G., Chu, X., and Wu, L. Tree search for llm agent reinforcement learning.arXiv preprint arXiv:2509.21240,
-
[17]
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Lepikhin, D., Lee, H., Xu, Y ., Chen, D., Firat, O., Huang, Y ., Krikun, M., Shazeer, N., and Chen, Z. Gshard: Scaling giant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668,
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[18]
arXiv preprint arXiv:2310.10505 , year=
Li, Z., Xu, T., Zhang, Y ., Yu, Y ., Sun, R., and Luo, Z.- Q. Remax: A simple, effective, and efficient method for aligning large language models.arXiv preprint arXiv:2310.10505,
-
[19]
Marft: Multi-agent reinforcement fine-tuning, 2025
Liao, J., Wen, M., Wang, J., and Zhang, W. Marft: Multi-agent reinforcement fine-tuning.arXiv preprint arXiv:2504.16129,
-
[20]
Demystifying long chain-of-thought reasoning in LLMs.arXiv preprint arXiv:2502.20379,2025
Lifshitz, S., McIlraith, S. A., and Du, Y . Multi-agent verifi- cation: Scaling test-time compute with multiple verifiers. arXiv preprint arXiv:2502.20379,
-
[21]
Liu, B., Guertler, L., Yu, S., Liu, Z., Qi, P., Balcells, D., Liu, M., Tan, C., Shi, W., Lin, M., et al. Spiral: Self- play on zero-sum games incentivizes reasoning via multi- agent multi-turn reinforcement learning.arXiv preprint arXiv:2506.24119, 2025a. Liu, M., Jiang, L., Liang, Y ., Du, S. S., Choi, Y ., Althoff, T., and Jaques, N. Chasing moving targ...
-
[22]
arXiv preprint arXiv:2102.04402 , year=
Lyu, X., Xiao, Y ., Daley, B., and Amato, C. Contrasting centralized and decentralized critics in multi-agent rein- forcement learning.arXiv preprint arXiv:2102.04402,
-
[23]
Learning to Cooperate via Policy Search
doi: 10.1007/ 978-3-319-28929-8. Peshkin, L., Kim, K.-E., Meuleau, N., and Kaelbling, L. P. Learning to cooperate via policy search.arXiv preprint cs/0105032,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Qi, M., Zhu, T., Zhang, L., Li, N., and Zhou, W. Towards transparent and incentive-compatible collaboration in de- centralized llm multi-agent systems: A blockchain-driven approach.arXiv preprint arXiv:2509.16736,
-
[25]
Sarkar, B., Xia, W., Liu, C. K., and Sadigh, D. Training language models for social deduction with multi-agent reinforcement learning.arXiv preprint arXiv:2502.06060,
-
[26]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Shao, Z., Wang, P., Zhu, Q., et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
B., Darvish, K., Aspuru-Guzik, A., Shkurti, F., and Garg, A
Skreta, M., Yoshikawa, N., Arellano-Rubach, S., Ji, Z., Kristensen, L. B., Darvish, K., Aspuru-Guzik, A., Shkurti, F., and Garg, A. Errors are useful prompts: Instruction guided task programming with verifier-assisted iterative prompting.arXiv preprint arXiv:2303.14100,
-
[28]
arXiv preprint arXiv:2501.05707 , year=
Subramaniam, V ., Du, Y ., Tenenbaum, J. B., Torralba, A., Li, S., and Mordatch, I. Multiagent finetuning: Self im- provement with diverse reasoning chains.arXiv preprint arXiv:2501.05707,
-
[29]
Value-Decomposition Networks For Cooperative Multi-Agent Learning
Sunehag, P., Lever, G., Gruslys, A., Czarnecki, W. M., Zam- baldi, V ., Jaderberg, M., Lanctot, M., Sonnerat, N., Leibo, J. Z., Tuyls, K., et al. Value-decomposition networks for cooperative multi-agent learning.arXiv preprint arXiv:1706.05296,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents
Talebirad, Y . and Nadiri, A. Multi-agent collaboration: Harnessing the power of intelligent llm agents.arXiv preprint arXiv:2306.03314,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Solving math word problems with process- and outcome-based feedback
Uesato, J., Kushman, N., Kumar, R., Song, F., Siegel, N., Wang, L., Creswell, A., Irving, G., and Higgins, I. Solv- ing math word problems with process-and outcome-based feedback.arXiv preprint arXiv:2211.14275,
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
Wu, L., Liu, X., Shi, T., Ye, Z., and Song, D. Deserve: Towards affordable offline llm inference via decentraliza- tion.arXiv preprint arXiv:2501.14784, 2025a. Wu, Q., Bansal, G., Zhang, J., Wu, Y ., Li, B., Zhu, E., Jiang, L., Zhang, X., Zhang, S., Liu, J., Awadallah, A. H., White, R. W., Burger, D., and Wang, C. Autogen: Enabling next- gen llm applicati...
-
[33]
arXiv preprint arXiv:2502.00640(2025)
Wu, S., Galley, M., Peng, B., Cheng, H., Li, G., Dou, Y ., Cai, W., Zou, J., Leskovec, J., and Gao, J. Collabllm: From passive responders to active collaborators.arXiv preprint arXiv:2502.00640, 2025b. Yang, J., Jimenez, C. E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., and Press, O. Swe-agent: Agent- computer interfaces enable automated software en...
-
[34]
arXiv preprint arXiv:2504.00587 , year=
Yang, Y ., Chai, H., Shao, S., Song, Y ., Qi, S., Rui, R., and Zhang, W. Agentnet: Decentralized evolutionary coordination for llm-based multi-agent systems.arXiv preprint arXiv:2504.00587, 2025a. Yang, Z., Guo, Z., Huang, Y ., Liang, X., Wang, Y ., and Tang, J. Treerpo: Tree relative policy optimization.arXiv preprint arXiv:2506.05183, 2025b. Yao, S., Zh...
-
[35]
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Yue, Y ., Chen, Z., Lu, R., Zhao, A., Wang, Z., Song, S., and Huang, G. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837,
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
For agentiat timet, define gk i,t =ρ k i,t∇θi logπ θi(ak i,t |h i,t)G k t . By linearity of expectation, Eπ h Gk t |h k i,t i =E π " 1 K KX ℓ=1 Gℓ t+1 |h k i,t # = 1 K KX ℓ=1 Eπ h Gℓ t+1 |h k i,t i , where ℓ denotes the sampling index at t+ 1 . Since at the terminal step t=H−1 , Gk H−1 =r k H−1, and gk i,H−1 is unbiased. We can derive backward that gk i,t...
work page 2001
-
[37]
Results are averaged over 5 runs
onCoopHE.Boldsindicate the best performance. Results are averaged over 5 runs. Table 2 presents the pass@k performance of coding collab- oration onCoopHE. Fine-tuning with GRPO and AC yields marginal improvements over the raw model. However, this improvement over the given model is not due to acquired algorithmic knowledge or increased capacity. Instead, ...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.