Learning Decentralized LLM Collaboration with Multi-Agent Actor Critic

Christopher Amato; Ryan Amiri; Shuo Liu; Tianle Chen

arxiv: 2601.21972 · v4 · submitted 2026-01-29 · 💻 cs.AI · cs.DC· cs.MA

Learning Decentralized LLM Collaboration with Multi-Agent Actor Critic

Shuo Liu , Tianle Chen , Ryan Amiri , Christopher Amato This is my paper

Pith reviewed 2026-05-16 09:45 UTC · model grok-4.3

classification 💻 cs.AI cs.DCcs.MA

keywords decentralized LLM collaborationmulti-agent actor-criticcentralized criticMonte Carlo fine-tuninglong-horizon taskssparse rewardswriting coding games

0 comments

The pith

Centralized critic in multi-agent actor-critic training outperforms decentralized critics and Monte Carlo methods for LLM collaboration on long-horizon or sparse-reward tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that multi-agent actor-critic methods can optimize decentralized LLM collaboration, allowing agents to run inference in parallel without fixed central protocols. It introduces CoLLM-CC, which uses a centralized critic, and CoLLM-DC, which uses decentralized critics, then compares both to standard Monte Carlo fine-tuning across writing, coding, and game-playing domains. The key result is that Monte Carlo and CoLLM-DC reach similar performance to CoLLM-CC in short-horizon dense-reward settings, yet both lag in long-horizon or sparse-reward cases where Monte Carlo needs far more samples and CoLLM-DC often fails to converge. This matters because decentralized execution is more practical for scalable LLM systems, but the work shows when a centralized training signal becomes necessary to make collaboration reliable.

Core claim

We develop Multi-Agent Actor-Critic methods to optimize decentralized LLM collaboration. We propose CoLLM-CC with a centralized critic and CoLLM-DC with decentralized critics. Experiments across writing, coding, and game-playing domains show that Monte Carlo methods and CoLLM-DC can achieve performance comparable to CoLLM-CC in short-horizon and dense-reward settings. However, they both underperform CoLLM-CC on long-horizon or sparse-reward tasks, where Monte Carlo methods require substantially more samples and CoLLM-DC struggles to converge.

What carries the argument

Multi-Agent Actor-Critic (MAAC) framework for LLM collaboration, with CoLLM-CC using a single centralized critic to estimate joint values and reduce variance during decentralized execution.

Load-bearing premise

That LLM collaboration tasks can be reliably cast as multi-agent reinforcement learning problems where the reward functions accurately capture collaboration quality and the environments admit stable actor-critic training.

What would settle it

An experiment on a long-horizon sparse-reward task where CoLLM-DC converges to performance matching or exceeding CoLLM-CC using fewer samples than Monte Carlo methods.

Figures

Figures reproduced from arXiv: 2601.21972 by Christopher Amato, Ryan Amiri, Shuo Liu, Tianle Chen.

**Figure 1.** Figure 1: Illustration of CoLLM-CC framework: (a) The agent model structure; (b) The overall centralized-critic architecture; (c) The critic model structure. The corresponding CoLLM-DC framework is shown in Appendix B. Proposition 4.3. Consider an H-horizon episode without early termination t ∈ [0, H). Suppose MA-REINFORCE expands a full K-ary rollout tree (K ≥ 1) and, at each history node, draws K i.i.d. joint acti… view at source ↗

**Figure 2.** Figure 2: Evaluation results of MAGRPO, CoLLM-DC, and CoLLM-CC across article writing, code generation, and game-playing tasks over 5 runs. The y-axis shows expected return, with limits (min/max) indicating the return scale for each task. Curves are smoothed using a time-weighted exponential moving average. Shaded regions denote 95% bootstrapped confidence intervals. At each training epoch, a minibatch β of joint tr… view at source ↗

**Figure 3.** Figure 3: Screenshots of building tasks in Minecraft. (a) StrBuild: The LLM agent with wood outputs a /setblock 12 5 5 minecraft:birch planks game instruction to complete the building in “ICML” shape. (b) HouseBuild: The LLM agent outputs /damage @e[type=spider,limit=1] 6 minecraft:player attack to attack a mob, while building a cubic concrete house with a wooden door, 4 obsidian pillars, and a triangular-prism s… view at source ↗

**Figure 4.** Figure 4: CoLLM-DC framework: (a) The agent structure; (b) The overall decentralized-critic architecture; (c) The critic structure. – Agents * Qwen2.5-3B-Instruct * Qwen3-4B-Instruct-2507 – Critic (if applicable): Qwen3-4B-Instruct-2507 – Temperature: 0.6 – Top-p: 0.6 – Top-k: null – Max output tokens * StrBuild: 256 * HouseBuild: 512 C.3. Hyperparameters We show the key hyperparameters used in MAGRPO, CoLLM-DC, and… view at source ↗

read the original abstract

Recent work has explored optimizing LLM collaboration through Multi-Agent Reinforcement Learning (MARL). However, most MARL fine-tuning approaches rely on predefined execution protocols, which often require centralized execution. Decentralized LLM collaboration is more appealing in practice, as agents can run inference in parallel with flexible deployments. Also, current approaches use Monte Carlo methods for fine-tuning, which suffer from high variance and thus require more samples to train effectively. Actor-critic methods are prevalent in MARL for dealing with these issues, so we developed Multi-Agent Actor-Critic (MAAC) methods to optimize decentralized LLM collaboration. In this paper, we analyze when and why these MAAC methods are beneficial. We propose 2 MAAC approaches, \textbf{CoLLM-CC} with a \textbf{C}entralized \textbf{C}ritic and \textbf{CoLLM-DC} with \textbf{D}ecentralized \textbf{C}ritics. Our experiments across writing, coding, and game-playing domains show that Monte Carlo methods and CoLLM-DC can achieve performance comparable to CoLLM-CC in short-horizon and dense-reward settings. However, they both underperform CoLLM-CC on long-horizon or sparse-reward tasks, where Monte Carlo methods require substantially more samples and CoLLM-DC struggles to converge. Our code is available at https://github.com/OpenMLRL/CoMLRL/releases/tag/v1.3.6.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Centralized critic MAAC beats decentralized and Monte Carlo on long-horizon sparse LLM tasks, but the convergence gap may trace to reward shaping rather than decentralization itself.

read the letter

The main thing to know is that this paper takes standard multi-agent actor-critic methods and applies them to decentralized LLM collaboration, showing that a centralized critic version (CoLLM-CC) holds up better than decentralized critics (CoLLM-DC) or Monte Carlo fine-tuning once horizons get long or rewards turn sparse. They test across writing, coding, and game domains and release the code, which is the practical part worth noting. The work is mostly an application of existing MAAC ideas rather than a new algorithm, but the regime analysis and the two named variants give a clear picture of when each approach pays off. In short settings with dense rewards the methods perform similarly, while the centralized critic pulls ahead in the harder cases where decentralized training fails to converge and Monte Carlo needs far more samples. That comparison is the useful takeaway. The experiments appear to support the headline claims, and shipping code plus the multi-domain setup counts as real evidence rather than just theory. The soft spot is that the abstract gives almost no experimental details on baselines, variance, or hyperparameter choices, so it is hard to judge how large or reliable the gaps actually are. The stress-test point about credit assignment also lands: if the downstream scalar rewards are not decomposed well, the decentralized critics could simply be seeing uninformative signals, which would make the non-convergence an artifact of the reward design instead of a general property of decentralized critics. Nothing in the provided description rules that out. This paper is for researchers already working on MARL for LLMs or multi-agent fine-tuning who want a concrete baseline with code. It is not paradigm-shifting but it is a solid, reproducible application that deserves a serious referee to check the full methods and stats. I would send it to review rather than desk reject.

Referee Report

2 major / 1 minor

Summary. The paper proposes two Multi-Agent Actor-Critic (MAAC) approaches for fine-tuning decentralized LLM collaboration: CoLLM-CC (centralized critic) and CoLLM-DC (decentralized critics). It compares these to Monte Carlo methods across writing, coding, and game-playing domains, claiming that MC and DC achieve comparable performance to CC in short-horizon dense-reward settings, but both underperform CC on long-horizon or sparse-reward tasks, with MC requiring more samples and DC struggling to converge.

Significance. If the empirical findings hold after verification, the work provides useful practical guidance on when centralized critics are necessary for stable training in LLM collaboration tasks cast as MARL. The code release and domain coverage (writing, coding, games) are strengths that could aid reproducibility and extension.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): The headline claim that CoLLM-DC 'struggles to converge' on long-horizon/sparse-reward tasks is not isolated from potential misspecification in reward decomposition or per-agent value-function approximation; without an ablation on credit-assignment mechanisms or variance of local signals, it is unclear whether the observed gap is inherent to decentralization or an artifact of the chosen critic architecture and reward shaping.
[§4] §4: No statistical tests, confidence intervals, hyperparameter settings, or full baseline descriptions are supplied for the reported performance gaps between CoLLM-CC, CoLLM-DC, and Monte Carlo methods; this prevents assessment of whether the underperformance is robust or sensitive to implementation details.

minor comments (1)

[§3] Notation for the centralized vs. decentralized critic formulations could be clarified with explicit equations distinguishing the critic inputs and loss terms.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each major comment point by point below, providing the strongest honest defense of our claims while incorporating revisions to improve rigor and clarity.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): The headline claim that CoLLM-DC 'struggles to converge' on long-horizon/sparse-reward tasks is not isolated from potential misspecification in reward decomposition or per-agent value-function approximation; without an ablation on credit-assignment mechanisms or variance of local signals, it is unclear whether the observed gap is inherent to decentralization or an artifact of the chosen critic architecture and reward shaping.

Authors: We agree that isolating the source of convergence difficulties requires explicit analysis of credit assignment and signal variance. In the revised manuscript we have added an ablation comparing alternative reward decompositions and reporting the empirical variance of local value estimates under CoLLM-DC versus the centralized critic. These new results (now in §4 and Appendix C) show substantially higher variance in the decentralized local signals precisely on the long-horizon/sparse-reward tasks, supporting that the observed gap is driven by decentralization rather than solely by the particular critic architecture chosen. The abstract has been updated to reflect this qualification. revision: yes
Referee: [§4] §4: No statistical tests, confidence intervals, hyperparameter settings, or full baseline descriptions are supplied for the reported performance gaps between CoLLM-CC, CoLLM-DC, and Monte Carlo methods; this prevents assessment of whether the underperformance is robust or sensitive to implementation details.

Authors: We acknowledge the omission of these details. The revised version now includes paired t-tests with reported p-values for all key performance gaps, 95% confidence intervals on every metric, a complete hyperparameter table in the appendix, and expanded baseline descriptions that specify the exact Monte Carlo implementation, sampling budgets, and reward computation. These additions confirm that the reported underperformance of CoLLM-DC and Monte Carlo methods remains statistically significant and is not sensitive to the tested hyperparameter ranges. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical MAAC application to LLM collaboration

full rationale

The paper applies standard multi-agent actor-critic methods (CoLLM-CC with centralized critic, CoLLM-DC with decentralized critics) to LLM collaboration tasks and supports its claims solely through experiments on writing, coding, and game-playing domains. No equations, fitted parameters renamed as predictions, or self-citation chains are present that reduce any result to the paper's own inputs by construction. Performance differences (e.g., CoLLM-DC struggling on long-horizon/sparse-reward tasks) are reported as empirical observations rather than derived analytically from prior self-referential assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced or quantified in the abstract; the work relies on standard MARL actor-critic assumptions.

pith-pipeline@v0.9.0 · 5563 in / 1028 out tokens · 47469 ms · 2026-05-16T09:45:10.430001+00:00 · methodology

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems
cs.AI 2026-05 unverdicted novelty 7.0

A survey that unifies prior work on multi-agent LLM systems via the LIFE framework, mapping dependencies across collaboration, failure attribution, and autonomous self-evolution while identifying cross-stage challenges.
Improving the Efficiency of Language Agent Teams with Adaptive Task Graphs
cs.MA 2026-05 unverdicted novelty 7.0

LATTE coordinates LLM agent teams with an evolving shared task graph, cutting token use, time, and failures while matching or beating accuracy of MetaGPT, leader-worker, and static methods.
Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems
cs.AI 2026-05 conditional novelty 5.0

The survey proposes the LIFE framework to unify fragmented research on collaboration, failure attribution, and self-evolution in LLM multi-agent systems into a progression toward self-organizing intelligence.
Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces
cs.CL 2026-05 unverdicted novelty 4.0

This survey organizes RL for LLM multi-agent systems into reward families, credit units, and five orchestration sub-decisions, notes the absence of explicit stopping-decision training in its paper pool, and releases a...
Cloud-native and Distributed Systems for Efficient and Scalable Large Language Models -- A Research Agenda
cs.DC 2026-04 unverdicted novelty 2.0

This research agenda argues that cloud-native architectures, microservices, autoscaling, and emerging trends like serverless inference and federated learning are required to make large language models efficient and scalable.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · cited by 4 Pith papers · 15 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

Ahmadian, A., Cremer, C., Gall´e, M., Fadaee, M., Kreutzer, J., Pietquin, O., ¨Ust¨un, A., and Hooker, S. Back to ba- sics: Revisiting reinforce style optimization for learn- ing from human feedback in LLMs.arXiv preprint arXiv:2402.14740,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Gemini: A Family of Highly Capable Multimodal Models

URL https://www. marl-book.com. Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Program Synthesis with Large Language Models

Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Qwen Technical Report

Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., Fan, Y ., Ge, W., Han, Y ., Huang, F., et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Why Do Multi-Agent LLM Systems Fail?

Cemri, M., Pan, M. Z., Yang, S., Agrawal, L. A., Chopra, B., Tiwari, R., Keutzer, K., Parameswaran, A., Klein, D., Ramchandran, K., et al. Why do multi-agent llm systems fail?arXiv preprint arXiv:2503.13657,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Evaluating Large Language Models Trained on Code

Chen, M., Tworek, J., Jun, H., et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Improving retrieval-augmented generation through multi-agent reinforcement learning

Chen, Y ., Yan, L., Sun, W., Ma, X., Zhang, Y ., Wang, S., Yin, D., Yang, Y ., and Mao, J. Improving retrieval-augmented generation through multi-agent reinforcement learning. arXiv preprint arXiv:2501.15228,

work page arXiv
[9]

S., Gupta, T., Makoviichuk, D., Makoviychuk, V ., Torr, P

De Witt, C. S., Gupta, T., Makoviichuk, D., Makoviychuk, V ., Torr, P. H., Sun, M., and Whiteson, S. Is indepen- dent learning all you need in the starcraft multi-agent challenge?arXiv preprint arXiv:2011.09533,

work page arXiv 2011
[10]

Treegrpo: Tree-advantage grpo for online rl post-training of diffusion models.arXiv preprint arXiv:2512.08153,

Ding, Z. and Ye, W. Treegrpo: Tree-advantage grpo for online rl post-training of diffusion models.arXiv preprint arXiv:2512.08153,

work page arXiv
[11]

A., Chhaparia, R., Donchev, Y., Kuncoro, A., Ranzato, M., Szlam, A., and Shen, J

Douillard, A., Feng, Q., Rusu, A. A., Chhaparia, R., Donchev, Y ., Kuncoro, A., Ranzato, M., Szlam, A., and Shen, J. Diloco: Distributed low-communication training of language models.arXiv preprint arXiv:2311.08105,

work page arXiv
[12]

Improving Factuality and Reasoning in Language Models through Multiagent Debate

Du, Y ., Li, S., Torralba, A., Tenenbaum, J. B., and Mor- datch, I. Improving factuality and reasoning in lan- guage models through multiagent debate.arXiv preprint arXiv:2305.14325,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Heterogeneous swarms: Jointly optimizing model roles and weights for multi-LLM systems.arXiv preprint arXiv:2502.04510,

Feng, S., Wang, Z., Goyal, P., Wang, Y ., Shi, W., Xia, H., Palangi, H., Zettlemoyer, L., Tsvetkov, Y ., Lee, C.-Y ., et al. Heterogeneous swarms: Jointly optimizing model roles and weights for multi-LLM systems.arXiv preprint arXiv:2502.04510,

work page arXiv
[14]

Gradientcoin: A peer-to-peer decentralized large language models

Gao, Y ., Song, Z., and Yin, J. Gradientcoin: A peer-to- peer decentralized large language models.arXiv preprint arXiv:2308.10502,

work page arXiv
[15]

Multi-agent deep research: Training multi-agent systems with m-grpo.arXiv preprint arXiv:2511.13288, 2025

doi: 10.1038/ s41586-025-09422-z. Hong, H., Yin, J., Wang, Y ., Liu, J., Chen, Z., Yu, A., Li, J., Ye, Z., Xiao, H., Chen, Y ., et al. Multi-agent deep research: Training multi-agent systems with m-grpo. arXiv preprint arXiv:2511.13288,

work page arXiv
[16]

Tree search for llm agent reinforcement learning.arXiv preprint arXiv:2509.21240, 2025

Ji, Y ., Ma, Z., Wang, Y ., Chen, G., Chu, X., and Wu, L. Tree search for llm agent reinforcement learning.arXiv preprint arXiv:2509.21240,

work page arXiv
[17]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Lepikhin, D., Lee, H., Xu, Y ., Chen, D., Firat, O., Huang, Y ., Krikun, M., Shazeer, N., and Chen, Z. Gshard: Scaling giant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668,

work page internal anchor Pith review Pith/arXiv arXiv 2006
[18]

arXiv preprint arXiv:2310.10505 , year=

Li, Z., Xu, T., Zhang, Y ., Yu, Y ., Sun, R., and Luo, Z.- Q. Remax: A simple, effective, and efficient method for aligning large language models.arXiv preprint arXiv:2310.10505,

work page arXiv
[19]

Marft: Multi-agent reinforcement fine-tuning, 2025

Liao, J., Wen, M., Wang, J., and Zhang, W. Marft: Multi-agent reinforcement fine-tuning.arXiv preprint arXiv:2504.16129,

work page arXiv
[20]

Demystifying long chain-of-thought reasoning in LLMs.arXiv preprint arXiv:2502.20379,2025

Lifshitz, S., McIlraith, S. A., and Du, Y . Multi-agent verifi- cation: Scaling test-time compute with multiple verifiers. arXiv preprint arXiv:2502.20379,

work page arXiv
[21]

Spiral: Self-play on zero-sum games incentivizes reasoning via multi-agent multi-turn reinforcement learning

Liu, B., Guertler, L., Yu, S., Liu, Z., Qi, P., Balcells, D., Liu, M., Tan, C., Shi, W., Lin, M., et al. Spiral: Self- play on zero-sum games incentivizes reasoning via multi- agent multi-turn reinforcement learning.arXiv preprint arXiv:2506.24119, 2025a. Liu, M., Jiang, L., Liang, Y ., Du, S. S., Choi, Y ., Althoff, T., and Jaques, N. Chasing moving targ...

work page arXiv
[22]

arXiv preprint arXiv:2102.04402 , year=

Lyu, X., Xiao, Y ., Daley, B., and Amato, C. Contrasting centralized and decentralized critics in multi-agent rein- forcement learning.arXiv preprint arXiv:2102.04402,

work page arXiv
[23]

Learning to Cooperate via Policy Search

doi: 10.1007/ 978-3-319-28929-8. Peshkin, L., Kim, K.-E., Meuleau, N., and Kaelbling, L. P. Learning to cooperate via policy search.arXiv preprint cs/0105032,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Towards transparent and incentive-compatible collaboration in de- centralized llm multi-agent systems: A blockchain-driven approach.arXiv preprint arXiv:2509.16736,

Qi, M., Zhu, T., Zhang, L., Li, N., and Zhou, W. Towards transparent and incentive-compatible collaboration in de- centralized llm multi-agent systems: A blockchain-driven approach.arXiv preprint arXiv:2509.16736,

work page arXiv
[25]

Karen Liu, and Dorsa Sadigh

Sarkar, B., Xia, W., Liu, C. K., and Sadigh, D. Training language models for social deduction with multi-agent reinforcement learning.arXiv preprint arXiv:2502.06060,

work page arXiv
[26]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

B., Darvish, K., Aspuru-Guzik, A., Shkurti, F., and Garg, A

Skreta, M., Yoshikawa, N., Arellano-Rubach, S., Ji, Z., Kristensen, L. B., Darvish, K., Aspuru-Guzik, A., Shkurti, F., and Garg, A. Errors are useful prompts: Instruction guided task programming with verifier-assisted iterative prompting.arXiv preprint arXiv:2303.14100,

work page arXiv
[28]

arXiv preprint arXiv:2501.05707 , year=

Subramaniam, V ., Du, Y ., Tenenbaum, J. B., Torralba, A., Li, S., and Mordatch, I. Multiagent finetuning: Self im- provement with diverse reasoning chains.arXiv preprint arXiv:2501.05707,

work page arXiv
[29]

Value-Decomposition Networks For Cooperative Multi-Agent Learning

Sunehag, P., Lever, G., Gruslys, A., Czarnecki, W. M., Zam- baldi, V ., Jaderberg, M., Lanctot, M., Sonnerat, N., Leibo, J. Z., Tuyls, K., et al. Value-decomposition networks for cooperative multi-agent learning.arXiv preprint arXiv:1706.05296,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents

Talebirad, Y . and Nadiri, A. Multi-agent collaboration: Harnessing the power of intelligent llm agents.arXiv preprint arXiv:2306.03314,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Solving math word problems with process- and outcome-based feedback

Uesato, J., Kushman, N., Kumar, R., Song, F., Siegel, N., Wang, L., Creswell, A., Irving, G., and Higgins, I. Solv- ing math word problems with process-and outcome-based feedback.arXiv preprint arXiv:2211.14275,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Deserve: Towards affordable offline llm inference via decentraliza- tion.arXiv preprint arXiv:2501.14784, 2025a

Wu, L., Liu, X., Shi, T., Ye, Z., and Song, D. Deserve: Towards affordable offline llm inference via decentraliza- tion.arXiv preprint arXiv:2501.14784, 2025a. Wu, Q., Bansal, G., Zhang, J., Wu, Y ., Li, B., Zhu, E., Jiang, L., Zhang, X., Zhang, S., Liu, J., Awadallah, A. H., White, R. W., Burger, D., and Wang, C. Autogen: Enabling next- gen llm applicati...

work page arXiv
[33]

arXiv preprint arXiv:2502.00640(2025)

Wu, S., Galley, M., Peng, B., Cheng, H., Li, G., Dou, Y ., Cai, W., Zou, J., Leskovec, J., and Gao, J. Collabllm: From passive responders to active collaborators.arXiv preprint arXiv:2502.00640, 2025b. Yang, J., Jimenez, C. E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., and Press, O. Swe-agent: Agent- computer interfaces enable automated software en...

work page arXiv
[34]

arXiv preprint arXiv:2504.00587 , year=

Yang, Y ., Chai, H., Shao, S., Song, Y ., Qi, S., Rui, R., and Zhang, W. Agentnet: Decentralized evolutionary coordination for llm-based multi-agent systems.arXiv preprint arXiv:2504.00587, 2025a. Yang, Z., Guo, Z., Huang, Y ., Liang, X., Wang, Y ., and Tang, J. Treerpo: Tree relative policy optimization.arXiv preprint arXiv:2506.05183, 2025b. Yao, S., Zh...

work page arXiv
[35]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yue, Y ., Chen, Z., Lu, R., Zhao, A., Wang, Z., Song, S., and Huang, G. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837,

work page internal anchor Pith review Pith/arXiv arXiv
[36]

By linearity of expectation, Eπ h Gk t |h k i,t i =E π " 1 K KX ℓ=1 Gℓ t+1 |h k i,t # = 1 K KX ℓ=1 Eπ h Gℓ t+1 |h k i,t i , where ℓ denotes the sampling index at t+ 1

For agentiat timet, define gk i,t =ρ k i,t∇θi logπ θi(ak i,t |h i,t)G k t . By linearity of expectation, Eπ h Gk t |h k i,t i =E π " 1 K KX ℓ=1 Gℓ t+1 |h k i,t # = 1 K KX ℓ=1 Eπ h Gℓ t+1 |h k i,t i , where ℓ denotes the sampling index at t+ 1 . Since at the terminal step t=H−1 , Gk H−1 =r k H−1, and gk i,H−1 is unbiased. We can derive backward that gk i,t...

work page 2001
[37]

Results are averaged over 5 runs

onCoopHE.Boldsindicate the best performance. Results are averaged over 5 runs. Table 2 presents the pass@k performance of coding collab- oration onCoopHE. Fine-tuning with GRPO and AC yields marginal improvements over the raw model. However, this improvement over the given model is not due to acquired algorithmic knowledge or increased capacity. Instead, ...

work page 2025

[1] [1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

Ahmadian, A., Cremer, C., Gall´e, M., Fadaee, M., Kreutzer, J., Pietquin, O., ¨Ust¨un, A., and Hooker, S. Back to ba- sics: Revisiting reinforce style optimization for learn- ing from human feedback in LLMs.arXiv preprint arXiv:2402.14740,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Gemini: A Family of Highly Capable Multimodal Models

URL https://www. marl-book.com. Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Program Synthesis with Large Language Models

Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Qwen Technical Report

Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., Fan, Y ., Ge, W., Han, Y ., Huang, F., et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Why Do Multi-Agent LLM Systems Fail?

Cemri, M., Pan, M. Z., Yang, S., Agrawal, L. A., Chopra, B., Tiwari, R., Keutzer, K., Parameswaran, A., Klein, D., Ramchandran, K., et al. Why do multi-agent llm systems fail?arXiv preprint arXiv:2503.13657,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Evaluating Large Language Models Trained on Code

Chen, M., Tworek, J., Jun, H., et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Improving retrieval-augmented generation through multi-agent reinforcement learning

Chen, Y ., Yan, L., Sun, W., Ma, X., Zhang, Y ., Wang, S., Yin, D., Yang, Y ., and Mao, J. Improving retrieval-augmented generation through multi-agent reinforcement learning. arXiv preprint arXiv:2501.15228,

work page arXiv

[9] [9]

S., Gupta, T., Makoviichuk, D., Makoviychuk, V ., Torr, P

De Witt, C. S., Gupta, T., Makoviichuk, D., Makoviychuk, V ., Torr, P. H., Sun, M., and Whiteson, S. Is indepen- dent learning all you need in the starcraft multi-agent challenge?arXiv preprint arXiv:2011.09533,

work page arXiv 2011

[10] [10]

Treegrpo: Tree-advantage grpo for online rl post-training of diffusion models.arXiv preprint arXiv:2512.08153,

Ding, Z. and Ye, W. Treegrpo: Tree-advantage grpo for online rl post-training of diffusion models.arXiv preprint arXiv:2512.08153,

work page arXiv

[11] [11]

A., Chhaparia, R., Donchev, Y., Kuncoro, A., Ranzato, M., Szlam, A., and Shen, J

Douillard, A., Feng, Q., Rusu, A. A., Chhaparia, R., Donchev, Y ., Kuncoro, A., Ranzato, M., Szlam, A., and Shen, J. Diloco: Distributed low-communication training of language models.arXiv preprint arXiv:2311.08105,

work page arXiv

[12] [12]

Improving Factuality and Reasoning in Language Models through Multiagent Debate

Du, Y ., Li, S., Torralba, A., Tenenbaum, J. B., and Mor- datch, I. Improving factuality and reasoning in lan- guage models through multiagent debate.arXiv preprint arXiv:2305.14325,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Heterogeneous swarms: Jointly optimizing model roles and weights for multi-LLM systems.arXiv preprint arXiv:2502.04510,

Feng, S., Wang, Z., Goyal, P., Wang, Y ., Shi, W., Xia, H., Palangi, H., Zettlemoyer, L., Tsvetkov, Y ., Lee, C.-Y ., et al. Heterogeneous swarms: Jointly optimizing model roles and weights for multi-LLM systems.arXiv preprint arXiv:2502.04510,

work page arXiv

[14] [14]

Gradientcoin: A peer-to-peer decentralized large language models

Gao, Y ., Song, Z., and Yin, J. Gradientcoin: A peer-to- peer decentralized large language models.arXiv preprint arXiv:2308.10502,

work page arXiv

[15] [15]

Multi-agent deep research: Training multi-agent systems with m-grpo.arXiv preprint arXiv:2511.13288, 2025

doi: 10.1038/ s41586-025-09422-z. Hong, H., Yin, J., Wang, Y ., Liu, J., Chen, Z., Yu, A., Li, J., Ye, Z., Xiao, H., Chen, Y ., et al. Multi-agent deep research: Training multi-agent systems with m-grpo. arXiv preprint arXiv:2511.13288,

work page arXiv

[16] [16]

Tree search for llm agent reinforcement learning.arXiv preprint arXiv:2509.21240, 2025

Ji, Y ., Ma, Z., Wang, Y ., Chen, G., Chu, X., and Wu, L. Tree search for llm agent reinforcement learning.arXiv preprint arXiv:2509.21240,

work page arXiv

[17] [17]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Lepikhin, D., Lee, H., Xu, Y ., Chen, D., Firat, O., Huang, Y ., Krikun, M., Shazeer, N., and Chen, Z. Gshard: Scaling giant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668,

work page internal anchor Pith review Pith/arXiv arXiv 2006

[18] [18]

arXiv preprint arXiv:2310.10505 , year=

Li, Z., Xu, T., Zhang, Y ., Yu, Y ., Sun, R., and Luo, Z.- Q. Remax: A simple, effective, and efficient method for aligning large language models.arXiv preprint arXiv:2310.10505,

work page arXiv

[19] [19]

Marft: Multi-agent reinforcement fine-tuning, 2025

Liao, J., Wen, M., Wang, J., and Zhang, W. Marft: Multi-agent reinforcement fine-tuning.arXiv preprint arXiv:2504.16129,

work page arXiv

[20] [20]

Demystifying long chain-of-thought reasoning in LLMs.arXiv preprint arXiv:2502.20379,2025

Lifshitz, S., McIlraith, S. A., and Du, Y . Multi-agent verifi- cation: Scaling test-time compute with multiple verifiers. arXiv preprint arXiv:2502.20379,

work page arXiv

[21] [21]

Spiral: Self-play on zero-sum games incentivizes reasoning via multi-agent multi-turn reinforcement learning

Liu, B., Guertler, L., Yu, S., Liu, Z., Qi, P., Balcells, D., Liu, M., Tan, C., Shi, W., Lin, M., et al. Spiral: Self- play on zero-sum games incentivizes reasoning via multi- agent multi-turn reinforcement learning.arXiv preprint arXiv:2506.24119, 2025a. Liu, M., Jiang, L., Liang, Y ., Du, S. S., Choi, Y ., Althoff, T., and Jaques, N. Chasing moving targ...

work page arXiv

[22] [22]

arXiv preprint arXiv:2102.04402 , year=

Lyu, X., Xiao, Y ., Daley, B., and Amato, C. Contrasting centralized and decentralized critics in multi-agent rein- forcement learning.arXiv preprint arXiv:2102.04402,

work page arXiv

[23] [23]

Learning to Cooperate via Policy Search

doi: 10.1007/ 978-3-319-28929-8. Peshkin, L., Kim, K.-E., Meuleau, N., and Kaelbling, L. P. Learning to cooperate via policy search.arXiv preprint cs/0105032,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Towards transparent and incentive-compatible collaboration in de- centralized llm multi-agent systems: A blockchain-driven approach.arXiv preprint arXiv:2509.16736,

Qi, M., Zhu, T., Zhang, L., Li, N., and Zhou, W. Towards transparent and incentive-compatible collaboration in de- centralized llm multi-agent systems: A blockchain-driven approach.arXiv preprint arXiv:2509.16736,

work page arXiv

[25] [25]

Karen Liu, and Dorsa Sadigh

Sarkar, B., Xia, W., Liu, C. K., and Sadigh, D. Training language models for social deduction with multi-agent reinforcement learning.arXiv preprint arXiv:2502.06060,

work page arXiv

[26] [26]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

B., Darvish, K., Aspuru-Guzik, A., Shkurti, F., and Garg, A

Skreta, M., Yoshikawa, N., Arellano-Rubach, S., Ji, Z., Kristensen, L. B., Darvish, K., Aspuru-Guzik, A., Shkurti, F., and Garg, A. Errors are useful prompts: Instruction guided task programming with verifier-assisted iterative prompting.arXiv preprint arXiv:2303.14100,

work page arXiv

[28] [28]

arXiv preprint arXiv:2501.05707 , year=

Subramaniam, V ., Du, Y ., Tenenbaum, J. B., Torralba, A., Li, S., and Mordatch, I. Multiagent finetuning: Self im- provement with diverse reasoning chains.arXiv preprint arXiv:2501.05707,

work page arXiv

[29] [29]

Value-Decomposition Networks For Cooperative Multi-Agent Learning

Sunehag, P., Lever, G., Gruslys, A., Czarnecki, W. M., Zam- baldi, V ., Jaderberg, M., Lanctot, M., Sonnerat, N., Leibo, J. Z., Tuyls, K., et al. Value-decomposition networks for cooperative multi-agent learning.arXiv preprint arXiv:1706.05296,

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents

Talebirad, Y . and Nadiri, A. Multi-agent collaboration: Harnessing the power of intelligent llm agents.arXiv preprint arXiv:2306.03314,

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

Solving math word problems with process- and outcome-based feedback

Uesato, J., Kushman, N., Kumar, R., Song, F., Siegel, N., Wang, L., Creswell, A., Irving, G., and Higgins, I. Solv- ing math word problems with process-and outcome-based feedback.arXiv preprint arXiv:2211.14275,

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

Deserve: Towards affordable offline llm inference via decentraliza- tion.arXiv preprint arXiv:2501.14784, 2025a

Wu, L., Liu, X., Shi, T., Ye, Z., and Song, D. Deserve: Towards affordable offline llm inference via decentraliza- tion.arXiv preprint arXiv:2501.14784, 2025a. Wu, Q., Bansal, G., Zhang, J., Wu, Y ., Li, B., Zhu, E., Jiang, L., Zhang, X., Zhang, S., Liu, J., Awadallah, A. H., White, R. W., Burger, D., and Wang, C. Autogen: Enabling next- gen llm applicati...

work page arXiv

[33] [33]

arXiv preprint arXiv:2502.00640(2025)

Wu, S., Galley, M., Peng, B., Cheng, H., Li, G., Dou, Y ., Cai, W., Zou, J., Leskovec, J., and Gao, J. Collabllm: From passive responders to active collaborators.arXiv preprint arXiv:2502.00640, 2025b. Yang, J., Jimenez, C. E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., and Press, O. Swe-agent: Agent- computer interfaces enable automated software en...

work page arXiv

[34] [34]

arXiv preprint arXiv:2504.00587 , year=

Yang, Y ., Chai, H., Shao, S., Song, Y ., Qi, S., Rui, R., and Zhang, W. Agentnet: Decentralized evolutionary coordination for llm-based multi-agent systems.arXiv preprint arXiv:2504.00587, 2025a. Yang, Z., Guo, Z., Huang, Y ., Liang, X., Wang, Y ., and Tang, J. Treerpo: Tree relative policy optimization.arXiv preprint arXiv:2506.05183, 2025b. Yao, S., Zh...

work page arXiv

[35] [35]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yue, Y ., Chen, Z., Lu, R., Zhao, A., Wang, Z., Song, S., and Huang, G. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837,

work page internal anchor Pith review Pith/arXiv arXiv

[36] [36]

By linearity of expectation, Eπ h Gk t |h k i,t i =E π " 1 K KX ℓ=1 Gℓ t+1 |h k i,t # = 1 K KX ℓ=1 Eπ h Gℓ t+1 |h k i,t i , where ℓ denotes the sampling index at t+ 1

For agentiat timet, define gk i,t =ρ k i,t∇θi logπ θi(ak i,t |h i,t)G k t . By linearity of expectation, Eπ h Gk t |h k i,t i =E π " 1 K KX ℓ=1 Gℓ t+1 |h k i,t # = 1 K KX ℓ=1 Eπ h Gℓ t+1 |h k i,t i , where ℓ denotes the sampling index at t+ 1 . Since at the terminal step t=H−1 , Gk H−1 =r k H−1, and gk i,H−1 is unbiased. We can derive backward that gk i,t...

work page 2001

[37] [37]

Results are averaged over 5 runs

onCoopHE.Boldsindicate the best performance. Results are averaged over 5 runs. Table 2 presents the pass@k performance of coding collab- oration onCoopHE. Fine-tuning with GRPO and AC yields marginal improvements over the raw model. However, this improvement over the given model is not due to acquired algorithmic knowledge or increased capacity. Instead, ...

work page 2025