arxiv: 2604.17693 · v2 · submitted 2026-04-20 · 💻 cs.LG · cs.AI· cs.MA

Recognition: no theorem link

COSAC: Counterfactual Credit Assignment in Sequential Cooperative Teams

Shripad Deshmukh , Jayakumar Subramanian , Raghavendra Addanki , Nikos Vlassis

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:57 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.MA

keywords multi-agent reinforcement learningcredit assignmentcounterfactual advantagepolicy gradientsequential teamsridge regressioncooperative learning

0 comments

The pith

COSAC assigns each agent in a sequential cooperative team its own credit signal from a shared reward by fitting an additive decomposition with ridge regression and estimating advantages from fictitious policy continuations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

When agents must act in sequence and receive only a single team reward, credit assignment is hard because one agent's change is entangled with the others. COSAC recovers each agent's contribution to the reward by solving one ridge regression over the collected rollouts. It then builds each agent's advantage estimate by running forward passes of the current policy in place of real environment steps, avoiding both importance sampling and extra calls. The resulting estimator comes with bias and variance bounds that stay controlled as the number of agents grows, and experiments on sequential bandits and a four-agent reasoning task show lower error and faster learning than prior critic-free approaches.

Core claim

COSAC is a critic-free per-agent policy gradient for sequential cooperative teams. It fits an additive per-agent decomposition of the team reward by a single ridge regression on the rollout batch and computes each agent's counterfactual advantage from fictitious continuations of the current policy. The estimator realizes the Sequential Aristocrat Utility and comes with proofs that its bias and variance bounds remain controlled as team size increases.

What carries the argument

The Sequential Aristocrat Utility (SeqAU) realized through a single ridge regression that recovers per-agent reward shares plus fictitious policy forward passes that generate counterfactual advantages without extra environment interactions or importance sampling.

If this is right

Each agent's policy gradient becomes independent of noise from teammates' actions.
Bias and variance of the per-agent credits stay bounded even as team size reaches 16.
Advantage estimation error is lower than other critic-free methods on sequential bandit tasks.
Learning regret remains consistently low across increasing team sizes in controlled studies.
Faster convergence occurs on collaborative reasoning tasks with multiple language-model agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could reduce reliance on centralized critics in other large multi-agent systems where actions are ordered.
Checking how well the additive decomposition holds on new domains would directly test when the method remains reliable.
The fictitious-continuation technique might combine with other decomposition methods to handle non-additive rewards approximately.

Load-bearing premise

The shared team reward admits an additive decomposition into per-agent contributions that ridge regression on finite rollout batches can recover accurately enough to avoid bias in policy learning.

What would settle it

If advantage mean squared error or learning regret in the sequential bandit study grows with team size in the same way as importance-sampling baselines, the controlled-bounds claim would be contradicted.

Figures

Figures reproduced from arXiv: 2604.17693 by Jayakumar Subramanian, Nikos Vlassis, Raghavendra Addanki, Shripad Deshmukh.

**Figure 1.** Figure 1: Per-agent advantage MSE (log scale) as a function of the number of agents K, averaged over agents and 30 seeds. Absolute values rise with K because the team-reward scale itself grows with K; the relevant readings are the ratios between methods at each K. 0.0 0.2 0.4 0.6 0.8 1.0 Interaction strength int 10 1 10 0 Mean advantage MSE Advantage estimation quality vs interaction strength (K = 4) CAPO C3 MA-GR… view at source ↗

**Figure 3.** Figure 3: CAPO (orange) versus the direct-effect-only ablation CAPO-Direct (green): AUC regret as [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Per-agent advantage MSE at K = 16, broken down by agent index. E Non-stationarity parametrization in the optimization experiment The non-stationarity parameter ρ used in Section 6.3 controls how strongly each agent’s conditional policy depends on its upstream action, which in turn determines the severity of the distribution shift over downstream behavior when an upstream agent’s policy is updated. Concrete… view at source ↗

**Figure 3.** Figure 3: CAPO-Direct is at least as good at ρ = 0, and CAPO pulls ahead cleanly as ρ grows. H Hyperparameters and experimental protocol [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗

**Figure 5.** Figure 5: Per-cell normalized regret AUC as a function of interaction strength [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Normalized regret as a function of policy iteration, across team sizes [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Regret versus cumulative real-environment calls for CAPO (orange) and CAPO-Direct [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: Regret versus cumulative real-environment calls for CAPO (orange) and CAPO-Direct [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: Regret versus cumulative real-environment calls for CAPO (orange) and CAPO-Direct [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Regret versus cumulative real-environment calls for CAPO (orange) and CAPO-Direct [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 11.** Figure 11: Regret versus cumulative real-environment calls for CAPO (orange) and CAPO-Direct [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

read the original abstract

In cooperative teams where agents act in a fixed order and share a single team-level reward (multi-agent language systems, sequential robotic tasks), per-agent credit assignment is under-determined. Critic-based approaches scale poorly as the number of agents grows owing to the costly maintenance of joint/factored critic(s), whereas the existing critic-free alternatives have other issues: common credit across agents that couples every agent's signal to teammate noise, importance-sampling corrections for upstream-update staleness that incur variance exponential in team size, or per-agent counterfactual replay that isolates each agent's effect at the price of extra environment or reward calls. We propose COSAC, a critic-free per-agent policy gradient for sequential cooperative teams. COSAC fits an additive per-agent decomposition of the team reward by a single ridge regression on the rollout batch (giving each agent a learning signal decoupled from teammate noise), and computes each agent's counterfactual advantage from fictitious continuations of the current policy (policy forward passes that replace both importance-sampling reweighting and per-agent environment replay, at no extra environment or reward cost). The estimator instantiates the Sequential Aristocrat Utility (SeqAU), our extension of Wolpert and Tumer's (2001) aristocrat utility to sequential teams. We prove bias and variance bounds on SeqAU credits that stay controlled as the team grows. Our controlled study on sequential bandits demonstrates that COSAC attains the lowest advantage MSE and consistently low learning regret across team sizes up to $K = 16$. On the AI2 Reasoning Challenge (ARC) task, where four Qwen3-0.6B agents reason in turn about a grade-school science question, COSAC attains faster convergence than the other critic-free baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

COSAC gives a workable critic-free credit assignment for sequential teams by splitting rewards with one ridge regression and using policy forward passes for counterfactuals, but the regression step is the part that needs the most checking.

read the letter

The main thing to know is that this paper extends the old aristocrat utility to sequential cooperative settings with a concrete implementation: fit an additive per-agent reward split via ridge regression on the rollout batch, then estimate advantages from fictitious continuations of the current policy instead of importance sampling or extra replays. That combination avoids the scaling cost of joint critics and the variance blow-up of other critic-free baselines while keeping the learning signal decoupled from teammate actions. The bounds they prove on bias and variance for the resulting SeqAU credits are the part that directly addresses growth in team size, and the bandit experiments up to K=16 plus the four-agent ARC run show lower advantage MSE and faster convergence than the alternatives they compare against. The implementation looks clean and the fictitious-pass trick is a practical win for avoiding extra environment calls. The soft spot is exactly the one the stress-test note flags. Ridge regression on finite on-policy batches in a sequential process can absorb temporal correlations rather than isolate true marginal contributions if the chosen features do not span the effects well. The abstract claims the bounds stay controlled as the team grows, but without an explicit misspecification or sensitivity result for the regression step, it is not obvious how much decomposition error leaks into the credits. The ARC experiment is small, so the scaling evidence rests mostly on the bandit study. This is for people working on cooperative multi-agent RL in robotics or language-agent chains who need something that stays practical past a handful of agents. It deserves a serious referee because the core idea is straightforward, the empirical comparison is controlled, and the theoretical target is the right one even if the regression robustness needs more scrutiny in review.

Referee Report

2 major / 1 minor

Summary. The paper proposes COSAC, a critic-free per-agent policy gradient method for sequential cooperative multi-agent teams that share a team-level reward. COSAC fits an additive per-agent reward decomposition via a single ridge regression on the rollout batch and computes each agent's counterfactual advantage (instantiating the Sequential Aristocrat Utility, or SeqAU) from fictitious continuations of the current policy. It proves bias and variance bounds on the SeqAU credits that remain controlled as the number of agents grows, and reports empirical results showing lower advantage MSE and faster convergence than critic-free baselines on sequential bandits (up to K=16) and a four-agent Qwen3-0.6B setup on the ARC task.

Significance. If the bias/variance bounds hold and the ridge-regression decomposition introduces negligible error, COSAC would offer a scalable, low-overhead alternative to joint-critic methods for credit assignment in ordered cooperative settings such as multi-agent language-model reasoning or sequential robotics. The use of fictitious policy continuations avoids both importance-sampling variance and extra environment calls, and the controlled scaling with team size addresses a known limitation of existing critic-free approaches. The ARC experiment provides concrete evidence of applicability to LLM-based agents.

major comments (2)

[Theoretical analysis (bias/variance bounds)] Proof of bias and variance bounds on SeqAU credits: the stated guarantees that bounds remain controlled as team size grows rest on the assumption that the single ridge regression on the finite on-policy rollout batch recovers an additive decomposition whose per-agent terms yield unbiased counterfactual advantages. No misspecification analysis or sensitivity result is provided for the case in which the chosen regressors fail to span the true marginal contributions under temporal action dependence; this assumption is load-bearing for the central theoretical claim.
[Empirical study (sequential bandits)] Experimental validation on sequential bandits: the reported lowest advantage MSE and low learning regret across K up to 16 rely on the fitted decomposition being accurate, yet the manuscript does not specify the exact feature set used in the ridge regression, the batch size relative to team size, or any controls for post-hoc selection effects when evaluating the estimator on the same rollout data used for fitting.

minor comments (1)

[Method description] Notation for the fictitious continuation operator and the exact form of the ridge-regression objective could be clarified with an explicit equation reference to avoid ambiguity when readers compare to standard counterfactual estimators.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of COSAC as a scalable critic-free approach. We address each major comment below with clarifications and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Theoretical analysis (bias/variance bounds)] Proof of bias and variance bounds on SeqAU credits: the stated guarantees that bounds remain controlled as team size grows rest on the assumption that the single ridge regression on the finite on-policy rollout batch recovers an additive decomposition whose per-agent terms yield unbiased counterfactual advantages. No misspecification analysis or sensitivity result is provided for the case in which the chosen regressors fail to span the true marginal contributions under temporal action dependence; this assumption is load-bearing for the central theoretical claim.

Authors: The bias and variance bounds for the SeqAU credits are formally derived conditional on the ridge regression recovering an additive per-agent decomposition of the team reward. Under this condition, the proof demonstrates that the counterfactual advantages obtained from fictitious policy continuations have bias and variance that remain bounded independently of team size K. The manuscript does not include a dedicated misspecification or sensitivity analysis for the case where the chosen regressors (linear features over actions and timesteps) fail to span the true marginal contributions in the presence of strong temporal action dependence. We view this as a valid point and will add a new subsection in the theoretical analysis that explicitly states the assumption, discusses its implications, and notes that the empirical advantage MSE results provide supporting evidence for the approximation quality in the evaluated settings. A full non-asymptotic sensitivity analysis would require additional technical development beyond the current scope. revision: partial
Referee: [Empirical study (sequential bandits)] Experimental validation on sequential bandits: the reported lowest advantage MSE and low learning regret across K up to 16 rely on the fitted decomposition being accurate, yet the manuscript does not specify the exact feature set used in the ridge regression, the batch size relative to team size, or any controls for post-hoc selection effects when evaluating the estimator on the same rollout data used for fitting.

Authors: We agree that these implementation details are essential for reproducibility and for addressing concerns about estimator accuracy. In the sequential bandits experiments the ridge regression uses a feature matrix whose columns are one-hot encodings of each agent's action together with the current timestep index (feature dimension linear in K). Each policy update is performed on a batch of 512 on-policy trajectories, with the decomposition fitted on this batch and advantage estimates subsequently evaluated on a disjoint held-out set of 128 trajectories to avoid post-hoc selection bias. These specifications, along with pseudocode for the ridge-regression step, will be added to the experimental section and appendix of the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper proposes COSAC as an algorithmic estimator: it explicitly fits an additive per-agent reward decomposition via ridge regression on the observed rollout batch, then computes counterfactual advantages using fictitious policy continuations. This fitting step is presented as part of the method to obtain per-agent signals, not as a first-principles derivation that reduces by construction to its own inputs. The bias and variance bounds are proved for the resulting SeqAU credits (under the maintained assumption that the fitted decomposition is sufficiently accurate), but no equation or step is shown to be tautological or statistically forced in a self-referential loop. No load-bearing self-citations, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation appear in the abstract or described chain. The approach has independent content in its use of regression for decoupling and policy simulations to avoid extra environment calls. Concerns about regression misspecification or bias belong to correctness analysis rather than circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; the approach assumes an additive decomposition of team rewards is recoverable by ridge regression and that policy forward passes suffice for counterfactual estimation. No explicit free parameters, axioms, or invented entities are detailed beyond the extension of prior aristocrat utility.

pith-pipeline@v0.9.0 · 5624 in / 1265 out tokens · 62327 ms · 2026-05-12T01:57:24.348531+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages

[1]

Difference rewards policy gradients

Jacopo Castellini, Sam Devlin, Frans A Oliehoek, and Rahul Savani. Difference rewards policy gradients. Neural Computing and Applications, 37 0 (19): 0 13163--13186, 2025

work page 2025
[2]

Contextual counterfactual credit assignment for multi-agent reinforcement learning in LLM collaboration, 2026

Yanjun Chen, Yirong Sun, Hanlin Wang, Xinming Zhang, Xiaoyu Shen, Wenjie Li, and Wei Zhang. Contextual counterfactual credit assignment for multi-agent reinforcement learning in LLM collaboration, 2026

work page 2026
[3]

DeepSeek-R1 : Incentivizing reasoning capability in LLM s via reinforcement learning, 2025

DeepSeek-AI . DeepSeek-R1 : Incentivizing reasoning capability in LLM s via reinforcement learning, 2025

work page 2025
[4]

Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson

Jakob N. Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2974--2982, 2018. doi:10.1609/aaai.v32i1.11794

work page doi:10.1609/aaai.v32i1.11794 2018
[5]

Lagoudakis, and Ronald Parr

Carlos Guestrin, Michail G. Lagoudakis, and Ronald Parr. Coordinated reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning (ICML), pages 227--234. Morgan Kaufmann, 2002

work page 2002
[6]

Multi-agent deep research: Training multi-agent systems with M-GRPO , 2025

Haoyang Hong, Jiajun Yin, Yuan Wang, Jingnan Liu, Zhe Chen, Ailing Yu, Ji Li, Zhiling Ye, Hansong Xiao, Yefei Chen, Hualei Zhou, Yun Yue, Minghui Yang, Chunxiao Guo, Junwei Liu, Peng Wei, and Jinjie Gu. Multi-agent deep research: Training multi-agent systems with M-GRPO , 2025

work page 2025
[7]

Kok and Nikos Vlassis

Jelle R. Kok and Nikos Vlassis. Collaborative multiagent reinforcement learning by payoff propagation. Journal of Machine Learning Research, 7 0 (65): 0 1789--1828, 2006. URL http://jmlr.org/papers/v7/kok06a.html

work page 2006
[8]

Settling the variance of multi-agent policy gradients

Jakub Grudzien Kuba, Muning Wen, Linghui Meng, Shangding Gu, Haifeng Zhang, David Henry Mguni, Jun Wang, and Yaodong Yang. Settling the variance of multi-agent policy gradients. In Advances in Neural Information Processing Systems 34 (NeurIPS 2021), pages 13458--13470, 2021

work page 2021
[9]

Trust region policy optimisation in multi-agent reinforcement learning

Jakub Grudzien Kuba, Ruiqing Chen, Muning Wen, Ying Wen, Fanglei Sun, Jun Wang, and Yaodong Yang. Trust region policy optimisation in multi-agent reinforcement learning. In International Conference on Learning Representations (ICLR), 2022. URL https://openreview.net/forum?id=EcGGFkNTxdJ

work page 2022
[10]

Bandit algorithms

Tor Lattimore and Csaba Szepesv \'a ri. Bandit algorithms. Cambridge University Press, 2020

work page 2020
[11]

Who deserves the reward? SHARP : Shapley credit-based optimization for multi-agent system, 2026 a

Yanming Li, Xuelin Zhang, WenJie Lu, Ziye Tang, Maodong Wu, Haotian Luo, Tongtong Wu, Zijie Peng, Hongze Mi, Yibo Feng, Naiqiang Tan, Chao Huang, Hong Chen, and Li Shen. Who deserves the reward? SHARP : Shapley credit-based optimization for multi-agent system, 2026 a

work page 2026
[12]

Difference advantage estimation for multi-agent policy gradients

Yueheng Li, Guangming Xie, and Zongqing Lu. Difference advantage estimation for multi-agent policy gradients. In Proceedings of the 39th International Conference on Machine Learning (ICML), volume 162 of Proceedings of Machine Learning Research, pages 13066--13085, 2022. URL https://proceedings.mlr.press/v162/li22w.html

work page 2022
[13]

Counterfactual Credit Policy Optimization for Multi-Agent Collaboration.arXiv preprint arXiv:2603.21563, 2026

Zhongyi Li, Wan Tian, Yikun Ban, Jinju Chen, Huiming Zhang, Yang Liu, and Fuzhen Zhuang. Counterfactual credit policy optimization for multi-agent collaboration. arXiv preprint arXiv:2603.21563, 2026 b

work page arXiv 2026
[14]

LLM collaboration with multi-agent reinforcement learning, 2025

Shuo Liu, Tianle Chen, Zeyu Liang, Xueguang Lyu, and Christopher Amato. LLM collaboration with multi-agent reinforcement learning, 2025

work page 2025
[15]

Sequential multi-agent dynamic algorithm configuration

Chen Lu, Ke Xue, Lei Yuan, Yao Wang, Yaoyuan Wang, Sheng Fu, and Chao Qian. Sequential multi-agent dynamic algorithm configuration. In Advances in Neural Information Processing Systems 38 (NeurIPS 2025), 2025

work page 2025
[16]

Bei Peng, Tabish Rashid, Christian Schroeder de Witt, Pierre-Alexandre Kamienny, Philip H. S. Torr, Wendelin B \"o hmer, and Shimon Whiteson. FACMAC : Factored multi-agent centralised policy gradients. In Advances in Neural Information Processing Systems 34 (NeurIPS 2021), pages 12208--12221, 2021

work page 2021
[17]

Foerster, and Shimon Whiteson

Tabish Rashid, Mikayel Samvelyan, Christian Schr \"o der de Witt, Gregory Farquhar, Jakob N. Foerster, and Shimon Whiteson. QMIX : Monotonic value function factorisation for deep multi-agent reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning (ICML), volume 80 of Proceedings of Machine Learning Research, pages 4...

work page 2018
[18]

Proximal policy optimization algorithms, 2017

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017

work page 2017
[19]

Leibo, Karl Tuyls, and Thore Graepel

Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vin \' cius Flores Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z. Leibo, Karl Tuyls, and Thore Graepel. Value-decomposition networks for cooperative multi-agent learning based on team reward. In Proceedings of the 17th International Conference on Autonomous Agents and...

work page 2085
[20]

Off-policy evaluation for slate recommendation

Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miroslav Dud \' k, John Langford, Damien Jose, and Imed Zitouni. Off-policy evaluation for slate recommendation. In Advances in Neural Information Processing Systems 30 (NIPS 2017), pages 3632--3642, 2017

work page 2017
[21]

Hindsight credit assignment for long-horizon LLM agents, 2026

Hui-Ze Tan, Xiao-Wen Yang, Hao Chen, Jie-Jing Shao, Yi Wen, Yuteng Shen, Weihong Luo, Xiku Du, Lan-Zhe Guo, and Yu-Feng Li. Hindsight credit assignment for long-horizon LLM agents, 2026

work page 2026
[22]

Counterfactual effect decomposition in multi-agent sequential decision making

Stelios Triantafyllou, Aleksa Sukovic, Yasaman Zolfimoselo, and Goran Radanovic. Counterfactual effect decomposition in multi-agent sequential decision making. In Proceedings of the 42nd International Conference on Machine Learning (ICML), volume 267 of Proceedings of Machine Learning Research, pages 60072--60098, 2025

work page 2025
[23]

A Concise Introduction to Multiagent Systems and Distributed Artificial Intelligence

Nikos Vlassis. A Concise Introduction to Multiagent Systems and Distributed Artificial Intelligence. Morgan and Claypool Publishers, 1st edition, 2007. ISBN 1598295268

work page 2007
[24]

DOP : Off-policy multi-agent decomposed policy gradients

Yihan Wang, Beining Han, Tonghan Wang, Heng Dong, and Chongjie Zhang. DOP : Off-policy multi-agent decomposed policy gradients. In International Conference on Learning Representations (ICLR), 2021. URL https://openreview.net/forum?id=6FqKiVAdI3Y

work page 2021
[25]

Wolpert and Kagan Tumer

David H. Wolpert and Kagan Tumer. Optimal payoff functions for members of collectives. Advances in Complex Systems, 4 0 (2--3): 0 265--279, 2002

work page 2002
[26]

Sequential cooperative multi-agent reinforcement learning

Yifan Zang, Jinmin He, Kai Li, Haobo Fu, Qiang Fu, and Junliang Xing. Sequential cooperative multi-agent reinforcement learning. In Proceedings of the 22nd International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pages 485--493, 2023

work page 2023
[27]

Multi-agent reinforcement learning: A selective overview of theories and algorithms

Kaiqing Zhang, Zhuoran Yang, and Tamer Ba s ar. Multi-agent reinforcement learning: A selective overview of theories and algorithms. In Kyriakos G. Vamvoudakis, Yan Wan, Frank L. Lewis, and Derya Cansever, editors, Handbook of Reinforcement Learning and Control, volume 325 of Studies in Systems, Decision and Control, pages 321--384. Springer, 2021. doi:10...

work page doi:10.1007/978-3-030-60990-0_12 2021
[28]

Unlocking the power of multi-agent LLM for reasoning: From lazy agents to deliberation, 2025

Zhiwei Zhang, Xiaomin Li, Yudi Lin, Hui Liu, Ramraj Chandradevan, Linlin Wu, Minhua Lin, Fali Wang, Xianfeng Tang, Qi He, and Suhang Wang. Unlocking the power of multi-agent LLM for reasoning: From lazy agents to deliberation, 2025

work page 2025
[29]

Stronger-MAS : Multi-agent reinforcement learning for collaborative LLM s, 2025

Yujie Zhao, Lanxiang Hu, Yang Wang, Minmin Hou, Hao Zhang, Ke Ding, and Jishen Zhao. Stronger-MAS : Multi-agent reinforcement learning for collaborative LLM s, 2025

work page 2025