SwarmCF enables robots to achieve low error on unseen task pairs with per-robot sample complexity linear in rank d rather than task count n by running decentralized low-rank matrix completion on masked broadcast data in the zero-knowledge MRTA regime.
hub
Value-Decomposition Networks For Cooperative Multi-Agent Learning
28 Pith papers cite this work. Polarity classification is still indexing.
abstract
We study the problem of cooperative multi-agent reinforcement learning with a single joint reward signal. This class of learning problems is difficult because of the often large combined action and observation spaces. In the fully centralized and decentralized approaches, we find the problem of spurious rewards and a phenomenon we call the "lazy agent" problem, which arises due to partial observability. We address these problems by training individual agents with a novel value decomposition network architecture, which learns to decompose the team value function into agent-wise value functions. We perform an experimental evaluation across a range of partially-observable multi-agent domains and show that learning such value-decompositions leads to superior results, in particular when combined with weight sharing, role information and information channels.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
A survey that unifies prior work on multi-agent LLM systems via the LIFE framework, mapping dependencies across collaboration, failure attribution, and autonomous self-evolution while identifying cross-stage challenges.
CPPO is an on-policy contrastive RL method that derives advantages from contrastive Q-values for PPO optimization, outperforming prior CRL baselines in 14/18 tasks and matching or exceeding reward-based PPO in 12/18 tasks.
HPML projects multi-agent update fields onto the closest metric-gradient potential flow via Hodge decomposition, yielding Lyapunov potentials and equilibrium-gap bounds.
DG-PG augments policy gradients with descent signals from analytical models to reduce estimator variance from O(N) to O(1), preserve game equilibria, and achieve agent-independent sample complexity while converging on 1500-agent tasks where baselines fail.
ATD(λ) adapts TD(λ) in MARL via a density ratio estimator on past/current replay buffers to assign λ per state-action pair, yielding competitive or better results than fixed-λ QMIX and MAPPO on SMAC and Gfootball.
Structured per-agent randomness via ranked masking in attention allows symmetric agents to break ties and coordinate, achieving perfect success on symmetric tasks where deterministic policies fail and enabling zero-shot transfer across team sizes.
A new controlled testbed and coordination diagnostics show that multi-agent RL methods achieving similar returns can differ substantially in redundant assignments, diversity, and efficiency.
NonZero introduces an interaction score and bandit-formalized proposal rule for local agent deviations in multi-agent MCTS, delivering a sublinear local-regret guarantee and improved sample efficiency on game benchmarks without full joint-action enumeration.
Entangled QMARL agents approach the Tsirelson bound of 0.854 in CHSH while unentangled versions match classical baselines, and hybrid quantum-classical setups outperform both in CoopNav.
SACHI enriches agent representations via graph transformer convolutions over inter-agent graphs to enable holistic information integration, outperforming baselines across five cooperative tasks with statistical significance.
Optimistic ε-Greedy Exploration adds decoupled optimistic networks that converge in probability to maximum returns and samples from them with probability ε to increase optimal joint-action frequency in CTDE MARL.
Wolfpack attack framework disrupts MARL cooperation by targeting initial and assisting agents; WALL trains robust policies against it with reported experimental gains.
Large-scale experiments on two million agents reveal that collective intelligence does not emerge from scale alone due to sparse and shallow interactions.
LLM-generated coordination graph priors improve multi-agent reinforcement learning performance on MPE benchmarks, with models as small as 1.5B parameters proving effective.
Reflective Context Learning unifies context optimization for agents by recasting prior methods as instances of a shared learning problem and extending them with classical primitives such as batching, failure replay, and grouped rollouts, yielding improvements on AppWorld, BrowseComp+, and RewardBene
Clarus is a four-layer collaboration infrastructure with a project-agent-resource model that reformulates research as an open, traceable, multi-participant process.
CyberOps-Bots is a hierarchical LLM-empowered multi-agent RL framework that reports 68.5% higher network availability and 34.7% better jumpstart performance in new scenarios without retraining on real cloud datasets.
DAC models fully decentralized cooperative MARL as a context modeling problem, using latent variables for joint policies to fix non-stationarity in value updates and relative overgeneralization in value estimation.
CoSER adaptively samples joint actions in CTDE MARL to reduce sampling error relative to the joint on-policy distribution, empirically improving reliability of independent policy gradient convergence.
ROE framework lets LLM defeat Very Hard bot in TextStarCraft II via keyframe selection, expert/self-experience decisions, and post-game reflection for new self-experience.
A curriculum of growing action spaces combined with simultaneous off-policy value estimation accelerates learning in large multi-agent action spaces.
APC adapts punishment via dynamic probability and a reward-guided defection awareness module to foster cooperation in iterated public goods games and sequential social dilemmas, outperforming baselines.
GLo-MAPPO applies centralized-training decentralized-execution MAPPO with a gain-based association scheme to jointly optimize LoRa parameters and UAV paths, yielding higher weighted energy efficiency than prior MARL baselines in simulations.
citing papers explorer
-
Descent-Guided Policy Gradient for Scalable Cooperative Multi-Agent Learning
DG-PG augments policy gradients with descent signals from analytical models to reduce estimator variance from O(N) to O(1), preserve game equilibria, and achieve agent-independent sample complexity while converging on 1500-agent tasks where baselines fail.
-
Coordination Matters: Evaluation of Cooperative Multi-Agent Reinforcement Learning
A new controlled testbed and coordination diagnostics show that multi-agent RL methods achieving similar returns can differ substantially in redundant assignments, diversity, and efficiency.
-
Optimistic {\epsilon}-Greedy Exploration for Cooperative Multi-Agent Reinforcement Learning
Optimistic ε-Greedy Exploration adds decoupled optimistic networks that converge in probability to maximum returns and samples from them with probability ε to increase optimal joint-action frequency in CTDE MARL.
-
Adaptive Punishment for Cooperation in Mixed-Motive Games
APC adapts punishment via dynamic probability and a reward-guided defection awareness module to foster cooperation in iterated public goods games and sequential social dilemmas, outperforming baselines.