Observation and action delays are formally equivalent in cooperative Dec-POMDPs, yielding identical optimal solutions and enabling zero-shot transfer, though learning dynamics differ due to credit assignment and operational constraints.
Mixed citations
Is independent learning all you need in the starcraft multi-agent challenge?arXiv preprint arXiv:2011.09533, 2020
Mixed citation behavior. Most common role is background (40%).
citation-role summary
citation-polarity summary
representative citing papers
ARMS is an automatic reward-shaping framework for sparse-reward MARL that uses trajectory ranking and conditional best-response reasoning to preserve Nash equilibria while improving sampling efficiency in pathfinding tasks.
Structured per-agent randomness via ranked masking in attention allows symmetric agents to break ties and coordinate, achieving perfect success on symmetric tasks where deterministic policies fail and enabling zero-shot transfer across team sizes.
A quality-aware exploration method using return-conditioned sigmoid scheduling and per-agent RSQ metrics achieves top-tier returns on seven cooperative MARL benchmarks.
OSPO trains optimal order dispatch policies for homogeneous AV fleets using only one-step group rewards, outperforming GRPO on a real ride-hailing dataset.
HiComm converts MARL communication into structured three-stage hierarchical retrieval over sender observations, matching or exceeding baselines with up to 23x lower communication volume per receiver per episode.
pcsp is a shared RL policy using LLM persona embeddings, low-rank projection, and PPO+InfoNCE+KL training that delivers 17x above-chance zero-shot persona identification and 22x faster inference on a 300-persona benchmark.
GAE suffers from amplified variance in imperfect-info self-play RL; VRPO with Q-boosting and multi-step Expected SARSA(λ) reduces it and improves performance on mid-to-large games.
SBC generates virtual environments via state blocking to expose agents to diverse suboptimal partner policies, yielding superior zero-shot coordination performance including with humans.
MARS replaces additive clipping and soft penalties in multi-agent trust-region methods with a symmetric geometric barrier, matching or exceeding MAPPO and MASPO performance across 47 tasks in eight environments.
SACHI enriches agent representations via graph transformer convolutions over inter-agent graphs to enable holistic information integration, outperforming baselines across five cooperative tasks with statistical significance.
A bilevel MARL framework with curriculum learning and closed-loop sequential updates learns stable tax policies in multi-group taxation simulations, extending effective game duration by 60.92% and reducing GDP disparities by 44.12% versus baseline.
A priority-driven RL algorithm learns joint communication priorities and control policies for decentralized multi-agent systems in a model-free way and outperforms baselines on benchmark tasks.
A survey of MARL with GNN-based communication that proposes a generalized process to organize and clarify existing methods.
CoSER adaptively samples joint actions in CTDE MARL to reduce sampling error relative to the joint on-policy distribution, empirically improving reliability of independent policy gradient convergence.
Introduces α-fair HATRPO and HAPPO algorithms that integrate α-fairness into HATRL via a weighted advantage function while claiming to preserve convergence to Nash equilibria.
This survey organizes RL for LLM multi-agent systems into reward families, credit units, and five orchestration sub-decisions, notes the absence of explicit stopping-decision training in its paper pool, and releases a tagged corpus.
GLo-MAPPO applies centralized-training decentralized-execution MAPPO with a gain-based association scheme to jointly optimize LoRa parameters and UAV paths, yielding higher weighted energy efficiency than prior MARL baselines in simulations.
citing papers explorer
-
Priority-Driven Control and Communication in Decentralized Multi-Agent Systems via Reinforcement Learning
A priority-driven RL algorithm learns joint communication priorities and control policies for decentralized multi-agent systems in a model-free way and outperforms baselines on benchmark tasks.