Observation and action delays are formally equivalent in cooperative Dec-POMDPs, yielding identical optimal solutions and enabling zero-shot transfer, though learning dynamics differ due to credit assignment and operational constraints.
Mixed citations
Title resolution pending
Mixed citation behavior. Most common role is background (40%).
citation-role summary
citation-polarity summary
representative citing papers
ARMS is an automatic reward-shaping framework for sparse-reward MARL that uses trajectory ranking and conditional best-response reasoning to preserve Nash equilibria while improving sampling efficiency in pathfinding tasks.
Structured per-agent randomness via ranked masking in attention allows symmetric agents to break ties and coordinate, achieving perfect success on symmetric tasks where deterministic policies fail and enabling zero-shot transfer across team sizes.
A quality-aware exploration method using return-conditioned sigmoid scheduling and per-agent RSQ metrics achieves top-tier returns on seven cooperative MARL benchmarks.
OSPO trains optimal order dispatch policies for homogeneous AV fleets using only one-step group rewards, outperforming GRPO on a real ride-hailing dataset.
pcsp is a shared RL policy using LLM persona embeddings, low-rank projection, and PPO+InfoNCE+KL training that delivers 17x above-chance zero-shot persona identification and 22x faster inference on a 300-persona benchmark.
GAE suffers from amplified variance in imperfect-info self-play RL; VRPO with Q-boosting and multi-step Expected SARSA(λ) reduces it and improves performance on mid-to-large games.
SBC generates virtual environments via state blocking to expose agents to diverse suboptimal partner policies, yielding superior zero-shot coordination performance including with humans.
MARS replaces additive clipping and soft penalties in multi-agent trust-region methods with a symmetric geometric barrier, matching or exceeding MAPPO and MASPO performance across 47 tasks in eight environments.
SACHI enriches agent representations via graph transformer convolutions over inter-agent graphs to enable holistic information integration, outperforming baselines across five cooperative tasks with statistical significance.
A bilevel MARL framework with curriculum learning and closed-loop sequential updates learns stable tax policies in multi-group taxation simulations, extending effective game duration by 60.92% and reducing GDP disparities by 44.12% versus baseline.
A priority-driven RL algorithm learns joint communication priorities and control policies for decentralized multi-agent systems in a model-free way and outperforms baselines on benchmark tasks.
A survey of MARL with GNN-based communication that proposes a generalized process to organize and clarify existing methods.
CoSER adaptively samples joint actions in CTDE MARL to reduce sampling error relative to the joint on-policy distribution, empirically improving reliability of independent policy gradient convergence.
Introduces α-fair HATRPO and HAPPO algorithms that integrate α-fairness into HATRL via a weighted advantage function while claiming to preserve convergence to Nash equilibria.
This survey organizes RL for LLM multi-agent systems into reward families, credit units, and five orchestration sub-decisions, notes the absence of explicit stopping-decision training in its paper pool, and releases a tagged corpus.
GLo-MAPPO applies centralized-training decentralized-execution MAPPO with a gain-based association scheme to jointly optimize LoRa parameters and UAV paths, yielding higher weighted energy efficiency than prior MARL baselines in simulations.
citing papers explorer
-
Structural Equivalence and Learning Dynamics in Delayed MARL
Observation and action delays are formally equivalent in cooperative Dec-POMDPs, yielding identical optimal solutions and enabling zero-shot transfer, though learning dynamics differ due to credit assignment and operational constraints.
-
ARMS: Automatic Reward Shaping for Sparse-Reward Multi-Agent Reinforcement Learning
ARMS is an automatic reward-shaping framework for sparse-reward MARL that uses trajectory ranking and conditional best-response reasoning to preserve Nash equilibria while improving sampling efficiency in pathfinding tasks.
-
Randomness is sometimes necessary for coordination
Structured per-agent randomness via ranked masking in attention allows symmetric agents to break ties and coordinate, achieving perfect success on symmetric tasks where deterministic policies fail and enabling zero-shot transfer across team sizes.
-
Quality-Aware Exploration Budget Allocation for Cooperative Multi-Agent Reinforcement Learning
A quality-aware exploration method using return-conditioned sigmoid scheduling and per-agent RSQ metrics achieves top-tier returns on seven cooperative MARL benchmarks.
-
One Step is Enough: Multi-Agent Reinforcement Learning based on One-Step Policy Optimization for Order Dispatch on Ride-Sharing Platforms
OSPO trains optimal order dispatch policies for homogeneous AV fleets using only one-step group rewards, outperforming GRPO on a real ride-hailing dataset.
-
One Policy, Infinite NPCs: Persona-Traceable Shared RL Policies for Scalable Game Agents
pcsp is a shared RL policy using LLM persona embeddings, low-rank projection, and PPO+InfoNCE+KL training that delivers 17x above-chance zero-shot persona identification and 22x faster inference on a 300-persona benchmark.
-
GAE Falls Short in Imperfect-Information Self-Play Reinforcement Learning
GAE suffers from amplified variance in imperfect-info self-play RL; VRPO with Q-boosting and multi-step Expected SARSA(λ) reduces it and improves performance on mid-to-large games.
-
Shaping Zero-Shot Coordination via State Blocking
SBC generates virtual environments via state blocking to expose agents to diverse suboptimal partner policies, yielding superior zero-shot coordination performance including with humans.
-
Rethinking Ratio-Based Trust Regions for Policy Optimization in Multi-Agent Reinforcement Learning
MARS replaces additive clipping and soft penalties in multi-agent trust-region methods with a symmetric geometric barrier, matching or exceeding MAPPO and MASPO performance across 47 tasks in eight environments.
-
SACHI: Structured Agent Coordination via Holistic Information Integration in Multi-Agent Reinforcement Learning
SACHI enriches agent representations via graph transformer convolutions over inter-agent graphs to enable holistic information integration, outperforming baselines across five cooperative tasks with statistical significance.
-
Hierarchical Multiagent Reinforcement Learning for Multi-Group Tax Game
A bilevel MARL framework with curriculum learning and closed-loop sequential updates learns stable tax policies in multi-group taxation simulations, extending effective game duration by 60.92% and reducing GDP disparities by 44.12% versus baseline.
-
Priority-Driven Control and Communication in Decentralized Multi-Agent Systems via Reinforcement Learning
A priority-driven RL algorithm learns joint communication priorities and control policies for decentralized multi-agent systems in a model-free way and outperforms baselines on benchmark tasks.
-
A Survey of Multi-Agent Deep Reinforcement Learning with Graph Neural Network-Based Communication
A survey of MARL with GNN-based communication that proposes a generalized process to organize and clarify existing methods.
-
Centralized Adaptive Sampling for Reliable Co-Training of Independent Multi-Agent Policies
CoSER adaptively samples joint actions in CTDE MARL to reduce sampling error relative to the joint on-policy distribution, empirically improving reliability of independent policy gradient convergence.
-
$\alpha$-fair heterogeneous agent reinforcement learning
Introduces α-fair HATRPO and HAPPO algorithms that integrate α-fairness into HATRL via a weighted advantage function while claiming to preserve convergence to Nash equilibria.
-
Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces
This survey organizes RL for LLM multi-agent systems into reward families, credit units, and five orchestration sub-decisions, notes the absence of explicit stopping-decision training in its paper pool, and releases a tagged corpus.
-
GLo-MAPPO: Multi-Agent Deep Reinforcement Learning for Energy-Efficient UAV-Assisted LoRa Networks
GLo-MAPPO applies centralized-training decentralized-execution MAPPO with a gain-based association scheme to jointly optimize LoRa parameters and UAV paths, yielding higher weighted energy efficiency than prior MARL baselines in simulations.
- Multi-Agent Reinforcement Learning for Safe Autonomous Driving Under Pedestrian Behavioral Uncertainty
- Dynamics of Cognitive Heterogeneity: Investigating Behavioral Biases in Multi-Stage Supply Chains with LLM-Based Simulation
- TABX: A High-Throughput Sandbox Battle Simulator for Multi-Agent Reinforcement Learning
- Learning Decentralized LLM Collaboration with Multi-Agent Actor Critic