Observation and action delays are formally equivalent in cooperative Dec-POMDPs, yielding identical optimal solutions and enabling zero-shot transfer, though learning dynamics differ due to credit assignment and operational constraints.
Mixed citations
Title resolution pending
Mixed citation behavior. Most common role is background (40%).
citation-role summary
citation-polarity summary
representative citing papers
ARMS is an automatic reward-shaping framework for sparse-reward MARL that uses trajectory ranking and conditional best-response reasoning to preserve Nash equilibria while improving sampling efficiency in pathfinding tasks.
Structured per-agent randomness via ranked masking in attention allows symmetric agents to break ties and coordinate, achieving perfect success on symmetric tasks where deterministic policies fail and enabling zero-shot transfer across team sizes.
A quality-aware exploration method using return-conditioned sigmoid scheduling and per-agent RSQ metrics achieves top-tier returns on seven cooperative MARL benchmarks.
OSPO trains optimal order dispatch policies for homogeneous AV fleets using only one-step group rewards, outperforming GRPO on a real ride-hailing dataset.
pcsp is a shared RL policy using LLM persona embeddings, low-rank projection, and PPO+InfoNCE+KL training that delivers 17x above-chance zero-shot persona identification and 22x faster inference on a 300-persona benchmark.
GAE suffers from amplified variance in imperfect-info self-play RL; VRPO with Q-boosting and multi-step Expected SARSA(λ) reduces it and improves performance on mid-to-large games.
SBC generates virtual environments via state blocking to expose agents to diverse suboptimal partner policies, yielding superior zero-shot coordination performance including with humans.
MARS replaces additive clipping and soft penalties in multi-agent trust-region methods with a symmetric geometric barrier, matching or exceeding MAPPO and MASPO performance across 47 tasks in eight environments.
SACHI enriches agent representations via graph transformer convolutions over inter-agent graphs to enable holistic information integration, outperforming baselines across five cooperative tasks with statistical significance.
A bilevel MARL framework with curriculum learning and closed-loop sequential updates learns stable tax policies in multi-group taxation simulations, extending effective game duration by 60.92% and reducing GDP disparities by 44.12% versus baseline.
A priority-driven RL algorithm learns joint communication priorities and control policies for decentralized multi-agent systems in a model-free way and outperforms baselines on benchmark tasks.
A survey of MARL with GNN-based communication that proposes a generalized process to organize and clarify existing methods.
CoSER adaptively samples joint actions in CTDE MARL to reduce sampling error relative to the joint on-policy distribution, empirically improving reliability of independent policy gradient convergence.
Introduces α-fair HATRPO and HAPPO algorithms that integrate α-fairness into HATRL via a weighted advantage function while claiming to preserve convergence to Nash equilibria.
This survey organizes RL for LLM multi-agent systems into reward families, credit units, and five orchestration sub-decisions, notes the absence of explicit stopping-decision training in its paper pool, and releases a tagged corpus.
GLo-MAPPO applies centralized-training decentralized-execution MAPPO with a gain-based association scheme to jointly optimize LoRa parameters and UAV paths, yielding higher weighted energy efficiency than prior MARL baselines in simulations.
citing papers explorer
No citing papers match the current filters.