pith. machine review for the scientific record. sign in

arxiv: 2605.10377 · v1 · submitted 2026-05-11 · 💻 cs.LG · cs.MA

PC3D: Zero-Shot Cooperation Across Variable Rosters via Personalized Context Distillation

Pith reviewed 2026-05-12 05:16 UTC · model grok-4.3

classification 💻 cs.LG cs.MA
keywords cooperative multi-agent reinforcement learningvariable roster sizescontext distillationdecentralized policieszero-shot adaptationpersonalized contextsepisodic team variation
0
0 comments X

The pith

Decentralized agents recover personalized team context from local histories to cooperate with changing roster sizes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that a training procedure using context distillation from a central coordinator can equip decentralized agents to handle changes in team size. A sympathetic reader would care because this removes the need for fixed teams or real-time communication in cooperative tasks. If the claim holds, agents could maintain performance when teammates join or leave between episodes using only their local data.

Core claim

PC3D trains decentralized policies by distilling agent-specific coordination tokens from a set-structured centralized teacher during training. At execution, each agent predicts its own context from local history and conditions its decision-making on it to handle episodic roster variations without communication or retraining.

What carries the argument

Personalized context distillation, in which a centralized teacher compresses the active team into coordination tokens and personalizes them for each agent before distilling the result into the decentralized policy for local prediction and adaptive use.

If this is right

  • PC3D yields higher returns than the evaluated baselines on three cooperative benchmarks for both seen and unseen roster sizes.
  • Ablations attribute the gains specifically to the combination of context distillation and adaptive context use during execution.
  • Decentralized policies can operate under episodic roster variation without requiring online retraining or privileged coordinators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same distillation approach could be applied to tasks with continuous rather than discrete action spaces to test whether context recovery scales beyond the current benchmarks.
  • If personalization tokens prove robust, the method might reduce reliance on centralized oversight in large agent teams by shifting adaptation to local prediction.
  • Connecting this context mechanism to existing work on partial observability could clarify how much history length is needed for reliable team-size inference.

Load-bearing premise

Each agent can recover relevant context about the active team solely from its local interaction history without execution-time communication or privileged information, and that this recovered context is sufficient to adapt behavior effectively across roster variations.

What would settle it

An experiment showing that PC3D policies achieve no higher returns than non-adaptive decentralized baselines on episodes with previously unseen roster sizes would indicate that the distilled context does not support the claimed adaptation.

Figures

Figures reproduced from arXiv: 2605.10377 by Ahmet Onur Akman, Rafa{\l} Kucharski.

Figure 1
Figure 1. Figure 1: PC3D at a glance. PC3D trains with centralized information over a distribution of roster sizes for a given cooperative task (a). During training, the centralized teacher provides personalized coordination contexts, which decentralized agents learn to recover from local interaction histories (b). At execution, agents act only from local histories, without communication or retraining, and coordinate across b… view at source ↗
Figure 3
Figure 3. Figure 3: Evaluation benchmarks. We evaluate PC3D on Spread, LBF, and RWARE, adapting each benchmark to episodic roster variation under fixed local observation interfaces. Baselines. We compare PC3D against three MARL baselines chosen to evaluate its contribution under the same decentralized execution setting: agents act from local histories without execution-time communication, privileged coordinators, global obser… view at source ↗
Figure 4
Figure 4. Figure 4: Training returns. Curves show mean training returns (±95 CI) across seeds, with each colored patch corresponding to one curriculum stage and its active training roster set. Higher returns are better [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: shows the evaluation returns for each method and benchmark across the used roster sizes. These plots better highlight count-specific performances that are compressed in the split means we report in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: displays the mean training returns across repetitions, using subplots for each curriculum stage. This visualization displays the same data as in [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Context recovery and use. We run reported final checkpoints for 8 rollouts for each task and roster size pair. Cell color shows teacher-student cosine alignment for the personalized coordination context; cell text shows mean ± standard deviation of the context-reliance gate across repetitions. The plots test whether PC3D’s centralized context is both locally recoverable and adaptively used by the decentral… view at source ↗
read the original abstract

Cooperative multi-agent reinforcement learning often assumes a fixed execution team, yet many decentralized systems must operate with varying numbers of active agents during deployment. We study this setting under episodic roster variation: each episode is executed by a set of homogeneous agents, with the team size varying across episodes. Agents act only from local histories, without execution-time communication, privileged coordinators, or online retraining. Therefore, effective cooperation requires each agent to recover relevant context about the active team and adapt its behavior accordingly. To this end, we propose PC3D (Personalized Central Coordination Context Distillation), a method for training decentralized policies to recover and use personalized coordination context from local interaction histories. During training, a set-structured centralized teacher compresses the active team into coordination tokens and personalizes them into agent-specific contexts, which are distilled into decentralized policies. At execution, each agent predicts its own context from local history and adaptively uses it to condition decision-making. Across three cooperative MARL benchmarks, PC3D achieves higher returns than the evaluated baselines with both seen and unseen roster sizes, and ablations attribute these gains to both context distillation and adaptive context use.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes PC3D, a distillation-based method for cooperative MARL under episodic roster variation. A set-structured centralized teacher generates coordination tokens and agent-specific contexts during training; these are distilled into decentralized policies that, at execution, predict personalized contexts from local histories alone (no communication or privileged information) and condition actions on them. The central claim is that this enables higher returns than baselines on three benchmarks for both seen and unseen team sizes, with ablations attributing gains to the distillation step and adaptive context use.

Significance. If the empirical claims are substantiated with quantitative results and the zero-shot generalization holds, the work would address a practical gap in decentralized MARL by allowing policies to adapt to dynamic team cardinalities without retraining or execution-time coordination. The teacher-to-student distillation pipeline for context recovery is a technically interesting approach that could generalize to other variable-agent settings.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experiments): the central empirical claim states that PC3D achieves higher returns than baselines for both seen and unseen roster sizes, yet the manuscript provides no numerical values, standard deviations, error bars, statistical tests, or details on baseline implementations and data splits. This leaves the magnitude, reliability, and reproducibility of the reported gains unverified.
  2. [§3.2] §3.2 (Distillation objective): the context predictor is trained on trajectories generated under the same roster-size distribution as the policy, but the objective does not explicitly regularize the student to recover roster cardinality or to produce contexts that remain informative for cardinalities outside the training support. This creates a mild circular dependence that directly affects the zero-shot claim.
  3. [§3.1 and §4.3] §3.1 and §4.3 (Ablations): the weakest assumption—that strictly local histories suffice to disambiguate active roster size for unseen cardinalities—is load-bearing for the zero-shot result, yet no auxiliary experiments (e.g., context-prediction accuracy or confusion matrices on held-out sizes) are reported to test whether histories generated by different team sizes are statistically distinguishable under the per-agent dynamics.
minor comments (2)
  1. [§3] Notation for “coordination tokens” and “personalized context” is introduced without a compact mathematical definition or diagram showing the exact tensor shapes and conditioning points.
  2. [§3.1] The description of the set-structured teacher would benefit from an explicit statement of how permutation invariance is enforced (e.g., via sum-pooling or attention).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The comments highlight important aspects of empirical reporting, methodological clarity, and supporting evidence for the zero-shot claims. We address each major comment below and will revise the manuscript accordingly to improve reproducibility, strengthen the presentation of results, and provide additional analysis where needed.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the central empirical claim states that PC3D achieves higher returns than baselines for both seen and unseen roster sizes, yet the manuscript provides no numerical values, standard deviations, error bars, statistical tests, or details on baseline implementations and data splits. This leaves the magnitude, reliability, and reproducibility of the reported gains unverified.

    Authors: We agree that the current presentation relies on figures without accompanying numerical summaries, which limits verifiability. In the revised manuscript we will add a dedicated results table in §4 reporting mean returns, standard deviations across random seeds, and 95% confidence intervals for PC3D and all baselines on both seen and unseen roster sizes. We will also expand the experimental details to include baseline implementation specifics (e.g., network architectures, training hyperparameters, and exact data splits), and we will include paired statistical significance tests (e.g., Welch’s t-test with p-values) comparing PC3D against each baseline. These additions will be referenced in the abstract where space allows. revision: yes

  2. Referee: [§3.2] §3.2 (Distillation objective): the context predictor is trained on trajectories generated under the same roster-size distribution as the policy, but the objective does not explicitly regularize the student to recover roster cardinality or to produce contexts that remain informative for cardinalities outside the training support. This creates a mild circular dependence that directly affects the zero-shot claim.

    Authors: The training distribution explicitly samples episodes with varying roster sizes drawn from the same support used at test time for seen sizes, and the teacher provides supervision derived from the full joint state. The distillation loss therefore trains the student to recover contexts that are useful for cooperation under that distribution. While we do not add an explicit cardinality-prediction term, the empirical zero-shot results on held-out sizes indicate that the learned contexts remain informative. We will revise §3.2 to more explicitly describe the roster-size sampling procedure and to discuss the implicit generalization mechanism. We view this as a clarification rather than a change to the objective itself. revision: partial

  3. Referee: [§3.1 and §4.3] §3.1 and §4.3 (Ablations): the weakest assumption—that strictly local histories suffice to disambiguate active roster size for unseen cardinalities—is load-bearing for the zero-shot result, yet no auxiliary experiments (e.g., context-prediction accuracy or confusion matrices on held-out sizes) are reported to test whether histories generated by different team sizes are statistically distinguishable under the per-agent dynamics.

    Authors: We concur that direct evidence of distinguishability would strengthen the zero-shot argument. In the revised §4.3 we will add auxiliary evaluation metrics: context-prediction accuracy (measured against the teacher-provided contexts) and confusion matrices for roster-size inference on held-out cardinalities. These experiments will be performed on the same local histories used by the decentralized policies, thereby testing whether different team sizes produce statistically distinguishable observation sequences under the per-agent dynamics. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on learned distillation from privileged teacher to local-history student

full rationale

The paper describes a training procedure with a centralized teacher that has privileged access to the full team roster and distills coordination contexts into decentralized policies conditioned only on local histories. The context predictor is trained to match teacher outputs on the training distribution, but this is a standard supervised distillation step rather than a self-definitional reduction or fitted input renamed as prediction. No equations or claims reduce the zero-shot generalization result to the inputs by construction. Evaluations on external MARL benchmarks (with both seen and unseen roster sizes) provide independent empirical content. No self-citation chains or uniqueness theorems imported from prior author work are load-bearing in the provided description. The central claim therefore remains non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that local histories contain recoverable team context and on the introduction of coordination tokens and personalized contexts whose utility is demonstrated only internally via the same training loop.

axioms (1)
  • domain assumption Agents act only from local histories, without execution-time communication, privileged coordinators, or online retraining.
    Explicitly stated as the execution constraint that makes context recovery necessary.
invented entities (2)
  • coordination tokens no independent evidence
    purpose: Compress the active team state into a compact representation for distillation
    New construct introduced by the centralized teacher; no external validation provided.
  • personalized context no independent evidence
    purpose: Agent-specific version of coordination tokens used to condition decentralized decisions
    Core novel element distilled into each policy; utility shown only through internal ablations.

pith-pipeline@v0.9.0 · 5506 in / 1399 out tokens · 47751 ms · 2026-05-12T05:16:57.629350+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages

  1. [1]

    Albrecht

    Kale ab Abebe Tessera, Arrasy Rahman, Amos Storkey, and Stefano V . Albrecht. HyperMARL: Adaptive Hypernetworks for Multi-Agent RL, 2025

  2. [2]

    arXiv preprint arXiv:1910.01465 , year=

    Johannes Ackermann, V olker Gabler, Takayuki Osa, and Masashi Sugiyama. Reducing overestimation bias in multi-agent domains using double centralized critics.arXiv preprint arXiv:1910.01465, 2019

  3. [3]

    Learning Transferable Coop- erative Behavior in Multi-Agent Teams

    Akshat Agarwal, Sumit Kumar, Katia Sycara, and Michael Lewis. Learning Transferable Coop- erative Behavior in Multi-Agent Teams. InProceedings of the 19th International Conference on Autonomous Agents and Multi Agent Systems, AAMAS ’20, page 1741–1743, Richland, SC,

  4. [4]

    International Foundation for Autonomous Agents and Multiagent Systems

  5. [5]

    Understanding the Impact of Entropy on Policy Optimization

    Zafarali Ahmed, Nicolas Le Roux, Mohammad Norouzi, and Dale Schuurmans. Understanding the Impact of Entropy on Policy Optimization. In Kamalika Chaudhuri and Ruslan Salakhutdi- nov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 151–160. PMLR, 09–15 Jun 2019

  6. [6]

    URB – Urban Routing Benchmark for RL-equipped Connected Autonomous Vehicles

    Ahmet Onur Akman, Anastasia Psarou, Michał Hoffmann, Łukasz Gorczyca, Łukasz Kowalski, Paweł Gora, Grzegorz Jamróz, and Rafał Kucharski. URB – Urban Routing Benchmark for RL-equipped Connected Autonomous Vehicles. InAdvances in Neural Information Processing Systems, 2025

  7. [7]

    An Introduction to Centralized Training for Decentralized Execution in Cooperative Multi-Agent Reinforcement Learning, 2024

    Christopher Amato. An Introduction to Centralized Training for Decentralized Execution in Cooperative Multi-Agent Reinforcement Learning, 2024

  8. [8]

    The complexity of decentralized control of Markov decision processes.Mathematics of operations research, 27(4):819–840, 2002

    Daniel S Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein. The complexity of decentralized control of Markov decision processes.Mathematics of operations research, 27(4):819–840, 2002

  9. [9]

    Benchmarl: Benchmarking multi-agent reinforcement learning.Journal of Machine Learning Research, 25(217):1–10, 2024

    Matteo Bettini, Amanda Prorok, and Vincent Moens. Benchmarl: Benchmarking multi-agent reinforcement learning.Journal of Machine Learning Research, 25(217):1–10, 2024

  10. [10]

    Autonomous vehicle fleet sizes required to serve different levels of demand.Transportation Research Record, 2542(1):111–119, 2016

    Patrick M Boesch, Francesco Ciari, and Kay W Axhausen. Autonomous vehicle fleet sizes required to serve different levels of demand.Transportation Research Record, 2542(1):111–119, 2016

  11. [11]

    HypeMARL: Multi-Agent Reinforcement Learning For High-Dimensional, Parametric, and Distributed Systems, 2025

    Nicolò Botteghi, Matteo Tomasetto, Urban Fasel, Francesco Braghin, and Andrea Manzoni. HypeMARL: Multi-Agent Reinforcement Learning For High-Dimensional, Parametric, and Distributed Systems, 2025

  12. [12]

    A brief review of hypernetworks in deep learning.Artificial Intelligence Review, 57(9):250, 2024

    Vinod Kumar Chauhan, Jiandong Zhou, Ping Lu, Soheila Molaei, and David A Clifton. A brief review of hypernetworks in deep learning.Artificial Intelligence Review, 57(9):250, 2024

  13. [13]

    PTDE: personalized training with distilled execution for multi-agent reinforcement learning

    Yiqun Chen, Hangyu Mao, Jiaxin Mao, Shiguang Wu, Tianle Zhang, Bin Zhang, Wei Yang, and Hongxing Chang. PTDE: personalized training with distilled execution for multi-agent reinforcement learning. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI ’24, 2024

  14. [14]

    On the Properties of Neural Machine Translation: Encoder-Decoder Approaches, 2014

    Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On the Properties of Neural Machine Translation: Encoder-Decoder Approaches, 2014. 10

  15. [15]

    Shared Experience Actor-Critic for Multi-Agent Reinforcement Learning

    Filippos Christianos, Lukas Schäfer, and Stefano V Albrecht. Shared Experience Actor-Critic for Multi-Agent Reinforcement Learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

  16. [16]

    Christian Schroeder de Witt, Tarun Gupta, Denys Makoviichuk, Viktor Makoviychuk, Philip H. S. Torr, Mingfei Sun, and Shimon Whiteson. Is Independent Learning All You Need in the StarCraft Multi-Agent Challenge?, 2020

  17. [17]

    Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson

    Jakob N. Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. InProceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artific...

  18. [18]

    Multi-agent deep reinforcement learning: a survey.Artificial Intelligence Review, 55(2):895–943, 2022

    Sven Gronauer and Klaus Diepold. Multi-agent deep reinforcement learning: a survey.Artificial Intelligence Review, 55(2):895–943, 2022

  19. [19]

    Dai, and Quoc V

    David Ha, Andrew M. Dai, and Quoc V . Le. HyperNetworks. InInternational Conference on Learning Representations, 2017

  20. [20]

    Deep Recurrent Q-Learning for Partially Observable MDPs

    Matthew J Hausknecht and Peter Stone. Deep Recurrent Q-Learning for Partially Observable MDPs. InAAAI fall symposia, volume 45, page 141, 2015

  21. [21]

    Randomized entity-wise factorization for multi-agent reinforcement learning

    Shariq Iqbal, Christian A Schroeder De Witt, Bei Peng, Wendelin Boehmer, Shimon Whiteson, and Fei Sha. Randomized entity-wise factorization for multi-agent reinforcement learning. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 4596–4...

  22. [22]

    Actor-attention-critic for multi-agent reinforcement learning

    Shariq Iqbal and Fei Sha. Actor-attention-critic for multi-agent reinforcement learning. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 2961–2970. PMLR, 09–15 Jun 2019

  23. [23]

    Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning, 2022

    Jakub Grudzien Kuba, Ruiqing Chen, Muning Wen, Ying Wen, Fanglei Sun, Jun Wang, and Yaodong Yang. Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning, 2022

  24. [24]

    Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks

    Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages...

  25. [25]

    Coach- Player Multi-agent Reinforcement Learning for Dynamic Team Composition

    Bo Liu, Qiang Liu, Peter Stone, Animesh Garg, Yuke Zhu, and Anima Anandkumar. Coach- Player Multi-agent Reinforcement Learning for Dynamic Team Composition. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 6860–6870. PMLR, 18–24 Jul 2021

  26. [26]

    Yeh, and Alexander G

    Iou-Jen Liu, Raymond A. Yeh, and Alexander G. Schwing. PIC: Permutation Invariant Critic for Multi-Agent Deep Reinforcement Learning. In Leslie Pack Kaelbling, Danica Kragic, and Komei Sugiura, editors,Proceedings of the Conference on Robot Learning, volume 100 of Proceedings of Machine Learning Research, pages 590–602. PMLR, 30 Oct–01 Nov 2020

  27. [27]

    Evolutionary Population Curriculum for Scaling Multi-Agent Reinforcement Learning, 2020

    Qian Long, Zihan Zhou, Abhibav Gupta, Fei Fang, Yi Wu, and Xiaolong Wang. Evolutionary Population Curriculum for Scaling Multi-Agent Reinforcement Learning, 2020

  28. [28]

    Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments

    Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. 11

  29. [29]

    Emergence of grounded compositional language in multi- agent populations

    Igor Mordatch and Pieter Abbeel. Emergence of grounded compositional language in multi- agent populations. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

  30. [30]

    A review of cooperative multi-agent deep reinforce- ment learning.Applied Intelligence, 53(11):13677–13722, 2023

    Afshin Oroojlooy and Davood Hajinezhad. A review of cooperative multi-agent deep reinforce- ment learning.Applied Intelligence, 53(11):13677–13722, 2023

  31. [31]

    Albrecht

    Georgios Papoudakis, Filippos Christianos, Lukas Schäfer, and Stefano V . Albrecht. Bench- marking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative Tasks. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS), 2021

  32. [32]

    FiLM: Visual Reasoning with a General Conditioning Layer.Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), Apr

    Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron Courville. FiLM: Visual Reasoning with a General Conditioning Layer.Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), Apr. 2018

  33. [33]

    David Portugal and Rui P. Rocha. Performance Estimation and Dimensioning of Team Size for Multirobot Patrol.IEEE Intelligent Systems, 32(6):30–38, 2017

  34. [34]

    How Many Vehicles Do We Need? Fleet Sizing for Shared Autonomous Vehicles With Ridesharing.IEEE Transactions on Intelligent Transportation Systems, 23(9):14594–14607, 2022

    Boting Qu, Linran Mao, Zhenzhou Xu, Jun Feng, and Xin Wang. How Many Vehicles Do We Need? Fleet Sizing for Shared Autonomous Vehicles With Ridesharing.IEEE Transactions on Intelligent Transportation Systems, 23(9):14594–14607, 2022

  35. [35]

    Albrecht

    Arrasy Rahman, Ignacio Carlucho, Niklas Höpner, and Stefano V . Albrecht. A General Learning Framework for Open Ad Hoc Teamwork Using Graph-based Policy Learning.Journal of Machine Learning Research, 24(298):1–74, 2023

  36. [36]

    Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning.Journal of Machine Learning Research, 21(178):1–51, 2020

    Tabish Rashid, Mikayel Samvelyan, Christian Schroeder de Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning.Journal of Machine Learning Research, 21(178):1–51, 2020

  37. [37]

    Rjeb, J-P

    A. Rjeb, J-P. Gayon, and S. Norre. Sizing of a homogeneous fleet of robots in a logistics warehouse.IFAC-PapersOnLine, 54(1):552–557, 2021. 17th IFAC Symposium on Information Control Problems in Manufacturing INCOM 2021

  38. [38]

    Kaminka, and Sarit Kraus.A Study of Scalability Properties in Robotic Teams, pages 27–51

    Avi Rosenfeld, Gal A. Kaminka, and Sarit Kraus.A Study of Scalability Properties in Robotic Teams, pages 27–51. Springer US, Boston, MA, 2006

  39. [39]

    High- dimensional continuous control using generalized advantage estimation, 2018

    John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High- dimensional continuous control using generalized advantage estimation, 2018

  40. [40]

    Proximal Policy Optimization Algorithms, 2017

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimization Algorithms, 2017

  41. [41]

    Self-Organized Group for Cooperative Multi-agent Reinforcement Learning

    Jianzhun Shao, Zhiqiang Lou, Hongchang Zhang, Yuhang Jiang, Shuncheng He, and Xiangyang Ji. Self-Organized Group for Cooperative Multi-agent Reinforcement Learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 5711–5723. Curran Associates, Inc., 2022

  42. [42]

    Ad Hoc Autonomous Agent Teams: Collaboration without Pre-Coordination.Proceedings of the AAAI Conference on Artificial Intelligence, 24(1):1504–1509, Jul

    Peter Stone, Gal Kaminka, Sarit Kraus, and Jeffrey Rosenschein. Ad Hoc Autonomous Agent Teams: Collaboration without Pre-Coordination.Proceedings of the AAAI Conference on Artificial Intelligence, 24(1):1504–1509, Jul. 2010

  43. [43]

    Leibo, Karl Tuyls, and Thore Graepel

    Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z. Leibo, Karl Tuyls, and Thore Graepel. Value-Decomposition Networks For Cooperative Multi-Agent Learning Based On Team Reward. InProceedings of the 17th International Conference on Autonomous Agents and MultiAgent...

  44. [44]

    Pettingzoo: Gym for multi-agent reinforcement learning.Advances in Neural Information Processing Systems, 34:15032–15043, 2021

    Jordan Terry, Benjamin Black, Nathaniel Grammel, Mario Jayakumar, Ananth Hari, Ryan Sullivan, Luis S Santos, Clemens Dieffendahl, Caroline Horsch, Rodrigo Perez-Vicente, et al. Pettingzoo: Gym for multi-agent reinforcement learning.Advances in Neural Information Processing Systems, 34:15032–15043, 2021. 12

  45. [45]

    Open Ad Hoc Teamwork with Cooperative Game Theory

    Jianhong Wang, Yang Li, Yuan Zhang, Wei Pan, and Samuel Kaski. Open Ad Hoc Teamwork with Cooperative Game Theory. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Lea...

  46. [46]

    Mutual-Information Regularized Multi-Agent Policy Iteration

    Wang Wang, Deheng Ye, and Zongqing Lu. Mutual-Information Regularized Multi-Agent Policy Iteration. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 2617–2635. Curran Associates, Inc., 2023

  47. [47]

    The surprising effectiveness of ppo in cooperative multi-agent games

    Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and YI WU. The surprising effectiveness of ppo in cooperative multi-agent games. InAdvances in Neural Information Processing Systems, volume 35, pages 24611–24624. Curran Associates, Inc., 2022

  48. [48]

    A survey of progress on cooperative multi-agent reinforcement learning in open environment.arXiv preprint arXiv:2312.01058, 2023

    Lei Yuan, Ziqian Zhang, Lihe Li, Cong Guan, and Yang Yu. A survey of progress on cooperative multi-agent reinforcement learning in open environment.arXiv preprint arXiv:2312.01058, 2023

  49. [49]

    Deep Sets

    Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola. Deep Sets. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

  50. [50]

    Decentralized multi-agent reinforcement learning with networked agents: Recent advances.Frontiers of Information Technology & Electronic Engineering, 22(6):802–814, 2021

    Kaiqing Zhang, Zhuoran Yang, and Tamer Ba¸ sar. Decentralized multi-agent reinforcement learning with networked agents: Recent advances.Frontiers of Information Technology & Electronic Engineering, 22(6):802–814, 2021

  51. [51]

    Multi-agent reinforcement learning: A selective overview of theories and algorithms.Handbook of reinforcement learning and control, pages 321–384, 2021

    Kaiqing Zhang, Zhuoran Yang, and Tamer Ba¸ sar. Multi-agent reinforcement learning: A selective overview of theories and algorithms.Handbook of reinforcement learning and control, pages 321–384, 2021

  52. [52]

    CTDS: Centralized Teacher With Decentralized Student for Multiagent Reinforcement Learning

    Jian Zhao, Xunhan Hu, Mingyu Yang, Wengang Zhou, Jiangcheng Zhu, and Houqiang Li. CTDS: Centralized Teacher With Decentralized Student for Multiagent Reinforcement Learning. IEEE Transactions on Games, 16(1):140–150, 2024. 13 Appendix A Additional results A.1 Training returns Figure 6 displays the mean training returns across repetitions, using subplots f...

  53. [53]

    are compatible in principle, they require additional method-specific adaptation because their critics estimate action values rather than state values. In a PC3D-style extension, the distilled teacher context should be constructed from agent observations before joint actions are introduced, while the realized joint action could be used only in the downstre...