PC3D: Zero-Shot Cooperation Across Variable Rosters via Personalized Context Distillation
Pith reviewed 2026-05-12 05:16 UTC · model grok-4.3
The pith
Decentralized agents recover personalized team context from local histories to cooperate with changing roster sizes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PC3D trains decentralized policies by distilling agent-specific coordination tokens from a set-structured centralized teacher during training. At execution, each agent predicts its own context from local history and conditions its decision-making on it to handle episodic roster variations without communication or retraining.
What carries the argument
Personalized context distillation, in which a centralized teacher compresses the active team into coordination tokens and personalizes them for each agent before distilling the result into the decentralized policy for local prediction and adaptive use.
If this is right
- PC3D yields higher returns than the evaluated baselines on three cooperative benchmarks for both seen and unseen roster sizes.
- Ablations attribute the gains specifically to the combination of context distillation and adaptive context use during execution.
- Decentralized policies can operate under episodic roster variation without requiring online retraining or privileged coordinators.
Where Pith is reading between the lines
- The same distillation approach could be applied to tasks with continuous rather than discrete action spaces to test whether context recovery scales beyond the current benchmarks.
- If personalization tokens prove robust, the method might reduce reliance on centralized oversight in large agent teams by shifting adaptation to local prediction.
- Connecting this context mechanism to existing work on partial observability could clarify how much history length is needed for reliable team-size inference.
Load-bearing premise
Each agent can recover relevant context about the active team solely from its local interaction history without execution-time communication or privileged information, and that this recovered context is sufficient to adapt behavior effectively across roster variations.
What would settle it
An experiment showing that PC3D policies achieve no higher returns than non-adaptive decentralized baselines on episodes with previously unseen roster sizes would indicate that the distilled context does not support the claimed adaptation.
Figures
read the original abstract
Cooperative multi-agent reinforcement learning often assumes a fixed execution team, yet many decentralized systems must operate with varying numbers of active agents during deployment. We study this setting under episodic roster variation: each episode is executed by a set of homogeneous agents, with the team size varying across episodes. Agents act only from local histories, without execution-time communication, privileged coordinators, or online retraining. Therefore, effective cooperation requires each agent to recover relevant context about the active team and adapt its behavior accordingly. To this end, we propose PC3D (Personalized Central Coordination Context Distillation), a method for training decentralized policies to recover and use personalized coordination context from local interaction histories. During training, a set-structured centralized teacher compresses the active team into coordination tokens and personalizes them into agent-specific contexts, which are distilled into decentralized policies. At execution, each agent predicts its own context from local history and adaptively uses it to condition decision-making. Across three cooperative MARL benchmarks, PC3D achieves higher returns than the evaluated baselines with both seen and unseen roster sizes, and ablations attribute these gains to both context distillation and adaptive context use.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes PC3D, a distillation-based method for cooperative MARL under episodic roster variation. A set-structured centralized teacher generates coordination tokens and agent-specific contexts during training; these are distilled into decentralized policies that, at execution, predict personalized contexts from local histories alone (no communication or privileged information) and condition actions on them. The central claim is that this enables higher returns than baselines on three benchmarks for both seen and unseen team sizes, with ablations attributing gains to the distillation step and adaptive context use.
Significance. If the empirical claims are substantiated with quantitative results and the zero-shot generalization holds, the work would address a practical gap in decentralized MARL by allowing policies to adapt to dynamic team cardinalities without retraining or execution-time coordination. The teacher-to-student distillation pipeline for context recovery is a technically interesting approach that could generalize to other variable-agent settings.
major comments (3)
- [Abstract and §4] Abstract and §4 (Experiments): the central empirical claim states that PC3D achieves higher returns than baselines for both seen and unseen roster sizes, yet the manuscript provides no numerical values, standard deviations, error bars, statistical tests, or details on baseline implementations and data splits. This leaves the magnitude, reliability, and reproducibility of the reported gains unverified.
- [§3.2] §3.2 (Distillation objective): the context predictor is trained on trajectories generated under the same roster-size distribution as the policy, but the objective does not explicitly regularize the student to recover roster cardinality or to produce contexts that remain informative for cardinalities outside the training support. This creates a mild circular dependence that directly affects the zero-shot claim.
- [§3.1 and §4.3] §3.1 and §4.3 (Ablations): the weakest assumption—that strictly local histories suffice to disambiguate active roster size for unseen cardinalities—is load-bearing for the zero-shot result, yet no auxiliary experiments (e.g., context-prediction accuracy or confusion matrices on held-out sizes) are reported to test whether histories generated by different team sizes are statistically distinguishable under the per-agent dynamics.
minor comments (2)
- [§3] Notation for “coordination tokens” and “personalized context” is introduced without a compact mathematical definition or diagram showing the exact tensor shapes and conditioning points.
- [§3.1] The description of the set-structured teacher would benefit from an explicit statement of how permutation invariance is enforced (e.g., via sum-pooling or attention).
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive report. The comments highlight important aspects of empirical reporting, methodological clarity, and supporting evidence for the zero-shot claims. We address each major comment below and will revise the manuscript accordingly to improve reproducibility, strengthen the presentation of results, and provide additional analysis where needed.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): the central empirical claim states that PC3D achieves higher returns than baselines for both seen and unseen roster sizes, yet the manuscript provides no numerical values, standard deviations, error bars, statistical tests, or details on baseline implementations and data splits. This leaves the magnitude, reliability, and reproducibility of the reported gains unverified.
Authors: We agree that the current presentation relies on figures without accompanying numerical summaries, which limits verifiability. In the revised manuscript we will add a dedicated results table in §4 reporting mean returns, standard deviations across random seeds, and 95% confidence intervals for PC3D and all baselines on both seen and unseen roster sizes. We will also expand the experimental details to include baseline implementation specifics (e.g., network architectures, training hyperparameters, and exact data splits), and we will include paired statistical significance tests (e.g., Welch’s t-test with p-values) comparing PC3D against each baseline. These additions will be referenced in the abstract where space allows. revision: yes
-
Referee: [§3.2] §3.2 (Distillation objective): the context predictor is trained on trajectories generated under the same roster-size distribution as the policy, but the objective does not explicitly regularize the student to recover roster cardinality or to produce contexts that remain informative for cardinalities outside the training support. This creates a mild circular dependence that directly affects the zero-shot claim.
Authors: The training distribution explicitly samples episodes with varying roster sizes drawn from the same support used at test time for seen sizes, and the teacher provides supervision derived from the full joint state. The distillation loss therefore trains the student to recover contexts that are useful for cooperation under that distribution. While we do not add an explicit cardinality-prediction term, the empirical zero-shot results on held-out sizes indicate that the learned contexts remain informative. We will revise §3.2 to more explicitly describe the roster-size sampling procedure and to discuss the implicit generalization mechanism. We view this as a clarification rather than a change to the objective itself. revision: partial
-
Referee: [§3.1 and §4.3] §3.1 and §4.3 (Ablations): the weakest assumption—that strictly local histories suffice to disambiguate active roster size for unseen cardinalities—is load-bearing for the zero-shot result, yet no auxiliary experiments (e.g., context-prediction accuracy or confusion matrices on held-out sizes) are reported to test whether histories generated by different team sizes are statistically distinguishable under the per-agent dynamics.
Authors: We concur that direct evidence of distinguishability would strengthen the zero-shot argument. In the revised §4.3 we will add auxiliary evaluation metrics: context-prediction accuracy (measured against the teacher-provided contexts) and confusion matrices for roster-size inference on held-out cardinalities. These experiments will be performed on the same local histories used by the decentralized policies, thereby testing whether different team sizes produce statistically distinguishable observation sequences under the per-agent dynamics. revision: yes
Circularity Check
No significant circularity; derivation relies on learned distillation from privileged teacher to local-history student
full rationale
The paper describes a training procedure with a centralized teacher that has privileged access to the full team roster and distills coordination contexts into decentralized policies conditioned only on local histories. The context predictor is trained to match teacher outputs on the training distribution, but this is a standard supervised distillation step rather than a self-definitional reduction or fitted input renamed as prediction. No equations or claims reduce the zero-shot generalization result to the inputs by construction. Evaluations on external MARL benchmarks (with both seen and unseen roster sizes) provide independent empirical content. No self-citation chains or uniqueness theorems imported from prior author work are load-bearing in the provided description. The central claim therefore remains non-circular.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Agents act only from local histories, without execution-time communication, privileged coordinators, or online retraining.
invented entities (2)
-
coordination tokens
no independent evidence
-
personalized context
no independent evidence
Reference graph
Works this paper leans on
- [1]
-
[2]
arXiv preprint arXiv:1910.01465 , year=
Johannes Ackermann, V olker Gabler, Takayuki Osa, and Masashi Sugiyama. Reducing overestimation bias in multi-agent domains using double centralized critics.arXiv preprint arXiv:1910.01465, 2019
-
[3]
Learning Transferable Coop- erative Behavior in Multi-Agent Teams
Akshat Agarwal, Sumit Kumar, Katia Sycara, and Michael Lewis. Learning Transferable Coop- erative Behavior in Multi-Agent Teams. InProceedings of the 19th International Conference on Autonomous Agents and Multi Agent Systems, AAMAS ’20, page 1741–1743, Richland, SC,
-
[4]
International Foundation for Autonomous Agents and Multiagent Systems
-
[5]
Understanding the Impact of Entropy on Policy Optimization
Zafarali Ahmed, Nicolas Le Roux, Mohammad Norouzi, and Dale Schuurmans. Understanding the Impact of Entropy on Policy Optimization. In Kamalika Chaudhuri and Ruslan Salakhutdi- nov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 151–160. PMLR, 09–15 Jun 2019
work page 2019
-
[6]
URB – Urban Routing Benchmark for RL-equipped Connected Autonomous Vehicles
Ahmet Onur Akman, Anastasia Psarou, Michał Hoffmann, Łukasz Gorczyca, Łukasz Kowalski, Paweł Gora, Grzegorz Jamróz, and Rafał Kucharski. URB – Urban Routing Benchmark for RL-equipped Connected Autonomous Vehicles. InAdvances in Neural Information Processing Systems, 2025
work page 2025
-
[7]
Christopher Amato. An Introduction to Centralized Training for Decentralized Execution in Cooperative Multi-Agent Reinforcement Learning, 2024
work page 2024
-
[8]
Daniel S Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein. The complexity of decentralized control of Markov decision processes.Mathematics of operations research, 27(4):819–840, 2002
work page 2002
-
[9]
Matteo Bettini, Amanda Prorok, and Vincent Moens. Benchmarl: Benchmarking multi-agent reinforcement learning.Journal of Machine Learning Research, 25(217):1–10, 2024
work page 2024
-
[10]
Patrick M Boesch, Francesco Ciari, and Kay W Axhausen. Autonomous vehicle fleet sizes required to serve different levels of demand.Transportation Research Record, 2542(1):111–119, 2016
work page 2016
-
[11]
Nicolò Botteghi, Matteo Tomasetto, Urban Fasel, Francesco Braghin, and Andrea Manzoni. HypeMARL: Multi-Agent Reinforcement Learning For High-Dimensional, Parametric, and Distributed Systems, 2025
work page 2025
-
[12]
A brief review of hypernetworks in deep learning.Artificial Intelligence Review, 57(9):250, 2024
Vinod Kumar Chauhan, Jiandong Zhou, Ping Lu, Soheila Molaei, and David A Clifton. A brief review of hypernetworks in deep learning.Artificial Intelligence Review, 57(9):250, 2024
work page 2024
-
[13]
PTDE: personalized training with distilled execution for multi-agent reinforcement learning
Yiqun Chen, Hangyu Mao, Jiaxin Mao, Shiguang Wu, Tianle Zhang, Bin Zhang, Wei Yang, and Hongxing Chang. PTDE: personalized training with distilled execution for multi-agent reinforcement learning. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI ’24, 2024
work page 2024
-
[14]
On the Properties of Neural Machine Translation: Encoder-Decoder Approaches, 2014
Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On the Properties of Neural Machine Translation: Encoder-Decoder Approaches, 2014. 10
work page 2014
-
[15]
Shared Experience Actor-Critic for Multi-Agent Reinforcement Learning
Filippos Christianos, Lukas Schäfer, and Stefano V Albrecht. Shared Experience Actor-Critic for Multi-Agent Reinforcement Learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2020
work page 2020
-
[16]
Christian Schroeder de Witt, Tarun Gupta, Denys Makoviichuk, Viktor Makoviychuk, Philip H. S. Torr, Mingfei Sun, and Shimon Whiteson. Is Independent Learning All You Need in the StarCraft Multi-Agent Challenge?, 2020
work page 2020
-
[17]
Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson
Jakob N. Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. InProceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artific...
work page 2018
-
[18]
Sven Gronauer and Klaus Diepold. Multi-agent deep reinforcement learning: a survey.Artificial Intelligence Review, 55(2):895–943, 2022
work page 2022
-
[19]
David Ha, Andrew M. Dai, and Quoc V . Le. HyperNetworks. InInternational Conference on Learning Representations, 2017
work page 2017
-
[20]
Deep Recurrent Q-Learning for Partially Observable MDPs
Matthew J Hausknecht and Peter Stone. Deep Recurrent Q-Learning for Partially Observable MDPs. InAAAI fall symposia, volume 45, page 141, 2015
work page 2015
-
[21]
Randomized entity-wise factorization for multi-agent reinforcement learning
Shariq Iqbal, Christian A Schroeder De Witt, Bei Peng, Wendelin Boehmer, Shimon Whiteson, and Fei Sha. Randomized entity-wise factorization for multi-agent reinforcement learning. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 4596–4...
work page 2021
-
[22]
Actor-attention-critic for multi-agent reinforcement learning
Shariq Iqbal and Fei Sha. Actor-attention-critic for multi-agent reinforcement learning. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 2961–2970. PMLR, 09–15 Jun 2019
work page 2019
-
[23]
Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning, 2022
Jakub Grudzien Kuba, Ruiqing Chen, Muning Wen, Ying Wen, Fanglei Sun, Jun Wang, and Yaodong Yang. Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning, 2022
work page 2022
-
[24]
Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks
Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages...
work page 2019
-
[25]
Coach- Player Multi-agent Reinforcement Learning for Dynamic Team Composition
Bo Liu, Qiang Liu, Peter Stone, Animesh Garg, Yuke Zhu, and Anima Anandkumar. Coach- Player Multi-agent Reinforcement Learning for Dynamic Team Composition. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 6860–6870. PMLR, 18–24 Jul 2021
work page 2021
-
[26]
Iou-Jen Liu, Raymond A. Yeh, and Alexander G. Schwing. PIC: Permutation Invariant Critic for Multi-Agent Deep Reinforcement Learning. In Leslie Pack Kaelbling, Danica Kragic, and Komei Sugiura, editors,Proceedings of the Conference on Robot Learning, volume 100 of Proceedings of Machine Learning Research, pages 590–602. PMLR, 30 Oct–01 Nov 2020
work page 2020
-
[27]
Evolutionary Population Curriculum for Scaling Multi-Agent Reinforcement Learning, 2020
Qian Long, Zihan Zhou, Abhibav Gupta, Fei Fang, Yi Wu, and Xiaolong Wang. Evolutionary Population Curriculum for Scaling Multi-Agent Reinforcement Learning, 2020
work page 2020
-
[28]
Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. 11
work page 2017
-
[29]
Emergence of grounded compositional language in multi- agent populations
Igor Mordatch and Pieter Abbeel. Emergence of grounded compositional language in multi- agent populations. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018
work page 2018
-
[30]
Afshin Oroojlooy and Davood Hajinezhad. A review of cooperative multi-agent deep reinforce- ment learning.Applied Intelligence, 53(11):13677–13722, 2023
work page 2023
-
[31]
Georgios Papoudakis, Filippos Christianos, Lukas Schäfer, and Stefano V . Albrecht. Bench- marking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative Tasks. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS), 2021
work page 2021
-
[32]
Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron Courville. FiLM: Visual Reasoning with a General Conditioning Layer.Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), Apr. 2018
work page 2018
-
[33]
David Portugal and Rui P. Rocha. Performance Estimation and Dimensioning of Team Size for Multirobot Patrol.IEEE Intelligent Systems, 32(6):30–38, 2017
work page 2017
-
[34]
Boting Qu, Linran Mao, Zhenzhou Xu, Jun Feng, and Xin Wang. How Many Vehicles Do We Need? Fleet Sizing for Shared Autonomous Vehicles With Ridesharing.IEEE Transactions on Intelligent Transportation Systems, 23(9):14594–14607, 2022
work page 2022
- [35]
-
[36]
Tabish Rashid, Mikayel Samvelyan, Christian Schroeder de Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning.Journal of Machine Learning Research, 21(178):1–51, 2020
work page 2020
- [37]
-
[38]
Kaminka, and Sarit Kraus.A Study of Scalability Properties in Robotic Teams, pages 27–51
Avi Rosenfeld, Gal A. Kaminka, and Sarit Kraus.A Study of Scalability Properties in Robotic Teams, pages 27–51. Springer US, Boston, MA, 2006
work page 2006
-
[39]
High- dimensional continuous control using generalized advantage estimation, 2018
John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High- dimensional continuous control using generalized advantage estimation, 2018
work page 2018
-
[40]
Proximal Policy Optimization Algorithms, 2017
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimization Algorithms, 2017
work page 2017
-
[41]
Self-Organized Group for Cooperative Multi-agent Reinforcement Learning
Jianzhun Shao, Zhiqiang Lou, Hongchang Zhang, Yuhang Jiang, Shuncheng He, and Xiangyang Ji. Self-Organized Group for Cooperative Multi-agent Reinforcement Learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 5711–5723. Curran Associates, Inc., 2022
work page 2022
-
[42]
Peter Stone, Gal Kaminka, Sarit Kraus, and Jeffrey Rosenschein. Ad Hoc Autonomous Agent Teams: Collaboration without Pre-Coordination.Proceedings of the AAAI Conference on Artificial Intelligence, 24(1):1504–1509, Jul. 2010
work page 2010
-
[43]
Leibo, Karl Tuyls, and Thore Graepel
Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z. Leibo, Karl Tuyls, and Thore Graepel. Value-Decomposition Networks For Cooperative Multi-Agent Learning Based On Team Reward. InProceedings of the 17th International Conference on Autonomous Agents and MultiAgent...
work page 2085
-
[44]
Jordan Terry, Benjamin Black, Nathaniel Grammel, Mario Jayakumar, Ananth Hari, Ryan Sullivan, Luis S Santos, Clemens Dieffendahl, Caroline Horsch, Rodrigo Perez-Vicente, et al. Pettingzoo: Gym for multi-agent reinforcement learning.Advances in Neural Information Processing Systems, 34:15032–15043, 2021. 12
work page 2021
-
[45]
Open Ad Hoc Teamwork with Cooperative Game Theory
Jianhong Wang, Yang Li, Yuan Zhang, Wei Pan, and Samuel Kaski. Open Ad Hoc Teamwork with Cooperative Game Theory. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Lea...
work page 2024
-
[46]
Mutual-Information Regularized Multi-Agent Policy Iteration
Wang Wang, Deheng Ye, and Zongqing Lu. Mutual-Information Regularized Multi-Agent Policy Iteration. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 2617–2635. Curran Associates, Inc., 2023
work page 2023
-
[47]
The surprising effectiveness of ppo in cooperative multi-agent games
Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and YI WU. The surprising effectiveness of ppo in cooperative multi-agent games. InAdvances in Neural Information Processing Systems, volume 35, pages 24611–24624. Curran Associates, Inc., 2022
work page 2022
-
[48]
Lei Yuan, Ziqian Zhang, Lihe Li, Cong Guan, and Yang Yu. A survey of progress on cooperative multi-agent reinforcement learning in open environment.arXiv preprint arXiv:2312.01058, 2023
-
[49]
Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola. Deep Sets. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017
work page 2017
-
[50]
Kaiqing Zhang, Zhuoran Yang, and Tamer Ba¸ sar. Decentralized multi-agent reinforcement learning with networked agents: Recent advances.Frontiers of Information Technology & Electronic Engineering, 22(6):802–814, 2021
work page 2021
-
[51]
Kaiqing Zhang, Zhuoran Yang, and Tamer Ba¸ sar. Multi-agent reinforcement learning: A selective overview of theories and algorithms.Handbook of reinforcement learning and control, pages 321–384, 2021
work page 2021
-
[52]
CTDS: Centralized Teacher With Decentralized Student for Multiagent Reinforcement Learning
Jian Zhao, Xunhan Hu, Mingyu Yang, Wengang Zhou, Jiangcheng Zhu, and Houqiang Li. CTDS: Centralized Teacher With Decentralized Student for Multiagent Reinforcement Learning. IEEE Transactions on Games, 16(1):140–150, 2024. 13 Appendix A Additional results A.1 Training returns Figure 6 displays the mean training returns across repetitions, using subplots f...
work page 2024
-
[53]
are compatible in principle, they require additional method-specific adaptation because their critics estimate action values rather than state values. In a PC3D-style extension, the distilled teacher context should be constructed from agent observations before joint actions are introduced, while the realized joint action could be used only in the downstre...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.