KD-MARL: Resource-Aware Knowledge Distillation in Multi-Agent Reinforcement Learning

Mahardhika Pratama; Monirul Islam Pavel; Muhammad Anwar Masum; Ryszard Kowalczyk; Siyi Hu; Zehong Jimmy Cao

arxiv: 2604.06691 · v1 · submitted 2026-04-08 · 💻 cs.AI

KD-MARL: Resource-Aware Knowledge Distillation in Multi-Agent Reinforcement Learning

Monirul Islam Pavel , Siyi Hu , Muhammad Anwar Masum , Mahardhika Pratama , Ryszard Kowalczyk , Zehong Jimmy Cao This is my paper

Pith reviewed 2026-05-10 18:45 UTC · model grok-4.3

classification 💻 cs.AI

keywords knowledge distillationmulti-agent reinforcement learningdecentralized policiescoordination preservationresource-aware trainingadvantage signalsSMAC benchmarkMPE benchmark

0 comments

The pith

KD-MARL distills coordinated behavior from centralized expert policies into lightweight decentralized students that retain over 90 percent performance at up to 28.6 times lower computational cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a knowledge distillation approach for multi-agent reinforcement learning that moves coordinated decision-making from expensive centralized expert models to efficient decentralized student agents. Students learn both individual actions and team coordination patterns through distilled advantage signals and structured policy supervision rather than relying on a shared critic. This setup supports agents with different model sizes and limited observations, which matters for running MARL on devices with tight memory, power, and time budgets. If the method holds, it removes a key barrier to deploying coordinated multi-agent systems outside of high-end servers.

Core claim

The central claim is that a two-stage distillation framework can transfer both action-level behavior and structural coordination from a centralized expert to heterogeneous decentralized students. The students train without a critic by using distilled advantage signals and structured policy supervision, allowing each agent to match its capacity to its own observation complexity while preserving coordination under partial observability. On SMAC and MPE benchmarks the resulting policies keep more than 90 percent of expert performance while cutting FLOPs by as much as 28.6 times.

What carries the argument

The two-stage KD-MARL framework that first extracts coordinated knowledge from a centralized expert and then supervises lightweight decentralized students with distilled advantage signals and structured policy outputs.

If this is right

Decentralized agents can match expert-level team behavior without access to a central critic during execution.
Student architectures can be sized differently per agent to fit each agent's observation load.
The same distillation pipeline applies across heterogeneous benchmarks such as SMAC and MPE.
Computational cost drops enough to fit coordinated policies on edge hardware with limited onboard resources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same signals might let teams adapt when agents join or leave during an episode without retraining the full expert.
Extending the approach to continuous control domains could test whether coordination patterns survive when action spaces are no longer discrete.
Running the students on physical robots would reveal whether the observed FLOPs savings translate to real-time latency gains under sensor noise.

Load-bearing premise

Distilled advantage signals and structured policy supervision are enough to keep multi-agent coordination intact when students act without a critic and have only partial or heterogeneous observations.

What would settle it

A controlled test on a new multi-agent task with severely restricted observations where student teams fall below 70 percent of expert win rate or coordination score would show the distillation fails to preserve necessary structure.

Figures

Figures reproduced from arXiv: 2604.06691 by Mahardhika Pratama, Monirul Islam Pavel, Muhammad Anwar Masum, Ryszard Kowalczyk, Siyi Hu, Zehong Jimmy Cao.

**Figure 2.** Figure 2: Proposed KD-MARL architecture with two-stage training strategy in limited & heterogeneous setup. large, expressive networks trained offline with state-of-theart MARL algorithms, capturing long-term dependencies and complex coordination. Student agents adopt structurally similar but smaller architectures with fewer recurrent units, enabling effective knowledge transfer whilst reducing complexity for reso… view at source ↗

**Figure 3.** Figure 3: FLOPs per episode comparison across SMAC and MPE maps, demonstrating resource-aware advantages of KD-MARL [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: The heatmaps show action-selection frequencies in the 3s5z SMAC scenario under constraints, with non-KD (left) and KDMARL (right). Warmer colours indicate higher frequency. KD-MARL exhibits more concentrated and stable attack patterns, while the non-KD policy shows more dispersed actions, indicating reduced coordination. TABLE III: Comparison of different methods on 3s5z map (SMAC) in heterogeneous setup.… view at source ↗

read the original abstract

Real world deployment of multi agent reinforcement learning MARL systems is fundamentally constrained by limited compute memory and inference time. While expert policies achieve high performance they rely on costly decision cycles and large scale models that are impractical for edge devices or embedded platforms. Knowledge distillation KD offers a promising path toward resource aware execution but existing KD methods in MARL focus narrowly on action imitation often neglecting coordination structure and assuming uniform agent capabilities. We propose resource aware Knowledge Distillation for Multi Agent Reinforcement Learning KD MARL a two stage framework that transfers coordinated behavior from a centralized expert to lightweight decentralized student agents. The student policies are trained without a critic relying instead on distilled advantage signals and structured policy supervision to preserve coordination under heterogeneous and limited observations. Our approach transfers both action level behavior and structural coordination patterns from expert policies while supporting heterogeneous student architectures allowing each agent model capacity to match its observation complexity which is crucial for efficient execution under partial or limited observability and limited onboard resources. Extensive experiments on SMAC and MPE benchmarks demonstrate that KD MARL achieves high performance retention while substantially reducing computational cost. Across standard multi agent benchmarks KD MARL retains over 90 percent of expert performance while reducing computational cost by up to 28.6 times FLOPs. The proposed approach achieves expert level coordination and preserves it through structured distillation enabling practical MARL deployment across resource constrained onboard platforms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

KD-MARL shows a workable two-stage distillation route for lighter MARL agents with big reported FLOPs cuts, but the claim that static advantage signals alone preserve coordination without a critic still looks like the least secure part.

read the letter

KD-MARL gives a two-stage framework that distills both actions and coordination structure from a centralized expert into decentralized students. The students run without their own critic and use the transferred advantage signals plus structured supervision instead. The paper reports that this keeps over 90 percent of expert performance on SMAC and MPE while cutting compute by up to 28.6 times in FLOPs, and it explicitly allows different student architectures per agent to match their observation load.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes KD-MARL, a two-stage resource-aware knowledge distillation framework for multi-agent reinforcement learning. It transfers coordinated behavior from a centralized expert to lightweight decentralized heterogeneous student agents trained without a critic, relying instead on distilled advantage signals and structured policy supervision. The approach supports varying student model capacities matched to observation complexity and is evaluated on SMAC and MPE benchmarks, claiming retention of over 90% of expert performance alongside computational cost reductions of up to 28.6 times in FLOPs.

Significance. If the empirical results hold under rigorous validation, KD-MARL could meaningfully advance practical MARL deployment on resource-constrained edge and embedded platforms by enabling efficient decentralized execution while preserving coordination in heterogeneous, partially observable settings. The explicit support for heterogeneous student architectures is a practical strength not commonly emphasized in prior MARL distillation work.

major comments (2)

[Abstract] Abstract: The headline performance claims (>90% expert retention and up to 28.6x FLOPs reduction) are stated without error bars, number of independent runs, statistical significance tests, or ablation studies on the relative contributions of distilled advantages versus structured policy supervision, leaving the robustness of coordination preservation unclear.
[Method (two-stage KD process)] Method description of the two-stage KD process: The central claim that static distilled advantage signals plus structured policy supervision suffice to transfer expert coordination to decentralized heterogeneous students without a critic (under partial or limited observations) is load-bearing but lacks analysis of how non-stationarity and credit assignment are mitigated when student observations diverge from the expert's joint view; this directly engages the potential failure mode under high partial observability.

minor comments (1)

[Abstract] Abstract contains minor grammatical and formatting inconsistencies (e.g., inconsistent hyphenation of 'multi-agent' and run-on sentences) that reduce readability but do not affect technical content.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the robustness of our empirical claims and the mechanistic analysis of coordination transfer. We address each major comment below, indicating planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The headline performance claims (>90% expert retention and up to 28.6x FLOPs reduction) are stated without error bars, number of independent runs, statistical significance tests, or ablation studies on the relative contributions of distilled advantages versus structured policy supervision, leaving the robustness of coordination preservation unclear.

Authors: We agree that the abstract would benefit from additional context on experimental rigor to support the headline claims. The full manuscript already reports all metrics as means over 5 independent runs with standard deviations, includes ablation studies in Section 4.3 that isolate the contributions of distilled advantage signals versus structured policy supervision, and applies paired t-tests for significance where appropriate. We will revise the abstract to briefly reference the number of runs, error bars, and the presence of these ablations, directing readers to the experimental section for details on coordination preservation. revision: yes
Referee: [Method (two-stage KD process)] Method description of the two-stage KD process: The central claim that static distilled advantage signals plus structured policy supervision suffice to transfer expert coordination to decentralized heterogeneous students without a critic (under partial or limited observations) is load-bearing but lacks analysis of how non-stationarity and credit assignment are mitigated when student observations diverge from the expert's joint view; this directly engages the potential failure mode under high partial observability.

Authors: The two-stage design directly targets these issues. Stage 1 distills fixed advantage signals from the expert's centralized critic, yielding stationary targets that students optimize against without maintaining their own online critics; this decouples student learning from the non-stationarity induced by decentralized execution and mismatched observations. Stage 2 then applies structured policy supervision that transfers not only actions but also the expert's relational coordination patterns, providing an implicit mechanism for credit assignment by encoding joint structure. While the manuscript demonstrates effective transfer via results on SMAC and MPE under partial observability, we concur that an explicit discussion would improve clarity. We will add a dedicated paragraph in the method section analyzing these mitigation strategies with reference to the existing empirical evidence. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical validation on benchmarks with no derivation chain or fitted-parameter loops

full rationale

The paper presents KD-MARL as a two-stage distillation framework transferring coordinated behavior from a centralized expert to decentralized heterogeneous students via distilled advantages and policy supervision. All load-bearing claims (90%+ performance retention, up to 28.6x FLOP reduction) are supported solely by empirical results on SMAC and MPE benchmarks rather than any equations, uniqueness theorems, or self-citations that reduce the outcome to the inputs by construction. No self-definitional steps, fitted predictions, or ansatz smuggling appear in the described method or evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard RL and KD assumptions plus the untested premise that coordination transfers via the proposed signals; no new physical entities or free parameters are introduced in the abstract.

axioms (1)

domain assumption Expert policies exist and contain transferable coordination structure that can be captured by advantage signals and policy supervision.
Implicit in the two-stage distillation design and the claim that students preserve coordination without a critic.

pith-pipeline@v0.9.0 · 5562 in / 1183 out tokens · 40351 ms · 2026-05-10T18:45:08.414335+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

two-stage framework that transfers coordinated behavior from a centralized expert to lightweight decentralized student agents... distilled advantage signals and structured policy supervision
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Teacher-Guided Advantage Distillation... Distilled GAE Advantage Targets

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages

[1]

Recent advances on cooperative control of heterogeneous multi-agent systems subject to constraints: A survey

Bao, G., Ma, L., Yi, X., 2022. Recent advances on cooperative control of heterogeneous multi-agent systems subject to constraints: A survey. Systems Science & Control Engineering 10, 539–551

work page 2022
[2]

Multiagent meta-reinforcement learning for adaptive multipath routing optimization

Chen, L., Hu, B., Guan, Z.H., Zhao, L., Shen, X., 2021. Multiagent meta-reinforcement learning for adaptive multipath routing optimization. IEEE Transactions on Neural Networks and Learning Systems 33, 5374– 5386

work page 2021
[3]

Portfolio management with multi-agent reinforcement learning: A role-aware distillation approach

Chen, R., Lin, J., Zhou, Y ., 2023. Portfolio management with multi-agent reinforcement learning: A role-aware distillation approach. Journal of Financial Data Science

work page 2023
[4]

Chen, Y ., Mao, H., Mao, J., Wu, S., Zhang, T., Zhang, B., Yang, W., Chang, H., 2024. Ptde: personalized training with distilled execution for multi-agent reinforcement learning, in: Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, pp. 31–39

work page 2024
[5]

Distilling policy distillation, in: The 22nd international conference on artificial intelligence and statistics, PMLR

Czarnecki, W.M., Pascanu, R., Osindero, S., Jayakumar, S., Swirszcz, G., Jaderberg, M., 2019. Distilling policy distillation, in: The 22nd international conference on artificial intelligence and statistics, PMLR. pp. 1331–1340

work page 2019
[6]

Pdd: Pruning during distillation for efficient multi-agent reinforcement learning, in: Proceedings of the AAAI Conference on Artificial Intelligence

Dan, X., Wang, L., He, Z., 2024. Pdd: Pruning during distillation for efficient multi-agent reinforcement learning, in: Proceedings of the AAAI Conference on Artificial Intelligence

work page 2024
[7]

Constrained multiagent markov decision processes: A taxonomy of problems and algorithms

De Nijs, F., Walraven, E., De Weerdt, M., Spaan, M., 2021. Constrained multiagent markov decision processes: A taxonomy of problems and algorithms. Journal of Artificial Intelligence Research 70, 955–1001

work page 2021
[8]

Smacv2: An improved benchmark for cooperative multi-agent reinforcement learning

Ellis, B., Cook, J., Moalla, S., Samvelyan, M., Sun, M., Mahajan, A., Foerster, J., Whiteson, S., 2023. Smacv2: An improved benchmark for cooperative multi-agent reinforcement learning. Advances in Neural Information Processing Systems 36, 37567–37593

work page 2023
[9]

Learning to communicate with deep multi-agent reinforcement learning

Foerster, J., Assael, I.A., De Freitas, N., Whiteson, S., 2016. Learning to communicate with deep multi-agent reinforcement learning. Advances in neural information processing systems 29

work page 2016
[10]

Knowru: Knowledge reuse in multi-agent reinforcement learning

Gao, Y ., Zhang, K., Yang, Y ., Li, Y ., Li, Z., Hu, H., 2021. Knowru: Knowledge reuse in multi-agent reinforcement learning. Neurocomput- ing 453, 464–475

work page 2021
[11]

Reciprocal teacher-student learning via forward and feedback knowledge distillation

Gou, J., Chen, Y ., Yu, B., Liu, J., Du, L., Wan, S., Yi, Z., 2024. Reciprocal teacher-student learning via forward and feedback knowledge distillation. IEEE transactions on multimedia 26, 7901–7916

work page 2024
[12]

Multi-agent deep reinforcement learning: a survey

Gronauer, S., Diepold, K., 2022. Multi-agent deep reinforcement learning: a survey. Artificial Intelligence Review 55, 895–943

work page 2022
[13]

Reinforce- ment learning via auxiliary task distillation, in: European Conference on Computer Vision, Springer

Harish, A.N., Heck, L., Hanna, J.P., Kira, Z., Szot, A., 2024. Reinforce- ment learning via auxiliary task distillation, in: European Conference on Computer Vision, Springer. pp. 214–230

work page 2024
[14]

arXiv preprint arXiv:2308.04268 , year=

Hu, C., Li, X., Liu, D., Wu, H., Chen, X., Wang, J., Liu, X., 2023. Teacher-student architecture for knowledge distillation: A survey. arXiv preprint arXiv:2308.04268

work page arXiv 2023
[15]

Value-based deep multi-agent reinforcement learning with dynamic sparse training

Hu, P., Li, S., Li, Z., Pan, L., Huang, L., 2024. Value-based deep multi-agent reinforcement learning with dynamic sparse training. arXiv preprint arXiv:2409.19391

work page arXiv 2024
[16]

Multi-agent reinforcement learning based cooperative content caching for mobile edge networks

Jiang, W., Feng, G., Qin, S., Liu, Y ., 2019. Multi-agent reinforcement learning based cooperative content caching for mobile edge networks. IEEE Access 7, 61856–61867

work page 2019
[17]

Double distillation network for robust multi-agent coordination

Li, Z., Hu, X., Tang, J., 2025. Double distillation network for robust multi-agent coordination. IEEE Transactions on Pattern Analysis and Machine Intelligence In press

work page 2025
[18]

Feature-level knowledge distillation for place recognition based on soft-hard labels teaching paradigm

Li, Z., Xu, P., Dong, Z., Zhang, R., Deng, Z., 2024. Feature-level knowledge distillation for place recognition based on soft-hard labels teaching paradigm. IEEE Transactions on Intelligent Transportation Systems

work page 2024
[19]

A survey of model compression techniques: Past, present, and future

Liu, D., Zhu, Y ., Liu, Z., Liu, Y ., Han, C., Tian, J., Li, R., Yi, W., 2025. A survey of model compression techniques: Past, present, and future. Frontiers in Robotics and AI 12, 1518965

work page 2025
[20]

Fine- grained learning behavior-oriented knowledge distillation for graph neural networks

Liu, K., Huang, Z., Wang, C.D., Gao, B., Chen, Y ., 2024. Fine- grained learning behavior-oriented knowledge distillation for graph neural networks. IEEE Transactions on Neural Networks and Learning Systems

work page 2024
[21]

Model compression in multi-agent reinforcement learning via reinforcement learning-guided pruning

Liu, W., Chen, J., Zhang, M., 2023. Model compression in multi-agent reinforcement learning via reinforcement learning-guided pruning. IEEE Transactions on Neural Networks and Learning Systems To appear

work page 2023
[22]

Multi-agent actor-critic for mixed cooperative-competitive environments

Lowe, R., Wu, Y .I., Tamar, A., Harb, J., Pieter Abbeel, O., Mordatch, I., 2017. Multi-agent actor-critic for mixed cooperative-competitive environments. Advances in neural information processing systems 30

work page 2017
[23]

Dealing with non-stationarity in decentralized cooperative multi-agent deep reinforcement learning via multi-timescale learning, in: Conference on Lifelong Learning Agents, PMLR

Nekoei, H., Badrinaaraayanan, A., Sinha, A., Amini, M., Rajendran, J., Mahajan, A., Chandar, S., 2023. Dealing with non-stationarity in decentralized cooperative multi-agent deep reinforcement learning via multi-timescale learning, in: Conference on Lifelong Learning Agents, PMLR. pp. 376–398

work page 2023
[24]

Relational knowledge distillation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp

Park, W., Kim, D., Lu, Y ., Cho, M., 2019. Relational knowledge distillation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3967–3976

work page 2019
[25]

Policy distil- lation for efficient decentralized execution in multi-agent reinforcement learning

Pei, Y ., Ren, T., Zhang, Y ., Sun, Z., Champeyrol, M., 2025. Policy distil- lation for efficient decentralized execution in multi-agent reinforcement learning. Neurocomputing , 129617

work page 2025
[26]

Monotonic value function factorisation for deep multi-agent reinforcement learning

Rashid, T., Samvelyan, M., De Witt, C.S., Farquhar, G., Foerster, J., Whiteson, S., 2020. Monotonic value function factorisation for deep multi-agent reinforcement learning. Journal of Machine Learning Research 21, 1–51

work page 2020
[27]

Sunehag, P., Lever, G., Gruslys, A., Czarnecki, W.M., Zambaldi, V ., Jaderberg, M., Lanctot, M., Sonnerat, N., Leibo, J.Z., Tuyls, K., et al.,

work page
[28]

2085– 2087

Value-decomposition networks for cooperative multi-agent learn- ing based on team reward, in: Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pp. 2085– 2087

work page 2085
[29]

Offline multi- agent reinforcement learning with knowledge distillation

Tseng, W.C., Wang, T.H.J., Lin, Y .C., Isola, P., 2022. Offline multi- agent reinforcement learning with knowledge distillation. Advances in Neural Information Processing Systems 35, 226–237

work page 2022
[30]

Offline multi-agent reinforcement learning via knowledge distillation, in: Advances in Neural Information Processing Systems (NeurIPS)

Wang, X., Zhao, Y ., Liu, Q., 2022. Offline multi-agent reinforcement learning via knowledge distillation, in: Advances in Neural Information Processing Systems (NeurIPS)

work page 2022
[31]

Deep multiagent reinforcement learning: Challenges and directions

Wong, A., B ¨ack, T., Kononova, A.V ., Plaat, A., 2023. Deep multiagent reinforcement learning: Challenges and directions. Artificial Intelligence Review 56, 5023–5056

work page 2023
[32]

A survey of reinforcement learning-driven knowledge distillation: Techniques, challenges, and applications

Xu, Z., Wang, J., Xu, X., Yu, P., Huang, T., Yi, J., 2025. A survey of reinforcement learning-driven knowledge distillation: Techniques, challenges, and applications

work page 2025
[33]

Multi-teacher knowledge distillation with reinforcement learning for visual recognition, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp

Yang, C., Yu, X., Yang, H., An, Z., Yu, C., Huang, L., Xu, Y ., 2025. Multi-teacher knowledge distillation with reinforcement learning for visual recognition, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 9148–9156

work page 2025
[34]

Beyond the edge: An advanced exploration of reinforcement learning for mobile edge computing, its applications, and future research trajectories

Yang, N., Chen, S., Zhang, H., Berry, R., 2024. Beyond the edge: An advanced exploration of reinforcement learning for mobile edge computing, its applications, and future research trajectories. IEEE Communications Surveys & Tutorials 27, 546–594

work page 2024
[35]

Yu, C., Velu, A., Vinitsky, E., Gao, J., Wang, Y ., Bayen, A., Wu, Y .,

work page
[36]

Advances in neural information processing systems 35, 24611– 24624

The surprising effectiveness of ppo in cooperative multi-agent games. Advances in neural information processing systems 35, 24611– 24624

work page
[37]

Entropy- regularized diffusion policy with q-ensembles for offline reinforcement learning

Zhang, R., Luo, Z., Sj ¨olund, J., Sch ¨on, T., Mattsson, P., 2024. Entropy- regularized diffusion policy with q-ensembles for offline reinforcement learning. Advances in Neural Information Processing Systems 37, 98871–98897

work page 2024
[38]

Ctds: Cen- tralized teacher with decentralized student for multiagent reinforcement learning

Zhao, J., Hu, X., Yang, M., Zhou, W., Zhu, J., Li, H., 2022. Ctds: Cen- tralized teacher with decentralized student for multiagent reinforcement learning. IEEE Transactions on Games 16, 140–150

work page 2022
[39]

Heterogeneous-agent reinforcement learning

Zhong, Y ., Kuba, J.G., Feng, X., Hu, S., Ji, J., Yang, Y ., 2024. Heterogeneous-agent reinforcement learning. Journal of Machine Learn- ing Research 25, 1–67

work page 2024

[1] [1]

Recent advances on cooperative control of heterogeneous multi-agent systems subject to constraints: A survey

Bao, G., Ma, L., Yi, X., 2022. Recent advances on cooperative control of heterogeneous multi-agent systems subject to constraints: A survey. Systems Science & Control Engineering 10, 539–551

work page 2022

[2] [2]

Multiagent meta-reinforcement learning for adaptive multipath routing optimization

Chen, L., Hu, B., Guan, Z.H., Zhao, L., Shen, X., 2021. Multiagent meta-reinforcement learning for adaptive multipath routing optimization. IEEE Transactions on Neural Networks and Learning Systems 33, 5374– 5386

work page 2021

[3] [3]

Portfolio management with multi-agent reinforcement learning: A role-aware distillation approach

Chen, R., Lin, J., Zhou, Y ., 2023. Portfolio management with multi-agent reinforcement learning: A role-aware distillation approach. Journal of Financial Data Science

work page 2023

[4] [4]

Chen, Y ., Mao, H., Mao, J., Wu, S., Zhang, T., Zhang, B., Yang, W., Chang, H., 2024. Ptde: personalized training with distilled execution for multi-agent reinforcement learning, in: Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, pp. 31–39

work page 2024

[5] [5]

Distilling policy distillation, in: The 22nd international conference on artificial intelligence and statistics, PMLR

Czarnecki, W.M., Pascanu, R., Osindero, S., Jayakumar, S., Swirszcz, G., Jaderberg, M., 2019. Distilling policy distillation, in: The 22nd international conference on artificial intelligence and statistics, PMLR. pp. 1331–1340

work page 2019

[6] [6]

Pdd: Pruning during distillation for efficient multi-agent reinforcement learning, in: Proceedings of the AAAI Conference on Artificial Intelligence

Dan, X., Wang, L., He, Z., 2024. Pdd: Pruning during distillation for efficient multi-agent reinforcement learning, in: Proceedings of the AAAI Conference on Artificial Intelligence

work page 2024

[7] [7]

Constrained multiagent markov decision processes: A taxonomy of problems and algorithms

De Nijs, F., Walraven, E., De Weerdt, M., Spaan, M., 2021. Constrained multiagent markov decision processes: A taxonomy of problems and algorithms. Journal of Artificial Intelligence Research 70, 955–1001

work page 2021

[8] [8]

Smacv2: An improved benchmark for cooperative multi-agent reinforcement learning

Ellis, B., Cook, J., Moalla, S., Samvelyan, M., Sun, M., Mahajan, A., Foerster, J., Whiteson, S., 2023. Smacv2: An improved benchmark for cooperative multi-agent reinforcement learning. Advances in Neural Information Processing Systems 36, 37567–37593

work page 2023

[9] [9]

Learning to communicate with deep multi-agent reinforcement learning

Foerster, J., Assael, I.A., De Freitas, N., Whiteson, S., 2016. Learning to communicate with deep multi-agent reinforcement learning. Advances in neural information processing systems 29

work page 2016

[10] [10]

Knowru: Knowledge reuse in multi-agent reinforcement learning

Gao, Y ., Zhang, K., Yang, Y ., Li, Y ., Li, Z., Hu, H., 2021. Knowru: Knowledge reuse in multi-agent reinforcement learning. Neurocomput- ing 453, 464–475

work page 2021

[11] [11]

Reciprocal teacher-student learning via forward and feedback knowledge distillation

Gou, J., Chen, Y ., Yu, B., Liu, J., Du, L., Wan, S., Yi, Z., 2024. Reciprocal teacher-student learning via forward and feedback knowledge distillation. IEEE transactions on multimedia 26, 7901–7916

work page 2024

[12] [12]

Multi-agent deep reinforcement learning: a survey

Gronauer, S., Diepold, K., 2022. Multi-agent deep reinforcement learning: a survey. Artificial Intelligence Review 55, 895–943

work page 2022

[13] [13]

Reinforce- ment learning via auxiliary task distillation, in: European Conference on Computer Vision, Springer

Harish, A.N., Heck, L., Hanna, J.P., Kira, Z., Szot, A., 2024. Reinforce- ment learning via auxiliary task distillation, in: European Conference on Computer Vision, Springer. pp. 214–230

work page 2024

[14] [14]

arXiv preprint arXiv:2308.04268 , year=

Hu, C., Li, X., Liu, D., Wu, H., Chen, X., Wang, J., Liu, X., 2023. Teacher-student architecture for knowledge distillation: A survey. arXiv preprint arXiv:2308.04268

work page arXiv 2023

[15] [15]

Value-based deep multi-agent reinforcement learning with dynamic sparse training

Hu, P., Li, S., Li, Z., Pan, L., Huang, L., 2024. Value-based deep multi-agent reinforcement learning with dynamic sparse training. arXiv preprint arXiv:2409.19391

work page arXiv 2024

[16] [16]

Multi-agent reinforcement learning based cooperative content caching for mobile edge networks

Jiang, W., Feng, G., Qin, S., Liu, Y ., 2019. Multi-agent reinforcement learning based cooperative content caching for mobile edge networks. IEEE Access 7, 61856–61867

work page 2019

[17] [17]

Double distillation network for robust multi-agent coordination

Li, Z., Hu, X., Tang, J., 2025. Double distillation network for robust multi-agent coordination. IEEE Transactions on Pattern Analysis and Machine Intelligence In press

work page 2025

[18] [18]

Feature-level knowledge distillation for place recognition based on soft-hard labels teaching paradigm

Li, Z., Xu, P., Dong, Z., Zhang, R., Deng, Z., 2024. Feature-level knowledge distillation for place recognition based on soft-hard labels teaching paradigm. IEEE Transactions on Intelligent Transportation Systems

work page 2024

[19] [19]

A survey of model compression techniques: Past, present, and future

Liu, D., Zhu, Y ., Liu, Z., Liu, Y ., Han, C., Tian, J., Li, R., Yi, W., 2025. A survey of model compression techniques: Past, present, and future. Frontiers in Robotics and AI 12, 1518965

work page 2025

[20] [20]

Fine- grained learning behavior-oriented knowledge distillation for graph neural networks

Liu, K., Huang, Z., Wang, C.D., Gao, B., Chen, Y ., 2024. Fine- grained learning behavior-oriented knowledge distillation for graph neural networks. IEEE Transactions on Neural Networks and Learning Systems

work page 2024

[21] [21]

Model compression in multi-agent reinforcement learning via reinforcement learning-guided pruning

Liu, W., Chen, J., Zhang, M., 2023. Model compression in multi-agent reinforcement learning via reinforcement learning-guided pruning. IEEE Transactions on Neural Networks and Learning Systems To appear

work page 2023

[22] [22]

Multi-agent actor-critic for mixed cooperative-competitive environments

Lowe, R., Wu, Y .I., Tamar, A., Harb, J., Pieter Abbeel, O., Mordatch, I., 2017. Multi-agent actor-critic for mixed cooperative-competitive environments. Advances in neural information processing systems 30

work page 2017

[23] [23]

Dealing with non-stationarity in decentralized cooperative multi-agent deep reinforcement learning via multi-timescale learning, in: Conference on Lifelong Learning Agents, PMLR

Nekoei, H., Badrinaaraayanan, A., Sinha, A., Amini, M., Rajendran, J., Mahajan, A., Chandar, S., 2023. Dealing with non-stationarity in decentralized cooperative multi-agent deep reinforcement learning via multi-timescale learning, in: Conference on Lifelong Learning Agents, PMLR. pp. 376–398

work page 2023

[24] [24]

Relational knowledge distillation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp

Park, W., Kim, D., Lu, Y ., Cho, M., 2019. Relational knowledge distillation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3967–3976

work page 2019

[25] [25]

Policy distil- lation for efficient decentralized execution in multi-agent reinforcement learning

Pei, Y ., Ren, T., Zhang, Y ., Sun, Z., Champeyrol, M., 2025. Policy distil- lation for efficient decentralized execution in multi-agent reinforcement learning. Neurocomputing , 129617

work page 2025

[26] [26]

Monotonic value function factorisation for deep multi-agent reinforcement learning

Rashid, T., Samvelyan, M., De Witt, C.S., Farquhar, G., Foerster, J., Whiteson, S., 2020. Monotonic value function factorisation for deep multi-agent reinforcement learning. Journal of Machine Learning Research 21, 1–51

work page 2020

[27] [27]

Sunehag, P., Lever, G., Gruslys, A., Czarnecki, W.M., Zambaldi, V ., Jaderberg, M., Lanctot, M., Sonnerat, N., Leibo, J.Z., Tuyls, K., et al.,

work page

[28] [28]

2085– 2087

Value-decomposition networks for cooperative multi-agent learn- ing based on team reward, in: Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pp. 2085– 2087

work page 2085

[29] [29]

Offline multi- agent reinforcement learning with knowledge distillation

Tseng, W.C., Wang, T.H.J., Lin, Y .C., Isola, P., 2022. Offline multi- agent reinforcement learning with knowledge distillation. Advances in Neural Information Processing Systems 35, 226–237

work page 2022

[30] [30]

Offline multi-agent reinforcement learning via knowledge distillation, in: Advances in Neural Information Processing Systems (NeurIPS)

Wang, X., Zhao, Y ., Liu, Q., 2022. Offline multi-agent reinforcement learning via knowledge distillation, in: Advances in Neural Information Processing Systems (NeurIPS)

work page 2022

[31] [31]

Deep multiagent reinforcement learning: Challenges and directions

Wong, A., B ¨ack, T., Kononova, A.V ., Plaat, A., 2023. Deep multiagent reinforcement learning: Challenges and directions. Artificial Intelligence Review 56, 5023–5056

work page 2023

[32] [32]

A survey of reinforcement learning-driven knowledge distillation: Techniques, challenges, and applications

Xu, Z., Wang, J., Xu, X., Yu, P., Huang, T., Yi, J., 2025. A survey of reinforcement learning-driven knowledge distillation: Techniques, challenges, and applications

work page 2025

[33] [33]

Multi-teacher knowledge distillation with reinforcement learning for visual recognition, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp

Yang, C., Yu, X., Yang, H., An, Z., Yu, C., Huang, L., Xu, Y ., 2025. Multi-teacher knowledge distillation with reinforcement learning for visual recognition, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 9148–9156

work page 2025

[34] [34]

Beyond the edge: An advanced exploration of reinforcement learning for mobile edge computing, its applications, and future research trajectories

Yang, N., Chen, S., Zhang, H., Berry, R., 2024. Beyond the edge: An advanced exploration of reinforcement learning for mobile edge computing, its applications, and future research trajectories. IEEE Communications Surveys & Tutorials 27, 546–594

work page 2024

[35] [35]

Yu, C., Velu, A., Vinitsky, E., Gao, J., Wang, Y ., Bayen, A., Wu, Y .,

work page

[36] [36]

Advances in neural information processing systems 35, 24611– 24624

The surprising effectiveness of ppo in cooperative multi-agent games. Advances in neural information processing systems 35, 24611– 24624

work page

[37] [37]

Entropy- regularized diffusion policy with q-ensembles for offline reinforcement learning

Zhang, R., Luo, Z., Sj ¨olund, J., Sch ¨on, T., Mattsson, P., 2024. Entropy- regularized diffusion policy with q-ensembles for offline reinforcement learning. Advances in Neural Information Processing Systems 37, 98871–98897

work page 2024

[38] [38]

Ctds: Cen- tralized teacher with decentralized student for multiagent reinforcement learning

Zhao, J., Hu, X., Yang, M., Zhou, W., Zhu, J., Li, H., 2022. Ctds: Cen- tralized teacher with decentralized student for multiagent reinforcement learning. IEEE Transactions on Games 16, 140–150

work page 2022

[39] [39]

Heterogeneous-agent reinforcement learning

Zhong, Y ., Kuba, J.G., Feng, X., Hu, S., Ji, J., Yang, Y ., 2024. Heterogeneous-agent reinforcement learning. Journal of Machine Learn- ing Research 25, 1–67

work page 2024