STEAM: A Training-Free Congestion-Aware Enhancement Framework for Decentralized Multi-Agent Path Finding

Mengnuo Zhang; Mingyang Feng; Shaoyuan Li; Xiang Yin

arxiv: 2605.20929 · v1 · pith:74YGJJ3Xnew · submitted 2026-05-20 · 💻 cs.RO

STEAM: A Training-Free Congestion-Aware Enhancement Framework for Decentralized Multi-Agent Path Finding

Mingyang Feng , Mengnuo Zhang , Shaoyuan Li , Xiang Yin This is my paper

Pith reviewed 2026-05-21 04:41 UTC · model grok-4.3

classification 💻 cs.RO

keywords decentralized multi-agent path findingcongestion awarenesstraining-free enhancementtest-time interventionMAPFmulti-agent systemspath planning

0 comments

The pith

STEAM enhances any pretrained decentralized MAPF policy at test time by detecting future congestion via shortest-path rollouts and applying targeted spatial, temporal, and density corrections.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents STEAM as a lightweight, training-free add-on that improves existing learned decentralized policies for multi-agent path finding in discrete grids. It works by first simulating shortest paths from the current cost-to-go estimates to locate likely upcoming congestion points, then adjusting agent cost maps for avoidable spatial conflicts, applying logit corrections for fixed bottlenecks, and using neighbor density information to ease local crowding. A reader would care because many practical multi-robot systems depend on decentralized learned policies that frequently fail under dynamic crowding, and this approach offers measurable gains in success rate, completion time, and total cost with negligible extra computation. Experiments across several representative algorithms confirm consistent improvements, including success-rate increases reaching 60 percent.

Core claim

STEAM requires no retraining or architectural changes to a given pretrained decentralized policy. It first rolls out the shortest paths induced by the current cost-to-go maps to identify potential future congestion hotspots. Spatially avoidable congestion is then mitigated by updating agent-specific cost-to-go information, spatially unavoidable bottlenecks are handled through temporal logit correction, and emergent local congestion is reduced by density-aware logit correction based on neighboring agents' corrected maps. This process yields higher success rates, lower makespan, and lower solution cost with only minor added overhead.

What carries the argument

Shortest-path rollouts on current cost-to-go maps that feed into spatial cost-to-go updates, temporal logit corrections, and density-aware logit corrections to steer policy execution away from predicted congestion.

If this is right

Any existing learning-based decentralized MAPF policy can be enhanced without retraining or replacement by a centralized planner.
Success rate, makespan, and solution cost all improve, with success-rate gains reaching 60 percent.
The added computation remains minor across tested algorithms and environments.
The same corrections apply uniformly to representative decentralized MAPF methods in discrete settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Congestion failures in learned policies often arise from incomplete foresight that can be supplied after training by simple rollout checks.
The same detection-plus-correction pattern could be examined in continuous-space or partially observable multi-agent tasks.
Widespread use might lower reliance on fully centralized solvers for large robot fleets.

Load-bearing premise

That rolling out the shortest paths induced by the current cost-to-go maps will reliably identify the future congestion hotspots that actually matter for the learned policy's execution.

What would settle it

A controlled test in which shortest-path rollouts from cost-to-go maps systematically miss the actual bottlenecks that arise during policy execution, resulting in zero or negative performance change under STEAM.

Figures

Figures reproduced from arXiv: 2605.20929 by Mengnuo Zhang, Mingyang Feng, Shaoyuan Li, Xiang Yin.

**Figure 1.** Figure 1: Overview of STEAM. The Spatial-, Temporal-, and Emergent congestion-Aware modules for MAPF capture congestion from global and local perspectives. Black arrows denote the original decentralized MAPF policy pipeline. Red arrows indicate interactions between the proposed modules and the original pipeline, while blue arrows denote our extensions. global and local guidance during test-time execution. Specifica… view at source ↗

**Figure 2.** Figure 2: Optimization performance of MAPF-GPT across different scenarios. The first column shows the map type, and the [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

We propose STEAM (Spatial, Temporal, and Emergent congestion Awareness for MAPF), a training-free test-time enhancement framework for learning-based decentralized Multi-Agent Path Finding (MAPF) in discrete environments. Given a pretrained decentralized policy, STEAM requires no retraining, architectural modification, or replacement by a centralized planner. Instead, it injects lightweight congestion-aware guidance into the original policy execution. STEAM first rolls out the shortest paths induced by the current cost-to-go maps to identify potential future congestion hotspots. Spatially avoidable congestion is mitigated by updating agent-specific cost-to-go information, while spatially unavoidable bottlenecks are handled through temporal logit correction. In addition, emergent local congestion is reduced by a density-aware logit correction based on neighboring agents' corrected cost-to-go maps. Extensive experiments on representative learning-based decentralized MAPF algorithms show that STEAM consistently improves success rate, makespan, and solution cost, with success-rate gains of up to 60% and only minor computational overhead. The implementation is available at https://anonymous.4open.science/r/STEAM-MAPF-7A62.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

STEAM is a lightweight test-time plug-in that tries to fix congestion in pretrained decentralized MAPF policies via shortest-path rollouts and three corrections, but its gains depend on how well those rollouts match actual policy trajectories.

read the letter

Colleague, the main takeaway is that this paper introduces STEAM as a training-free add-on for learning-based decentralized MAPF. It rolls out shortest paths from current cost-to-go maps to spot future hotspots, then applies a spatial cost-to-go update for avoidable congestion, a temporal logit correction for unavoidable bottlenecks, and a density-aware logit correction drawn from neighbors' maps. The whole thing runs at execution time with no retraining or switch to centralized planning, and the abstract reports consistent lifts in success rate, makespan, and cost across several base algorithms, with peaks near 60% and only small extra compute. Code is released too, which helps. That combination of corrections looks like a concrete new recipe rather than a direct lift from prior MAPF work, and the framing as a plug-in for existing policies is a practical strength for settings where central solvers do not scale. The experiments are described as extensive, so if the tables and ablations hold up they would give a clear picture of where each piece helps. The soft spot is the core assumption that shortest-path rollouts will reliably flag the congestion that actually arises under the learned policy. Because the policy uses local observations and can deviate from pure shortest-path behavior, the predicted hotspots could miss the mark or point to the wrong places and times. The paper would need to show that the approximation is close enough in practice, perhaps by comparing rollout predictions to real executed trajectories or by testing sensitivity to policy stochasticity. Without that check the claimed gains could shrink. The work is aimed at researchers building or deploying learned multi-robot controllers who want a quick reliability boost without redesigning everything. A reader focused on practical MAPF enhancements would find the details and open implementation useful. I would bring it to a reading group to walk through the rollout step and the experimental controls. It deserves peer review so the empirical claims and the hotspot-prediction assumption can be examined in detail.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces STEAM, a training-free test-time enhancement for pretrained decentralized learning-based MAPF policies in discrete grids. It identifies future congestion by rolling out shortest paths implied by current cost-to-go maps, then applies three corrections: spatial updates to agent-specific cost-to-go values for avoidable congestion, temporal logit corrections for unavoidable bottlenecks, and density-aware logit corrections based on neighbors' corrected maps. The central empirical claim is that STEAM yields consistent gains in success rate (up to 60%), makespan, and solution cost across representative algorithms with only minor overhead.

Significance. If the gains prove robust and the hotspot-identification assumption holds, STEAM would offer a practical, modular addition to the decentralized MAPF literature by improving existing policies without retraining or centralization. The open-source implementation supports reproducibility. The approach is noteworthy for its additive, policy-agnostic design.

major comments (2)

[§3] §3 (Method): The framework's core step—rolling out shortest paths from cost-to-go maps to locate congestion hotspots—assumes these trajectories approximate the actual future density patterns that arise under the learned decentralized policy. Because the policy is trained on local observations and may deviate from greedy shortest-path behavior due to stochasticity or interactions, the predicted hotspots can diverge from reality. The manuscript provides no direct validation (e.g., correlation analysis or ablation comparing predicted vs. observed congestion) that this approximation is reliable; if the divergence is large, the corrections target incorrect locations and the reported success-rate gains are undermined.
[Experimental section] Experimental section / Table 1 (or equivalent results table): The abstract asserts 'consistent' improvements and 'up to 60%' success-rate gains across algorithms, yet the provided description lacks quantitative details on baseline comparisons, per-component ablations (spatial vs. temporal vs. density-aware), statistical significance, or variation across map sizes and agent densities. Without these, the load-bearing claim of reliable, generalizable enhancement cannot be assessed.

minor comments (2)

[§3.3] Clarify the exact formulation of the temporal logit correction and how it interacts with the original policy logits; an equation or pseudocode block would improve precision.
Figure captions for any congestion or rollout visualizations should explicitly state the map, agent count, and time step shown.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [§3] §3 (Method): The framework's core step—rolling out shortest paths from cost-to-go maps to locate congestion hotspots—assumes these trajectories approximate the actual future density patterns that arise under the learned decentralized policy. Because the policy is trained on local observations and may deviate from greedy shortest-path behavior due to stochasticity or interactions, the predicted hotspots can diverge from reality. The manuscript provides no direct validation (e.g., correlation analysis or ablation comparing predicted vs. observed congestion) that this approximation is reliable; if the divergence is large, the corrections target incorrect locations and the reported success-rate gains are undermined.

Authors: The rollout of shortest paths from the learned cost-to-go maps is intended as a lightweight, training-free proxy for anticipated future trajectories. Because the cost-to-go values are optimized to guide agents toward goals under the decentralized policy, the induced paths provide a reasonable estimate of where congestion is likely to emerge. The consistent empirical gains reported across multiple policies and environments indicate that the identified hotspots are sufficiently accurate for the subsequent corrections to be effective. Nevertheless, we agree that a direct validation would strengthen the claims. In the revised manuscript we will add a dedicated analysis subsection that reports correlation between predicted and observed congestion (computed from full policy rollouts) together with an ablation that replaces the predicted hotspots with ground-truth future density; these results will quantify the approximation error and its impact on final performance. revision: yes
Referee: [Experimental section] Experimental section / Table 1 (or equivalent results table): The abstract asserts 'consistent' improvements and 'up to 60%' success-rate gains across algorithms, yet the provided description lacks quantitative details on baseline comparisons, per-component ablations (spatial vs. temporal vs. density-aware), statistical significance, or variation across map sizes and agent densities. Without these, the load-bearing claim of reliable, generalizable enhancement cannot be assessed.

Authors: The manuscript already reports results on several representative learning-based decentralized MAPF algorithms and states success-rate gains of up to 60% together with improvements in makespan and solution cost. To make the quantitative support fully transparent, we will expand the experimental section in the revision. The updated version will include: explicit numerical baseline comparisons, per-component ablations that isolate the spatial, temporal, and density-aware corrections, statistical significance tests (e.g., paired Wilcoxon signed-rank tests with p-values), and performance breakdowns across map sizes and agent densities. These additions will appear in revised tables and supplementary figures. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in STEAM derivation or claims

full rationale

The paper describes STEAM as a training-free, additive test-time framework that applies heuristic rollouts of shortest paths from existing cost-to-go maps to detect congestion, followed by spatial updates, temporal logit corrections, and density-aware corrections. No equations, fitted parameters, or self-citations are presented that reduce the reported performance gains (success rate, makespan) to the inputs by construction. The method operates on external pretrained decentralized policies without retraining or architectural changes, and the central claims rest on experimental validation rather than self-referential definitions or load-bearing self-citations. The rollout assumption is a methodological heuristic whose validity is separate from circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that pretrained decentralized policies produce usable cost-to-go maps and that short-horizon shortest-path rollouts are sufficient proxies for future congestion; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption Pretrained decentralized policies produce cost-to-go maps that can be used to induce shortest paths for congestion prediction.
Invoked when the method first rolls out shortest paths from current cost-to-go maps.

pith-pipeline@v0.9.0 · 5737 in / 1217 out tokens · 27488 ms · 2026-05-21T04:41:49.929552+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 1 internal anchor

[1]

DeepFleet: Multi-Agent Foundation Models for Mobile Robots

Ameya Agaskar, Sriram Siva, William Pickering, Kyle O’Brien, Charles Kekeh, Ang Li, Brianna Gallo Sarker, Alicia Chua, Mayur Nemade, Charun Thattai, et al. Deepfleet: Multi-agent foundation models for mobile robots.arXiv:2508.08574, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Advancing learnable multi-agent pathfinding solvers with active fine-tuning

Anton Andreychuk, Konstantin Yakovlev, Aleksandr Panov, and Alexey Skrynnik. Advancing learnable multi-agent pathfinding solvers with active fine-tuning. InIEEE/RSJ International Conf. Intelligent Robots and Systems, pages 10564–10571, 2025

work page 2025
[3]

Mapf-gpt: Imitation learning for multi-agent pathfinding at scale

Anton Andreychuk, Konstantin Yakovlev, Aleksandr Panov, and Alexey Skrynnik. Mapf-gpt: Imitation learning for multi-agent pathfinding at scale. InProceedings of the AAAI Conf. Artificial Intelligence, volume 39, pages 23126–23134, 2025

work page 2025
[4]

Primal 2: Pathfinding via reinforcement and imitation multi- agent learning-lifelong.IEEE Robotics and Automation Letters, 6(2):2666–2673, 2021

Mehul Damani, Zhiyao Luo, Emerson Wenzel, and Guillaume Sar- toretti. Primal 2: Pathfinding via reinforcement and imitation multi- agent learning-lifelong.IEEE Robotics and Automation Letters, 6(2):2666–2673, 2021

work page 2021
[5]

On multiple moving objects.Algorithmica, 2(1):477–521, 1987

Michael Erdmann and Tomas Lozano-Perez. On multiple moving objects.Algorithmica, 2(1):477–521, 1987

work page 1987
[6]

Solving multiagent path finding on highly centralized networks

Foivos Fioravantes, Du ˇsan Knop, Jan Maty ´aˇs K ˇriˇst’an, Nikolaos Melissinos, Michal Opler, and Tung Anh Vu. Solving multiagent path finding on highly centralized networks. InProceedings of the AAAI Conf. Artificial Intelligence, volume 39, pages 23186–23193, 2025

work page 2025
[7]

Interpretable multi-agent path finding via decision tree extraction from neural policies

Thibault Lahire. Interpretable multi-agent path finding via decision tree extraction from neural policies. 2026

work page 2026
[8]

Eecbs: A bounded- suboptimal search for multi-agent path finding

Jiaoyang Li, Wheeler Ruml, and Sven Koenig. Eecbs: A bounded- suboptimal search for multi-agent path finding. InProceedings of the AAAI Conf. Artificial Intelligence, volume 35, pages 12353–12362, 2021

work page 2021
[9]

Multi-agent path finding for large agents

Jiaoyang Li, Pavel Surynek, Ariel Felner, Hang Ma, TK Satish Kumar, and Sven Koenig. Multi-agent path finding for large agents. In Proceedings of the AAAI Conf. Artificial Intelligence, volume 33, pages 7627–7634, 2019

work page 2019
[10]

Message- aware graph attention networks for large-scale multi-robot path plan- ning.IEEE Robotics and Automation Letters, 6(3):5533–5540, 2021

Qingbiao Li, Weizhe Lin, Zhe Liu, and Amanda Prorok. Message- aware graph attention networks for large-scale multi-robot path plan- ning.IEEE Robotics and Automation Letters, 6(3):5533–5540, 2021

work page 2021
[11]

Lacam: Search-based algorithm for quick multi- agent pathfinding

Keisuke Okumura. Lacam: Search-based algorithm for quick multi- agent pathfinding. InProceedings of the AAAI Conf. Artificial Intelligence, volume 37, pages 11655–11662, 2023

work page 2023
[12]

Primal: Pathfinding via reinforcement and imitation multi-agent learning.IEEE Robotics and Automation Letters, 4(3):2378–2385, 2019

Guillaume Sartoretti, Justin Kerr, Yunfei Shi, Glenn Wagner, TK Satish Kumar, Sven Koenig, and Howie Choset. Primal: Pathfinding via reinforcement and imitation multi-agent learning.IEEE Robotics and Automation Letters, 4(3):2378–2385, 2019

work page 2019
[13]

Cooperative pathfinding

David Silver. Cooperative pathfinding. InProceedings of the AAAI Conf. Artificial Intelligence and Interactive Digital Entertainment, volume 1, pages 117–122, 2005

work page 2005
[14]

Pogema: A benchmark platform for cooperative multi-agent pathfinding

Alexey Skrynnik, Anton Andreychuk, Anatolii Borzilov, Alexander Chernyavskiy, Konstantin Yakovlev, and Aleksandr Panov. Pogema: A benchmark platform for cooperative multi-agent pathfinding. arXiv:2407.14931, 2024

work page arXiv 2024
[15]

Multi-agent path finding–an overview.Artificial Intel- ligence: 5th RAAI Summer School, Tutorial Lectures, pages 96–115, 2019

Roni Stern. Multi-agent path finding–an overview.Artificial Intel- ligence: 5th RAAI Summer School, Tutorial Lectures, pages 96–115, 2019

work page 2019
[16]

Deadlock-free hybrid rl-mapf framework for zero-shot multi-robot navigation.arXiv:2511.22685, 2025

Haoyi Wang, Licheng Luo, Yiannis Kantaros, Bruno Sinopoli, and Mingyu Cai. Deadlock-free hybrid rl-mapf framework for zero-shot multi-robot navigation.arXiv:2511.22685, 2025

work page arXiv 2025
[17]

Intel- ligent games meeting with multi-agent deep reinforcement learning: a comprehensive review.Artificial Intelligence Review, 58(6):165, 2025

Yiqin Wang, Yufeng Wang, Feng Tian, Jianhua Ma, and Qun Jin. Intel- ligent games meeting with multi-agent deep reinforcement learning: a comprehensive review.Artificial Intelligence Review, 58(6):165, 2025

work page 2025

[1] [1]

DeepFleet: Multi-Agent Foundation Models for Mobile Robots

Ameya Agaskar, Sriram Siva, William Pickering, Kyle O’Brien, Charles Kekeh, Ang Li, Brianna Gallo Sarker, Alicia Chua, Mayur Nemade, Charun Thattai, et al. Deepfleet: Multi-agent foundation models for mobile robots.arXiv:2508.08574, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Advancing learnable multi-agent pathfinding solvers with active fine-tuning

Anton Andreychuk, Konstantin Yakovlev, Aleksandr Panov, and Alexey Skrynnik. Advancing learnable multi-agent pathfinding solvers with active fine-tuning. InIEEE/RSJ International Conf. Intelligent Robots and Systems, pages 10564–10571, 2025

work page 2025

[3] [3]

Mapf-gpt: Imitation learning for multi-agent pathfinding at scale

Anton Andreychuk, Konstantin Yakovlev, Aleksandr Panov, and Alexey Skrynnik. Mapf-gpt: Imitation learning for multi-agent pathfinding at scale. InProceedings of the AAAI Conf. Artificial Intelligence, volume 39, pages 23126–23134, 2025

work page 2025

[4] [4]

Primal 2: Pathfinding via reinforcement and imitation multi- agent learning-lifelong.IEEE Robotics and Automation Letters, 6(2):2666–2673, 2021

Mehul Damani, Zhiyao Luo, Emerson Wenzel, and Guillaume Sar- toretti. Primal 2: Pathfinding via reinforcement and imitation multi- agent learning-lifelong.IEEE Robotics and Automation Letters, 6(2):2666–2673, 2021

work page 2021

[5] [5]

On multiple moving objects.Algorithmica, 2(1):477–521, 1987

Michael Erdmann and Tomas Lozano-Perez. On multiple moving objects.Algorithmica, 2(1):477–521, 1987

work page 1987

[6] [6]

Solving multiagent path finding on highly centralized networks

Foivos Fioravantes, Du ˇsan Knop, Jan Maty ´aˇs K ˇriˇst’an, Nikolaos Melissinos, Michal Opler, and Tung Anh Vu. Solving multiagent path finding on highly centralized networks. InProceedings of the AAAI Conf. Artificial Intelligence, volume 39, pages 23186–23193, 2025

work page 2025

[7] [7]

Interpretable multi-agent path finding via decision tree extraction from neural policies

Thibault Lahire. Interpretable multi-agent path finding via decision tree extraction from neural policies. 2026

work page 2026

[8] [8]

Eecbs: A bounded- suboptimal search for multi-agent path finding

Jiaoyang Li, Wheeler Ruml, and Sven Koenig. Eecbs: A bounded- suboptimal search for multi-agent path finding. InProceedings of the AAAI Conf. Artificial Intelligence, volume 35, pages 12353–12362, 2021

work page 2021

[9] [9]

Multi-agent path finding for large agents

Jiaoyang Li, Pavel Surynek, Ariel Felner, Hang Ma, TK Satish Kumar, and Sven Koenig. Multi-agent path finding for large agents. In Proceedings of the AAAI Conf. Artificial Intelligence, volume 33, pages 7627–7634, 2019

work page 2019

[10] [10]

Message- aware graph attention networks for large-scale multi-robot path plan- ning.IEEE Robotics and Automation Letters, 6(3):5533–5540, 2021

Qingbiao Li, Weizhe Lin, Zhe Liu, and Amanda Prorok. Message- aware graph attention networks for large-scale multi-robot path plan- ning.IEEE Robotics and Automation Letters, 6(3):5533–5540, 2021

work page 2021

[11] [11]

Lacam: Search-based algorithm for quick multi- agent pathfinding

Keisuke Okumura. Lacam: Search-based algorithm for quick multi- agent pathfinding. InProceedings of the AAAI Conf. Artificial Intelligence, volume 37, pages 11655–11662, 2023

work page 2023

[12] [12]

Primal: Pathfinding via reinforcement and imitation multi-agent learning.IEEE Robotics and Automation Letters, 4(3):2378–2385, 2019

Guillaume Sartoretti, Justin Kerr, Yunfei Shi, Glenn Wagner, TK Satish Kumar, Sven Koenig, and Howie Choset. Primal: Pathfinding via reinforcement and imitation multi-agent learning.IEEE Robotics and Automation Letters, 4(3):2378–2385, 2019

work page 2019

[13] [13]

Cooperative pathfinding

David Silver. Cooperative pathfinding. InProceedings of the AAAI Conf. Artificial Intelligence and Interactive Digital Entertainment, volume 1, pages 117–122, 2005

work page 2005

[14] [14]

Pogema: A benchmark platform for cooperative multi-agent pathfinding

Alexey Skrynnik, Anton Andreychuk, Anatolii Borzilov, Alexander Chernyavskiy, Konstantin Yakovlev, and Aleksandr Panov. Pogema: A benchmark platform for cooperative multi-agent pathfinding. arXiv:2407.14931, 2024

work page arXiv 2024

[15] [15]

Multi-agent path finding–an overview.Artificial Intel- ligence: 5th RAAI Summer School, Tutorial Lectures, pages 96–115, 2019

Roni Stern. Multi-agent path finding–an overview.Artificial Intel- ligence: 5th RAAI Summer School, Tutorial Lectures, pages 96–115, 2019

work page 2019

[16] [16]

Deadlock-free hybrid rl-mapf framework for zero-shot multi-robot navigation.arXiv:2511.22685, 2025

Haoyi Wang, Licheng Luo, Yiannis Kantaros, Bruno Sinopoli, and Mingyu Cai. Deadlock-free hybrid rl-mapf framework for zero-shot multi-robot navigation.arXiv:2511.22685, 2025

work page arXiv 2025

[17] [17]

Intel- ligent games meeting with multi-agent deep reinforcement learning: a comprehensive review.Artificial Intelligence Review, 58(6):165, 2025

Yiqin Wang, Yufeng Wang, Feng Tian, Jianhua Ma, and Qun Jin. Intel- ligent games meeting with multi-agent deep reinforcement learning: a comprehensive review.Artificial Intelligence Review, 58(6):165, 2025

work page 2025