pith. machine review for the scientific record. sign in

arxiv: 2605.08391 · v1 · submitted 2026-05-08 · 💻 cs.LG

Recognition: no theorem link

SACHI: Structured Agent Coordination via Holistic Information Integration in Multi-Agent Reinforcement Learning

James Zachary Hare, Jesse Milzman, Nikunj Gupta, Rajgopal Kannan, Viktor Prasanna

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:25 UTC · model grok-4.3

classification 💻 cs.LG
keywords multi-agent reinforcement learningcoordination graphsgraph transformerspartial observabilityinformation integrationmessage passingcooperative agents
0
0 comments X

The pith

Graph transformer convolutions on coordination graphs integrate receiver-sensitive teammate signals to overcome partial observation bottlenecks in cooperative multi-agent reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to resolve the core information bottleneck in cooperative multi-agent reinforcement learning, where each agent sees only local partial observations yet must select actions that are jointly optimal with teammates whose observations, intentions, and choices remain hidden. It does so by framing coordination as structured holistic information integration and introducing graph transformer convolutions that run over an explicit inter-agent coordination graph. These convolutions enrich every agent's internal representation with signals that are both receiver-sensitive and content-dependent before any action is chosen. A sympathetic reader would care because the approach replaces ad-hoc compression or separate learned channels with a single, architecture-level mechanism that demonstrably yields higher team performance across diverse task types.

Core claim

By treating action coordination as a problem of holistic information integration, SACHI applies graph transformer convolutions over an inter-agent coordination graph to supply each agent with receiver-sensitive, content-dependent signals drawn from teammates prior to action selection; the resulting agents match or exceed the strongest of twelve baselines on every one of five tasks spanning spatial, communicative, and adversarial settings, with aggregate statistical tests (normalized scores, bootstrap intervals, Friedman ranking, and profiling) confirming the advantage is significant, robust, and independent of model capacity.

What carries the argument

Graph transformer convolutions over the inter-agent coordination graph, which extract and route receiver-sensitive, content-dependent signals into each agent's representation before action selection.

If this is right

  • SACHI matches or outperforms the strongest baseline on every task tested.
  • Statistical analyses with bootstrap intervals and Friedman ranking establish that the performance edge is significant and consistent across environments.
  • Parameter-matched ablations isolate the source of gains to the degree of content dependence in the message-passing operator.
  • The same architecture succeeds across spatial, communicative, and adversarial coordination problems without requiring extra model capacity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The explicit use of a coordination graph may scale more predictably to larger agent teams than fully learned communication protocols.
  • If the graph structure itself can be learned or adapted online, the method could extend to environments where optimal coordination patterns change over time.
  • The receiver-sensitive nature of the signals suggests similar graph-based enrichment could improve other partially observed multi-agent settings such as sensor networks or distributed robotics.

Load-bearing premise

A coordination graph can be specified such that the graph transformer convolutions reliably deliver the exact receiver-sensitive signals needed to resolve the information bottleneck without introducing new training instabilities or scalability limits.

What would settle it

A controlled experiment in which SACHI is evaluated on a held-out cooperative task and either fails to match the best baseline or shows no performance difference when the message-passing operator is made content-independent would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.08391 by James Zachary Hare, Jesse Milzman, Nikunj Gupta, Rajgopal Kannan, Viktor Prasanna.

Figure 1
Figure 1. Figure 1: Solving the MARL Information Bottleneck. (a) Each agent holds only a fragment of the information needed to determine the globally optimal joint action; (b) Since agents cannot see their teammates’ observations or intentions, their independent “rational” choices often lead to group failure; (c) SACHI resolves this by allowing agents to intelligently filter and “borrow” relevant context from teammates throug… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of SACHI. Local observations are encoded into agent embeddings, refined through soft-attention-modulated graph transformer message passing, and mapped to per-agent Q-values. The model performs receiver-dependent information integration prior to action selection, allowing coordinated behavior to emerge under decentralized execution. messages along the edges of A˜ . At layer ℓ, the embedding of agen… view at source ↗
Figure 3
Figure 3. Figure 3: REFERENCE: a1 observes a2’s goal (orange dashed) and vice versa (teal dashed). Each agent signals this (purple) so its partner can navigate to the correct landmark (solid arrows). its actions. Success requires learning a shared protocol in which each agent acts as both an informative sender and a faithful receiver. The task directly tests content-dependent coordination, as agents must extract and transmit … view at source ↗
Figure 7
Figure 7. Figure 7: DISPERSE: Agents must try to achieve full coverage. e) DISPERSE [33]: requires n agents to occupy n iden￾tical resource zones with exactly one agent per zone. Rewards depend on conflict-free coverage, but all zones are locally indistinguishable, making symmetry-breaking the central chal￾lenge. Greedy local policies lead to clustering and under￾coverage. Solving the task requires agents to incorporate teamm… view at source ↗
Figure 5
Figure 5. Figure 5: SPEAKER LISTENER: speaker observes target (⋆) and broadcasts a discrete symbol (orange); listener interprets the symbol and navigates (teal), unable to see the target directly. c) SPEAKER LISTENER [6]: introduces fixed role asym￾metry. The speaker observes the target but cannot move, while the listener can move but cannot observe the target. The speaker must broadcast a discrete signal encoding the target,… view at source ↗
Figure 8
Figure 8. Figure 8: Learning curves across five cooperative environments (mean [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Aggregate normalized scores with 95% bootstrap confidence intervals. SACHI leads on all four metrics; its CI does not overlap with any baseline on Mean, IQM, or OG. measures by how much, on average, a method falls short of the best-baseline ceiling (vˆ = 1); a method that always matches or exceeds the best baseline has OG = 0. For each metric, we construct 95% confidence intervals by bootstrap resampling t… view at source ↗
Figure 10
Figure 10. Figure 10: Average rank across five environments (lower is [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Performance profile. SACHI stochastically dominates every baseline at every threshold τ . a) Learning-trajectory efficiency: Let Rm,e,s(t) denote the test return of method m on environment e with seed s at training timestep t, and let T denote the total training budget. We compute the normalized area under the learning curve: AUCm,e,s = 1 T Z T 0 Rm,e,s(t) dt, (16) estimated via the trapezoidal rule and m… view at source ↗
Figure 13
Figure 13. Figure 13: Jump-start performance over the first 100K steps (percentage of best baseline’s final score). SACHI ranks first. matches the worst baseline’s early performance; 100% means it matches the best. The aggregate score is the mean of score(m, e) across environments [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗
Figure 12
Figure 12. Figure 12: Normalized area under the learning curve (higher is [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗
Figure 14
Figure 14. Figure 14: Parameter breakdown per method. Blue: agent + [PITH_FULL_IMAGE:figures/full_fig_p012_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Ablation learning curves on REFERENCE. (a) Encoder architecture (parameter-matched). (b) Number of attention heads. (c) Layer depth. The default configuration is shown in red. baseline at every performance threshold. It achieves this with fewer parameters (31K) than most baselines, the highest jump￾start performance (138%), and the highest learning-trajectory AUC (1.01). The ablations trace the source of … view at source ↗
read the original abstract

Cooperative multi-agent reinforcement learning agents that act on partial local observations face a fundamental information bottleneck: the knowledge needed to select jointly optimal actions is scattered across the team, yet each agent must commit to a decision without access to its teammates' observations, intentions, or chosen actions. Existing methods either ignore this bottleneck, compress it into a scalar mixing signal, or route around it with learned communication channels. Framing action coordination as a problem of structured information integration among agents, we propose \textit{structured agent coordination via holistic information integration}, or SACHI, in which graph transformer convolutions over an inter-agent coordination graph enrich each agent's representation with receiver-sensitive, content-dependent signals from teammates prior to action selection. We evaluate SACHI across five cooperative tasks spanning spatial, communicative, and adversarial coordination challenges against twelve baselines. SACHI consistently matches or outperforms the best baseline on every task, and rigorous aggregate statistical analyses, including normalized metrics with bootstrap confidence intervals, Friedman ranking, and performance profiling, confirm that this advantage is statistically significant, robust across environments, and not attributable to increased model capacity. Parameter-matched ablations further trace the source of the gains to a single architectural property: the degree of content-dependence in the message-passing operator.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces SACHI, an architecture for cooperative multi-agent reinforcement learning under partial observations. It frames coordination as structured information integration and uses graph transformer convolutions over an inter-agent coordination graph to supply each agent with receiver-sensitive, content-dependent signals from teammates before action selection. The work evaluates the method on five cooperative tasks (spatial, communicative, and adversarial) against twelve baselines, reporting consistent matching or outperformance of the best baseline, supported by normalized metrics with bootstrap confidence intervals, Friedman ranking, performance profiling, and parameter-matched ablations that attribute gains specifically to the degree of content-dependence in the message-passing operator rather than increased model capacity.

Significance. If the results hold under a general graph-construction procedure that does not rely on privileged information, SACHI would offer a concrete architectural route to overcoming the information bottleneck in POMDP-style MARL without scalar value mixing or fully learned communication channels. The statistical rigor (bootstrap CIs, Friedman tests, profiling) and the ablation design that isolates content-dependence are genuine strengths that exceed the typical empirical standard in the area and would make the claims more falsifiable and reproducible.

major comments (1)
  1. [Abstract] Abstract: the central performance advantage is attributed to 'graph transformer convolutions over an inter-agent coordination graph' that deliver 'receiver-sensitive, content-dependent signals.' No procedure is given for constructing or obtaining this graph from the agents' partial observations alone. If the graph edges encode task-specific dependencies (spatial layout, full-state information, or hand-specified topology unavailable under the POMDP protocol), then the reported gains do not demonstrate that the architecture itself solves the information bottleneck in a general setting; the ablations control for capacity but not for this presupposition.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation of our statistical analyses, ablation design, and overall empirical rigor. We address the single major comment point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central performance advantage is attributed to 'graph transformer convolutions over an inter-agent coordination graph' that deliver 'receiver-sensitive, content-dependent signals.' No procedure is given for constructing or obtaining this graph from the agents' partial observations alone. If the graph edges encode task-specific dependencies (spatial layout, full-state information, or hand-specified topology unavailable under the POMDP protocol), then the reported gains do not demonstrate that the architecture itself solves the information bottleneck in a general setting; the ablations control for capacity but not for this presupposition.

    Authors: We appreciate the referee highlighting this point of clarity. The full manuscript (Section 3.2) specifies that the coordination graph is constructed via a task-agnostic procedure using only information available under the POMDP protocol: either a fixed complete graph (when no spatial features are observable) or edges determined by locally observable proximity when positions form part of each agent's partial observation. No full-state or hand-specified privileged topology is used. To eliminate any ambiguity in the abstract, we will revise it to include a brief clause describing this construction rule and will add an explicit paragraph in the methods confirming that the procedure respects partial observability. The existing ablations already vary graph topology independently of the message-passing operator (including random and learned graphs), showing that gains derive from content-dependent integration rather than the specific graph. These changes will be incorporated in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical architectural proposal

full rationale

The paper proposes an empirical architecture (graph transformer convolutions over a coordination graph) for MARL and evaluates it on external tasks with statistical tests and ablations. No mathematical derivation chain, equations, or predictions are presented that reduce by construction to fitted parameters, self-referential quantities, or self-citation load-bearing premises. The coordination graph is treated as an input structure; its construction details do not appear as a derived result within any claimed equations. This is a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities beyond standard components of graph neural networks and reinforcement learning; the core contribution is an architectural application rather than new mathematical primitives.

pith-pipeline@v0.9.0 · 5534 in / 1247 out tokens · 61712 ms · 2026-05-12T01:25:31.495051+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 1 internal anchor

  1. [1]

    Mooney,The principles of organization

    J. Mooney,The principles of organization. Harper & Row, 1947. [Online]. Available: https://books.google. com/books?id=d7rczgEACAAJ

  2. [2]

    F. A. Oliehoek, C. Amatoet al.,A concise introduction to decentralized POMDPs. Springer, 2016, vol. 1

  3. [3]

    The complexity of decentralized control of markov decision processes,

    D. S. Bernstein, R. Givan, N. Immerman, and S. Zil- berstein, “The complexity of decentralized control of markov decision processes,”Mathematics of operations research, vol. 27, no. 4, pp. 819–840, 2002

  4. [4]

    Value-Decomposition Networks For Cooperative Multi-Agent Learning

    P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V . Zambaldi, M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuylset al., “Value-decomposition networks for cooperative multi-agent learning,”arXiv preprint arXiv:1706.05296, 2017

  5. [5]

    Monotonic value function factorisation for deep multi-agent reinforcement learn- ing,

    T. Rashid, M. Samvelyan, C. S. De Witt, G. Farquhar, J. Foerster, and S. Whiteson, “Monotonic value function factorisation for deep multi-agent reinforcement learn- ing,”Journal of Machine Learning Research, vol. 21, no. 178, pp. 1–51, 2020

  6. [6]

    Multi-agent actor-critic for mixed cooperative-competitive environments,

    R. Lowe, Y . Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environments,”Neural Informa- tion Processing Systems (NIPS), 2017

  7. [7]

    The surprising effectiveness of ppo in cooperative multi-agent games,

    C. Yu, A. Velu, E. Vinitsky, J. Gao, Y . Wang, A. Bayen, and Y . Wu, “The surprising effectiveness of ppo in cooperative multi-agent games,”Advances in neural in- formation processing systems, vol. 35, pp. 24 611–24 624, 2022

  8. [8]

    Applications of multi-agent reinforcement learning in future internet: A comprehensive survey,

    T. Li, K. Zhu, N. C. Luong, D. Niyato, Q. Wu, Y . Zhang, and B. Chen, “Applications of multi-agent reinforcement learning in future internet: A comprehensive survey,” IEEE Communications Surveys & Tutorials, vol. 24, no. 2, pp. 1240–1279, 2022

  9. [9]

    A review of cooper- ative multi-agent deep reinforcement learning,

    A. Oroojlooy and D. Hajinezhad, “A review of cooper- ative multi-agent deep reinforcement learning,”Applied Intelligence, vol. 53, no. 11, pp. 13 677–13 722, 2023

  10. [10]

    Distributed reinforcement learning for robot teams: A review,

    Y . Wang, M. Damani, P. Wang, Y . Cao, and G. Sartoretti, “Distributed reinforcement learning for robot teams: A review,”Current Robotics Reports, vol. 3, no. 4, pp. 239– 257, 2022

  11. [11]

    Collision avoidance mechanism for swarms of drones,

    D. Marek, P. Biernacki, J. Szyguła, A. Doma ´nski, M. Paszkuta, M. Szczygieł, M. Kr ´ol, and K. Woj- ciechowski, “Collision avoidance mechanism for swarms of drones,”Sensors, vol. 25, no. 4, p. 1141, 2025

  12. [12]

    Masked label prediction: Unified message passing model for semi-supervised classification,

    S. Yunsheng, H. Zhengjie, F. Shikun, Z. Hui, W. Wenjing, and S. Yu, “Masked label prediction: Unified message passing model for semi-supervised classification,”Pro- ceedings of the Thirtieth International Joint Conference on Artificial Intelligence, pp. 1548–1554, 08 2021

  13. [13]

    A com- prehensive survey of multiagent reinforcement learning,

    L. Busoniu, R. Babuska, and B. De Schutter, “A com- prehensive survey of multiagent reinforcement learning,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 38, no. 2, pp. 156–172, 2008

  14. [14]

    A survey and critique of multiagent deep reinforcement learning,

    P. Hernandez-Leal, B. Kartal, and M. E. Taylor, “A survey and critique of multiagent deep reinforcement learning,”Autonomous Agents and Multi-Agent Systems, vol. 33, no. 6, pp. 750–797, 2019

  15. [15]

    Multi-agent reinforcement learning: Indepen- dent vs. cooperative agents,

    M. Tan, “Multi-agent reinforcement learning: Indepen- dent vs. cooperative agents,” inProceedings of the tenth international conference on machine learning, 1993, pp. 330–337

  16. [16]

    In- dependent reinforcement learners in cooperative markov games: a survey regarding coordination problems,

    L. Matignon, G. J. Laurent, and N. Le Fort-Piat, “In- dependent reinforcement learners in cooperative markov games: a survey regarding coordination problems,”The Knowledge Engineering Review, vol. 27, no. 1, pp. 1–31, 2012. 15

  17. [17]

    The dynamics of rein- forcement learning in cooperative multiagent systems,

    C. Claus and C. Boutilier, “The dynamics of rein- forcement learning in cooperative multiagent systems,” AAAI/IAAI, vol. 1998, no. 746-752, p. 2, 1998

  18. [18]

    Learning to communicate with deep multi-agent reinforcement learning,

    J. Foerster, I. A. Assael, N. De Freitas, and S. White- son, “Learning to communicate with deep multi-agent reinforcement learning,”Advances in neural information processing systems, vol. 29, 2016

  19. [19]

    Learning multiagent communication with backpropagation,

    S. Sukhbaatar, R. Ferguset al., “Learning multiagent communication with backpropagation,”Advances in neu- ral information processing systems, vol. 29, 2016

  20. [20]

    Tarmac: Targeted multi-agent communication,

    A. Das, T. Gervet, J. Romoff, D. Batra, D. Parikh, M. Rabbat, and J. Pineau, “Tarmac: Targeted multi-agent communication,” inInternational Conference on machine learning. PMLR, 2019, pp. 1538–1546

  21. [21]

    Roma: multi-agent reinforcement learning with emergent roles,

    T. Wang, H. Dong, V . Lesser, and C. Zhang, “Roma: multi-agent reinforcement learning with emergent roles,” inProceedings of the 37th International Conference on Machine Learning, ser. ICML’20. JMLR.org, 2020

  22. [22]

    Celebrating diversity in shared multi-agent reinforce- ment learning,

    C. Li, T. Wang, C. Wu, Q. Zhao, J. Yang, and C. Zhang, “Celebrating diversity in shared multi-agent reinforce- ment learning,”Advances in Neural Information Process- ing Systems, vol. 34, pp. 3991–4002, 2021

  23. [23]

    Is independent learning all you need in the starcraft multi-agent challenge?arXiv preprint arXiv:2011.09533, 2020

    C. S. De Witt, T. Gupta, D. Makoviichuk, V . Makoviy- chuk, P. H. Torr, M. Sun, and S. Whiteson, “Is indepen- dent learning all you need in the starcraft multi-agent challenge?”arXiv preprint arXiv:2011.09533, 2020

  24. [24]

    Trust region policy optimisation in multi-agent reinforcement learning,

    J. G. Kuba, R. Chen, M. Wen, Y . Wen, F. Sun, J. Wang, and Y . Yang, “Trust region policy optimisation in multi-agent reinforcement learning,” inInternational Conference on Learning Representations, 2022. [Online]. Available: https: //openreview.net/forum?id=EcGGFkNTxdJ

  25. [25]

    Multi-agent reinforcement learning as a rehearsal for decentralized planning,

    L. Kraemer and B. Banerjee, “Multi-agent reinforcement learning as a rehearsal for decentralized planning,”Neu- rocomputing, vol. 190, pp. 82–94, 2016

  26. [26]

    Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning,

    K. Son, D. Kim, W. J. Kang, D. E. Hostallero, and Y . Yi, “Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning,” in International conference on machine learning. PMLR, 2019, pp. 5887–5896

  27. [27]

    arXiv preprint arXiv:2008.01062 , year=

    J. Wang, Z. Ren, T. Liu, Y . Yu, and C. Zhang, “Qplex: Duplex dueling multi-agent q-learning,”arXiv preprint arXiv:2008.01062, 2020

  28. [28]

    Counterfactual multi-agent policy gradi- ents,

    J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson, “Counterfactual multi-agent policy gradi- ents,” inProceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018

  29. [29]

    Coordinated reinforcement learning,

    C. Guestrin, M. Lagoudakis, and R. Parr, “Coordinated reinforcement learning,” inICML, vol. 2, 2002, pp. 227– 234

  30. [30]

    Deep coordi- nation graphs,

    W. B ¨ohmer, V . Kurin, and S. Whiteson, “Deep coordi- nation graphs,” inInternational Conference on Machine Learning. PMLR, 2020, pp. 980–991

  31. [31]

    Deep implicit coordination graphs for multi-agent reinforcement learning,

    S. Li, J. K. Gupta, P. Morales, R. Allen, and M. J. Kochenderfer, “Deep implicit coordination graphs for multi-agent reinforcement learning,” inAdaptive Agents and Multi-Agent Systems, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:219966887

  32. [32]

    Self-organized polynomial-time coordination graphs,

    Q. Yang, W. Dong, Z. Ren, J. Wang, T. Wang, and C. Zhang, “Self-organized polynomial-time coordination graphs,” inInternational conference on machine learn- ing. PMLR, 2022, pp. 24 963–24 979

  33. [33]

    Context-aware sparse deep coordination graphs,

    T. Wang, L. Zeng, W. Dong, Q. Yang, Y . Yu, and C. Zhang, “Context-aware sparse deep coordination graphs,” inInternational Conference on Learning Representations, 2022. [Online]. Available: https:// openreview.net/forum?id=wQfgfb8VKTn

  34. [34]

    Deep meta coordination graphs for multi-agent reinforcement learning,

    N. Gupta, J. Z. Hare, R. Kannan, and V . Prasanna, “Deep meta coordination graphs for multi-agent reinforcement learning,”arXiv preprint arXiv:2502.04028, 2025

  35. [35]

    Hammer: Multi-level coordination of reinforcement learning agents via learned messaging,

    N. Gupta, G. Srinivasaraghavan, S. Mohalik, N. Kumar, and M. E. Taylor, “Hammer: Multi-level coordination of reinforcement learning agents via learned messaging,” Neural Computing and Applications, vol. 37, no. 19, pp. 13 221–13 236, 2025

  36. [36]

    Tiger-marl: Enhancing multi- agent reinforcement learning with temporal information through graph-based embeddings and representations,

    N. Gupta, L. Twardecka, J. Z. Hare, J. Milzman, R. Kan- nan, and V . Prasanna, “Tiger-marl: Enhancing multi- agent reinforcement learning with temporal information through graph-based embeddings and representations,” arXiv preprint arXiv:2511.08832, 2025

  37. [37]

    Action-graph policies: Learning action co-dependencies in multi-agent reinforcement learning,

    N. Gupta, J. Z. Hare, J. Milzman, R. Kannan, and V . Prasanna, “Action-graph policies: Learning action co-dependencies in multi-agent reinforcement learning,” arXiv preprint arXiv:2602.17009, 2026

  38. [38]

    Graph convolutional reinforcement learning,

    J. Jiang, C. Dun, T. Huang, and Z. Lu, “Graph convolutional reinforcement learning,” inInternational Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id= HkxdQkSYDB

  39. [39]

    Deep multi-agent reinforcement learning with relevance graphs,

    A. Malysheva, T. T. Sung, C.-B. Sohn, D. Kudenko, and A. Shpilman, “Deep multi-agent reinforcement learning with relevance graphs,”arXiv preprint arXiv:1811.12557, 2018

  40. [40]

    Relational inductive biases, deep learning, and graph networks

    P. W. Battaglia, J. B. Hamrick, V . Bapst, A. Sanchez- Gonzalez, V . Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkneret al., “Relational inductive biases, deep learning, and graph networks,” arXiv preprint arXiv:1806.01261, 2018

  41. [41]

    Scalable multi-agent rein- forcement learning through intelligent information aggre- gation,

    S. Nayak, K. Choi, W. Ding, S. Dolan, K. Gopalakr- ishnan, and H. Balakrishnan, “Scalable multi-agent rein- forcement learning through intelligent information aggre- gation,” inInternational conference on machine learning. PMLR, 2023, pp. 25 817–25 833

  42. [42]

    Graph attention networks,

    P. Veli ˇckovi´c, G. Cucurull, A. Casanova, A. Romero, P. Li `o, and Y . Bengio, “Graph attention networks,” in International Conference on Learning Representations,

  43. [43]

    Available: https://openreview.net/forum? id=rJXMpikCZ

    [Online]. Available: https://openreview.net/forum? id=rJXMpikCZ

  44. [44]

    Deep reinforce- ment learning with double q-learning,

    H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforce- ment learning with double q-learning,” inProceedings of the AAAI conference on artificial intelligence, vol. 30, no. 1, 2016

  45. [45]

    Semi-supervised classifica- tion with graph convolutional networks,

    T. N. Kipf and M. Welling, “Semi-supervised classifica- tion with graph convolutional networks,” inInternational Conference on Learning Representations, 2017. [Online]. Available: https://openreview.net/forum?id=SJU4ayYgl 16

  46. [46]

    Graph convolutional value decomposition in multi-agent reinforcement learning,

    N. Naderializadeh, F. H. Hung, S. Soleyman, and D. Khosla, “Graph convolutional value decomposition in multi-agent reinforcement learning,”arXiv preprint arXiv:2010.04740, 2020

  47. [47]

    Non-linear coordination graphs,

    Y . Kang, T. Wang, Q. Yang, X. Wu, and C. Zhang, “Non-linear coordination graphs,”Advances in neural information processing systems, vol. 35, pp. 25 655– 25 666, 2022

  48. [48]

    Group-aware coordination graph for multi-agent reinforcement learning,

    W. Duan, J. Lu, and J. Xuan, “Group-aware coordination graph for multi-agent reinforcement learning,” in Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, ser. IJCAI ’24,

  49. [49]

    2022/168

    [Online]. Available: https://doi.org/10.24963/ijcai. 2024/434

  50. [50]

    Fop: Fac- torizing optimal joint policy of maximum-entropy multi- agent reinforcement learning,

    T. Zhang, Y . Li, C. Wang, G. Xie, and Z. Lu, “Fop: Fac- torizing optimal joint policy of maximum-entropy multi- agent reinforcement learning,” inInternational confer- ence on machine learning. PMLR, 2021, pp. 12 491– 12 500

  51. [51]

    Benchmarking multi-agent deep reinforcement learning algorithms in cooperative tasks.arXiv:2006.07869, 2020

    G. Papoudakis, F. Christianos, L. Sch ¨afer, and S. V . Albrecht, “Benchmarking multi-agent deep reinforcement learning algorithms in cooperative tasks,” inProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS), 2021. [Online]. Available: http://arxiv.org/abs/2006.07869

  52. [52]

    arXiv preprint arXiv:1902.04043 , year=

    M. Samvelyan, T. Rashid, C. S. de Witt, G. Farquhar, N. Nardelli, T. G. J. Rudner, C.-M. Hung, P. H. S. Torr, J. Foerster, and S. Whiteson, “The StarCraft Multi-Agent Challenge,”CoRR, vol. abs/1902.04043, 2019

  53. [53]

    Rethinking the implementation tricks and monotonicity constraint in cooperative multi-agent reinforcement learning,

    J. Hu, S. Wang, S. Jiang, and M. Wang, “Rethinking the implementation tricks and monotonicity constraint in cooperative multi-agent reinforcement learning,” inThe Second Blogpost Track at ICLR 2023, 2023. [Online]. Available: https: //openreview.net/forum?id=Y8hONVbMSDj

  54. [54]

    Deep reinforcement learning at the edge of the statistical precipice,

    R. Agarwal, M. Schwarzer, P. S. Castro, A. C. Courville, and M. Bellemare, “Deep reinforcement learning at the edge of the statistical precipice,”Advances in neural information processing systems, vol. 34, pp. 29 304– 29 320, 2021

  55. [55]

    Statistical comparisons of classifiers over multiple data sets,

    J. Dem ˇsar, “Statistical comparisons of classifiers over multiple data sets,”Journal of Machine learning re- search, vol. 7, no. Jan, pp. 1–30, 2006

  56. [56]

    Graph transformers: A survey,

    A. Shehzad, F. Xia, S. Abid, C. Peng, S. Yu, D. Zhang, and K. Verspoor, “Graph transformers: A survey,”IEEE Transactions on Neural Networks and Learning Systems, 2026

  57. [57]

    A sur- vey on oversmoothing in graph neural networks,

    T. K. Rusch, M. M. Bronstein, and S. Mishra, “A sur- vey on oversmoothing in graph neural networks,”arXiv preprint arXiv:2303.10993, 2023

  58. [58]

    and Wang, Y

    C. Cai and Y . Wang, “A note on over-smoothing for graph neural networks,”arXiv preprint arXiv:2006.13318, 2020