pith. sign in

arxiv: 2605.08391 · v2 · pith:DS6RY4OGnew · submitted 2026-05-08 · 💻 cs.LG

SACHI: Structured Agent Coordination via Holistic Information Integration in Multi-Agent Reinforcement Learning

Pith reviewed 2026-05-20 22:31 UTC · model grok-4.3

classification 💻 cs.LG
keywords multi-agent reinforcement learningcooperative MARLgraph neural networksgraph transformersinformation bottleneckpartial observabilityagent coordinationmessage passing
0
0 comments X

The pith

Graph transformer convolutions over a coordination graph let each agent receive tailored signals from teammates before acting on partial observations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In cooperative multi-agent reinforcement learning each agent faces an information bottleneck because joint optimal actions require knowledge scattered across the team yet every agent decides from its own local view alone. The paper frames coordination as structured holistic information integration and introduces SACHI to build an inter-agent coordination graph whose graph transformer convolutions enrich each agent's representation with receiver-sensitive and content-dependent signals from teammates. These enriched representations are used for action selection without relying on scalar mixing values or separate learned communication channels. Across five tasks that cover spatial, communicative and adversarial challenges SACHI matches or exceeds twelve baselines while parameter-matched ablations and aggregate statistical tests attribute the gains specifically to the content-dependence property of the message-passing operator.

Core claim

The paper claims that graph transformer convolutions applied to an inter-agent coordination graph produce receiver-sensitive, content-dependent signals that integrate scattered teammate knowledge into each agent's local representation, thereby reducing the partial-observation information bottleneck and yielding performance that is statistically superior to prior mixing or communication baselines on every evaluated task.

What carries the argument

An inter-agent coordination graph whose nodes carry local observations and whose edges are processed by graph transformer convolutions that generate content-dependent and receiver-sensitive messages before action selection.

If this is right

  • The same architecture produces statistically significant gains on tasks that require spatial coordination, explicit communication, and adversarial interaction.
  • Parameter-matched ablations isolate the performance lift to the degree of content-dependence inside the message-passing operator rather than to raw model size.
  • Bootstrap confidence intervals, Friedman rankings, and performance profiles all support that the advantage is robust across environments and not an artifact of a single metric.
  • The method works inside the standard centralized-training decentralized-execution loop without requiring agents to transmit raw observations or intentions at decision time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the coordination graph can be updated online, the same integration pattern could handle environments where team membership changes during an episode.
  • The approach suggests that explicit communication protocols may be unnecessary when implicit, receiver-tailored integration already supplies the missing joint knowledge.
  • Similar graph-based enrichment might improve single-agent reinforcement learning under strong partial observability by treating past states or auxiliary sensors as virtual teammates.

Load-bearing premise

The coordination graph and its graph transformer convolution operator can be constructed and trained so that the resulting signals actually reduce the information bottleneck without creating new representational or optimization failures.

What would settle it

Replacing the graph transformer convolution with a content-independent aggregator such as mean pooling while keeping parameter count fixed and observing no drop in performance would falsify the claim that content-dependence is the source of the reported gains.

Figures

Figures reproduced from arXiv: 2605.08391 by James Zachary Hare, Jesse Milzman, Nikunj Gupta, Rajgopal Kannan, Viktor Prasanna.

Figure 1
Figure 1. Figure 1: Solving the MARL Information Bottleneck. (a) Each agent holds only a fragment of the information needed to determine the globally optimal joint action; (b) Since agents cannot see their teammates’ observations or intentions, their independent “rational” choices often lead to group failure; (c) SACHI resolves this by allowing agents to intelligently filter and “borrow” relevant context from teammates throug… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of SACHI. Local observations are encoded into agent embeddings, refined through soft-attention-modulated graph transformer message passing, and mapped to per-agent Q-values. The model performs receiver-dependent information integration prior to action selection, allowing coordinated behavior to emerge under decentralized execution. messages along the edges of A˜ . At layer ℓ, the embedding of agen… view at source ↗
Figure 3
Figure 3. Figure 3: REFERENCE: a1 observes a2’s goal (orange dashed) and vice versa (teal dashed). Each agent signals this (purple) so its partner can navigate to the correct landmark (solid arrows). its actions. Success requires learning a shared protocol in which each agent acts as both an informative sender and a faithful receiver. The task directly tests content-dependent coordination, as agents must extract and transmit … view at source ↗
Figure 7
Figure 7. Figure 7: DISPERSE: Agents must try to achieve full coverage. e) DISPERSE [33]: requires n agents to occupy n iden￾tical resource zones with exactly one agent per zone. Rewards depend on conflict-free coverage, but all zones are locally indistinguishable, making symmetry-breaking the central chal￾lenge. Greedy local policies lead to clustering and under￾coverage. Solving the task requires agents to incorporate teamm… view at source ↗
Figure 5
Figure 5. Figure 5: SPEAKER LISTENER: speaker observes target (⋆) and broadcasts a discrete symbol (orange); listener interprets the symbol and navigates (teal), unable to see the target directly. c) SPEAKER LISTENER [6]: introduces fixed role asym￾metry. The speaker observes the target but cannot move, while the listener can move but cannot observe the target. The speaker must broadcast a discrete signal encoding the target,… view at source ↗
Figure 8
Figure 8. Figure 8: Learning curves across five cooperative environments (mean [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Aggregate normalized scores with 95% bootstrap confidence intervals. SACHI leads on all four metrics; its CI does not overlap with any baseline on Mean, IQM, or OG. measures by how much, on average, a method falls short of the best-baseline ceiling (vˆ = 1); a method that always matches or exceeds the best baseline has OG = 0. For each metric, we construct 95% confidence intervals by bootstrap resampling t… view at source ↗
Figure 10
Figure 10. Figure 10: Average rank across five environments (lower is [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Performance profile. SACHI stochastically dominates every baseline at every threshold τ . a) Learning-trajectory efficiency: Let Rm,e,s(t) denote the test return of method m on environment e with seed s at training timestep t, and let T denote the total training budget. We compute the normalized area under the learning curve: AUCm,e,s = 1 T Z T 0 Rm,e,s(t) dt, (16) estimated via the trapezoidal rule and m… view at source ↗
Figure 13
Figure 13. Figure 13: Jump-start performance over the first 100K steps (percentage of best baseline’s final score). SACHI ranks first. matches the worst baseline’s early performance; 100% means it matches the best. The aggregate score is the mean of score(m, e) across environments [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗
Figure 12
Figure 12. Figure 12: Normalized area under the learning curve (higher is [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗
Figure 14
Figure 14. Figure 14: Parameter breakdown per method. Blue: agent + [PITH_FULL_IMAGE:figures/full_fig_p012_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Ablation learning curves on REFERENCE. (a) Encoder architecture (parameter-matched). (b) Number of attention heads. (c) Layer depth. The default configuration is shown in red. baseline at every performance threshold. It achieves this with fewer parameters (31K) than most baselines, the highest jump￾start performance (138%), and the highest learning-trajectory AUC (1.01). The ablations trace the source of … view at source ↗
read the original abstract

Cooperative multi-agent reinforcement learning agents that act on partial local observations face a fundamental information bottleneck: the knowledge needed to select jointly optimal actions is scattered across the team, yet each agent must commit to a decision without access to its teammates' observations, intentions, or chosen actions. Existing methods either ignore this bottleneck, compress it into a scalar mixing signal, or route around it with learned communication channels. Framing action coordination as a problem of structured information integration among agents, we propose \textit{structured agent coordination via holistic information integration}, or SACHI, in which graph transformer convolutions over an inter-agent coordination graph enrich each agent's representation with receiver-sensitive, content-dependent signals from teammates prior to action selection. We evaluate SACHI across five cooperative tasks spanning spatial, communicative, and adversarial coordination challenges against twelve baselines. SACHI consistently matches or outperforms the best baseline on every task, and rigorous aggregate statistical analyses, including normalized metrics with bootstrap confidence intervals, Friedman ranking, and performance profiling, confirm that this advantage is statistically significant, robust across environments, and not attributable to increased model capacity. Parameter-matched ablations further trace the source of the gains to a single architectural property: the degree of content-dependence in the message-passing operator.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes SACHI for cooperative MARL under partial observations, framing coordination as holistic information integration via graph transformer convolutions over an inter-agent coordination graph. This produces receiver-sensitive, content-dependent signals that enrich each agent's representation before action selection. Evaluated on five tasks (spatial, communicative, adversarial) against twelve baselines, SACHI matches or outperforms the best baseline on every task; aggregate statistics (normalized metrics with bootstrap CIs, Friedman ranking, performance profiling) establish statistical significance and robustness, while parameter-matched ablations attribute gains specifically to the degree of content-dependence in the message-passing operator.

Significance. If the central architectural claim holds, the work offers a structured alternative to scalar mixing or generic communication channels for reducing information bottlenecks in cooperative MARL. Strengths include the breadth of evaluation, rigorous statistical aggregation across environments, and explicit ablations that isolate content-dependence rather than capacity. These elements make the empirical contribution more convincing than typical MARL ablation studies and could inform future designs that require fine-grained, receiver-aware coordination.

major comments (2)
  1. [§3 (Method, coordination graph definition)] The inter-agent coordination graph construction is underspecified. It is unclear whether edges are fixed (e.g., complete graph), learned via adjacency matrix, or environment-specific; without this definition it is impossible to verify that the resulting signals are receiver-sensitive and content-dependent as claimed in the abstract and §3. This detail is load-bearing for the central claim that gains arise from the architectural property rather than implicit capacity or tuning differences.
  2. [§3.2 (graph transformer convolution)] The graph transformer convolution operator lacks explicit specification of how receiver identity and message content condition the attention keys/queries (or equivalent). It is therefore difficult to distinguish the operator from standard GAT or QMIX-style mixing and to confirm that it actually reduces the information bottleneck without introducing new representational failures. This directly affects the validity of the ablation results tracing gains to content-dependence.
minor comments (2)
  1. [Figure 2] Figure 2 (or equivalent architecture diagram) would benefit from explicit annotation of receiver-specific conditioning paths to make the claimed property visually verifiable.
  2. [Abstract] The abstract states 'five cooperative tasks' but does not name them; a parenthetical list would improve readability without lengthening the paragraph.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We address each major comment below and agree that greater explicitness in the method section will strengthen the manuscript. We have prepared revisions to clarify the coordination graph construction and the conditioning mechanisms in the graph transformer convolution.

read point-by-point responses
  1. Referee: [§3 (Method, coordination graph definition)] The inter-agent coordination graph construction is underspecified. It is unclear whether edges are fixed (e.g., complete graph), learned via adjacency matrix, or environment-specific; without this definition it is impossible to verify that the resulting signals are receiver-sensitive and content-dependent as claimed in the abstract and §3. This detail is load-bearing for the central claim that gains arise from the architectural property rather than implicit capacity or tuning differences.

    Authors: We agree that the coordination graph requires a more explicit definition to support verification of the claimed properties. The manuscript constructs the inter-agent coordination graph as a fixed complete graph over the set of agents (i.e., every pair of agents shares an edge), independent of environment-specific features and without learned adjacency. This fixed structure is chosen precisely to isolate the effects of the subsequent graph transformer convolution. We will revise §3 to include a formal definition of the graph, pseudocode for its construction, and an accompanying figure. With this clarification, the receiver-sensitive and content-dependent character of the signals can be attributed directly to the convolution operator rather than to the graph topology, preserving the validity of the central architectural claim and the parameter-matched ablations. revision: yes

  2. Referee: [§3.2 (graph transformer convolution)] The graph transformer convolution operator lacks explicit specification of how receiver identity and message content condition the attention keys/queries (or equivalent). It is therefore difficult to distinguish the operator from standard GAT or QMIX-style mixing and to confirm that it actually reduces the information bottleneck without introducing new representational failures. This directly affects the validity of the ablation results tracing gains to content-dependence.

    Authors: We appreciate the referee highlighting the need for greater mathematical detail. In SACHI the graph transformer convolution computes attention scores by forming queries from each receiver agent's local representation (thereby incorporating receiver identity) while keys and values are produced from the sender agent's message content via content-dependent linear transformations; agent identity embeddings are added to both queries and keys to further emphasize receiver sensitivity. This formulation is distinct from standard GAT (which does not explicitly separate receiver conditioning in this manner) and from QMIX-style mixing (which performs centralized value decomposition rather than per-agent message passing). We will insert the precise attention equations, including the conditioning steps, into the revised §3.2. These additions will allow direct verification that the operator reduces the information bottleneck and will reinforce the ablation results that isolate the contribution of content dependence. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on held-out task performance and ablations, not self-referential definitions or fitted inputs renamed as predictions

full rationale

The paper's central claims concern empirical outperformance on five cooperative MARL tasks, supported by statistical analyses (bootstrap CIs, Friedman ranking, performance profiling) and parameter-matched ablations that isolate content-dependence in the message-passing operator. No derivation chain reduces a claimed result to its own inputs by construction: the coordination graph and graph transformer are architectural choices evaluated against baselines, not fitted parameters whose outputs are then presented as independent predictions. Self-citations to prior MARL literature are not load-bearing for the uniqueness or correctness of the reported gains, which are externally falsifiable via the described experiments. The method is self-contained against the provided benchmarks and does not invoke self-citation chains or ansatzes that collapse the result to the input.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on standard cooperative MARL assumptions and the effectiveness of graph transformers; no new free parameters, axioms, or invented entities are introduced beyond the architectural choice itself.

axioms (1)
  • domain assumption Agents operate under partial local observations and must select actions without direct access to teammates' observations or intentions.
    This premise defines the information bottleneck that SACHI is designed to mitigate.

pith-pipeline@v0.9.0 · 5765 in / 1226 out tokens · 33685 ms · 2026-05-20T22:31:29.515101+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 3 internal anchors

  1. [1]

    Mooney,The principles of organization

    J. Mooney,The principles of organization. Harper & Row, 1947. [Online]. Available: https://books.google. com/books?id=d7rczgEACAAJ

  2. [2]

    F. A. Oliehoek, C. Amatoet al.,A concise introduction to decentralized POMDPs. Springer, 2016, vol. 1

  3. [3]

    The complexity of decentralized control of markov decision processes,

    D. S. Bernstein, R. Givan, N. Immerman, and S. Zil- berstein, “The complexity of decentralized control of markov decision processes,”Mathematics of operations research, vol. 27, no. 4, pp. 819–840, 2002

  4. [4]

    Value-Decomposition Networks For Cooperative Multi-Agent Learning

    P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V . Zambaldi, M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuylset al., “Value-decomposition networks for cooperative multi-agent learning,”arXiv preprint arXiv:1706.05296, 2017

  5. [5]

    Monotonic value function factorisation for deep multi-agent reinforcement learn- ing,

    T. Rashid, M. Samvelyan, C. S. De Witt, G. Farquhar, J. Foerster, and S. Whiteson, “Monotonic value function factorisation for deep multi-agent reinforcement learn- ing,”Journal of Machine Learning Research, vol. 21, no. 178, pp. 1–51, 2020

  6. [6]

    Multi-agent actor-critic for mixed cooperative-competitive environments,

    R. Lowe, Y . Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environments,”Neural Informa- tion Processing Systems (NIPS), 2017

  7. [7]

    The surprising effectiveness of ppo in cooperative multi-agent games,

    C. Yu, A. Velu, E. Vinitsky, J. Gao, Y . Wang, A. Bayen, and Y . Wu, “The surprising effectiveness of ppo in cooperative multi-agent games,”Advances in neural in- formation processing systems, vol. 35, pp. 24 611–24 624, 2022

  8. [8]

    Applications of multi-agent reinforcement learning in future internet: A comprehensive survey,

    T. Li, K. Zhu, N. C. Luong, D. Niyato, Q. Wu, Y . Zhang, and B. Chen, “Applications of multi-agent reinforcement learning in future internet: A comprehensive survey,” IEEE Communications Surveys & Tutorials, vol. 24, no. 2, pp. 1240–1279, 2022

  9. [9]

    A review of cooper- ative multi-agent deep reinforcement learning,

    A. Oroojlooy and D. Hajinezhad, “A review of cooper- ative multi-agent deep reinforcement learning,”Applied Intelligence, vol. 53, no. 11, pp. 13 677–13 722, 2023

  10. [10]

    Distributed reinforcement learning for robot teams: A review,

    Y . Wang, M. Damani, P. Wang, Y . Cao, and G. Sartoretti, “Distributed reinforcement learning for robot teams: A review,”Current Robotics Reports, vol. 3, no. 4, pp. 239– 257, 2022

  11. [11]

    Collision avoidance mechanism for swarms of drones,

    D. Marek, P. Biernacki, J. Szyguła, A. Doma ´nski, M. Paszkuta, M. Szczygieł, M. Kr ´ol, and K. Woj- ciechowski, “Collision avoidance mechanism for swarms of drones,”Sensors, vol. 25, no. 4, p. 1141, 2025

  12. [12]

    Masked label prediction: Unified message passing model for semi-supervised classification,

    S. Yunsheng, H. Zhengjie, F. Shikun, Z. Hui, W. Wenjing, and S. Yu, “Masked label prediction: Unified message passing model for semi-supervised classification,”Pro- ceedings of the Thirtieth International Joint Conference on Artificial Intelligence, pp. 1548–1554, 08 2021

  13. [13]

    A com- prehensive survey of multiagent reinforcement learning,

    L. Busoniu, R. Babuska, and B. De Schutter, “A com- prehensive survey of multiagent reinforcement learning,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 38, no. 2, pp. 156–172, 2008

  14. [14]

    A survey and critique of multiagent deep reinforcement learning,

    P. Hernandez-Leal, B. Kartal, and M. E. Taylor, “A survey and critique of multiagent deep reinforcement learning,”Autonomous Agents and Multi-Agent Systems, vol. 33, no. 6, pp. 750–797, 2019

  15. [15]

    Multi-agent reinforcement learning: Indepen- dent vs. cooperative agents,

    M. Tan, “Multi-agent reinforcement learning: Indepen- dent vs. cooperative agents,” inProceedings of the tenth international conference on machine learning, 1993, pp. 330–337

  16. [16]

    In- dependent reinforcement learners in cooperative markov games: a survey regarding coordination problems,

    L. Matignon, G. J. Laurent, and N. Le Fort-Piat, “In- dependent reinforcement learners in cooperative markov games: a survey regarding coordination problems,”The Knowledge Engineering Review, vol. 27, no. 1, pp. 1–31, 2012. 15

  17. [17]

    The dynamics of rein- forcement learning in cooperative multiagent systems,

    C. Claus and C. Boutilier, “The dynamics of rein- forcement learning in cooperative multiagent systems,” AAAI/IAAI, vol. 1998, no. 746-752, p. 2, 1998

  18. [18]

    Learning to communicate with deep multi-agent reinforcement learning,

    J. Foerster, I. A. Assael, N. De Freitas, and S. White- son, “Learning to communicate with deep multi-agent reinforcement learning,”Advances in neural information processing systems, vol. 29, 2016

  19. [19]

    Learning multiagent communication with backpropagation,

    S. Sukhbaatar, R. Ferguset al., “Learning multiagent communication with backpropagation,”Advances in neu- ral information processing systems, vol. 29, 2016

  20. [20]

    Tarmac: Targeted multi-agent communication,

    A. Das, T. Gervet, J. Romoff, D. Batra, D. Parikh, M. Rabbat, and J. Pineau, “Tarmac: Targeted multi-agent communication,” inInternational Conference on machine learning. PMLR, 2019, pp. 1538–1546

  21. [21]

    Roma: multi-agent reinforcement learning with emergent roles,

    T. Wang, H. Dong, V . Lesser, and C. Zhang, “Roma: multi-agent reinforcement learning with emergent roles,” inProceedings of the 37th International Conference on Machine Learning, ser. ICML’20. JMLR.org, 2020

  22. [22]

    Celebrating diversity in shared multi-agent reinforce- ment learning,

    C. Li, T. Wang, C. Wu, Q. Zhao, J. Yang, and C. Zhang, “Celebrating diversity in shared multi-agent reinforce- ment learning,”Advances in Neural Information Process- ing Systems, vol. 34, pp. 3991–4002, 2021

  23. [23]

    S., Gupta, T., Makoviichuk, D., Makoviychuk, V ., Torr, P

    C. S. De Witt, T. Gupta, D. Makoviichuk, V . Makoviy- chuk, P. H. Torr, M. Sun, and S. Whiteson, “Is indepen- dent learning all you need in the starcraft multi-agent challenge?”arXiv preprint arXiv:2011.09533, 2020

  24. [24]

    Trust region policy optimisation in multi-agent reinforcement learning,

    J. G. Kuba, R. Chen, M. Wen, Y . Wen, F. Sun, J. Wang, and Y . Yang, “Trust region policy optimisation in multi-agent reinforcement learning,” inInternational Conference on Learning Representations, 2022. [Online]. Available: https: //openreview.net/forum?id=EcGGFkNTxdJ

  25. [25]

    Multi-agent reinforcement learning as a rehearsal for decentralized planning,

    L. Kraemer and B. Banerjee, “Multi-agent reinforcement learning as a rehearsal for decentralized planning,”Neu- rocomputing, vol. 190, pp. 82–94, 2016

  26. [26]

    Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning,

    K. Son, D. Kim, W. J. Kang, D. E. Hostallero, and Y . Yi, “Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning,” in International conference on machine learning. PMLR, 2019, pp. 5887–5896

  27. [27]

    Qplex: Duplex dueling multi-agent q-learning

    J. Wang, Z. Ren, T. Liu, Y . Yu, and C. Zhang, “Qplex: Duplex dueling multi-agent q-learning,”arXiv preprint arXiv:2008.01062, 2020

  28. [28]

    Counterfactual multi-agent policy gradi- ents,

    J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson, “Counterfactual multi-agent policy gradi- ents,” inProceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018

  29. [29]

    Coordinated reinforcement learning,

    C. Guestrin, M. Lagoudakis, and R. Parr, “Coordinated reinforcement learning,” inICML, vol. 2, 2002, pp. 227– 234

  30. [30]

    Deep coordi- nation graphs,

    W. B ¨ohmer, V . Kurin, and S. Whiteson, “Deep coordi- nation graphs,” inInternational Conference on Machine Learning. PMLR, 2020, pp. 980–991

  31. [31]

    Deep implicit coordination graphs for multi-agent reinforcement learning,

    S. Li, J. K. Gupta, P. Morales, R. Allen, and M. J. Kochenderfer, “Deep implicit coordination graphs for multi-agent reinforcement learning,” inAdaptive Agents and Multi-Agent Systems, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:219966887

  32. [32]

    Self-organized polynomial-time coordination graphs,

    Q. Yang, W. Dong, Z. Ren, J. Wang, T. Wang, and C. Zhang, “Self-organized polynomial-time coordination graphs,” inInternational conference on machine learn- ing. PMLR, 2022, pp. 24 963–24 979

  33. [33]

    Context-aware sparse deep coordination graphs,

    T. Wang, L. Zeng, W. Dong, Q. Yang, Y . Yu, and C. Zhang, “Context-aware sparse deep coordination graphs,” inInternational Conference on Learning Representations, 2022. [Online]. Available: https:// openreview.net/forum?id=wQfgfb8VKTn

  34. [34]

    Deep meta coordination graphs for multi-agent reinforcement learning,

    N. Gupta, J. Z. Hare, R. Kannan, and V . Prasanna, “Deep meta coordination graphs for multi-agent reinforcement learning,”arXiv preprint arXiv:2502.04028, 2025

  35. [35]

    Hammer: Multi-level coordination of reinforcement learning agents via learned messaging,

    N. Gupta, G. Srinivasaraghavan, S. Mohalik, N. Kumar, and M. E. Taylor, “Hammer: Multi-level coordination of reinforcement learning agents via learned messaging,” Neural Computing and Applications, vol. 37, no. 19, pp. 13 221–13 236, 2025

  36. [36]

    Tiger-marl: Enhancing multi- agent reinforcement learning with temporal information through graph-based embeddings and representations,

    N. Gupta, L. Twardecka, J. Z. Hare, J. Milzman, R. Kan- nan, and V . Prasanna, “Tiger-marl: Enhancing multi- agent reinforcement learning with temporal information through graph-based embeddings and representations,” arXiv preprint arXiv:2511.08832, 2025

  37. [37]

    Action-graph policies: Learning action co-dependencies in multi-agent reinforcement learning,

    N. Gupta, J. Z. Hare, J. Milzman, R. Kannan, and V . Prasanna, “Action-graph policies: Learning action co-dependencies in multi-agent reinforcement learning,” arXiv preprint arXiv:2602.17009, 2026

  38. [38]

    Graph convolutional reinforcement learning,

    J. Jiang, C. Dun, T. Huang, and Z. Lu, “Graph convolutional reinforcement learning,” inInternational Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id= HkxdQkSYDB

  39. [39]

    Deep Multi-Agent Reinforcement Learning with Relevance Graphs

    A. Malysheva, T. T. Sung, C.-B. Sohn, D. Kudenko, and A. Shpilman, “Deep multi-agent reinforcement learning with relevance graphs,”arXiv preprint arXiv:1811.12557, 2018

  40. [40]

    Relational inductive biases, deep learning, and graph networks

    P. W. Battaglia, J. B. Hamrick, V . Bapst, A. Sanchez- Gonzalez, V . Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkneret al., “Relational inductive biases, deep learning, and graph networks,” arXiv preprint arXiv:1806.01261, 2018

  41. [41]

    Scalable multi-agent rein- forcement learning through intelligent information aggre- gation,

    S. Nayak, K. Choi, W. Ding, S. Dolan, K. Gopalakr- ishnan, and H. Balakrishnan, “Scalable multi-agent rein- forcement learning through intelligent information aggre- gation,” inInternational conference on machine learning. PMLR, 2023, pp. 25 817–25 833

  42. [42]

    Graph attention networks,

    P. Veli ˇckovi´c, G. Cucurull, A. Casanova, A. Romero, P. Li `o, and Y . Bengio, “Graph attention networks,” in International Conference on Learning Representations,

  43. [43]

    Available: https://openreview.net/forum? id=rJXMpikCZ

    [Online]. Available: https://openreview.net/forum? id=rJXMpikCZ

  44. [44]

    Deep reinforce- ment learning with double q-learning,

    H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforce- ment learning with double q-learning,” inProceedings of the AAAI conference on artificial intelligence, vol. 30, no. 1, 2016

  45. [45]

    Semi-supervised classifica- tion with graph convolutional networks,

    T. N. Kipf and M. Welling, “Semi-supervised classifica- tion with graph convolutional networks,” inInternational Conference on Learning Representations, 2017. [Online]. Available: https://openreview.net/forum?id=SJU4ayYgl 16

  46. [46]

    Graph convolutional value decomposi- tion in multi-agent reinforcement learning,

    N. Naderializadeh, F. H. Hung, S. Soleyman, and D. Khosla, “Graph convolutional value decomposition in multi-agent reinforcement learning,”arXiv preprint arXiv:2010.04740, 2020

  47. [47]

    Non-linear coordination graphs,

    Y . Kang, T. Wang, Q. Yang, X. Wu, and C. Zhang, “Non-linear coordination graphs,”Advances in neural information processing systems, vol. 35, pp. 25 655– 25 666, 2022

  48. [48]

    Group-aware coordination graph for multi-agent reinforcement learning,

    W. Duan, J. Lu, and J. Xuan, “Group-aware coordination graph for multi-agent reinforcement learning,” in Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, ser. IJCAI ’24,

  49. [49]

    2023/120

    [Online]. Available: https://doi.org/10.24963/ijcai. 2024/434

  50. [50]

    Fop: Fac- torizing optimal joint policy of maximum-entropy multi- agent reinforcement learning,

    T. Zhang, Y . Li, C. Wang, G. Xie, and Z. Lu, “Fop: Fac- torizing optimal joint policy of maximum-entropy multi- agent reinforcement learning,” inInternational confer- ence on machine learning. PMLR, 2021, pp. 12 491– 12 500

  51. [51]

    Benchmarking multi-agent deep reinforcement learn- ing algorithms in cooperative tasks

    G. Papoudakis, F. Christianos, L. Sch ¨afer, and S. V . Albrecht, “Benchmarking multi-agent deep reinforcement learning algorithms in cooperative tasks,” inProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS), 2021. [Online]. Available: http://arxiv.org/abs/2006.07869

  52. [52]

    S., Farquhar, G., Nardelli, N., Rudner, T

    M. Samvelyan, T. Rashid, C. S. de Witt, G. Farquhar, N. Nardelli, T. G. J. Rudner, C.-M. Hung, P. H. S. Torr, J. Foerster, and S. Whiteson, “The StarCraft Multi-Agent Challenge,”CoRR, vol. abs/1902.04043, 2019

  53. [53]

    Rethinking the implementation tricks and monotonicity constraint in cooperative multi-agent reinforcement learning,

    J. Hu, S. Wang, S. Jiang, and M. Wang, “Rethinking the implementation tricks and monotonicity constraint in cooperative multi-agent reinforcement learning,” inThe Second Blogpost Track at ICLR 2023, 2023. [Online]. Available: https: //openreview.net/forum?id=Y8hONVbMSDj

  54. [54]

    Deep reinforcement learning at the edge of the statistical precipice,

    R. Agarwal, M. Schwarzer, P. S. Castro, A. C. Courville, and M. Bellemare, “Deep reinforcement learning at the edge of the statistical precipice,”Advances in neural information processing systems, vol. 34, pp. 29 304– 29 320, 2021

  55. [55]

    Statistical comparisons of classifiers over multiple data sets,

    J. Dem ˇsar, “Statistical comparisons of classifiers over multiple data sets,”Journal of Machine learning re- search, vol. 7, no. Jan, pp. 1–30, 2006

  56. [56]

    Graph transformers: A survey,

    A. Shehzad, F. Xia, S. Abid, C. Peng, S. Yu, D. Zhang, and K. Verspoor, “Graph transformers: A survey,”IEEE Transactions on Neural Networks and Learning Systems, 2026

  57. [57]

    K., Bronstein, M

    T. K. Rusch, M. M. Bronstein, and S. Mishra, “A sur- vey on oversmoothing in graph neural networks,”arXiv preprint arXiv:2303.10993, 2023

  58. [58]

    A Note on Over-Smoothing for Graph Neural Networks, June 2020

    C. Cai and Y . Wang, “A note on over-smoothing for graph neural networks,”arXiv preprint arXiv:2006.13318, 2020