pith. sign in

arxiv: 2509.16606 · v5 · submitted 2025-09-20 · 💻 cs.MA · cs.LG

Bayesian Ego-graph Inference for Networked Multi-Agent Reinforcement Learning

Pith reviewed 2026-05-18 15:56 UTC · model grok-4.3

classification 💻 cs.MA cs.LG
keywords Networked Multi-Agent Reinforcement LearningBayesian Variational InferenceEgo-graphLatent Communication MaskDecentralized MARLTraffic ControlDynamic Interaction Structures
0
0 comments X

The pith

Decentralized agents learn dynamic interaction structures by sampling latent communication masks over ego-graphs using variational inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that networked multi-agent reinforcement learning can move beyond static neighborhoods by letting each agent infer a context-aware subgraph from its local physical graph. It shows this through a Bayesian variational approach where agents sample latent masks to guide message passing and policy updates, all trained end-to-end with an ELBO objective in a fully decentralized way. A sympathetic reader would care because real systems like traffic networks involve dozens or hundreds of agents whose useful connections change with context, yet centralized graph learning remains impractical. The method is tested on large-scale traffic control with up to 167 agents and reports better scalability and performance than strong baselines.

Core claim

We introduce BayesG, a decentralized actor-critic framework in which each agent operates over an ego-graph and samples a latent communication mask from a variational distribution to condition its message passing and policy. The variational distribution is trained jointly with the policy via the evidence lower bound, allowing agents to discover sparse, context-aware interaction structures without access to global states or centralized infrastructure.

What carries the argument

Ego-graph with sampled latent communication mask, which selects which neighbors participate in message passing for each decision.

If this is right

  • Agents can adapt their effective neighborhoods to current context without global coordination.
  • Communication remains sparse and local even as the number of agents grows to hundreds.
  • Joint optimization of topology and policy becomes possible in fully decentralized settings.
  • The approach scales to heterogeneous environments where static neighborhoods fail.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same ego-graph sampling idea could apply to other domains with costly or unreliable links, such as sensor networks or robot teams.
  • If the variational approximation remains accurate, the method might reduce total messages exchanged compared with full-neighborhood baselines.
  • Testing on environments where the physical graph itself changes over time would reveal whether the current fixed-graph assumption limits generality.

Load-bearing premise

The variational distribution over latent masks can be trained end-to-end via ELBO using only local observations without introducing significant bias from the approximation or the fixed underlying graph.

What would settle it

A controlled run on the traffic benchmark where forcing the masks to be fixed or removing the ELBO term causes performance to drop to the level of standard graph-based MARL baselines.

Figures

Figures reproduced from arXiv: 2509.16606 by Jie Lu, Junyu Xuan, Wei Duan.

Figure 1
Figure 1. Figure 1: (a) In CTDE, the global state is available for learning both the centralized critic and the interaction graph. (b) Overview of BayesG. In networked MARL, each agent’s state and action are influenced by its neighbors, forming local data Di = {sVi , ui , uNi }. We formulate latent graph learning as Bayesian variational inference, where each agent infers a binary mask Zi over its neighborhood from the environ… view at source ↗
Figure 2
Figure 2. Figure 2: Training reward curves of BayesG and baselines across five ATSC environments. BayesG [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of traffic congestion on the Grid map at 3500 simulation seconds. Road [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Case study on the ATSC_Grid map at time step 1400. (a) The latent interaction graph inferred by BayesG. Each agent samples a probabilistic binary mask over its ego-graph and the global latent graph is formed by aggregating these per-agent ego-graph masks. Edge thickness reflects the inferred likelihood of communication between intersections. (b) The vehicle density snapshot from the SUMO simulation. Thicke… view at source ↗
Figure 5
Figure 5. Figure 5: Ablation study on the Monaco and NewYork51 environments. Left: Performance comparison of different graph masking strategies. Right: Effect of different feature types used to generate the variational mask. 5.4 Ablation Studies To better understand the impact of BayesG’s components, we conduct two sets of ablation studies on the Monaco and NewYork51 environments, focusing on (i) how the graph mask is generat… view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of the NewYork33, NewYork51, and NewYork167 environments. Top: SUMO network with signalized intersections. Bottom: extracted graph structure used in networked MARL, including traffic light nodes and their physical neighbors. • Phase duration: Time elapsed in current phase For neighborhood-aware coordination, agents also receive aggregated statistics from immediate neighbors Ni (e.g., neighbor… view at source ↗
Figure 6
Figure 6. Figure 6: Each row in [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Training loss curves of BayesG on ATSC_Grid. 0 5 10 15 20 25 30 35 Training Steps (×10 4 ) 38.5 38.0 37.5 37.0 36.5 36.0 35.5 35.0 Loss prior Run 1 Run 2 Run 3 Run 4 Run 5 0 5 10 15 20 25 30 35 Training Steps (×10 4 ) 14 16 18 20 22 24 Loss mask_entropy Run 1 Run 2 Run 3 Run 4 Run 5 0 5 10 15 20 25 30 35 Training Steps (×10 4 ) 350 300 250 200 150 100 50 0 Loss policy Run 1 Run 2 Run 3 Run 4 Run 5 0 5 10 1… view at source ↗
Figure 8
Figure 8. Figure 8: Training loss curves of BayesG on Monaco. E Training Loss Analysis Figures 7, 8, and 9 illustrate the evolution of training losses for BayesG on three representative environments: ATSC_Grid, Monaco, and NewYork33. We report the component-wise losses across five random seeds. Policy loss Lθ,φ. The policy loss reflects the negative log-probability of selected actions, weighted by the estimated advantage (see… view at source ↗
Figure 9
Figure 9. Figure 9: Training loss curves of BayesG on NewYork33. momentarily reduce alignment with advantage estimates, leading to transient spikes. However, as both the policy and graph inference converge, the loss gradually stabilizes, indicating improved policy learning under the inferred interaction structures. ELBO loss LELBO. The ELBO combines the actor loss with KL regularization terms (see Defini￾tion 5). On Grid and … view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative comparison at different simulation times (1000–3599s) on [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Extended case study visualizations for BayesG on [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
read the original abstract

In networked multi-agent reinforcement learning (Networked-MARL), decentralized agents must act under local observability and constrained communication over fixed physical graphs. Existing methods often assume static neighborhoods, limiting adaptability to dynamic or heterogeneous environments. While centralized frameworks can learn dynamic graphs, their reliance on global state access and centralized infrastructure is impractical in real-world decentralized systems. We propose a stochastic graph-based policy for Networked-MARL, where each agent conditions its decision on a sampled subgraph over its local physical neighborhood. Building on this formulation, we introduce BayesG, a decentralized actor-framework that learns sparse, context-aware interaction structures via Bayesian variational inference. Each agent operates over an ego-graph and samples a latent communication mask to guide message passing and policy computation. The variational distribution is trained end-to-end alongside the policy using an evidence lower bound (ELBO) objective, enabling agents to jointly learn both interaction topology and decision-making strategies. BayesG outperforms strong MARL baselines on large-scale traffic control tasks with up to 167 agents, demonstrating superior scalability, efficiency, and performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces BayesG, a decentralized actor-critic framework for Networked-MARL. Each agent maintains an ego-graph over its local physical neighborhood and samples a latent communication mask from a variational distribution to form a stochastic subgraph for message passing. The variational distribution is trained jointly with the policy via an ELBO objective, enabling end-to-end learning of both interaction topology and control policies. The central empirical claim is that BayesG outperforms strong MARL baselines on large-scale traffic control tasks involving up to 167 agents, with gains in scalability, efficiency, and performance.

Significance. If the decentralized variational inference produces interaction structures whose quality is not materially degraded by local observations or the fixed physical graph, the approach would offer a practical route to adaptive communication in fully decentralized settings where centralized graph learning is infeasible. The ego-graph formulation combined with stochastic policies is a natural extension of existing graph-based MARL methods and could improve robustness in heterogeneous or dynamic environments such as traffic networks.

major comments (2)
  1. [Experiments] Experiments section: the headline claim of outperformance on traffic tasks with up to 167 agents is load-bearing for the paper's contribution, yet the abstract (and by extension the experimental reporting) supplies no details on the specific baselines, metrics (e.g., average travel time, throughput), number of independent runs, variance, or statistical tests. Without these, it is impossible to assess whether the reported gains are attributable to the learned masks rather than simulator-specific artifacts or implementation choices.
  2. [§3 (Method)] §3 (Method) and ELBO derivation: the variational distribution q(z | local obs) is trained end-to-end in a fully decentralized manner over a static physical graph. The manuscript does not provide analysis or ablations demonstrating that the resulting masks remain consistent across agents or capture globally relevant edges rather than being biased by the mean-field approximation and local-only observations. This assumption is load-bearing for the claim that the inferred structures improve message passing without introducing significant bias into the policy gradients.
minor comments (2)
  1. [Notation] Clarify in the notation section how the sampled mask is exactly multiplied into the message-passing update and whether the physical neighborhood is strictly enforced or can be overridden.
  2. [Related Work] Add a short paragraph in Related Work contrasting BayesG with prior centralized graph-learning MARL methods and with other variational approaches in multi-agent RL.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment below and describe the revisions we will make to improve clarity and strengthen the empirical and methodological claims.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the headline claim of outperformance on traffic tasks with up to 167 agents is load-bearing for the paper's contribution, yet the abstract (and by extension the experimental reporting) supplies no details on the specific baselines, metrics (e.g., average travel time, throughput), number of independent runs, variance, or statistical tests. Without these, it is impossible to assess whether the reported gains are attributable to the learned masks rather than simulator-specific artifacts or implementation choices.

    Authors: We agree that the abstract is too concise and does not convey the necessary experimental details. The full experimental section reports results on average travel time and throughput, using baselines including MADDPG, QMIX, and several graph-based MARL methods. All results are averaged over five independent random seeds with standard deviations shown, and statistical significance is assessed via paired t-tests. We will revise the abstract to explicitly mention the primary metrics, number of runs, and key baselines. We will also add a short paragraph in the experimental section summarizing the evaluation protocol to make these elements immediately accessible. revision: yes

  2. Referee: [§3 (Method)] §3 (Method) and ELBO derivation: the variational distribution q(z | local obs) is trained end-to-end in a fully decentralized manner over a static physical graph. The manuscript does not provide analysis or ablations demonstrating that the resulting masks remain consistent across agents or capture globally relevant edges rather than being biased by the mean-field approximation and local-only observations. This assumption is load-bearing for the claim that the inferred structures improve message passing without introducing significant bias into the policy gradients.

    Authors: We acknowledge that additional analysis of the learned masks would strengthen the paper. The current manuscript already contains ablation studies that isolate the contribution of the learned stochastic masks versus fixed neighborhoods, together with qualitative visualizations of sampled ego-graphs on the traffic network. Because the method is strictly decentralized, direct verification of global edge consistency is not possible without violating the problem setting. In the revision we will add a dedicated discussion subsection addressing potential biases arising from local observations and the mean-field variational approximation. We will also include new quantitative experiments on smaller synthetic networks where centralized ground-truth comparisons are feasible, to quantify any systematic deviation from globally optimal edges. revision: partial

Circularity Check

0 steps flagged

No significant circularity; standard ELBO applied to new ego-graph policy structure

full rationale

The paper's core derivation introduces a stochastic graph-based policy where agents sample latent communication masks from a variational distribution q(z | local obs) over their ego-graph neighborhood, then optimizes this jointly with the policy via the standard ELBO objective. This is a direct, non-circular application of variational inference to the decentralized MARL setting; the ELBO is not redefined in terms of the target performance metric, nor is any fitted parameter renamed as a prediction. No self-citation chains, uniqueness theorems, or ansatzes from prior author work are invoked as load-bearing. Empirical outperformance on traffic control tasks (up to 167 agents) is presented as an external validation rather than a quantity forced by construction from the inputs. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The approach rests on standard variational inference assumptions plus the domain premise that local physical neighborhoods contain sufficient structure for useful subgraph sampling; no new physical entities are postulated.

free parameters (1)
  • variational distribution parameters
    Parameters of the approximate posterior over latent communication masks are learned via ELBO but their exact form and initialization are not specified in the abstract.
axioms (2)
  • domain assumption Agents operate under local observability and constrained communication over fixed physical graphs.
    Stated in the problem setup and used to motivate the ego-graph sampling.
  • standard math The evidence lower bound provides a tractable objective for joint optimization of topology and policy.
    Invoked when training the variational distribution end-to-end with the policy.
invented entities (1)
  • latent communication mask no independent evidence
    purpose: To represent a sampled sparse subgraph for message passing within each agent's ego-graph.
    Introduced as the mechanism for context-aware interaction structure learning.

pith-pipeline@v0.9.0 · 5710 in / 1333 out tokens · 35432 ms · 2026-05-18T15:56:51.849464+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages

  1. [1]

    Zhou, M., J. Luo, J. Villela, et al. SMARTS: an open-source scalable multi-agent RL training school for autonomous driving. In4th Conference on Robot Learning, CoRL 2020, 16-18 November 2020, Virtual Event / Cambridge, MA, USA, vol. 155 ofProceedings of Machine Learning Research, pages 264–285. PMLR, 2020

  2. [2]

    Yeh, J., V . Soo. Toward socially friendly autonomous driving using multi-agent deep reinforce- ment learning. InProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems, (AAMAS 2024), Auckland, New Zealand, May 6-10, pages 2573–2575. 2024

  3. [3]

    Naderializadeh, N., J. J. Sydir, M. Simsek, et al. Resource management in wireless networks via multi-agent deep reinforcement learning.IEEE Trans. Wirel. Commun., 20(6):3507–3523, 2021

  4. [4]

    Lv, Z., L. Xiao, Y . Du, et al. Efficient communications in multi-agent reinforcement learning for mobile applications.IEEE Trans. Wirel. Commun., 23(9):12440–12454, 2024

  5. [5]

    Rashid, C

    Samvelyan, M., T. Rashid, C. S. de Witt, et al. The starcraft multi-agent challenge. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, (AAMAS 2019), Montreal, QC, Canada, May 13-17, pages 2186–2188. 2019

  6. [6]

    Shao, J., Z. Lou, H. Zhang, et al. Self-organized group for cooperative multi-agent reinforcement learning. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems, (NIPS 2022), November 28 - December 9, New Orleans, LA, USA. 2022

  7. [7]

    Chinchali, S

    Chu, T., S. Chinchali, S. Katti. Multi-agent reinforcement learning for networked system control. In8th International Conference on Learning Representations, (ICLR 2020), Addis Ababa, Ethiopia, April 26-30. 2020

  8. [8]

    Zhang, Y ., Y . Zhou, H. Fujita. Distributed multi-agent reinforcement learning for cooperative low-carbon control of traffic network flow using cloud-based parallel optimization.IEEE Trans. Intell. Transp. Syst., 25(12):20715–20728, 2024

  9. [9]

    Samvelyan, C

    Rashid, T., M. Samvelyan, C. S. de Witt, et al. QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. InProceedings of the 35th International Conference on Machine Learning (ICML 2018), Stockholmsmässan, Stockholm, Sweden, vol. 80, pages 4292–4301. 2018

  10. [10]

    Wang, T., H. Dong, V . R. Lesser, et al. ROMA: multi-agent reinforcement learning with emergent roles. InProceedings of the 37th International Conference on Machine Learning (ICML 2020), Virtual Event, vol. 119 ofProceedings of Machine Learning Research, pages 9876–9886. 2020

  11. [11]

    Wang, J., Z. Ren, T. Liu, et al. QPLEX: duplex dueling multi-agent q-learning. In9th International Conference on Learning Representations (ICLR 2021), Virtual Event, Austria. 2021

  12. [12]

    Lowe, R., Y . Wu, A. Tamar, et al. Multi-agent actor-critic for mixed cooperative-competitive environments. InAdvances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, pages 6379–6390. 2017

  13. [13]

    Liu, Y ., W. Wang, Y . Hu, et al. Multi-agent game abstraction via graph attention neural network. InThe Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI 2020), New York, NY, USA,, pages 7211–7218. AAAI Press, 2020. 11

  14. [14]

    Wang, T., L. Zeng, W. Dong, et al. Context-aware sparse deep coordination graphs. In The Tenth International Conference on Learning Representations (ICLR 2022), Virtual Event. OpenReview.net, 2022

  15. [15]

    Zhang, K., Z. Yang, H. Liu, et al. Fully decentralized multi-agent reinforcement learning with networked agents. InProceedings of the 35th International Conference on Machine Learn- ing,(ICML 2018), Stockholmsmässan, Stockholm, Sweden, July 10-15, vol. 80 ofProceedings of Machine Learning Research, pages 5867–5876. PMLR, 2018

  16. [16]

    Wierman, N

    Qu, G., A. Wierman, N. Li. Scalable reinforcement learning of localized policies for multi-agent networked systems. InProceedings of the 2nd Annual Conference on Learning for Dynamics and Control (L4DC 2020), Online Event, Berkeley, CA, USA, 11-12 June, vol. 120, pages 256–266. PMLR, 2020

  17. [17]

    Chu, T., J. Wang, L. Codecà, et al. Multi-agent deep reinforcement learning for large-scale traffic signal control.IEEE Trans. Intell. Transp. Syst., 21(3):1086–1095, 2020

  18. [18]

    Yi, Y ., G. Li, Y . Wang, et al. Learning to share in networked multi-agent reinforcement learning. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems, ( NIPS 2022), New Orleans, LA, USA, November 28 - December 9. 2022

  19. [19]

    Du, Y ., B. Liu, V . Moens, et al. Learning correlated communication topology in multi-agent reinforcement learning. In20th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2021), May 3-7„ Virtual Event, United Kingdom, pages 456–464. 2021

  20. [20]

    Duan, W., J. Lu, J. Xuan. Inferring latent temporal sparse coordination graph for multiagent reinforcement learning.IEEE Trans. Neural Networks Learn. Syst., pages 1–13, 2024

  21. [21]

    Duan, W., J. Lu, J. Xuan. Group-aware coordination graph for multi-agent reinforcement learning. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, (IJCAI 2024), Jeju, South Korea, August 3-9, 2024, pages 3926–3934. 2024

  22. [22]

    Lin, Y ., G. Qu, L. Huang, et al. Multi-agent reinforcement learning in stochastic networked systems. InAdvances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems (NeurIPS 2021), December 6-14, virtual, pages 7825–7837. 2021

  23. [23]

    Du, Y ., C. Ma, Y . Liu, et al. Scalable model-based policy optimization for decentralized networked systems. InIEEE/RSJ International Conference on Intelligent Robots and Systems, (IROS 2022), Kyoto, Japan, October 23-27, pages 9019–9026. IEEE, 2022

  24. [24]

    Ma, C., A. Li, Y . Du, et al. Efficient and scalable reinforcement learning for large-scale network control.Nature Machine Intelligence, 6(9):1006–1020, 2024

  25. [25]

    Qu, G., Y . Lin, A. Wierman, et al. Scalable multi-agent reinforcement learning for networked systems with average reward. InAdvances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. 2020

  26. [26]

    Anand, E., G. Qu. Efficient reinforcement learning for global decision making in the presence of local agents at scale.CoRR, abs/2403.00222, 2024

  27. [27]

    Jiang, J., C. Dun, T. Huang, et al. Graph convolutional reinforcement learning. In8th Interna- tional Conference on Learning Representations, (ICLR 2020), Addis Ababa, Ethiopia, April 26-30. OpenReview.net, 2020

  28. [28]

    Li, S., J. K. Gupta, P. Morales, et al. Deep implicit coordination graphs for multi-agent reinforcement learning. InAAMAS ’21: 20th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2021), Virtual Event, United Kingdom, pages 764–772. ACM, 2021. 12

  29. [29]

    Lin, B., C. Lee. HGAP: boosting permutation invariant and permutation equivariant in multi- agent reinforcement learning via graph attention network. InForty-first International Conference on Machine Learning (ICML 2024), Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024

  30. [30]

    Duan, W., J. Xuan, M. Qiao, et al. Graph convolutional neural networks with diverse negative samples via decomposed determinant point processes.IEEE Transactions on Neural Networks and Learning Systems, 35(12):18160–18171, 2024

  31. [31]

    Duan, W., J. Lu, Y . G. Wang, et al. Layer-diverse negative sampling for graph neural networks. Trans. Mach. Learn. Res., 2024, 2024

  32. [32]

    Kurin, S

    Boehmer, W., V . Kurin, S. Whiteson. Deep coordination graphs. InProceedings of the 37th International Conference on Machine Learning (ICML 2020), Virtual Event, vol. 119 of Proceedings of Machine Learning Research, pages 980–991. PMLR, 2020

  33. [33]

    Yang, Q., W. Dong, Z. Ren, et al. Self-organized polynomial-time coordination graphs. In International Conference on Machine Learning (ICML 2022), Baltimore, Maryland, USA, vol. 162, pages 24963–24979. 2022

  34. [34]

    Jang, E., S. Gu, B. Poole. Categorical reparameterization with gumbel-softmax. Inthe 5th International Conference on Learning Representations (ICLR 2017), Toulon, France. 2017

  35. [35]

    Maddison, C. J., A. Mnih, Y . W. Teh. The concrete distribution: A continuous relaxation of discrete random variables. Inthe 5th International Conference on Learning Representations (ICLR 2017),Toulon, France. 2017

  36. [36]

    Haarnoja, T., A. Zhou, P. Abbeel, et al. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InProceedings of the 35th International Conference on Machine Learning, (ICML 2018), Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, vol. 80 ofProceedings of Machine Learning Research, pages 1856–1865

  37. [37]

    Lopez, P. A., M. Behrisch, L. Bieker-Walz, et al. Microscopic traffic simulation using sumo. In The 21st IEEE International Conference on Intelligent Transportation Systems. IEEE, 2018

  38. [38]

    Foerster, J. N., N. Nardelli, G. Farquhar, et al. Stabilising experience replay for deep multi- agent reinforcement learning. InProceedings of the 34th International Conference on Machine Learning, (ICML 2017), Sydney, NSW, Australia, 6-11 August, vol. 70, pages 1146–1155. PMLR, 2017

  39. [39]

    Szlam, R

    Sukhbaatar, S., A. Szlam, R. Fergus. Learning multiagent communication with backpropagation. InAdvances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems (NIPS 2016), December 5-10, Barcelona, Spain, pages 2244–

  40. [40]

    Foerster, J. N., Y . M. Assael, N. de Freitas, et al. Learning to communicate with deep multi-agent reinforcement learning. InAdvances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, pages 2137–2145. 2016

  41. [41]

    Kipf, T. N., M. Welling. Semi-supervised classification with graph convolutional networks. In 5th International Conference on Learning Representations (ICLR 2017), Toulon, France, April 24-26. 2017

  42. [42]

    Yu, E., J. Lu, G. Zhang. Generalized incremental learning under concept drift across evolving data streams.CoRR, abs/2506.05736, 2025

  43. [43]

    Yu, E., J. Lu, X. Yang, et al. Learning robust spectral dynamics for temporal domain gener- alization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025). 2025

  44. [44]

    Yang, X., J. Lu, E. Yu. Walking the tightrope: Disentangling beneficial and detrimental drifts in non-stationary custom-tuning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. 2025. 13 Appendix A Detailed Derivation A.1 Derivation of Graph-based Policy We expand on Definition 3, which defines each agent's policy as a two-sta...

  45. [45]

    Likelihood Term:logp(D i |Z i) This is modeled using the graph-conditioned policy loss, i.e., logp(D i |Z i)≈ −L θ,φ,(27) whereL θ,φ is the actor loss under sampled subgraphZ i ⊙G env Vi

  46. [46]

    Prior Term:logp(Z i) We define the prior as an element-wise Bernoulli with retention biasλ: p(Zi) = Y j∈Ni λzij(1−λ) 1−zij .(28) 15 Then: logp(Z i) = X j∈Ni zij logλ+ (1−z ij) log(1−λ).(29) Taking expectation underq(Z i): Eq[logp(Z i)] = X j∈Ni [σ(ϕij) logλ+ (1−σ(ϕ ij)) log(1−λ)].(30)

  47. [47]

    Entropy Term:−logq(Z i;ϕ i) Sinceq(Z i)is a factorized Bernoulli: H(q(Z ij)) =−σ(ϕ ij) logσ(ϕ ij)−(1−σ(ϕ ij)) log(1−σ(ϕ ij)).(31) Then: Eq[logq(Z i)] =− X j∈Ni H(q(Z ij)).(32) Final Objective Combining all terms: LELBO =E q(Zi;ϕi) [−Lθ,φ] + X j∈Ni [λlogσ(ϕ ij) + (1−λ) log(1−σ(ϕ ij))]− X j∈Ni H(q(Z ij)) =E q(Zi;ϕi) [−Lθ,φ] + X j∈Ni [λlogσ(ϕ ij) + (1−λ) log...

  48. [48]

    Each agent’s state evolution depends on its immediate neighbors’ actions, not the global joint action of all agents

    Localized dynamics.Traffic flow is governed by physical proximity: upstream intersections release vehicles that propagate to downstream intersections. Each agent’s state evolution depends on its immediate neighbors’ actions, not the global joint action of all agents

  49. [49]

    Fixed physical topology.The road network structure is fixed and sparse, with agents (intersections) only interacting with directly connected neighbors via shared road segments

  50. [50]

    Decentralized execution requirement.In real-world deployments, traffic signals operate inde- pendently with limited communication bandwidth. Centralized control is impractical due to: • Scalability: City-scale networks have hundreds of intersections; centralized joint action spaces grow exponentially • Communication constraints: Real-time global state agg...

  51. [51]

    listen to

    Local observability.Each intersection has sensors only for its incoming lanes, consistent with the partial observability assumption in Spatiotemporal-MDP. These properties make ATSC fundamentally different from cooperative benchmarks (e.g., Star- Craft) that assume global rewards, unrestricted communication, and arbitrary coordination graphs. Our method e...