Scalable Constrained Multi-Agent Reinforcement Learning via State Augmentation and Consensus for Separable Dynamics

Anders Jonsson; Miguel Calvo-Fullana; Santiago Amaya-Corredor

arxiv: 2605.30461 · v1 · pith:P47TLXVInew · submitted 2026-05-28 · 💻 cs.LG · cs.AI

Scalable Constrained Multi-Agent Reinforcement Learning via State Augmentation and Consensus for Separable Dynamics

Santiago Amaya-Corredor , Miguel Calvo-Fullana , Anders Jonsson This is my paper

Pith reviewed 2026-06-29 08:36 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords constrained MARLstate augmentationconsensus optimizationLagrange multipliersseparable dynamicsdistributed reinforcement learningsmart grid demand responsemulti-agent systems

0 comments

The pith

State augmentation with neighbor consensus on Lagrange multipliers enables scalable constrained MARL under separable dynamics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that agents with independent dynamics can still coordinate to meet shared constraints by augmenting each policy with a shared dual variable that tracks the constraint and then averaging that variable locally with neighbors. A sympathetic reader would care because standard independent learning leaves agents unable to allocate the shared resource correctly, often leading to either violations or useless solutions like never using the resource. The authors prove that the local averaging keeps the dual variables close enough across agents that the resulting joint behavior violates the constraint by only a bounded amount, and this bound improves with denser communication graphs or extra averaging steps. This keeps both training and runtime linear in the number of agents rather than quadratic, enabling experiments with thousands of agents on a demand-response task where the method succeeds and baselines fail.

Core claim

The central claim is that lightweight neighbor-to-neighbor consensus over Lagrange multipliers suffices for globally coordinated constraint enforcement while preserving the scalability of independent training. Each agent learns a single augmented policy offline, conditioned on both its local state and a dual variable encoding constraint feedback. During execution, agents reach agreement on this dual variable through local communication alone. Under mild connectivity assumptions, the consensus error among agents' multipliers is bounded, translating to a bounded constraint violation that decreases with graph connectivity and the number of consensus rounds.

What carries the argument

State-augmented policies conditioned on a dual variable, combined with distributed consensus over those dual variables via neighbor-to-neighbor communication.

Load-bearing premise

Each agent's state transition depends only on its own action and state, not on those of other agents.

What would settle it

Observing whether constraint violations remain bounded when the communication graph is disconnected or when one agent's transition depends on another's action.

Figures

Figures reproduced from arXiv: 2605.30461 by Anders Jonsson, Miguel Calvo-Fullana, Santiago Amaya-Corredor.

**Figure 2.** Figure 2: Evolution of the global multipliers λ i for the first three agents during execution. type, in episodes of 80 timesteps (about 3 days of demand), with multipliers sampled from λ ∈ [0, 15] and ν ∈ [−25, 25]. The dataset contains 3,000 hours of data with random episode starting points. All multiplier step sizes are set to α = ϵ = η = 0.01, and we use L = 1 consensus round per timestep. The step sizes were sel… view at source ↗

**Figure 3.** Figure 3: Cumulative unmet demand with consensus. Stable negative values (c=0.3, 0.4) indicate proactive [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Cost and demand-satisfaction trajectories. The dashed horizontal line in (b) marks zero unmet [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation: 414 IPPO agents trained with fixed [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Average performance of state-of-the-art MARL baselines. The grey dashed line marks the grid [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Centralized oracle vs. distributed consensus ( [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Scalability of the execution phase. (a) Wall-clock execution time versus agent count. (b) Running [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

read the original abstract

We present a distributed approach for constrained Multi-Agent Reinforcement Learning (MARL) that combines state-augmented policy learning with distributed consensus over dual variables. Our method targets systems where agents have separable dynamics but must coordinate to satisfy global resource constraints, a setting in which, as we demonstrate empirically, independent learning fails to produce feasible solutions because agents cannot determine appropriate individual contributions toward collective constraint satisfaction. The key technical contribution is showing that lightweight neighbor-to-neighbor consensus over Lagrange multipliers suffices for globally coordinated constraint enforcement while preserving the scalability of independent training. Each agent learns a single augmented policy offline, conditioned on both its local state and a dual variable encoding constraint feedback. During execution, agents reach agreement on this dual variable through local communication alone. We prove that under mild connectivity assumptions, the consensus error among agents' multipliers is bounded, and show that this translates to a bounded constraint violation that decreases with graph connectivity and the number of consensus rounds. Unlike centralized training with decentralized execution (CTDE) approaches, whose complexity grows at least quadratically with agent count, our method scales linearly in both training and execution. Experiments on smart grid demand response demonstrate that consensus coordination is \emph{essential for feasibility}: without it, agents satisfy grid capacity constraints only by indefinitely postponing demand, a degenerate non-solution. With consensus, agents converge to a shared dual variable and satisfy both grid constraints and demand fulfillment, scaling to thousands of agents while CTDE baselines are limited to dozens.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows state augmentation plus local consensus on duals can enforce global constraints in separable-dynamics MARL at large scale, but the linear execution claim is not yet secure on sparse graphs.

read the letter

The main takeaway is that agents can learn a single policy conditioned on both local state and a dual variable, then run lightweight neighbor consensus on that dual at runtime to coordinate on a global constraint. This keeps training fully decentralized and independent while still producing feasible joint behavior, which independent learners cannot achieve on their own.

The paper does this cleanly. The smart-grid demand-response experiments show the practical problem: without consensus every agent defers consumption indefinitely to meet the capacity limit, a degenerate outcome. With consensus the agents converge to a shared multiplier and satisfy both the grid constraint and actual demand, scaling to thousands of agents where CTDE methods already fail. That contrast is direct evidence that the coordination step matters.

The soft spot is the communication scaling. The abstract states that consensus error is bounded under mild connectivity and that the bound on violation improves with connectivity and rounds. But mild connectivity includes path or tree graphs whose diameter grows with N. On those graphs the number of rounds needed to drive multiplier disagreement below the level that keeps violation acceptable can grow with diameter, so per-step communication is no longer constant. The proof sketch does not appear to restrict the graph family or quantify this dependence, which leaves the linear-in-N execution claim open to the stress-test concern.

The work is aimed at people building constrained MARL for resource allocation settings. It deserves a serious referee because the motivation is concrete, the empirical demonstration is useful, and the technical move is straightforward to evaluate even if the graph dependence needs tightening.

Referee Report

2 major / 0 minor

Summary. The paper proposes a distributed constrained MARL algorithm for agents with separable dynamics but global resource constraints. Each agent learns an offline policy augmented with a dual variable (Lagrange multiplier) encoding constraint feedback; at execution, agents perform neighbor-to-neighbor consensus on the multipliers. The central claim is that, under mild connectivity assumptions, the resulting consensus error is bounded and this bound translates into a bounded constraint violation that improves with graph connectivity and the number of consensus rounds. The method is asserted to scale linearly in both training and execution (unlike CTDE), with experiments on a smart-grid demand-response task showing that consensus is required for feasible solutions while independent learning produces degenerate policies.

Significance. If the claimed translation from bounded multiplier consensus error to bounded constraint violation holds with explicit dependence only on local graph properties that preserve linear scaling, the approach would offer a practical route to large-scale constrained MARL that avoids the quadratic cost of centralized training. The empirical contrast between consensus-enabled feasibility and independent-learning degeneracy is a useful demonstration. The combination of state augmentation with distributed dual consensus is a novel synthesis for this separable-dynamics setting.

major comments (2)

[Abstract] Abstract: the proof sketch asserts that bounded consensus error on multipliers translates to bounded constraint violation that decreases with graph connectivity and consensus rounds, yet provides no explicit dependence of the required number of rounds on graph diameter or spectral gap. Under the stated 'mild connectivity' assumption, graphs such as paths or trees (spectral gap O(1/N)) would require rounds scaling with diameter to keep violation below a fixed tolerance, contradicting the linear-in-N execution claim. This dependence is load-bearing for the scalability assertion.
[Abstract] Abstract and experimental section: the manuscript reports that consensus is 'essential for feasibility' on the smart-grid task and that the method scales to thousands of agents, but supplies neither the full derivation of the error-to-violation bound, experiment hyperparameters, nor error bars on the reported constraint violations or demand-fulfillment metrics. Without these, the empirical support for the central claim cannot be assessed at the level required for a soundness judgment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below, indicating planned revisions where appropriate to improve clarity and completeness.

read point-by-point responses

Referee: [Abstract] Abstract: the proof sketch asserts that bounded consensus error on multipliers translates to bounded constraint violation that decreases with graph connectivity and consensus rounds, yet provides no explicit dependence of the required number of rounds on graph diameter or spectral gap. Under the stated 'mild connectivity' assumption, graphs such as paths or trees (spectral gap O(1/N)) would require rounds scaling with diameter to keep violation below a fixed tolerance, contradicting the linear-in-N execution claim. This dependence is load-bearing for the scalability assertion.

Authors: The abstract summarizes the high-level result; the full proof in the appendix derives the consensus error bound explicitly in terms of the number of rounds k and the spectral gap (second-smallest eigenvalue of the normalized Laplacian). The constraint violation is then bounded proportionally to this error. We agree the dependence is important. The 'mild connectivity' assumption is intended to cover graphs with spectral gap bounded below by a positive constant independent of N (e.g., expanders), allowing fixed k for any target violation tolerance and thereby preserving linear scaling. For path or tree graphs the gap scales as O(1/N) and more rounds would be required, which we acknowledge would impact the claim. We will revise the abstract and add a dedicated paragraph clarifying the precise graph conditions needed for linear scaling. This constitutes a partial revision. revision: partial
Referee: [Abstract] Abstract and experimental section: the manuscript reports that consensus is 'essential for feasibility' on the smart-grid task and that the method scales to thousands of agents, but supplies neither the full derivation of the error-to-violation bound, experiment hyperparameters, nor error bars on the reported constraint violations or demand-fulfillment metrics. Without these, the empirical support for the central claim cannot be assessed at the level required for a soundness judgment.

Authors: We will ensure the revised manuscript supplies all requested details. The complete derivation of the error-to-violation bound already appears in the appendix; we will move a concise version into the main text for accessibility. A new 'Experimental Details' subsection will list all hyperparameters (learning rates, consensus rounds per step, network sizes, training episodes, etc.). Error bars are included on the figures; we will add explicit discussion of them in the text and report numerical values with standard deviations. These changes will allow full assessment of the empirical claims. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on independent proof of consensus bound

full rationale

The paper's core claim is a proof that, for separable dynamics, neighbor-to-neighbor consensus on Lagrange multipliers yields bounded consensus error that translates to bounded constraint violation decreasing with connectivity and rounds. This is presented as a mathematical result under mild connectivity assumptions, not as a fit to data or a renaming of prior results. No equations reduce by construction to their own inputs, no parameters are fitted then relabeled as predictions, and no load-bearing step rests on self-citation chains. The method combines standard dual-variable consensus with state augmentation; the novelty is the combination for this MARL setting, but the derivation chain itself does not collapse into definitional equivalence or fitted-input renaming. The provided abstract and reader assessment confirm the absence of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach relies on standard assumptions in distributed optimization and RL; no new entities or free parameters are introduced in the abstract.

axioms (2)

domain assumption Agents have separable dynamics but global resource constraints
Explicitly stated as the target setting in the abstract.
domain assumption Mild connectivity assumptions on the communication graph
Invoked for the consensus error bound and constraint violation bound.

pith-pipeline@v0.9.1-grok · 5804 in / 1320 out tokens · 30729 ms · 2026-06-29T08:36:16.014166+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Miguel Calvo-Fullana, Santiago Paternain, Luiz F

doi: 10.1126/science.aay2400. Miguel Calvo-Fullana, Santiago Paternain, Luiz F. O. Chamon, and Alejandro Ribeiro. State augmented constrained reinforcement learning: Overcoming the limitations of learning with rewards.IEEE Transactions on Automatic Control, 69(7):4275–4290, 2024. doi: 10.1109/TAC.2023.3325279. Dong Chen, Kaian Chen, Zhaojian Li, Tianshu C...

work page doi:10.1126/science.aay2400 2024
[2]

doi: 10.1007/s10462-025-11340-5. Roger A. Horn and Charles R. Johnson.Matrix Analysis. Cambridge University Press, Cambridge, United Kingdom, second edition, 2012. Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. InProceedings of the 19th International Conference on Machine Learning, pp. 267–274, 2002. Michael Kearn...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/s10462-025-11340-5 2012
[3]

Guannan Qu, Yiheng Lin, Adam Wierman, and Na Li

doi: 10.1109/TAC.2022.3152724. Guannan Qu, Yiheng Lin, Adam Wierman, and Na Li. Scalable multi-agent reinforcement learning for networked systems with average reward. InAdvances in Neural Information Processing Systems, volume 33, 2020. Tabish Rashid, Mikayel Samvelyan, Christian Schroeder de Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. QM...

work page doi:10.1109/tac.2022.3152724 2022
[4]

Then, Lsym is diagonalizable, and its eigenvalues are real

Real symmetry and positive semidefiniteness:The matrixLsym is real symmetric (sinceL is symmetric andD−1/2is diagonal). Then, Lsym is diagonalizable, and its eigenvalues are real. Standard results in spectral graph theory further showLsym is positive semidefinite, implying its eigenvalues are nonnegative (Chung, 1997; Godsil & Royle, 2001)

1997
[5]

The eigenvalues ofLrw also lie in[0,2]

Eigenvalues in[0, 2]:From classical bounds on the spectrum ofLsym (e.g., using the structure of the degree and adjacency matrices), one obtains 0 = Λ 1≤Λ2≤···≤ΛN≤2(Horn & Johnson, 2012; Mohar et al., 1991). The eigenvalues ofLrw also lie in[0,2]. Because0≤Λi≤2and0<ϵ<1, we have |νi|=|1−ϵΛi|≤1, Since G is connected, the multiplicity of the zero eigenvalue o...

2012

[1] [1]

Miguel Calvo-Fullana, Santiago Paternain, Luiz F

doi: 10.1126/science.aay2400. Miguel Calvo-Fullana, Santiago Paternain, Luiz F. O. Chamon, and Alejandro Ribeiro. State augmented constrained reinforcement learning: Overcoming the limitations of learning with rewards.IEEE Transactions on Automatic Control, 69(7):4275–4290, 2024. doi: 10.1109/TAC.2023.3325279. Dong Chen, Kaian Chen, Zhaojian Li, Tianshu C...

work page doi:10.1126/science.aay2400 2024

[2] [2]

doi: 10.1007/s10462-025-11340-5. Roger A. Horn and Charles R. Johnson.Matrix Analysis. Cambridge University Press, Cambridge, United Kingdom, second edition, 2012. Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. InProceedings of the 19th International Conference on Machine Learning, pp. 267–274, 2002. Michael Kearn...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/s10462-025-11340-5 2012

[3] [3]

Guannan Qu, Yiheng Lin, Adam Wierman, and Na Li

doi: 10.1109/TAC.2022.3152724. Guannan Qu, Yiheng Lin, Adam Wierman, and Na Li. Scalable multi-agent reinforcement learning for networked systems with average reward. InAdvances in Neural Information Processing Systems, volume 33, 2020. Tabish Rashid, Mikayel Samvelyan, Christian Schroeder de Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. QM...

work page doi:10.1109/tac.2022.3152724 2022

[4] [4]

Then, Lsym is diagonalizable, and its eigenvalues are real

Real symmetry and positive semidefiniteness:The matrixLsym is real symmetric (sinceL is symmetric andD−1/2is diagonal). Then, Lsym is diagonalizable, and its eigenvalues are real. Standard results in spectral graph theory further showLsym is positive semidefinite, implying its eigenvalues are nonnegative (Chung, 1997; Godsil & Royle, 2001)

1997

[5] [5]

The eigenvalues ofLrw also lie in[0,2]

Eigenvalues in[0, 2]:From classical bounds on the spectrum ofLsym (e.g., using the structure of the degree and adjacency matrices), one obtains 0 = Λ 1≤Λ2≤···≤ΛN≤2(Horn & Johnson, 2012; Mohar et al., 1991). The eigenvalues ofLrw also lie in[0,2]. Because0≤Λi≤2and0<ϵ<1, we have |νi|=|1−ϵΛi|≤1, Since G is connected, the multiplicity of the zero eigenvalue o...

2012