LLM-Guided Communication for Cooperative Multi-Agent Reinforcement Learning

Sanghyeon Lee; Sangjun Bae; Seungyul Han; Yisak Park

arxiv: 2605.18077 · v2 · pith:LL4Z54LJnew · submitted 2026-05-18 · 💻 cs.AI · cs.LG· cs.MA

LLM-Guided Communication for Cooperative Multi-Agent Reinforcement Learning

Sangjun Bae , Yisak Park , Sanghyeon Lee , Seungyul Han This is my paper

Pith reviewed 2026-05-20 11:03 UTC · model grok-4.3

classification 💻 cs.AI cs.LGcs.MA

keywords multi-agent reinforcement learningcommunication protocolslarge language modelsstate reconstructioncooperative MARLLLM-guided communication

0 comments

The pith

An LLM can design and iteratively refine a communication protocol so that multi-agent RL agents reconstruct the full state more accurately and uniformly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LMAC, a method that uses large language models to create communication protocols for multi-agent reinforcement learning. The LLM generates a protocol aimed at letting every agent reconstruct the underlying environment state as accurately and uniformly as possible. It then refines this protocol step by step using a state-awareness criterion that measures how well agents know the state. This process narrows knowledge gaps between agents and leads to better overall performance on standard MARL tasks compared to earlier communication approaches.

Core claim

LMAC leverages an LLM's reasoning capability to design a communication protocol that enables all agents to reconstruct the underlying state as accurately and uniformly as possible. LMAC iteratively refines the protocol using an explicit state-awareness criterion, improving state recovery while narrowing differences in agents' knowledge.

What carries the argument

The iterative refinement loop in LMAC, where an LLM proposes a communication protocol that is evaluated and improved according to an explicit state-awareness criterion measuring state reconstruction accuracy and uniformity across agents.

Load-bearing premise

An LLM's reasoning can reliably produce and refine a communication protocol that achieves accurate and uniform state reconstruction across agents.

What would settle it

Running the MARL benchmarks with LMAC disabled or replaced by a fixed non-iterative protocol and observing no gains in state recovery metrics or task rewards.

Figures

Figures reproduced from arXiv: 2605.18077 by Sanghyeon Lee, Sangjun Bae, Seungyul Han, Yisak Park.

**Figure 1.** Figure 1: Protocol refinement in LMAC. Given natural-language descriptions of the task goal and the state and observation dimensions, the LLM designs an initial communication protocol to support accurate state reconstruction (star position in the figure). The protocol is then refined via criterion-based two-step feedback to make reconstruction more accurate and uniform by sharing missing local information: Step 1 im… view at source ↗

**Figure 2.** Figure 2: Comparison on the StarCraft II environment after 2M timesteps. (a) Average reconstruction error of the enemy position. (b) Inter-agent variance of the reconstruction error. (c) The true positions of ally agents (green/yellow) and the enemy (red), and each ally’s estimated enemy position. For all methods, the metrics are computed by performing an additional state-inference step. space, A the joint action sp… view at source ↗

**Figure 3.** Figure 3: Overview of the iterative communication protocol refinement framework: (a) Input prompt x with task description IT and design instruction IP , (b) generated protocol f (k) C that maps local observations to agent-specific messages, and (c) feedback instruction x˜ (k+1) guiding the optimization for the next step, structurally divided into the step-wise analysis instruction performing analysis using criterion… view at source ↗

**Figure 5.** Figure 5: Overall framework of the proposed LMAC protocol sequence fC = (f (0) C , f(1) C , f(2) C ), which progressively adds messages to improve both recovery accuracy and uniformity. We limit refinement to two steps, since experiments with additional iterations (k > 2) in Appendix D.1 show negligible further gains. To assess the refinement behavior, [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Performance comparison in various MARL benchmarks (a) (b) (c) [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: MARL Benchmarks used in our experiments: (a) SMACComm, (b) LBF, and (c) GRF courages redundant content and encourages z i t to retain only reconstructable, task-relevant features aligned with reconstruction reliability. Finally, we incorporate z i t into the individual utilities as Qi (τ i t , zi t ) and optimize the joint action-value Qtot via TD learning. In summary, the overall LMAC framework is shown … view at source ↗

**Figure 8.** Figure 8: Protocol refinement analysis on SMAC 1o 10b vs 1r: (a) Task scenario with an Overseer, a Roach, and Banelings under partial observability, (b) protocol messages and corresponding feedback at each step k, (c) trajectory-level recovery success rate and inter-agent knowledge imbalance with (k = 0, 1, 2) or without messages (No-comm), and (d) average win rates across steps. via inter-agent consensus. All basel… view at source ↗

**Figure 9.** Figure 9: Win rate (%) under simplified prompt. and bane vs hM, where effective communication strongly depends on positional and unit-status information. In this variant, the task description IT is reduced to a coarse summary of allied and enemy unit status and spatial configuration, while the observation description is given only at the chunk level rather than dimension-wise. The protocol design instruction IP i… view at source ↗

read the original abstract

Communication is a key component in multi-agent reinforcement learning (MARL) for mitigating partial observability, yet prior approaches often rely on inefficient information exchange or fail to transmit sufficient state information. To address this, we propose LLM-driven Multi-Agent Communication (LMAC), which leverages an LLM's reasoning capability to design a communication protocol that enables all agents to reconstruct the underlying state as accurately and uniformly as possible. LMAC iteratively refines the protocol using an explicit state-awareness criterion, improving state recovery while narrowing differences in agents' knowledge. Experiments on diverse MARL benchmarks show that LMAC improves state reconstruction across agents and yields substantial performance gains over prior communication baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LMAC uses an LLM to iteratively refine a MARL communication protocol for better state reconstruction, but the state-awareness criterion looks hard to ground without privileged information.

read the letter

Hi, the main point is that this paper has agents use an LLM to design and then iteratively tweak a communication protocol so they can reconstruct the global state more accurately and with less variation across the team. The loop is driven by an explicit state-awareness criterion that scores how well the protocol is working. This is presented as a way to handle partial observability better than earlier fixed or learned communication schemes in MARL. The iterative LLM angle is the clearest new piece; most prior work either learns message content directly or uses attention over fixed channels, so treating protocol design itself as an LLM-guided refinement process stands out. The paper reports that this leads to improved state recovery and noticeable performance lifts over baselines on standard MARL benchmarks, which is the kind of concrete claim that can be checked. The experiments are described as covering diverse environments, so the empirical side gets some credit for trying to show practical effect. The soft spot is exactly the one the stress-test flags. Under partial observability agents lack the true state, so any criterion that measures reconstruction quality has to be computed from local observations and messages alone or it requires an oracle during training. The abstract gives no indication of which route is taken or how the LLM is prompted to optimize for it without introducing bias or instability. If the full paper does not spell out a self-contained way to evaluate the criterion, the refinement loop rests on an assumption that may not transfer outside the lab. This is aimed at the MARL communication subgroup rather than a broad audience. A reader already working on coordination or message passing would get the most out of it and could test the protocol idea in their own setups. It is worth sending for peer review so the implementation details and numbers can be examined properly.

Referee Report

2 major / 1 minor

Summary. The paper proposes LLM-driven Multi-Agent Communication (LMAC) for cooperative MARL. It uses an LLM to iteratively design and refine a communication protocol according to an explicit state-awareness criterion, with the goal of enabling all agents to reconstruct the global state as accurately and uniformly as possible despite partial observability. The approach is claimed to narrow knowledge differences across agents and to deliver substantial performance gains over prior communication baselines on diverse MARL benchmarks.

Significance. If the central claims are substantiated, the work would offer a novel LLM-guided mechanism for adaptive protocol design in MARL, potentially improving upon fixed or learned communication schemes by leveraging reasoning to target state reconstruction directly. The iterative refinement loop is a distinctive element that could influence subsequent research on LLM-assisted multi-agent coordination.

major comments (2)

[Method (LMAC description)] The state-awareness criterion that drives protocol refinement is described only at a high level in the abstract and method overview; it is unclear whether reconstruction accuracy is scored using an external oracle (true global state) or solely from local observations and messages. This distinction is load-bearing for the central claim, because standard MARL settings provide no oracle and any dependence on privileged information would render the method non-deployable under the partial-observability regime the paper targets.
[Experiments] The abstract asserts 'substantial performance gains' and 'improved state reconstruction' yet supplies no quantitative metrics, error bars, baseline names, or statistical significance tests. Without these details the empirical support for the performance claim cannot be evaluated and the cross-benchmark generality asserted in the abstract remains unverified.

minor comments (1)

[Abstract] The abstract refers to 'diverse MARL benchmarks' without naming the environments or providing citation; listing the specific tasks (e.g., StarCraft, MPE, etc.) would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment point by point below, providing clarifications and noting planned revisions to the manuscript.

read point-by-point responses

Referee: [Method (LMAC description)] The state-awareness criterion that drives protocol refinement is described only at a high level in the abstract and method overview; it is unclear whether reconstruction accuracy is scored using an external oracle (true global state) or solely from local observations and messages. This distinction is load-bearing for the central claim, because standard MARL settings provide no oracle and any dependence on privileged information would render the method non-deployable under the partial-observability regime the paper targets.

Authors: We thank the referee for highlighting this critical clarification. The state-awareness criterion in LMAC is computed exclusively from the agents' local observations and exchanged messages, without any external oracle or access to the true global state. The LLM evaluates protocol quality by reasoning over expected reconstruction consistency from the partial views available to each agent. This ensures the approach remains fully compatible with standard partial-observability settings and decentralized execution. We will revise the method section to include a formal definition and explicit statement that no privileged information is used. revision: yes
Referee: [Experiments] The abstract asserts 'substantial performance gains' and 'improved state reconstruction' yet supplies no quantitative metrics, error bars, baseline names, or statistical significance tests. Without these details the empirical support for the performance claim cannot be evaluated and the cross-benchmark generality asserted in the abstract remains unverified.

Authors: We agree that the abstract is high-level and omits specific numbers. The full manuscript reports quantitative metrics for both state reconstruction accuracy and task performance, error bars across multiple seeds, comparisons against named baselines, and statistical significance tests in the Experiments section. We will revise the abstract to incorporate key quantitative highlights (e.g., average gains and reconstruction improvements) while retaining its concise style. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces LMAC as a novel LLM-based method for designing and iteratively refining communication protocols in MARL via an explicit state-awareness criterion. No equations, fitted parameters, or self-referential definitions appear in the provided abstract or description that would reduce the claimed state reconstruction improvements or performance gains to a construction by definition. The approach is presented as an external proposal leveraging LLM reasoning capabilities, with experiments on external benchmarks providing independent validation. No load-bearing self-citations, uniqueness theorems, or ansatzes from prior author work are invoked to force the central claims. The derivation chain remains self-contained without reducing predictions to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the unverified premise that LLMs can serve as effective protocol designers for MARL; no free parameters or invented physical entities are mentioned, but the LLM reasoning step functions as a domain assumption without independent evidence in the abstract.

axioms (1)

domain assumption Large language models possess reasoning capabilities that can be used to design and iteratively refine effective communication protocols for multi-agent reinforcement learning.
Invoked when the abstract states that LMAC leverages an LLM's reasoning capability to design the protocol.

invented entities (1)

LMAC (LLM-driven Multi-Agent Communication) no independent evidence
purpose: Framework that uses LLMs to create and refine communication protocols for better state reconstruction in MARL.
New method name and structure introduced to address the stated limitations of prior approaches.

pith-pipeline@v0.9.0 · 5643 in / 1367 out tokens · 38905 ms · 2026-05-20T11:03:41.211534+00:00 · methodology

LLM-Guided Communication for Cooperative Multi-Agent Reinforcement Learning

Core claim

What carries the argument

Load-bearing premise

What would settle it

discussion (0)