TodyComm: Task-Oriented Dynamic Communication for Multi-Round LLM-based Multi-Agent System

Chunyu Miao; Henry Peng Zou; Tommaso Tognoli; Wenzhe Fan; Xinhua Zhang; Yibo Wang

arxiv: 2602.03688 · v2 · submitted 2026-02-03 · 💻 cs.AI

TodyComm: Task-Oriented Dynamic Communication for Multi-Round LLM-based Multi-Agent System

Wenzhe Fan , Tommaso Tognoli , Henry Peng Zou , Chunyu Miao , Yibo Wang , Xinhua Zhang This is my paper

Pith reviewed 2026-05-16 07:47 UTC · model grok-4.3

classification 💻 cs.AI

keywords multi-agent systemsLLM agentsdynamic communicationpolicy gradientcollaboration topologytask optimizationadversarial settings

0 comments

The pith

TodyComm learns round-specific communication topologies for multi-agent LLM systems via policy gradients to adapt to shifting tasks and adversaries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multi-round LLM-based multi-agent systems often suffer when communication structures stay fixed across rounds, even as agent roles shift due to adversaries, task changes, or bandwidth limits. TodyComm addresses this by generating behavior-driven collaboration topologies that evolve at each round. It trains these topologies end-to-end through policy gradient to maximize task utility. On five benchmarks the method delivers higher task success than fixed baselines while using fewer tokens and scaling across different adversarial strengths.

Core claim

TodyComm is a task-oriented dynamic communication algorithm that produces behavior-driven collaboration topologies adapting to the dynamics at each round and optimizes the utility for the task through policy gradient, achieving superior task performance while maintaining token efficiency, scalability, and strong generalizability across varying adversarial conditions.

What carries the argument

Policy-gradient training of round-by-round communication topologies that select which agents communicate based on observed behaviors and immediate task reward.

If this is right

Communication topologies become round-specific rather than static, improving resilience when adversaries or constraints evolve.
Task performance rises on benchmarks while total tokens exchanged stay lower than fixed dense graphs.
The same learned policy generalizes across different levels of adversarial strength without retraining.
The approach respects explicit communication budgets without collapsing task accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same round-wise topology selection could be applied to non-LLM agent teams where message costs differ by modality.
Adding a small auxiliary reward for topology sparsity might further cut token use without separate tuning.
The learned topologies could serve as a diagnostic tool to reveal which agent pairs matter most at each stage of a task.

Load-bearing premise

Policy gradient can reliably learn effective round-by-round communication topologies from task feedback without prohibitive sample complexity or instability when agent roles and constraints change across rounds.

What would settle it

Running TodyComm and a strong fixed-topology baseline on the same five benchmarks under dynamic adversarial conditions and observing no performance gap or worse token usage would falsify the claimed advantage.

read the original abstract

Multi-round LLM-based multi-agent systems rely on effective communication structures to support collaboration across rounds. However, most existing methods employ a fixed communication topology during inference, which falls short in many realistic applications where the agents' roles may change \textit{across rounds} due to dynamic adversary, task progression, or time-varying constraints such as communication bandwidth. In this paper, we propose addressing this issue through TodyComm, a \textbf{t}ask-\textbf{o}riented \textbf{dy}namic \textbf{comm}unication algorithm. It produces behavior-driven collaboration topologies that adapt to the dynamics at each round, optimizing the utility for the task through policy gradient. Experiments on five benchmarks demonstrate that, under both dynamic adversarial settings and communication budget constraints, TodyComm achieves superior task performance while maintaining token efficiency, scalability, and strong generalizability across varying adversarial conditions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TodyComm uses policy gradients to pick per-round communication links in multi-agent LLM systems, but the abstract gives no implementation details or results to check if it actually works.

read the letter

TodyComm treats the choice of which agents communicate each round as a policy-gradient decision that maximizes task utility. This directly targets the problem of fixed topologies failing when roles shift due to adversaries or bandwidth limits, which is a real constraint in deployed multi-agent setups. The framing is straightforward and applies standard RL machinery to a new domain without obvious circularity. What stands out is the focus on behavior-driven adaptation rather than static graphs or hand-coded rules. The paper does a clean job naming the gap in existing work. The soft spots are in the evidence. The abstract claims better task performance and token efficiency on five benchmarks under dynamic conditions, yet supplies no numbers, baselines, ablations, or training details. Policy-gradient updates on discrete agent-pair choices will see high variance from LLM sampling noise and non-stationary rewards when roles change; without any mention of variance reduction or entropy terms, convergence is not guaranteed within reasonable sample budgets. The stress-test note on instability looks plausible given the setup. This is for researchers already working on multi-agent LLM coordination who need ideas for adaptive communication. A reader could pull the core idea and try it, but the current write-up does not let you verify the claims. I would send it to peer review so the authors can add the missing method and result sections.

Referee Report

2 major / 0 minor

Summary. The paper proposes TodyComm, a task-oriented dynamic communication algorithm for multi-round LLM-based multi-agent systems. It generates behavior-driven collaboration topologies that adapt each round to dynamics such as changing agent roles, adversaries, or constraints, optimizing task utility via policy gradient. Experiments on five benchmarks are claimed to demonstrate superior task performance, token efficiency, scalability, and generalizability under dynamic adversarial settings and communication budget constraints.

Significance. If the empirical results hold after proper validation, the work would be significant for multi-agent LLM systems by replacing fixed communication topologies with learned, round-by-round adaptive structures. This directly addresses a practical limitation in dynamic environments and applies standard policy-gradient machinery to a new application domain.

major comments (2)

[Abstract] Abstract: The abstract asserts superior performance on five benchmarks but supplies no method details, baselines, statistical tests, or ablation results, so the data cannot be checked against the claim. This is load-bearing for the central empirical assertion.
[Method] Policy gradient section: The claim that policy gradient can reliably learn effective round-by-round topologies from task feedback lacks any derivation, variance-reduction technique, baseline, or ablation addressing high variance from stochastic LLM outputs and non-stationarity induced by role changes across rounds.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for major revision. We address each point below and will incorporate changes to improve clarity and rigor in the revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract asserts superior performance on five benchmarks but supplies no method details, baselines, statistical tests, or ablation results, so the data cannot be checked against the claim. This is load-bearing for the central empirical assertion.

Authors: We agree that the abstract is concise and omits these supporting details. In the revised version, we will expand the abstract to briefly outline the TodyComm policy-gradient approach, name the primary baselines, and report key performance gains with mention of statistical significance. Full method descriptions, baselines, statistical tests, and ablation results remain in Sections 3 and 4; the abstract update will make the central claims more verifiable without exceeding length limits. revision: yes
Referee: [Method] Policy gradient section: The claim that policy gradient can reliably learn effective round-by-round topologies from task feedback lacks any derivation, variance-reduction technique, baseline, or ablation addressing high variance from stochastic LLM outputs and non-stationarity induced by role changes across rounds.

Authors: We acknowledge the need for greater rigor here. The current text presents the policy-gradient formulation at a high level. In revision we will (1) add an explicit derivation of the REINFORCE-style estimator with task-utility reward, (2) introduce a learned baseline for variance reduction, (3) include an ablation isolating the effects of LLM sampling noise and round-wise role non-stationarity, and (4) describe the periodic re-optimization schedule used to mitigate non-stationarity. These additions will directly address the reliability concerns. revision: yes

Circularity Check

0 steps flagged

No circularity; standard policy-gradient application to dynamic multi-agent topologies

full rationale

The paper applies off-the-shelf policy gradient to learn per-round communication topologies from task utility feedback. No derivation step reduces to a self-definition, fitted input renamed as prediction, or self-citation chain. The central claim rests on the empirical performance of the learned policy rather than any tautological equivalence between inputs and outputs. This is a normal, non-circular use of RL machinery in a new application domain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the ledger is therefore empty.

pith-pipeline@v0.9.0 · 5465 in / 1055 out tokens · 62812 ms · 2026-05-16T07:47:11.604881+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

optimizing the utility for the task through policy gradient... REINFORCE algorithm... agent-level credit values... GRN
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TodyComm models multi-round agent interactions using two graph structures... Markov decision process

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

\textsc{MasFACT}: Continual Multi-Agent Topology Learning via Geometry-Aware Posterior Transfer
cs.LG 2026-05 unverdicted novelty 7.0

MasFACT transfers historical topology priors across tasks via Fused Gromov-Wasserstein optimal transport and PAC-Bayes conservative adaptation to reduce topology forgetting in continual multi-agent settings.
Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems
cs.AI 2026-05 unverdicted novelty 7.0

A survey that unifies prior work on multi-agent LLM systems via the LIFE framework, mapping dependencies across collaboration, failure attribution, and autonomous self-evolution while identifying cross-stage challenges.
Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems
cs.AI 2026-05 conditional novelty 5.0

The survey proposes the LIFE framework to unify fragmented research on collaboration, failure attribution, and self-evolution in LLM multi-agent systems into a progression toward self-organizing intelligence.