Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems
Pith reviewed 2026-05-09 21:44 UTC · model grok-4.3
The pith
Multi-agent language models improve reasoning by learning to communicate through internal latent representations instead of text.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DiffMAS treats latent communication as a learnable component by performing parameter-efficient supervised training over multi-agent latent trajectories. This enables the agents to jointly learn optimal ways to encode and interpret information across interactions, resulting in improved reasoning accuracy and decoding stability on benchmarks including 26.7% on AIME24 and 20.2% on GPQA-Diamond.
What carries the argument
DiffMAS, the training framework that performs parameter-efficient supervised training over multi-agent latent trajectories to jointly optimize how agents encode and interpret shared information.
If this is right
- Consistent accuracy gains across mathematical reasoning, scientific QA, code generation, and commonsense benchmarks.
- Superior performance to single-agent inference, text-based multi-agent systems, and prior latent communication methods.
- Improved decoding stability during multi-agent interactions.
- Specific results of 26.7 percent on AIME24 and 20.2 percent on GPQA-Diamond.
Where Pith is reading between the lines
- This latent approach might reduce communication overhead when scaling to teams of more than two agents.
- It could allow agent groups to develop task-specific internal protocols without human-readable prompts.
- Similar trajectory-based optimization might apply to non-text modalities such as vision or structured data.
Load-bearing premise
That supervised training on multi-agent latent trajectories can be performed in a parameter-efficient manner that jointly optimizes encoding and interpretation without instability or requiring unavailable task-specific data.
What would settle it
If applying DiffMAS produces no accuracy gains or reduced stability on held-out reasoning benchmarks compared to single-agent or text-based baselines, or if it requires full model updates instead of parameter-efficient training.
Figures
read the original abstract
Multi-agent systems built on large language models have shown strong performance on complex reasoning tasks, yet most work focuses on agent roles and orchestration while treating inter-agent communication as a fixed interface. Latent communication through internal representations such as key-value caches offers a promising alternative to text-based protocols, but existing approaches do not jointly optimize communication with multi-agent reasoning. Therefore we propose DiffMAS, a training framework that treats latent communication as a learnable component of multi-agent systems. DiffMAS performs parameter-efficient supervised training over multi-agent latent trajectories, enabling agents to jointly learn how information should be encoded and interpreted across interactions. Experiments on mathematical reasoning, scientific QA, code generation, and commonsense benchmarks show that DiffMAS consistently improves reasoning accuracy and decoding stability over single-agent inference, text-based multi-agent systems, and prior latent communication methods, achieving 26.7% on AIME24, 20.2% on GPQA-Diamond, and consistent gains across reasoning benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes DiffMAS, a training framework that treats latent communication in multi-agent LLM systems as a learnable component via parameter-efficient supervised training over multi-agent latent trajectories. This enables agents to jointly optimize how information is encoded and interpreted across interactions. Experiments across mathematical reasoning, scientific QA, code generation, and commonsense benchmarks report consistent gains over single-agent inference, text-based multi-agent systems, and prior latent communication methods, including 26.7% on AIME24 and 20.2% on GPQA-Diamond.
Significance. If the empirical results and training procedure hold, this would advance multi-agent LLM systems by shifting from fixed or text-based communication interfaces to jointly optimized latent protocols. The parameter-efficient supervised approach could improve scalability and reasoning stability on complex tasks, offering a practical path beyond role-orchestration-focused methods.
major comments (2)
- [§3] §3 (Method): The procedure for collecting and labeling multi-agent latent trajectories for supervised training is not specified in sufficient detail. If trajectory generation depends on running base multi-agent systems on tasks with known answers or requires task-specific ground-truth labels, this creates potential circularity and data-scarcity issues that directly undermine the central claim of general, end-to-end optimization without pre-existing optimized systems.
- [§4] §4 (Experiments): The reported benchmark improvements (e.g., 26.7% on AIME24, 20.2% on GPQA-Diamond) lack accompanying details on training procedure, data construction, baseline implementations, number of runs, variance, or statistical significance testing. These omissions make it impossible to verify reproducibility or assess whether the gains are robust, which is load-bearing for the claim of consistent outperformance.
minor comments (2)
- [Abstract] Abstract: The phrase 'decoding stability' is invoked as a benefit but is neither defined nor quantified, leaving its meaning and measurement unclear.
- [Throughout] Throughout: Notation for latent representations (e.g., key-value caches) would benefit from explicit equations or diagrams to clarify the communication mechanism.
Simulated Author's Rebuttal
We are grateful to the referee for the detailed and constructive feedback. We believe the suggested revisions will strengthen the paper by enhancing clarity on the method and experimental rigor. Below we provide point-by-point responses to the major comments.
read point-by-point responses
-
Referee: [§3] §3 (Method): The procedure for collecting and labeling multi-agent latent trajectories for supervised training is not specified in sufficient detail. If trajectory generation depends on running base multi-agent systems on tasks with known answers or requires task-specific ground-truth labels, this creates potential circularity and data-scarcity issues that directly undermine the central claim of general, end-to-end optimization without pre-existing optimized systems.
Authors: We thank the referee for highlighting this important point. Upon review, we acknowledge that the description in §3 could be more detailed to avoid ambiguity. In the revised manuscript, we will specify that the multi-agent latent trajectories are collected by performing standard forward passes with the base (unoptimized) LLMs on the training set of tasks. Labeling uses the available ground-truth answers for supervised loss computation on the reasoning outputs, while the latent communication parameters are trained to improve encoding and decoding of information between agents. This setup does not rely on pre-existing optimized systems, as the base models remain frozen and the training starts from random adapters. We will also discuss how this scales to tasks without labels by using consistency-based pseudo-labeling, mitigating data scarcity concerns. revision: yes
-
Referee: [§4] §4 (Experiments): The reported benchmark improvements (e.g., 26.7% on AIME24, 20.2% on GPQA-Diamond) lack accompanying details on training procedure, data construction, baseline implementations, number of runs, variance, or statistical significance testing. These omissions make it impossible to verify reproducibility or assess whether the gains are robust, which is load-bearing for the claim of consistent outperformance.
Authors: We agree that the experimental section lacks sufficient details for full reproducibility and robustness assessment. We will revise §4 and add an appendix to include: comprehensive training procedure details (e.g., learning rate, batch size, number of training steps); data construction process (sampling of trajectories from benchmark datasets); precise baseline implementations (model versions, hyperparameters); results averaged over multiple independent runs with reported variance; and p-values from statistical tests (e.g., Wilcoxon signed-rank test) to confirm significance of improvements. These additions will directly support the claims of consistent outperformance. revision: yes
Circularity Check
No significant circularity: empirical training method without derivation chain
full rationale
The paper describes DiffMAS as a parameter-efficient supervised training framework that operates over multi-agent latent trajectories to jointly optimize encoding and interpretation. No equations, first-principles derivations, or mathematical predictions are presented in the abstract or described claims. Performance gains are reported via empirical benchmarks (AIME24, GPQA-Diamond, etc.) rather than any reduction of a result to a fitted quantity defined by the method itself. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are invoked in the provided text. The work is self-contained as an empirical contribution and does not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Multi-agent Architecture Search via Agentic Supernet , author=. 2025 , eprint=
work page 2025
-
[2]
AFlow: Automating Agentic Workflow Generation , author=. 2025 , eprint=
work page 2025
-
[3]
AutoAgents: A Framework for Automatic Agent Generation , author=. 2024 , eprint=
work page 2024
-
[4]
Optima: Optimizing Effectiveness and Efficiency for LLM-Based Multi-Agent System , author=. 2025 , eprint=
work page 2025
-
[5]
EvolveSearch: An Iterative Self-Evolving Search Agent , author=. 2025 , eprint=
work page 2025
-
[6]
Reflexion: Language Agents with Verbal Reinforcement Learning , author=. 2023 , eprint=
work page 2023
- [7]
-
[8]
MALT: Improving Reasoning with Multi-Agent LLM Training , author=. 2025 , eprint=
work page 2025
-
[9]
Coevolving with the Other You: Fine-Tuning LLM with Sequential Cooperative Multi-Agent Reinforcement Learning , author=. 2025 , eprint=
work page 2025
-
[10]
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation , author=. 2023 , eprint=
work page 2023
-
[11]
ChatDev: Communicative Agents for Software Development , author=. 2024 , eprint=
work page 2024
-
[12]
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework , author=. 2024 , eprint=
work page 2024
-
[13]
CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society , author=. 2023 , eprint=
work page 2023
- [14]
-
[15]
Training Large Language Models to Reason in a Continuous Latent Space , author=. 2025 , eprint=
work page 2025
-
[16]
Liu, Yexiang and Li, Zekun and Fang, Zhi and Xu, Nan and He, Ran and Tan, Tieniu , year=. Rethinking the Role of Prompting Strategies in LLM Test-Time Scaling: A Perspective of Probability Theory , url=. doi:10.18653/v1/2025.acl-long.1356 , booktitle=
-
[17]
Implicit Chain of Thought Reasoning via Knowledge Distillation , author=. 2023 , eprint=
work page 2023
-
[18]
Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking , author=. 2024 , eprint=
work page 2024
- [19]
- [20]
-
[21]
Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. 2023 , eprint=
work page 2023
-
[22]
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving , author=. 2025 , eprint=
work page 2025
-
[23]
Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? , author=. 2022 , eprint=
work page 2022
-
[24]
Thought Communication in Multiagent Collaboration , author=. 2025 , eprint=
work page 2025
-
[25]
Learning to Communicate with Deep Multi-Agent Reinforcement Learning , author=. 2016 , eprint=
work page 2016
-
[26]
Cache-to-Cache: Direct Semantic Communication Between Large Language Models , author=. 2025 , eprint=
work page 2025
-
[27]
Enabling Agents to Communicate Entirely in Latent Space , author=. 2026 , eprint=
work page 2026
-
[28]
Understanding the Information Propagation Effects of Communication Topologies in LLM-based Multi-Agent Systems , author=. 2025 , eprint=
work page 2025
-
[29]
Can In-context Learning Really Generalize to Out-of-distribution Tasks? , author=. 2024 , eprint=
work page 2024
-
[30]
Toolformer: Language Models Can Teach Themselves to Use Tools , author=. 2023 , eprint=
work page 2023
-
[31]
A Survey on the Optimization of Large Language Model-based Agents , author=. 2025 , eprint=
work page 2025
-
[32]
The Twelfth International Conference on Learning Representations , year=
Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors , author=. The Twelfth International Conference on Learning Representations , year=
-
[33]
Automated Design of Agentic Systems
Automated design of agentic systems , author=. arXiv preprint arXiv:2408.08435 , year=
work page internal anchor Pith review arXiv
-
[34]
Scaling Large Language Model-based Multi-Agent Collaboration , author=
-
[35]
Thought communication in multiagent collaboration.arXiv preprint arXiv:2510.20733,
Thought communication in multiagent collaboration , author=. arXiv preprint arXiv:2510.20733 , year=
-
[36]
Mixture of Thoughts: Learning to Aggregate What Experts Think, Not Just What They Say , author=. 2025 , eprint=
work page 2025
-
[37]
Advances in Neural Information Processing Systems , volume=
Fincon: A synthesized llm multi-agent system with conceptual verbal reinforcement for enhanced financial decision making , author=. Advances in Neural Information Processing Systems , volume=
-
[38]
Advances and challenges in foundation agents: From brain-inspired intelligence to evolutionary, collaborative, and safe systems , author=. arXiv preprint arXiv:2504.01990 , year=
-
[39]
Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=
work page 2024
-
[40]
Training Verifiers to Solve Math Word Problems
Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[41]
Advances in Neural Information Processing Systems , volume=
Self-refine: Iterative refinement with self-feedback , author=. Advances in Neural Information Processing Systems , volume=
-
[42]
Wang, Kun and Zhang, Guibin and Ye, ManKit and Deng, Xinyu and Wang, Dongxia and Hu, Xiaobin and Guo, Jinyang and Liu, Yang and Guo, Yufei , journal=. MAS ^
- [43]
- [44]
-
[45]
First Conference on Language Modeling , year=
Gpqa: A graduate-level google-proof q&a benchmark , author=. First Conference on Language Modeling , year=
-
[46]
Advances in Neural Information Processing Systems , volume=
Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation , author=. Advances in Neural Information Processing Systems , volume=
-
[47]
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. 2018 , eprint=
work page 2018
-
[48]
Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[49]
Cache-to-Cache: Direct Semantic Communication Between Large Language Models , author=. 2026 , eprint=
work page 2026
- [50]
-
[51]
Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Bochao and Feng, Bei ...
-
[52]
CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge , author=. 2019 , eprint=
work page 2019
-
[53]
Measuring Mathematical Problem Solving With the MATH Dataset , author=. NeurIPS , year=
-
[54]
Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=
work page 2021
-
[55]
OpenHermes 2.5: An Open Dataset of Synthetic Data for Generalist LLM Assistants , author =. 2023 , publisher =
work page 2023
- [56]
-
[57]
Learning Multiagent Communication with Backpropagation , author=. 2016 , eprint=
work page 2016
-
[58]
Multi-Agent Cooperation and the Emergence of (Natural) Language , author=. 2017 , eprint=
work page 2017
-
[59]
Wang, Lei and Ma, Chen and Feng, Xueyang and Zhang, Zeyu and Yang, Hao and Zhang, Jingsen and Chen, Zhiyuan and Tang, Jiakai and Chen, Xu and Lin, Yankai and Zhao, Wayne Xin and Wei, Zhewei and Wen, Jirong , year=. A survey on large language model based autonomous agents , volume=. Frontiers of Computer Science , publisher=. doi:10.1007/s11704-024-40231-1...
-
[60]
ReAct: Synergizing Reasoning and Acting in Language Models , author=. 2023 , eprint=
work page 2023
-
[61]
Emergence of Grounded Compositional Language in Multi-Agent Populations , author=. 2018 , eprint=
work page 2018
-
[62]
Emergence of Language with Multi-agent Games: Learning to Communicate with Sequences of Symbols , author=. 2017 , eprint=
work page 2017
-
[63]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. 2023 , eprint=
work page 2023
-
[64]
Sequence to Sequence Learning with Neural Networks , author=. 2014 , eprint=
work page 2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.