What Do Agents Communicate? Characterizing Information Exchange in Multi-Agent Systems
Pith reviewed 2026-05-21 06:12 UTC · model grok-4.3
The pith
Absence of reasoning and verification in inter-agent communication degrades multi-agent performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that the absence of reasoning and verification in inter-agent communication significantly degrades performance. Category-Aware Recovery Augmentation enforces the presence of critical information during communication and recovers up to 86.2% of failed cases. The results highlight the key role of information quality in effective MA collaboration.
What carries the argument
Category-Aware Recovery Augmentation, which categorizes critical information such as reasoning and verification and augments inter-agent messages to ensure their inclusion.
If this is right
- Error propagation in multi-agent systems can be mitigated by ensuring messages contain explicit reasoning.
- Verification steps in communications help maintain the integrity of information across agent iterations.
- Performance in collaborative tasks improves when agents exchange complete categories of information.
- The design of communication protocols is central to the success of multi-agent LLM setups.
Where Pith is reading between the lines
- This technique could be integrated into agent frameworks to automatically check and supplement messages.
- Similar principles might apply to improving communication in other AI systems like tool-using agents.
- Exploring the minimal set of categories needed could lead to more efficient augmentation methods.
Load-bearing premise
The categories of critical information such as reasoning and verification are broadly applicable across different tasks and agent architectures without introducing new failure modes.
What would settle it
Applying the Category-Aware Recovery Augmentation to a different multi-agent task or architecture and measuring a recovery rate much lower than 86 percent would falsify the general applicability of the claim.
Figures
read the original abstract
Large Language Models (LLMs) have enabled collaborative Multi-Agent (MA) systems, where interacting agents improve performance through diverse reasoning and iterative refinement. However, these systems remain vulnerable to error propagation, where early-stage information degrades downstream reasoning. To address this, we conduct a systematic analysis of inter-agent communication to identify which information drives MA performance. We find that the absence of reasoning and verification in inter-agent communication significantly degrades performance. Based on these insights, we propose Category-Aware Recovery Augmentation (technique), which enforces the presence of critical information during communication. recovers up to 86.2% of failed cases. Our results highlight the key role of information quality in effective MA collaboration. Our code is available at https://anonymous.4open.science/r/cara_mas
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports a systematic empirical analysis of inter-agent communications in LLM-based multi-agent systems. The authors observe that the lack of explicit reasoning and verification steps in these communications leads to degraded performance due to error propagation. They introduce a Category-Aware Recovery Augmentation (CARA) method that injects these critical information categories into the communication process. Experiments show that this approach recovers up to 86.2% of cases that previously failed. The work underscores the importance of information quality for successful collaboration in such systems.
Significance. Should the findings prove robust, this paper makes a meaningful contribution by characterizing the types of information that are pivotal in multi-agent LLM interactions and offering a targeted intervention to mitigate common failure modes. The high recovery rate indicates potential practical utility for improving MA system reliability. Explicit code release aids in verifying and extending the results.
major comments (2)
- [Section 3] Section 3 (Communication Analysis): The identification of reasoning and verification as the primary missing categories is derived from pattern observation in the evaluated trajectories. However, the paper does not provide a quantitative measure of how frequently these categories appear or are absent across different agent setups, which is necessary to establish them as the load-bearing factors for the performance claims.
- [Section 5] Section 5 (Experimental Evaluation): The reported 86.2% recovery is presented in aggregate without per-task breakdowns or a control condition that injects equivalent additional information without restricting to the identified categories. This leaves open whether the gains arise from category enforcement specifically or from stronger general prompting, which is central to validating the information-quality hypothesis.
minor comments (2)
- [Abstract] Abstract: The claim that the technique 'recovers up to 86.2% of failed cases' would be strengthened by stating the total number of evaluated cases and the baseline failure rate for context.
- [Method] Method section: The description of how categories are detected and enforced during augmentation would benefit from a concise pseudocode listing or explicit decision rules to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments, which highlight important areas for strengthening the empirical support in our analysis of inter-agent communication. We address each major comment below and commit to revisions that will clarify the role of specific information categories without overstating current results.
read point-by-point responses
-
Referee: [Section 3] Section 3 (Communication Analysis): The identification of reasoning and verification as the primary missing categories is derived from pattern observation in the evaluated trajectories. However, the paper does not provide a quantitative measure of how frequently these categories appear or are absent across different agent setups, which is necessary to establish them as the load-bearing factors for the performance claims.
Authors: We agree that quantitative frequency measures would provide stronger grounding for identifying reasoning and verification as load-bearing factors. The current analysis in Section 3 relies on systematic pattern observation across trajectories, but we will revise the section to include explicit counts and percentages of category presence/absence, broken down by agent setup and task type. This will be presented in a new table or figure to directly support the performance claims. revision: yes
-
Referee: [Section 5] Section 5 (Experimental Evaluation): The reported 86.2% recovery is presented in aggregate without per-task breakdowns or a control condition that injects equivalent additional information without restricting to the identified categories. This leaves open whether the gains arise from category enforcement specifically or from stronger general prompting, which is central to validating the information-quality hypothesis.
Authors: We acknowledge that the aggregate reporting of the 86.2% recovery rate limits interpretability. In the revised manuscript, we will add per-task breakdowns of the recovery rates in Section 5 to show consistency across tasks. To directly test whether gains stem from the specific categories rather than general prompting, we will also include a control condition that injects comparable amounts of additional information without category restrictions; results from this ablation will be reported alongside the main CARA results to better isolate the effect of information quality. revision: yes
Circularity Check
Empirical study derives augmentation from observed communication patterns with no reduction to inputs by construction
full rationale
The paper performs a systematic empirical analysis of inter-agent messages in LLM-based multi-agent systems, identifies the absence of reasoning and verification steps as a performance-degrading factor through direct observation of trajectories, and introduces Category-Aware Recovery Augmentation as a technique motivated by those observations. The reported 86.2% recovery is an experimental outcome on previously failed cases rather than a quantity forced by fitting or redefinition. No equations, uniqueness theorems, or self-citations are invoked to make the central claim equivalent to its inputs; the derivation remains self-contained and externally falsifiable via the provided code and task evaluations.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Interacting LLM agents improve performance through diverse reasoning and iterative refinement but remain vulnerable to error propagation.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose Category-Aware Recovery Augmentation (CARA), which enforces the presence of critical information during communication. CARA recovers up to 86.2% of failed cases.
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Occlusion analysis to systematically mask each identified information category
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Junwei Liu, Kaixin Wang, Yixuan Chen, Xin Peng, Zhenpeng Chen, Lingming Zhang, and Yiling Lou. Large language model-based agents for software engineering: A survey.ACM Transactions on Software Engineering and Methodology, 2024
work page 2024
-
[2]
Knowledge boundary of large language models: A survey
Moxin Li, Yong Zhao, Wenxuan Zhang, Shuaiyi Li, Wenya Xie, See Kiong Ng, Tat-Seng Chua, and Yang Deng. Knowledge boundary of large language models: A survey. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5131–5157, 2025
work page 2025
-
[3]
Zhenhailong Wang, Shaoguang Mao, Wenshan Wu, Tao Ge, Furu Wei, and Heng Ji. Unleashing the emergent cognitive synergy in large language models: A task-solving agent through multi- persona self-collaboration. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Vol...
work page 2024
-
[4]
Aflow: Automating agentic workflow generation
Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al. Aflow: Automating agentic workflow generation. InInternational Conference on Learning Representations (ICLR), 2025
work page 2025
-
[5]
Alfin Wijaya Rahardja, Junwei Liu, Weitong Chen, Zhenpeng Chen, and Yiling Lou. Can agents fix agent issues? InProceedings of the 39th International Conference on Neural Information Processing Systems, 2025
work page 2025
-
[6]
Unified software engineering agent as ai software engineer
Leonhard Applis, Yuntong Zhang, Shanchao Liang, Nan Jiang, Lin Tan, and Abhik Roychoud- hury. Unified software engineering agent as ai software engineer. In2026 IEEE/ACM 48th International Conference on Software Engineering (ICSE), 2026
work page 2026
-
[7]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY , USA, 2022. Curran Associates Inc
work page 2022
-
[8]
Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, 2023
work page 2023
-
[9]
Large language models cannot self-correct reasoning yet
Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xiny- ing Song, and Denny Zhou. Large language models cannot self-correct reasoning yet. In International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[10]
Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024
work page 2024
-
[11]
Improving multi-agent debate with sparse communication topology
Yunxuan Li, Yibing Du, Jiageng Zhang, Le Hou, Peter Grabowski, Yeqing Li, and Eugene Ie. Improving multi-agent debate with sparse communication topology. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 7281–7294, Miami, Florida, USA, November 2024. Association for Computational Linguistics
work page 2024
-
[12]
Hyeong Kyu Choi, Xiaojin Zhu, and Sharon Li. Debate or vote: Which yields better decisions in multi-agent large language models? InProceedings of the 39th International Conference on Neural Information Processing Systems, 2025
work page 2025
-
[13]
Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th Annual ACM Symposium on User Interface Software and Technology, UIST ’23, New York, NY , USA, 2023. Association for Computing Machinery
work page 2023
-
[14]
Multi-agent collaboration via evolving orchestration
Yufan Dang, Chen Qian, Xueheng Luo, Jingru Fan, Zihao Xie, Ruijie Shi, Weize Chen, Cheng Yang, Xiaoyin Che, Ye Tian, et al. Multi-agent collaboration via evolving orchestration. In Proceedings of the 39th International Conference on Neural Information Processing Systems, 2025. 11
work page 2025
-
[15]
Encouraging divergent thinking in large language models through multi-agent debate
Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. Encouraging divergent thinking in large language models through multi-agent debate. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 17889–17904, Miami, Florida, USA, November 2024. Association for C...
work page 2024
-
[16]
Chateval: Towards better LLM-based evaluators through multi-agent debate
Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better LLM-based evaluators through multi-agent debate. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[17]
Xu Shen, Yixin Liu, Yiwei Dai, Yili Wang, Rui Miao, Yue Tan, Shirui Pan, and Xin Wang. Understanding the information propagation effects of communication topologies in LLM-based multi-agent systems. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, November 2025
work page 2025
-
[18]
Multi-agent design: Optimizing agents with better prompts and topologies
Han Zhou, Xingchen Wan, Ruoxi Sun, Hamid Palangi, Shariq Iqbal, Ivan Vuli´c, Anna Korhonen, and Sercan Ö Arık. Multi-agent design: Optimizing agents with better prompts and topologies. InInternational Conference on Learning Representations (ICLR), 2026
work page 2026
-
[19]
Cut the crap: An economical communication pipeline for llm-based multi-agent systems
Guibin Zhang, Yanwei Yue, Zhixun Li, Sukwon Yun, Guancheng Wan, Kun Wang, Dawei Cheng, Jeffrey Xu Yu, and Tianlong Chen. Cut the crap: An economical communication pipeline for llm-based multi-agent systems. InInternational Conference on Learning Representations (ICLR), 2025
work page 2025
-
[20]
Melissa Z Pan, Mert Cemri, Lakshya A Agrawal, Shuyi Yang, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Kannan Ramchandran, Dan Klein, et al. Why do multiagent systems fail? InICLR 2025 Workshop on Building Trust in Language Models and Applications, 2025
work page 2025
- [21]
-
[22]
verbose database queries correlate with null results
Kunlun Zhu, Zijia Liu, Bingxuan Li, Muxin Tian, Yingxuan Yang, Jiaxun Zhang, Pengrui Han, Qipeng Xie, Fuyang Cui, Weijia Zhang, et al. Where llm agents fail and how they can learn from failures.arXiv preprint arXiv:2509.25370, 2025
-
[23]
Evaluating step-by-step reasoning traces: A survey
Jinu Lee and Julia Hockenmaier. Evaluating step-by-step reasoning traces: A survey. In Findings of the Association for Computational Linguistics: EMNLP 2025. Association for Computational Linguistics, 2025
work page 2025
-
[24]
Junda He, Christoph Treude, and David Lo. Llm-based multi-agent systems for software engineering: Literature review, vision, and the road ahead.ACM Trans. Softw. Eng. Methodol.,
-
[25]
doi: 10.1145/3712003
-
[26]
Chapman And Hall, New York, 1982
R Dennis Cook and Sanford Weisberg.Residuals and influence in regression. Chapman And Hall, New York, 1982. ISBN 9780412242809
work page 1982
-
[27]
Understanding black-box predictions via influence functions
Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. InInternational conference on machine learning, pages 1885–1894. PMLR, 2017
work page 2017
-
[28]
Exchange-of-thought: Enhancing large language model capabilities through cross-model communication
Zhangyue Yin, Qiushi Sun, Cheng Chang, Qipeng Guo, Junqi Dai, Xuan-Jing Huang, and Xipeng Qiu. Exchange-of-thought: Enhancing large language model capabilities through cross-model communication. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 15135–15153, 2023
work page 2023
-
[29]
V oting or consensus? decision-making in multi-agent debate
Lars Benedikt Kaesberg, Jonas Becker, Jan Philip Wahle, Terry Ruas, and Bela Gipp. V oting or consensus? decision-making in multi-agent debate. InFindings of the Association for Computational Linguistics: ACL 2025, pages 11640–11671, 2025
work page 2025
-
[30]
Jina Chun, Qihong Chen, Jiawei Li, and Iftekhar Ahmed. Is multi-agent debate (mad) the silver bullet? an empirical analysis of mad in code summarization and translation.arXiv preprint arXiv:2503.12029, 2025. 12
-
[31]
Scaling large language model-based multi-agent collabora- tion
Chen Qian, Zihao Xie, Yifei Wang, Wei Liu, Kunlun Zhu, Hanchen Xia, Yufan Dang, Zhuoyun Du, Weize Chen, Cheng Yang, et al. Scaling large language model-based multi-agent collabora- tion. InInternational Conference on Learning Representations (ICLR), 2025
work page 2025
-
[32]
Minh Huynh Nguyen, Thang Phan Chau, Phong X. Nguyen, and Nghi D. Q. Bui. Agilecoder: Dynamic collaborative agents for software development based on agile methodology. In 2025 IEEE/ACM Second International Conference on AI Foundation Models and Software Engineering (Forge), pages 156–167, 2025
work page 2025
-
[33]
CAMEL: Communicative agents for ”mind” exploration of large language model society
Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. CAMEL: Communicative agents for ”mind” exploration of large language model society. InThirty-seventh Conference on Neural Information Processing Systems, 2023
work page 2023
-
[34]
Glenn L. Burrows. Sampling techniques. by william g. cochran. new york: John wiley and sons, inc., 1953. 330 pp. $6.50.Social Forces, 32(3):304–305, 03 1954. ISSN 0037-7732. doi: 10.2307/2573260
-
[35]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[36]
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023
work page 2023
-
[37]
Measuring massive multitask language understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations (ICLR), 2021
work page 2021
-
[38]
Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies
Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361, 2021
work page 2021
-
[39]
Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida I. Wang. Cruxeval: a benchmark for code reasoning, understanding and execution. In Proceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024
work page 2024
-
[40]
Livecodebench: Holistic and contamination free evaluation of large language models for code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[41]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Open, closed, or small language models for text classification?arXiv preprint arXiv:2308.10092, 2023
Hao Yu, Zachary Yang, Kellin Pelrine, Jean Francois Godbout, and Reihaneh Rabbany. Open, closed, or small language models for text classification?arXiv preprint arXiv:2308.10092, 2023
-
[43]
Andries Smit, Nathan Grinsztajn, Paul Duckworth, Thomas D. Barrett, and Arnu Pretorius. Should we be going mad? a look at multi-agent debate strategies for llms. InProceedings of the 41st International Conference on Machine Learning, 2024
work page 2024
-
[44]
Hugging Face. Qwen2.5-32b-instruct-awq. https://huggingface.co/Qwen/Qwen2. 5-32B-Instruct-AWQ, 2026. Accessed: 2026-04-27
work page 2026
-
[45]
Qwen2.5-coder-32b-instruct-awq
Hugging Face. Qwen2.5-coder-32b-instruct-awq. https://huggingface.co/Qwen/Qwen2. 5-Coder-32B-Instruct-AWQ, 2026. Accessed: 2026-04-27
work page 2026
-
[46]
Multi-agent consensus seeking via large language models.arXiv preprint arXiv:2310.20151, 2023
Huaben Chen, Wenkang Ji, Lufeng Xu, and Shiyu Zhao. Multi-agent consensus seeking via large language models.arXiv preprint arXiv:2310.20151, 2023. 13
-
[47]
Barney G. Glaser and Hon. Open coding descriptions.Grounded Theory Review: An Interna- tional Journal, 2016
work page 2016
-
[48]
Jane Forman and Laura Damschroder. Qualitative content analysis. InEmpirical Methods for Bioethics: A Primer. Emerald Group Publishing Limited, 11 2007
work page 2007
-
[49]
Toufique Ahmed, Premkumar Devanbu, Christoph Treude, and Michael Pradel. Can llms replace manual annotation of software engineering artifacts? In2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR), pages 526–538. IEEE, 2025
work page 2025
-
[50]
Large language models for data annotation and synthesis: A survey
Zhen Tan, Dawei Li, Song Wang, Alimohammad Beigi, Bohan Jiang, Amrita Bhattacharjee, Mansooreh Karami, Jundong Li, Lu Cheng, and Huan Liu. Large language models for data annotation and synthesis: A survey. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 930–957, 2024
work page 2024
-
[51]
Xin Zhou, Kisub Kim, Ting Zhang, Martin Weyssow, Luis F Gomes, Guang Yang, Kui Liu, Xin Xia, and David Lo. An llm-as-judge metric for bridging the gap with human evaluation in se tasks.arXiv preprint arXiv:2505.20854, 2025
-
[52]
Just put a human in the loop? investigating LLM-assisted annotation for subjective tasks
Hope Schroeder, Deb Roy, and Jad Kabbara. Just put a human in the loop? investigating LLM-assisted annotation for subjective tasks. InFindings of the Association for Computational Linguistics: ACL 2025, Vienna, Austria, July 2025. Association for Computational Linguistics. doi: 10.18653/v1/2025.findings-acl.1323
- [53]
-
[54]
Jacob Cohen. A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960. doi: 10.1177/001316446002000104
-
[55]
Considering likelihood in NLP classification explana- tions with occlusion and language modeling
David Harbecke and Christoph Alt. Considering likelihood in NLP classification explana- tions with occlusion and language modeling. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. Association for Computational Linguistics, July 2020. doi: 10.18653/v1/2020.acl-srw.16
-
[56]
Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld, Luke Zettlemoyer, and Omer Levy. Spanbert: Improving pre-training by representing and predicting spans.Transactions of the association for computational linguistics, 8:64–77, 2020
work page 2020
-
[57]
Fleiss, Bruce Levin, and Myunghee Cho Paik
Joseph L. Fleiss, Bruce Levin, and Myunghee Cho Paik. Statistical methods for rates and proportions.Wiley Series in Probability and Statistics, Sep 2003. doi: https://doi.org/10.1002/ 0471445428
work page 2003
-
[58]
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023
work page 2023
-
[59]
Magi- core: Multi-agent, iterative, coarse-to-fine refinement for reasoning
Justin Chen, Archiki Prasad, Swarnadeep Saha, Elias Stengel-Eskin, and Mohit Bansal. Magi- core: Multi-agent, iterative, coarse-to-fine refinement for reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 32651–32674, 2025
work page 2025
-
[60]
imad: Intelligent multi-agent debate for efficient and accurate llm inference
Wei Fan, JinYi Yoon, and Bo Ji. imad: Intelligent multi-agent debate for efficient and accurate llm inference. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 29403–29411, 2026
work page 2026
-
[61]
Stop wasting your tokens: Towards efficient runtime multi-agent systems
Fulin Lin, Shaowen Chen, Ruishan Fang, Hongwei Wang, and Tao Lin. Stop wasting your tokens: Towards efficient runtime multi-agent systems. InInternational Conference on Learning Representations (ICLR), 2026
work page 2026
-
[62]
Zhexuan Wang, Yutong Wang, Xuebo Liu, Liang Ding, Miao Zhang, Jie Liu, and Min Zhang. AgentDropout: Dynamic agent elimination for token-efficient and high-performance LLM- based multi-agent collaboration. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2025
work page 2025
-
[63]
Shuyang Liu, Yang Chen, Rahul Krishna, Saurabh Sinha, Jatin Ganhotra, and Reyhaneh Jabbarvand. Process-centric analysis of agentic software systems.Proceedings of the ACM on Programming Languages, 10(OOPSLA1):1961–1988, 2026. 14 Appendix This appendix complements the main paper by providing additional experimental details, prompt templates, supplementary ...
-
[64]
Prompt Augmentation.The system prompt for each agent is extended with explicit instructions specifying the critical information categories that must be present in the response
-
[65]
Response Verification.After generation, the response is checked to verify whether all required categories are present. If any are missing, the agent is re-invoked with a correction instruction for a fixed number of retries, until all critical information are included in their response. Layer 1 — Prompt Augmentation CARASystem Prompt (Initial Response) <MA...
work page 1975
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.