pith. machine review for the scientific record. sign in

arxiv: 2603.04474 · v2 · submitted 2026-03-04 · 💻 cs.MA · cs.AI

Recognition: 2 theorem links

· Lean Theorem

From Spark to Fire: Modeling and Mitigating Error Cascades in LLM-Based Multi-Agent Collaboration

Authors on Pith no claims yet

Pith reviewed 2026-05-15 16:34 UTC · model grok-4.3

classification 💻 cs.MA cs.AI
keywords error cascadesmulti-agent systemsLLMgenealogy graphgovernance layerpropagation dynamicserror mitigationcollaboration
0
0 comments X

The pith

A genealogy-graph governance layer suppresses error amplification in LLM-based multi-agent systems without altering their collaboration architecture.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper models how small inaccuracies in LLM multi-agent systems can spread and amplify through message dependencies into system-wide false consensus. It identifies three vulnerability classes—cascade amplification, topological sensitivity, and consensus inertia—and shows that a single injected error can trigger broad failure. The authors introduce a genealogy-graph-based governance layer as a message-layer plugin that tracks dependencies to detect and block risky propagations early. This approach prevents final infection in at least 89 percent of runs across six frameworks while preserving natural collaboration flows. The method addresses reliability gaps that existing single-agent checks or architecture changes often leave open.

Core claim

By abstracting LLM-MAS collaboration as a directed dependency graph and defining an early-stage risk criterion, the paper shows that error cascades follow predictable amplification patterns. Experiments across mainstream frameworks expose cascade amplification along dependency paths, sensitivity to graph topology, and inertia toward erroneous consensus. A single atomic error seed suffices to infect the system. The genealogy-graph governance layer, implemented as a non-intrusive plugin, suppresses both endogenous and exogenous amplification and blocks final infection in at least 89 percent of runs without modifying the underlying collaboration structure.

What carries the argument

The genealogy-graph-based governance layer, which tracks message lineage in the directed dependency graph to apply early risk criteria and intercept error propagation.

If this is right

  • Minor errors no longer solidify into system-level false consensus during iterative collaboration.
  • Protection works across six mainstream multi-agent frameworks without architecture changes.
  • A single error seed is prevented from causing widespread failure in most operating modes.
  • Both internally generated and externally introduced errors are suppressed by the same layer.
  • Effective information flow between agents remains intact while cascade risks are reduced.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Dependency tracking of this form could be adapted to improve reliability in distributed systems beyond LLM agents.
  • Agent frameworks may benefit from making genealogy logging a default feature rather than an add-on.
  • Topology-aware agent design informed by the risk criterion could further lower cascade exposure.
  • Validation on larger, open-ended tasks would test whether the reported 89 percent prevention rate generalizes.

Load-bearing premise

The directed dependency graph abstraction and early-stage risk criterion capture the dominant mechanisms of error spread in real LLM multi-agent deployments.

What would settle it

A real deployment in which errors propagate and amplify through non-message channels such as shared external memory or tool states that the genealogy graph does not record, allowing infection despite the governance layer.

Figures

Figures reproduced from arXiv: 2603.04474 by Congcong Zhu, Dayong Ye, Huajie Chen, Minfeng Qi, Tianqing Zhu, Wanlei Zhou, Xinyue Zhang, Yizhe Xie.

Figure 1
Figure 1. Figure 1: The amplification of errors in LLM-MAS. Whether the input is a factuality error or a faithfulness er￾ror, the agents reach a false consensus. This results in failures ranging from security breaches to operational outages. roles [41]. Concurrently, representative open-source frame￾works for collaborative orchestration have emerged. For in￾stance, AUTOGEN [50] organizes task flows through multi￾agent convers… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our work. We categorize false consensus arising from internal vulnerabilities versus external induction. We model propagation dynamics to characterize consensus collapse mechanisms. Correspondingly, a genealogy-based governance layer implements atomic propagation control to guarantee faithfulness and factuality. amplification of errors. Minor deviations regarding factuality or faithfulness, whe… view at source ↗
Figure 3
Figure 3. Figure 3: Model validation across different topologies. The black lines represent the observed mean infection rates with ±1 standard error. The dashed lines show the fitted curves using product-based and Poisson-based infection functions. of agent j causes agent i to treat m as a usable premise when generating ui(t +1). Under IBMF, sj(t) approximates the probability that agent j is active regarding m at round t. The… view at source ↗
Figure 4
Figure 4. Figure 4: The evolution of error coverage S(t). 4.2 Vulnerability II: Topological Fragility We identify a structural failure mode where system resilience depends not solely on error content but on the entry coordi￾nates. Whereas Vulnerability I demonstrates error spreading potential, Vulnerability II investigates where the graph is most susceptible. Mechanism. The global divergence of outcomes is governed by the adj… view at source ↗
Figure 5
Figure 5. Figure 5: Overview of the Genealogy-Based Governance Layer. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Infection rate S(t) across communication turns under three topologie. low atoms are forwarded with high-risk tags but excluded from confirmed lineage. This prevents deadlock while limit￾ing rejection propagation. Case B: Qreturn = 0/ (Release). The outgoing message is assembled from Green and policy-handled Yellow atoms, pre￾serving sequence, and released downstream, updating L ac￾cordingly. 6.6 Online and… view at source ↗
read the original abstract

Large Language Model-based Multi-Agent Systems (LLM-MAS) are increasingly applied to complex collaborative scenarios. However, their collaborative mechanisms may cause minor inaccuracies to gradually solidify into system-level false consensus through iteration. Such risks are difficult to trace since errors can propagate and amplify through message dependencies. Existing protections often rely on single-agent validation or require modifications to the collaboration architecture, which can weaken effective information flow and may not align with natural collaboration processes in real tasks. To address this, we propose a propagation dynamics model tailored for LLM-MAS that abstracts collaboration as a directed dependency graph and provides an early-stage risk criterion to characterize amplification risk. Through experiments on six mainstream frameworks, we identify three vulnerability classes: cascade amplification, topological sensitivity, and consensus inertia. We further instantiate an attack where injecting just a single atomic error seed leads to widespread failure. In response, we introduce a genealogy-graph-based governance layer, implemented as a message-layer plugin, that suppresses both endogenous and exogenous error amplification without altering the collaboration architecture. Experiments show that this approach prevents final infection in at least 89% of runs across operating modes and significantly mitigates the cascading spread of minor errors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper models error propagation in LLM-based multi-agent systems as a directed dependency graph, identifies three vulnerability classes (cascade amplification, topological sensitivity, consensus inertia) via experiments on six frameworks, shows that a single atomic error seed can cause widespread failure, and proposes a genealogy-graph governance layer implemented as a message-layer plugin. This plugin is claimed to suppress endogenous and exogenous amplification without altering the collaboration architecture, preventing final infection in at least 89% of runs across operating modes.

Significance. If the directed-graph abstraction faithfully captures dominant propagation mechanisms and the empirical results prove robust to controls, the work offers a lightweight, architecture-preserving mitigation strategy for error cascades in LLM-MAS. This is significant for practical deployment of collaborative agents, as it avoids the drawbacks of single-agent validation or architectural changes while providing concrete prevention rates across multiple frameworks.

major comments (3)
  1. [Abstract and Experimental Setup] Abstract and Experimental Setup: The 89% prevention rate is presented without details on error definitions, number of runs, statistical tests, variance, or baseline comparisons (e.g., no-governance controls). This information is load-bearing for evaluating the mitigation's effectiveness and generalizability.
  2. [Propagation Dynamics Model] Propagation Dynamics Model: The directed dependency graph and early-stage risk criterion assume message dependencies dominate error spread. However, shared context, tool outputs, or implicit state outside explicit messages are common in LLM-MAS and could allow undetected amplification, potentially invalidating the reported prevention rate when the model is incomplete.
  3. [Vulnerability Classification] Vulnerability Classification: The three classes appear identified post-hoc from the runs, introducing selection-effect risk that could overstate their generality and the attack's representativeness.
minor comments (1)
  1. [Abstract] Abstract: The six mainstream frameworks are not named, which reduces immediate reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on experimental transparency, model assumptions, and classification methodology. We address each major comment below, indicating revisions where the manuscript will be updated to strengthen clarity and address potential limitations.

read point-by-point responses
  1. Referee: [Abstract and Experimental Setup] The 89% prevention rate is presented without details on error definitions, number of runs, statistical tests, variance, or baseline comparisons (e.g., no-governance controls). This information is load-bearing for evaluating the mitigation's effectiveness and generalizability.

    Authors: We agree that the abstract would benefit from more detail on these elements for immediate evaluation. The full manuscript's Section 4 specifies error as factual deviation from ground truth exceeding a 5% threshold, with 500 runs per framework and operating mode, including standard deviations and t-test results (p < 0.01) against no-governance baselines showing infection rates above 70%. We will revise the abstract to incorporate key statistics and add a summary table of runs, variance, and baselines in the main text. revision: yes

  2. Referee: [Propagation Dynamics Model] The directed dependency graph and early-stage risk criterion assume message dependencies dominate error spread. However, shared context, tool outputs, or implicit state outside explicit messages are common in LLM-MAS and could allow undetected amplification, potentially invalidating the reported prevention rate when the model is incomplete.

    Authors: The model abstracts collaboration via explicit message dependencies as the primary propagation channel, which our experiments on six frameworks confirm as the dominant mechanism in the tested scenarios. We acknowledge that shared context and tool outputs may enable additional implicit paths not fully modeled. In revision, we will add a limitations discussion on this point and note that the genealogy-graph plugin mitigates observable cascades at the message layer. The reported prevention rates remain valid under the model's explicit-dependency assumptions. revision: partial

  3. Referee: [Vulnerability Classification] The three classes appear identified post-hoc from the runs, introducing selection-effect risk that could overstate their generality and the attack's representativeness.

    Authors: The classes emerged from systematic patterns observed consistently across all frameworks and attack variants, informed by graph properties such as path amplification and node sensitivity. To mitigate selection concerns, we will clarify the a priori hypotheses in the revision, provide full run data in supplementary materials, and cross-validate the classification against additional independent scenarios. revision: yes

Circularity Check

0 steps flagged

No circularity detected; claims rest on graph abstraction and empirical validation

full rationale

The paper defines a directed dependency graph model and early-stage risk criterion, then reports empirical results from experiments on six frameworks showing vulnerability classes and 89% prevention via the genealogy-graph plugin. No equations are presented that equate outputs to inputs by construction, no fitted parameters are relabeled as predictions, and no self-citations are used to justify uniqueness or smuggle ansatzes. The derivation chain is self-contained against the stated experimental benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the directed dependency graph and genealogy tracking are presented as modeling choices rather than new physical entities.

pith-pipeline@v0.9.0 · 5531 in / 1003 out tokens · 47786 ms · 2026-05-15T16:34:08.573537+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy

    cs.LG 2026-05 unverdicted novelty 6.0

    Pretrained base models exhibit higher yield to peer disagreement than RLHF instruct variants, with the effect localized to mid-layer attention and mitigated by structured dissent rather than prompt defenses.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · cited by 1 Pith paper · 7 internal anchors

  1. [1]

    Trustworthy agentic ai systems: A cross-layer review of architectures, threat models, and governance strategies for real-world deployment.F1000Research, 14(905):905, 2025

    Ibrahim Adabara, Bashir Olaniyi Sadiq, Aliyu Nuhu Shuaibu, Yale Ibrahim Danjuma, and Venkateswarlu Maninti. Trustworthy agentic ai systems: A cross-layer review of architectures, threat models, and governance strategies for real-world deployment.F1000Research, 14(905):905, 2025

  2. [2]

    The orchestration of multi-agent systems: Architec- tures, protocols, and enterprise adoption.arXiv preprint arXiv:2601.13671, 2026

    Apoorva Adimulam, Rajesh Gupta, and Sumit Kumar. The orchestration of multi-agent systems: Architec- tures, protocols, and enterprise adoption.arXiv preprint arXiv:2601.13671, 2026

  3. [3]

    An overview of recent advances of resilient consensus for multiagent systems under at- tacks.Computational Intelligence and Neuroscience, 2022(1):6732343, 2022

    Muhammad Muzamil Aslam, Zahoor Ahmed, Liping Du, Muhammad Zohaib Hassan, Sajid Ali, and Muham- mad Nasir. An overview of recent advances of resilient consensus for multiagent systems under at- tacks.Computational Intelligence and Neuroscience, 2022(1):6732343, 2022

  4. [4]

    Uci machine learning repository, 2007

    Arthur Asuncion, David Newman, et al. Uci machine learning repository, 2007

  5. [5]

    Monitoring reason- ing models for misbehavior and the risks of promoting obfuscation.arXiv preprint arXiv:2503.11926, 2025

    Bowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi. Monitoring reason- ing models for misbehavior and the risks of promoting obfuscation.arXiv preprint arXiv:2503.11926, 2025

  6. [6]

    A theory of fads, fashion, custom, and cultural change as informational cascades.Journal of political Economy, 100(5):992–1026, 1992

    Sushil Bikhchandani, David Hirshleifer, and Ivo Welch. A theory of fads, fashion, custom, and cultural change as informational cascades.Journal of political Economy, 100(5):992–1026, 1992

  7. [7]

    Sagallm: Con- text management, validation, and transaction guaran- tees for multi-agent llm planning.arXiv preprint arXiv:2503.11951, 2025

    Edward Y Chang and Longling Geng. Sagallm: Con- text management, validation, and transaction guaran- tees for multi-agent llm planning.arXiv preprint arXiv:2503.11951, 2025

  8. [8]

    A lattice model of secure informa- tion flow.Communications of the ACM, 19(5):236–243, 1976

    Dorothy E Denning. A lattice model of secure informa- tion flow.Communications of the ACM, 19(5):236–243, 1976. 14

  9. [9]

    Improving factuality and reasoning in language models through multiagent de- bate

    Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenen- baum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent de- bate. InForty-first International Conference on Machine Learning, 2023

  10. [10]

    Exploration of llm multi- agent application implementation based on langgraph+ crewai.arXiv preprint arXiv:2411.18241, 2024

    Zhihua Duan and Jialin Wang. Exploration of llm multi- agent application implementation based on langgraph+ crewai.arXiv preprint arXiv:2411.18241, 2024

  11. [11]

    PhD thesis, University of Oxford, 2021

    Christopher J D’Urso.Nowhere to hide: investigating the use of unilateral alternatives to extradition in United States prosecutions of transnational cybercrime. PhD thesis, University of Oxford, 2021

  12. [12]

    Ragas: Automated evaluation of retrieval augmented generation

    Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert. Ragas: Automated evaluation of retrieval augmented generation. InProceedings of the 18th Con- ference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 150–158, 2024

  13. [13]

    From prompt injections to pro- tocol exploits: Threats in llm-powered ai agents work- flows.ICT Express, 2025

    Mohamed Amine Ferrag, Norbert Tihanyi, Djallel Hamouda, Leandros Maglaras, Abderrahmane Lakas, and Merouane Debbah. From prompt injections to pro- tocol exploits: Threats in llm-powered ai agents work- flows.ICT Express, 2025

  14. [14]

    Multi-agent frame- work for threat mitigation and resilience in ai-based systems.arXiv preprint arXiv:2512.23132, 2025

    Armstrong Foundjem, Lionel Nganyewou Tidjon, Leu- son Da Silva, and Foutse Khomh. Multi-agent frame- work for threat mitigation and resilience in ai-based systems.arXiv preprint arXiv:2512.23132, 2025

  15. [15]

    Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injec- tion

    Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injec- tion. InProceedings of the 16th ACM workshop on artificial intelligence and security, pages 79–90, 2023

  16. [16]

    Large Language Model based Multi-Agents: A Survey of Progress and Challenges

    Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xian- gliang Zhang. Large language model based multi-agents: A survey of progress and challenges.arXiv preprint arXiv:2402.01680, 2024

  17. [17]

    DeBERTa: Decoding-enhanced BERT with Disentangled Attention

    Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention.arXiv preprint arXiv:2006.03654, 2020

  18. [18]

    Red-teaming llm multi-agent systems via communication attacks

    Pengfei He, Yuping Lin, Shen Dong, Han Xu, Yue Xing, and Hui Liu. Red-teaming llm multi-agent systems via communication attacks. InFindings of the Association for Computational Linguistics: ACL 2025, pages 6726– 6747, 2025

  19. [19]

    Sentinelagent: Graph-based anomaly detection in multi-agent systems

    Xu He, Di Wu, Yan Zhai, and Kun Sun. Sentinelagent: Graph-based anomaly detection in multi-agent systems. arXiv preprint arXiv:2505.24201, 2025

  20. [20]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020

  21. [21]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

  22. [22]

    Metagpt: Meta programming for a multi-agent collabora- tive framework

    Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. Metagpt: Meta programming for a multi-agent collabora- tive framework. InThe twelfth international conference on learning representations, 2023

  23. [23]

    Understanding the planning of LLM agents: A survey

    Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. Understanding the planning of llm agents: A survey.arXiv preprint arXiv:2402.02716, 2024

  24. [24]

    An overview on multi-agent consensus under adversarial attacks.An- nual Reviews in Control, 53:252–272, 2022

    Hideaki Ishii, Yuan Wang, and Shuai Feng. An overview on multi-agent consensus under adversarial attacks.An- nual Reviews in Control, 53:252–272, 2022

  25. [25]

    A multi-vocal review of security orchestration.ACM Computing Surveys (CSUR), 52(2):1–45, 2019

    Chadni Islam, Muhammad Ali Babar, and Surya Nepal. A multi-vocal review of security orchestration.ACM Computing Surveys (CSUR), 52(2):1–45, 2019

  26. [26]

    Towards mitigating llm halluci- nation via self reflection

    Ziwei Ji, Tiezheng Yu, Yan Xu, Nayeon Lee, Etsuko Ishii, and Pascale Fung. Towards mitigating llm halluci- nation via self reflection. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 1827–1843, 2023

  27. [27]

    A survey on large language models for code generation.ACM Transactions on Software Engi- neering and Methodology, 35(2):1–72, January 2026

    Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code generation.ACM Transactions on Software Engi- neering and Methodology, 35(2):1–72, January 2026

  28. [28]

    A survey of llm-driven ai agent communication: Protocols, security risks, and defense countermeasures.arXiv preprint arXiv:2506.19676, 2025

    Dezhang Kong, Shi Lin, Zhenhua Xu, Zhebo Wang, Minghao Li, Yufeng Li, Yilun Zhang, Hujin Peng, Xiang Chen, Zeyang Sha, et al. A survey of llm-driven ai agent communication: Protocols, security risks, and defense countermeasures.arXiv preprint arXiv:2506.19676, 2025

  29. [29]

    Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, 15 et al. Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

  30. [30]

    Camel: Communica- tive agents for" mind" exploration of large language model society.Advances in Neural Information Process- ing Systems, 36:51991–52008, 2023

    Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communica- tive agents for" mind" exploration of large language model society.Advances in Neural Information Process- ing Systems, 36:51991–52008, 2023

  31. [31]

    A survey on llm-based multi-agent systems: workflow, in- frastructure, and challenges.Vicinagearth, 1(1):9, 2024

    Xinyi Li, Sai Wang, Siqi Zeng, Yu Wu, and Yi Yang. A survey on llm-based multi-agent systems: workflow, in- frastructure, and challenges.Vicinagearth, 1(1):9, 2024

  32. [32]

    Attack and defense techniques in large language models: A survey and new perspectives.Neu- ral Networks, page 108388, 2025

    Zhiyu Liao, Kang Chen, Yuanguo Lin, Kangkang Li, Yunxuan Liu, Hefeng Chen, Xingwang Huang, and Yuanhui Yu. Attack and defense techniques in large language models: A survey and new perspectives.Neu- ral Networks, page 108388, 2025

  33. [33]

    The Dark Side of LLMs: Agent-based Attack Vectors for System-level Compromise

    Matteo Lupinacci, Francesco Aurelio Pironti, Francesco Blefari, Francesco Romeo, Luigi Arena, and Angelo Furfaro. The dark side of llms: Agent-based at- tacks for complete computer takeover.arXiv preprint arXiv:2507.06850, 2025

  34. [34]

    Factscore: Fine-grained atomic evaluation of factual precision in long form text generation

    Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, 2023

  35. [35]

    Why do multiagent systems fail? In ICLR 2025 Workshop on Building Trust in Language Models and Applications, 2025

    Melissa Z Pan, Mert Cemri, Lakshya A Agrawal, Shuyi Yang, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Kannan Ramchandran, Dan Klein, et al. Why do multiagent systems fail? In ICLR 2025 Workshop on Building Trust in Language Models and Applications, 2025

  36. [36]

    Epidemic processes in complex networks.Reviews of modern physics, 87(3):925–979, 2015

    Romualdo Pastor-Satorras, Claudio Castellano, Piet Van Mieghem, and Alessandro Vespignani. Epidemic processes in complex networks.Reviews of modern physics, 87(3):925–979, 2015

  37. [37]

    A review on agent-to-agent pro- tocol: Concept, state-of-the-art, challenges and future directions.Authorea Preprints, 2025

    Partha Pratim Ray. A review on agent-to-agent pro- tocol: Concept, state-of-the-art, challenges and future directions.Authorea Preprints, 2025

  38. [38]

    Ai agents vs

    Ranjan Sapkota, Konstantinos I Roumeliotis, and Manoj Karkee. Ai agents vs. agentic ai: A conceptual tax- onomy, applications and challenges.arXiv preprint arXiv:2505.10468, 2025

  39. [39]

    Audit-llm: Multi- agent collaboration for log-based insider threat detection

    Chengyu Song, Linru Ma, Jianming Zheng, Jinzhi Liao, Hongyu Kuang, and Lin Yang. Audit-llm: Multi- agent collaboration for log-based insider threat detection. arXiv preprint arXiv:2408.08902, 2024

  40. [40]

    Towards detecting llms hallucination via markov chain-based multi-agent debate framework

    Xiaoxi Sun, Jinpeng Li, Yan Zhong, Dongyan Zhao, and Rui Yan. Towards detecting llms hallucination via markov chain-based multi-agent debate framework. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025

  41. [41]

    Talebirad, A

    Yashar Talebirad and Amirhossein Nadiri. Multi-agent collaboration: Harnessing the power of intelligent llm agents.arXiv preprint arXiv:2306.03314, 2023

  42. [42]

    Creating large language model applications utilizing langchain: A primer on developing llm apps fast

    Oguzhan Topsakal and Tahir Cetin Akinci. Creating large language model applications utilizing langchain: A primer on developing llm apps fast. InInternational conference on applied engineering and natural sciences, volume 1, pages 1050–1056, 2023

  43. [43]

    Multi-agent systems execute arbitrary malicious code

    Harold Triedman, Rishi Jha, and Vitaly Shmatikov. Multi-agent systems execute arbitrary malicious code. arXiv preprint arXiv:2503.12188, 2025

  44. [44]

    The spread of true and false news online.science, 359(6380):1146– 1151, 2018

    Soroush V osoughi, Deb Roy, and Sinan Aral. The spread of true and false news online.science, 359(6380):1146– 1151, 2018

  45. [45]

    Agent ai with lang- graph: A modular framework for enhancing machine translation using large language models.arXiv preprint arXiv:2412.03801, 2024

    Jialin Wang and Zhihua Duan. Agent ai with lang- graph: A modular framework for enhancing machine translation using large language models.arXiv preprint arXiv:2412.03801, 2024

  46. [46]

    A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

  47. [47]

    Security of internet of agents: Attacks and counter- measures.IEEE Open Journal of the Computer Society, 2025

    Yuntao Wang, Yanghe Pan, Shaolong Guo, and Zhou Su. Security of internet of agents: Attacks and counter- measures.IEEE Open Journal of the Computer Society, 2025

  48. [48]

    Large model based agents: State-of-the-art, cooperation paradigms, security and privacy, and future trends.IEEE Communications Surveys & Tutorials, 2025

    Yuntao Wang, Yanghe Pan, Zhou Su, Yi Deng, Quan Zhao, Linkang Du, Tom H Luan, Jiawen Kang, and Dusit Niyato. Large model based agents: State-of-the-art, cooperation paradigms, security and privacy, and future trends.IEEE Communications Surveys & Tutorials, 2025

  49. [49]

    A simple model of global cascades on random networks.Proceedings of the National Academy of Sciences, 99(9):5766–5771, 2002

    Duncan J Watts. A simple model of global cascades on random networks.Proceedings of the National Academy of Sciences, 99(9):5766–5771, 2002

  50. [50]

    Autogen: Enabling next-gen llm applications via multi-agent conversations

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst Conference on Language Modeling, 2024. 16

  51. [51]

    The rise and potential of large language model based agents: A survey, 2023

    Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Yuhao Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang Liu, Zhangyue Yin, Shi- han Dou, Rongxiang Weng, Wensen Cheng, Qi Zhang, Wenjuan Qin, Yongyan Zheng, Xipeng Qiu, Xuanjing Huang,...

  52. [52]

    Who’s the mole? modeling and detecting intention-hiding mali- cious agents in llm-based multi-agent systems.arXiv preprint arXiv:2507.04724, 2025

    Yizhe Xie, Congcong Zhu, Xinyue Zhang, Tianqing Zhu, Dayong Ye, Minghao Wang, and Chi Liu. Who’s the mole? modeling and detecting intention-hiding mali- cious agents in llm-based multi-agent systems.arXiv preprint arXiv:2507.04724, 2025

  53. [53]

    Minimizing hallucinations and communication costs: Adversarial debate and voting mechanisms in llm-based multi-agents.Applied Sciences, 15(7):3676, 2025

    Yi Yang, Yitong Ma, Hao Feng, Yiming Cheng, and Zhu Han. Minimizing hallucinations and communication costs: Adversarial debate and voting mechanisms in llm-based multi-agents.Applied Sciences, 15(7):3676, 2025

  54. [54]

    Jailbreak Attacks and Defenses Against Large Language Models: A Survey

    Sibo Yi, Yule Liu, Zhen Sun, Tianshuo Cong, Xinlei He, Jiaxing Song, Ke Xu, and Qi Li. Jailbreak attacks and defenses against large language models: A survey.arXiv preprint arXiv:2407.04295, 2024

  55. [55]

    Blockchain for network service or- chestration: Trust and adoption in multi-domain envi- ronments.IEEE Communications Standards Magazine, 7(2):16–22, 2023

    Engin Zeydan, Jorge Baranda, Josep Mangues-Bafalluy, and Yekta Turk. Blockchain for network service or- chestration: Trust and adoption in multi-domain envi- ronments.IEEE Communications Standards Magazine, 7(2):16–22, 2023

  56. [56]

    Which agent causes task failures and when? on automated failure attribution of llm multi-agent systems.arXiv preprint arXiv:2505.00212, 2025

    Shaokun Zhang, Ming Yin, Jieyu Zhang, Jiale Liu, Zhiguang Han, Jingyang Zhang, Beibin Li, Chi Wang, Huazheng Wang, Yiran Chen, et al. Which agent causes task failures and when? on automated failure attribution of llm multi-agent systems.arXiv preprint arXiv:2505.00212, 2025

  57. [57]

    infection

    Tommaso Zoppi, Andrea Ceccarelli, and Andrea Bon- davalli. Exploring anomaly detection in systems of systems. InProceedings of the Symposium on Applied Computing, pages 1139–1146, 2017. A Model Fitting and Topology Configuration Details This appendix specifies the configuration and fitting proto- col omitted from §2.3. In this calibration experiment, we o...