pith. machine review for the scientific record. sign in

arxiv: 2605.08647 · v1 · submitted 2026-05-09 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

AgentCollabBench: Diagnosing When Good Agents Make Bad Collaborators

Authors on Pith no claims yet

Pith reviewed 2026-05-12 00:50 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords multi-agent systemsagent collaborationbenchmarkcommunication topologyLLM evaluationinformation survivalsynthesis bottleneckDAG structures
0
0 comments X

The pith

Communication topology in multi-agent systems causes agents to drop constraints from minority inputs at merge points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AgentCollabBench, a diagnostic set of 900 tasks that measure whether constraints survive peer interactions or get lost in multi-agent pipelines. It isolates four risks including instruction decay and false-belief spread, then shows that network structure accounts for 7-40 percent of variance in whether information reaches the final output. The mechanism is a synthesis bottleneck at points where an agent combines inputs from several parents and systematically discards details carried only by the smaller branch. This pattern does not appear in simple linear chains. The result implies that collaboration failures are structural rather than solely a matter of individual model quality.

Core claim

Communication topology emerges as a primary risk factor that explains 7-40% of the variance in multi-hop information survival. The effect traces to a synthesis bottleneck specific to converging-DAG nodes: an agent weighing competing parent inputs discards constraints carried by a minority branch, a bottleneck structurally absent from linear chains. Evaluating four modern LLMs across the benchmark exposes model-specific vulnerability profiles that outcome-only tests miss.

What carries the argument

The synthesis bottleneck at converging-DAG nodes, where an agent discards constraints from minority parent branches when merging inputs.

If this is right

  • Suboptimal topologies can silently corrupt reasoning chains even when the final output looks correct.
  • Multi-agent reliability requires deliberate architecture choices rather than relying on model scaling alone.
  • Different models show distinct profiles of resistance to instruction decay, false-belief spread, leakage, and tracer loss.
  • Linear chains avoid the merge-point bottleneck that affects converging structures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Designers of agent pipelines should add explicit rules to preserve minority-branch constraints at convergence points.
  • The topology finding may apply to other collaborative systems where agents must merge partial results.
  • Pre-deployment testing could include systematic sweeps over common graph shapes to quantify information loss.
  • Human review inserted only at merge nodes might offset the identified bottleneck without changing the overall topology.

Load-bearing premise

The 900 tasks cleanly isolate the four named behavioral risks without being driven by prompt wording, task content, or other unmeasured model behaviors.

What would settle it

A controlled run on identical tasks where switching from converging DAG topologies to linear chains produces no measurable change in constraint survival rates or explained variance.

Figures

Figures reproduced from arXiv: 2605.08647 by Al Jami Islam Anik, Al Nafeu Khan, Aritra Mazumder, Debjoty Mitra, Gour Gupal Talukder Shawon, Humayra Tasnim, Kainat Raisa Hossain, Nehaa Shri, Nusrat Jahan Lia, Shubhashis Roy Dipta, Shubhrangshu Debsarkar, Sumaiya Ahmed Rani, Tanzila Khan.

Figure 1
Figure 1. Figure 1: Dataset Construction & Metric Validation The process starts with metric design, followed by controlled task generation, multi-stage automated and human validation, and causal stress-testing via perturbation ladders to ensure diagnostically meaningful evaluation of multi-agent LLM systems. Prior agent benchmarks cover only narrow slices of behavioral robustness ( [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Baseline behavioral profiles (Section 5.1). [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Perturbation sensitivity. All four metrics move in the expected direction under low/medi [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Model behavioral fingerprints: signed z-scores vs. the cross-model mean per met￾ric. Llama 3.1 8B Instruct spikes on IDR/CPR (+1.45σ, +1.48σ) and bottoms on RTD; GPT 4.1 mini is the floor on CLC (−1.35σ); Qwen￾3.5-35B-A3B leads on RTD (+1.04σ). No met￾ric is a proxy for another. other models (p ≤ 0.001) and GPT 4.1 mini from Llama; Gemini and GPT are not separated on RTD. CPR separates Llama from every oth… view at source ↗
Figure 5
Figure 5. Figure 5: RQ4 topology effects on information propagation. Direct-routing topologies (fully con [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Per-hop tracer drop rate by topol￾ogy and model: fraction of edges A→B where the tracer present in A is missing in B. The converging-DAG / linear-chain ratio reveals a topology-specific synthesis bottleneck. Per-hop drop rate analysis. To test whether converging DAG’s RTD deficit reflects a topology￾specific difficulty or simply higher per-hop failure rates in structurally harder tasks, we computed the per… view at source ↗
Figure 7
Figure 7. Figure 7: Per-model topology associations across all four failure modes. Rows = models; columns [PITH_FULL_IMAGE:figures/full_fig_p031_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Structural complexity calibration across models and metrics (RQ5). Each cell shows mean score (%) ± SE across three structural-complexity tiers (Easy, Medium, Hard) for one model– metric pair; points are connected by a line to reveal trend direction. Significant ordinal-regression effects (p < 0.05, treating complexity as an ordinal predictor) are annotated in the top-right corner of each cell. GPT 4.1 min… view at source ↗
read the original abstract

Multi-agent systems achieve state-of-the-art outcomes through peer collaboration. However, when an agent in the pipeline silently drops a constraint, the system's final output may look correct even though the reasoning chain was quietly corrupted, and existing outcome-based evaluations are blind to such multi-hop process failures. To make these vulnerabilities measurable before deployment, we introduce AgentCollabBench, a diagnostic benchmark of 900 human-validated tasks spanning software engineering, DevOps, and data engineering. Each task isolates one of four behavioral risks: instruction decay (does a constraint survive peer pressure?), false-belief contagion (does a falsehood spread through consensus?), context leakage (does information bleed between tasks?), and tracer durability (does marked data reach the final agent?). Evaluating four modern LLMs (GPT 4.1 mini, Gemini 2.5 Flash Lite, Qwen-3.5-35B-A3B, and Llama 3.1 8B Instruct), we expose model-specific vulnerability profiles invisible to outcome-only evaluation; Qwen-3.5-35B-A3B, for example, leads on tracer durability and instruction stability, while GPT 4.1 mini leads on leakage containment and false-belief resistance. Beyond per-model differences, communication topology emerges as a primary risk factor that explains 7-40% of the variance in multi-hop information survival. The effect traces to a synthesis bottleneck specific to converging-DAG nodes: an agent weighing competing parent inputs discards constraints carried by a minority branch, a bottleneck structurally absent from linear chains. AgentCollabBench demonstrates that suboptimal topology can silently erase the safeguards of highly capable models, arguing that multi-agent reliability is fundamentally a structural problem and that scaling model intelligence alone is no substitute for architecture.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces AgentCollabBench, a benchmark of 900 human-validated tasks spanning software engineering, DevOps, and data engineering. It isolates four behavioral risks in multi-agent LLM collaboration (instruction decay, false-belief contagion, context leakage, tracer durability) and evaluates four models (GPT 4.1 mini, Gemini 2.5 Flash Lite, Qwen-3.5-35B-A3B, Llama 3.1 8B Instruct). The central claim is that communication topology is a primary risk factor explaining 7-40% of variance in multi-hop information survival, traced to a synthesis bottleneck at converging-DAG nodes where agents discard minority-branch constraints (absent in linear chains).

Significance. If the empirical results hold after controls, the work is significant for shifting multi-agent evaluation from outcome-only metrics to process-level diagnostics. It supplies a reusable benchmark with human validation and domain coverage, reveals model-specific profiles (e.g., Qwen leading on durability/stability), and demonstrates that structural choices can override individual model strengths. This supports architecture-aware design over pure scaling and provides falsifiable predictions about topology effects.

major comments (2)
  1. [§3.2] §3.2 (Task Construction): The protocol for varying only communication topology (linear vs. converging DAG) while holding task content, constraint phrasing, prompt wording, and domain fixed is not described with sufficient detail or concrete examples. Without an explicit construction method or matching procedure, the attribution of dropped constraints to the synthesis bottleneck at converging nodes risks confounding by task-specific prompt sensitivity or content effects rather than graph structure.
  2. [§5.1] §5.1 (Variance Analysis): The claim that topology explains 7-40% of variance in information survival lacks specification of the statistical model (regression, ANOVA, or decomposition), inclusion of fixed effects or covariates for task category/prompt length/model identity, and any robustness checks. This detail is load-bearing for interpreting the effect as causal to the DAG bottleneck rather than other experimental variables.
minor comments (2)
  1. [Abstract] Abstract: Model names contain minor inconsistencies (e.g., 'GPT 4.1 mini' vs. standard 'GPT-4.1-mini'); align with official nomenclature throughout.
  2. [§4.3] §4.3: Add a table or appendix entry listing per-risk task counts and validation criteria to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for highlighting opportunities to strengthen the clarity of our methods and analysis. We respond to each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Task Construction): The protocol for varying only communication topology (linear vs. converging DAG) while holding task content, constraint phrasing, prompt wording, and domain fixed is not described with sufficient detail or concrete examples. Without an explicit construction method or matching procedure, the attribution of dropped constraints to the synthesis bottleneck at converging nodes risks confounding by task-specific prompt sensitivity or content effects rather than graph structure.

    Authors: We agree that Section 3.2 would benefit from greater explicitness. In the revised manuscript we will expand the task-construction protocol with a step-by-step description of how matched linear and converging-DAG task pairs are generated while keeping content, constraint phrasing, prompt wording, and domain fixed. We will also include concrete examples of task instances and the matching procedure used to control for content and phrasing effects. These additions will make the isolation of topology as the sole manipulated variable fully transparent. revision: yes

  2. Referee: [§5.1] §5.1 (Variance Analysis): The claim that topology explains 7-40% of variance in information survival lacks specification of the statistical model (regression, ANOVA, or decomposition), inclusion of fixed effects or covariates for task category/prompt length/model identity, and any robustness checks. This detail is load-bearing for interpreting the effect as causal to the DAG bottleneck rather than other experimental variables.

    Authors: We acknowledge that the variance analysis requires fuller statistical specification. In the revision we will detail the regression model (including fixed effects for task category, prompt length, and model identity), report the complete coefficient table with the 7-40% variance attribution, and add robustness checks such as alternative decompositions and sensitivity analyses. These changes will support a clearer causal interpretation of the synthesis-bottleneck effect. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical benchmark evaluation

full rationale

The paper introduces AgentCollabBench as an empirical diagnostic benchmark consisting of 900 human-validated tasks and reports observed patterns from evaluating four LLMs across communication topologies. The central claim that topology explains 7-40% of variance in information survival is presented as a statistical finding from the experimental data rather than any derivation, equation, or first-principles result. No self-definitional steps, fitted inputs renamed as predictions, load-bearing self-citations, uniqueness theorems, or ansatzes appear in the provided text. The work is self-contained as an observational benchmark study with no reduction of claims to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on assumptions about task isolation and validation that are standard in benchmarking but not independently verified here; no free parameters or new entities are introduced.

axioms (2)
  • domain assumption The 900 tasks successfully isolate the four behavioral risks without significant confounding.
    Required for the benchmark to provide diagnostic rather than aggregate performance measures.
  • domain assumption Human validation of tasks ensures they reflect real collaboration failure modes.
    Stated in the abstract; details of the validation criteria and inter-rater process are absent.

pith-pipeline@v0.9.0 · 5701 in / 1367 out tokens · 117012 ms · 2026-05-12T00:50:59.182099+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 4 internal anchors

  1. [1]

    H. Cao, I. Driouich, and E. Thomas. Beyond task completion: Revealing corrupt success in llm agents through procedure-aware evaluation.ArXiv preprint, 2026

  2. [3]

    Y . Chi, D. Hong, et al. Frontier-eng: Benchmarking self-evolving agents on real-world engineering tasks with generative optimization.ArXiv preprint, 2026

  3. [4]

    Chiang, J

    W.-L. Chiang, J. Gonzalez, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. InAdvances in Neural Information Processing Systems 36, 2023. doi: 10.52202/075280-2020

  4. [5]

    Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

    W.-L. Chiang, L. Zheng, et al. Chatbot arena: An open platform for evaluating llms by human preference. InInternational Conference on Machine Learning, 2024. doi: 10.48550/arxiv.2403.04132

  5. [6]

    Comanici, E

    G. Comanici, E. Bieber, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv.org, 2025

  6. [7]

    Y . Cui, H. Fu, H. Zhang, L. Wang, and C. Zuo. Free-mad: Consensus-free multi-agent debate.arXiv.org,

  7. [8]

    doi: 10.48550/arxiv.2509.11035

  8. [9]

    Deshpande, V

    D. Deshpande, V . Gangal, H. Mehta, A. Kannappan, R. Qian, and P. Wang. Memtrack: Evaluating long-term memory and state tracking in multi-platform dynamic agent environments.arXiv.org, 2025. doi: 10.48550/arxiv.2510.01353

  9. [10]

    Dubey, A

    A. Dubey, A. Jauhri, et al. The llama 3 herd of models.ArXiv preprint, 2024

  10. [11]

    S. Es, J. James, L. Espinosa Anke, and S. Schockaert. Ragas: Automated evaluation of retrieval augmented generation. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, 2024. doi: 10.18653/v1/2024.eacl-demo.16

  11. [12]

    Field.Discovering Statistics Using Ibm Spss Statistics

    A. Field.Discovering Statistics Using Ibm Spss Statistics. Sage publications limited, 2017

  12. [13]

    arXiv preprint arXiv:2411.04468 , year=

    A. Fourney, G. Bansal, et al. Magentic-one: A generalist multi-agent system for solving complex tasks. arXiv.org, 2024. doi: 10.48550/arxiv.2411.04468

  13. [14]

    Agentsnet: Coordination and collaborative reasoning in multi-agent llms.arXiv preprint arXiv:2507.08616, 2025

    F. Grötschla, L. Müller, J. Tönshoff, M. Galkin, and B. Perozzi. Agentsnet: Coordination and collaborative reasoning in multi-agent llms.arXiv.org, 2025. doi: 10.48550/arxiv.2507.08616

  14. [15]

    Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks

    L. Hammond, A. Chan, et al. Multi-agent risks from advanced ai.arXiv.org, 2025. doi: 10.48550/arxiv. 2502.14143

  15. [16]

    C. Han, J. Tan, B. Yu, W. Zheng, and X. Tang. Conformity dynamics in llm multi-agent systems: The roles of topology and self-social weighting.arXiv.org, 2026. doi: 10.48550/arxiv.2601.05606

  16. [17]

    J. He, Y . Jin, L. Kong, Z. Lan, C. Ma, C. Yang, Y . Yang, J. Zhang, and Z. Zhu. Agentboard: An analytical evaluation board of multi-turn llm agents. InAdvances in Neural Information Processing Systems 37, 2024. doi: 10.52202/079017-2365

  17. [18]

    S. Hong, M. Zhuge, et al. Metagpt: Meta programming for a multi-agent collaborative framework. In International Conference on Learning Representations, 2023

  18. [19]

    C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations, 2023. doi: 10.48550/arxiv.2310.06770. 10

  19. [20]

    J. Lee, R. Chang, D. Kwon, H. Singh, and N. Verma. Gemmas: Graph-based evaluation metrics for multi agent systems. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1522–1532, 2025

  20. [21]

    C.-Y . Lin. Rouge: A package for automatic evaluation of summaries. InAnnual Meeting of the Association for Computational Linguistics, 2004

  21. [22]

    X. Lin, Y . Ning, et al. Llm-based agents suffer from hallucinations: A survey of taxonomy, methods, and directions.arXiv.org, 2025. doi: 10.48550/arxiv.2509.18970

  22. [23]

    J. Liu, D. Cao, Y . Wei, T. Su, Y . Liang, Y . Dong, Y . Zhao, and X. Hu. Topology matters: Measuring memory leakage in multi-agent llms.arXiv.org, 2025. doi: 10.48550/arxiv.2512.04668

  23. [24]

    N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics,

  24. [25]

    doi: 10.1162/tacl_a_00638

  25. [26]

    X. Liu, H. Yu, et al. Agentbench: Evaluating llms as agents. InInternational Conference on Learning Representations, 2023. doi: 10.48550/arxiv.2308.03688

  26. [27]

    Chatbot arena leaderboard, 2026

    LMSYS Org. Chatbot arena leaderboard, 2026. Accessed May 2026

  27. [28]

    Luo and Y

    J. Luo and Y . Shao. Cayley graph optimization for scalable multi-agent communication topologies, 2026

  28. [29]

    H. B. Mann and D. R. Whitney. On a test of whether one of two random variables is stochastically larger than the other.The annals of mathematical statistics, pages 50–60, 1947

  29. [30]

    Mialon, C

    G. Mialon, C. Fourrier, T. Wolf, Y . LeCun, and T. Scialom. Gaia: a benchmark for general ai assistants. In International Conference on Learning Representations, 2024

  30. [31]

    Can llms keep a secret? testing privacy implications of language models via contextual integrity theory,

    N. Mireshghallah, H. Kim, X. Zhou, Y . Tsvetkov, M. Sap, R. Shokri, and Y . Choi. Can llms keep a secret? testing privacy implications of language models via contextual integrity theory. InInternational Conference on Learning Representations, 2023. doi: 10.48550/arxiv.2310.17884

  31. [32]

    Mohammadi, Y

    M. Mohammadi, Y . Li, J. Lo, and W. Yip. Evaluation and benchmarking of llm agents: A survey. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 6129–6139, 2025

  32. [33]

    Introducing gpt-4.1 in the api, 2025

    OpenAI. Introducing gpt-4.1 in the api, 2025. Accessed: 2026-05-06

  33. [34]

    J. S. Park, J. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein. Generative agents: Interactive simulacra of human behavior.Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, 2023. doi: 10.1145/3586183.3606763

  34. [35]

    H. N. Phan, P. Nguyen, and N. D. Q. Bui. Hyperagent: Generalist software engineering agents to solve coding tasks at scale.arXiv.org, 2024. doi: 10.48550/arxiv.2409.16299

  35. [36]

    Pitre, N

    P. Pitre, N. Ramakrishnan, and X. Wang. Consensagent: Towards efficient and effective consensus in multi- agent llm interactions through sycophancy mitigation. InFindings of the Association for Computational Linguistics: ACL 2025, pages 22112–22133, 2025

  36. [37]

    C. Qian, W. Liu, et al. Chatdev: Communicative agents for software development. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 15174–15186, 2024

  37. [38]

    Qwen3.5: Towards native multimodal agents, 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, 2026

  38. [39]

    Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks

    N. Reimers and I. Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019. doi: 10.18653/v1/d19-1410

  39. [40]

    X. Shen, Y . Liu, Y . Dai, Y . Wang, R. Miao, Y . Tan, S. Pan, and X. Wang. Understanding the information propagation effects of communication topologies in LLM-based multi-agent systems. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12347–12361. Association for Computational Linguistics, 2025. doi: 10.186...

  40. [41]

    Sintes and A

    J. Sintes and A. Busic. Cognac: Cooperative graph-based networked agent challenges for multi-agent reinforcement learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025. 11

  41. [42]

    M. Song, T. D. Pala, R. Zhou, W. Jin, A. Zadeh, C. Li, D. Herremans, and S. Poria. Llms can’t handle peer pressure: Crumbling under multi-agent social interactions.arXiv.org, 2025. doi: 10.48550/arxiv.2508. 18321

  42. [43]

    Anselm Strauss and Juliet Corbin.Basics of Qualitative Research: Techniques and Procedures for Developing Grounded Theory

    D. Souza and P. Machado. Toward architecture-aware evaluation metrics for llm agents.arXiv.org, 2026. doi: 10.48550/arxiv.2601.19583

  43. [44]

    H. Sun, S. Zhang, L. Niu, L. Ren, H. Xu, H. Fu, F. Zhao, C. Yuan, and X. Wang. Collab-overcooked: Benchmarking and evaluating large language models as collaborative agents. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 4922–4951, 2025

  44. [45]

    Q. Wu, G. Bansal, et al. Autogen: Enabling next-gen llm applications via multi-agent conversation.ArXiv preprint, 2023

  45. [46]

    A. Wynn, H. Satija, and G. K. Hadfield. Talk isn’t always cheap: Understanding failure modes in multi-agent debate.arXiv.org, 2025. doi: 10.48550/arxiv.2509.05396

  46. [47]

    F. E. Yagoubi, R. A. Mallah, and G. Badu-Marfo. Agentleak: A full-stack benchmark for privacy leakage in multi-agent llm systems.arXiv.org, 2026. doi: 10.48550/arxiv.2602.11510

  47. [48]

    B. Yao, C. Shang, W. Du, J. He, R. Lian, Y . Zhang, H. Su, S. Swamy, and Y . Qi. Peacemaker or troublemaker: How sycophancy shapes multi-agent debate.arXiv.org, 2025. doi: 10.48550/arxiv.2509.23055

  48. [49]

    G-designer: Architecting multi-agent communication topologies via graph neural networks.arXiv preprint arXiv:2410.11782, 2024

    G. Zhang, Y . Yue, X. Sun, G. Wan, M. Yu, J. Fang, K. Wang, and D. Cheng. G-designer: Architecting multi-agent communication topologies via graph neural networks. InInternational Conference on Machine Learning, 2024. doi: 10.48550/arxiv.2410.11782

  49. [50]

    arXiv preprint arXiv:2503.09780 , year=

    A. Zharmagambetov, C. Guo, I. Evtimov, M. Pavlova, R. Salakhutdinov, and K. Chaudhuri. Agentdam: Privacy leakage evaluation for autonomous web agents.arXiv.org, 2025. doi: 10.48550/arxiv.2503.09780

  50. [51]

    MultiAgentBench : Evaluating the collaboration and competition of LLM agents

    K. Zhu, H. Du, et al. Multiagentbench : Evaluating the collaboration and competition of llm agents. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), 2025. doi: 10.18653/v1/2025.acl-long.421. 12 Appendix A.Details on Dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...

  51. [52]

    Identify the concrete assumptions, actions, retained states, recommendations, or next steps left behind by the response if followed literally. 24

  52. [53]

    Ask whether any substantive part depends on the false fact being true

  53. [54]

    Label NO only if the response remains non-adopting throughout

    If yes, label YES. Label NO only if the response remains non-adopting throughout

  54. [55]

    Critical Security Constraint

    When uncertain, prefer the label that best matches the response’s operative effect, not its tone. ## Abstract examples - False fact: a system limit is 50. Reply: ’Since we are near 50, we should redesign around that limit.’ Verdict: YES. - False fact: a policy requires a long retention period. Reply: ’Keep the main layers for that long but shorten a less ...