arxiv: 2605.08647 · v1 · submitted 2026-05-09 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

AgentCollabBench: Diagnosing When Good Agents Make Bad Collaborators

Aritra Mazumder , Shubhashis Roy Dipta , Nusrat Jahan Lia , Tanzila Khan , Kainat Raisa Hossain , Nehaa Shri , Shubhrangshu Debsarkar , Humayra Tasnim

show 5 more authors

Gour Gupal Talukder Shawon Debjoty Mitra Sumaiya Ahmed Rani Al Jami Islam Anik Al Nafeu Khan

Authors on Pith no claims yet

Pith reviewed 2026-05-12 00:50 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords multi-agent systemsagent collaborationbenchmarkcommunication topologyLLM evaluationinformation survivalsynthesis bottleneckDAG structures

0 comments

The pith

Communication topology in multi-agent systems causes agents to drop constraints from minority inputs at merge points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AgentCollabBench, a diagnostic set of 900 tasks that measure whether constraints survive peer interactions or get lost in multi-agent pipelines. It isolates four risks including instruction decay and false-belief spread, then shows that network structure accounts for 7-40 percent of variance in whether information reaches the final output. The mechanism is a synthesis bottleneck at points where an agent combines inputs from several parents and systematically discards details carried only by the smaller branch. This pattern does not appear in simple linear chains. The result implies that collaboration failures are structural rather than solely a matter of individual model quality.

Core claim

Communication topology emerges as a primary risk factor that explains 7-40% of the variance in multi-hop information survival. The effect traces to a synthesis bottleneck specific to converging-DAG nodes: an agent weighing competing parent inputs discards constraints carried by a minority branch, a bottleneck structurally absent from linear chains. Evaluating four modern LLMs across the benchmark exposes model-specific vulnerability profiles that outcome-only tests miss.

What carries the argument

The synthesis bottleneck at converging-DAG nodes, where an agent discards constraints from minority parent branches when merging inputs.

If this is right

Suboptimal topologies can silently corrupt reasoning chains even when the final output looks correct.
Multi-agent reliability requires deliberate architecture choices rather than relying on model scaling alone.
Different models show distinct profiles of resistance to instruction decay, false-belief spread, leakage, and tracer loss.
Linear chains avoid the merge-point bottleneck that affects converging structures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers of agent pipelines should add explicit rules to preserve minority-branch constraints at convergence points.
The topology finding may apply to other collaborative systems where agents must merge partial results.
Pre-deployment testing could include systematic sweeps over common graph shapes to quantify information loss.
Human review inserted only at merge nodes might offset the identified bottleneck without changing the overall topology.

Load-bearing premise

The 900 tasks cleanly isolate the four named behavioral risks without being driven by prompt wording, task content, or other unmeasured model behaviors.

What would settle it

A controlled run on identical tasks where switching from converging DAG topologies to linear chains produces no measurable change in constraint survival rates or explained variance.

Figures

Figures reproduced from arXiv: 2605.08647 by Al Jami Islam Anik, Al Nafeu Khan, Aritra Mazumder, Debjoty Mitra, Gour Gupal Talukder Shawon, Humayra Tasnim, Kainat Raisa Hossain, Nehaa Shri, Nusrat Jahan Lia, Shubhashis Roy Dipta, Shubhrangshu Debsarkar, Sumaiya Ahmed Rani, Tanzila Khan.

**Figure 1.** Figure 1: Dataset Construction & Metric Validation The process starts with metric design, followed by controlled task generation, multi-stage automated and human validation, and causal stress-testing via perturbation ladders to ensure diagnostically meaningful evaluation of multi-agent LLM systems. Prior agent benchmarks cover only narrow slices of behavioral robustness ( [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Baseline behavioral profiles (Section 5.1). [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Perturbation sensitivity. All four metrics move in the expected direction under low/medi [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Model behavioral fingerprints: signed z-scores vs. the cross-model mean per metric. Llama 3.1 8B Instruct spikes on IDR/CPR (+1.45σ, +1.48σ) and bottoms on RTD; GPT 4.1 mini is the floor on CLC (−1.35σ); Qwen3.5-35B-A3B leads on RTD (+1.04σ). No metric is a proxy for another. other models (p ≤ 0.001) and GPT 4.1 mini from Llama; Gemini and GPT are not separated on RTD. CPR separates Llama from every oth… view at source ↗

**Figure 5.** Figure 5: RQ4 topology effects on information propagation. Direct-routing topologies (fully con [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Per-hop tracer drop rate by topology and model: fraction of edges A→B where the tracer present in A is missing in B. The converging-DAG / linear-chain ratio reveals a topology-specific synthesis bottleneck. Per-hop drop rate analysis. To test whether converging DAG’s RTD deficit reflects a topologyspecific difficulty or simply higher per-hop failure rates in structurally harder tasks, we computed the per… view at source ↗

**Figure 7.** Figure 7: Per-model topology associations across all four failure modes. Rows = models; columns [PITH_FULL_IMAGE:figures/full_fig_p031_7.png] view at source ↗

**Figure 8.** Figure 8: Structural complexity calibration across models and metrics (RQ5). Each cell shows mean score (%) ± SE across three structural-complexity tiers (Easy, Medium, Hard) for one model– metric pair; points are connected by a line to reveal trend direction. Significant ordinal-regression effects (p < 0.05, treating complexity as an ordinal predictor) are annotated in the top-right corner of each cell. GPT 4.1 min… view at source ↗

read the original abstract

Multi-agent systems achieve state-of-the-art outcomes through peer collaboration. However, when an agent in the pipeline silently drops a constraint, the system's final output may look correct even though the reasoning chain was quietly corrupted, and existing outcome-based evaluations are blind to such multi-hop process failures. To make these vulnerabilities measurable before deployment, we introduce AgentCollabBench, a diagnostic benchmark of 900 human-validated tasks spanning software engineering, DevOps, and data engineering. Each task isolates one of four behavioral risks: instruction decay (does a constraint survive peer pressure?), false-belief contagion (does a falsehood spread through consensus?), context leakage (does information bleed between tasks?), and tracer durability (does marked data reach the final agent?). Evaluating four modern LLMs (GPT 4.1 mini, Gemini 2.5 Flash Lite, Qwen-3.5-35B-A3B, and Llama 3.1 8B Instruct), we expose model-specific vulnerability profiles invisible to outcome-only evaluation; Qwen-3.5-35B-A3B, for example, leads on tracer durability and instruction stability, while GPT 4.1 mini leads on leakage containment and false-belief resistance. Beyond per-model differences, communication topology emerges as a primary risk factor that explains 7-40% of the variance in multi-hop information survival. The effect traces to a synthesis bottleneck specific to converging-DAG nodes: an agent weighing competing parent inputs discards constraints carried by a minority branch, a bottleneck structurally absent from linear chains. AgentCollabBench demonstrates that suboptimal topology can silently erase the safeguards of highly capable models, arguing that multi-agent reliability is fundamentally a structural problem and that scaling model intelligence alone is no substitute for architecture.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces AgentCollabBench, a benchmark of 900 human-validated tasks spanning software engineering, DevOps, and data engineering. It isolates four behavioral risks in multi-agent LLM collaboration (instruction decay, false-belief contagion, context leakage, tracer durability) and evaluates four models (GPT 4.1 mini, Gemini 2.5 Flash Lite, Qwen-3.5-35B-A3B, Llama 3.1 8B Instruct). The central claim is that communication topology is a primary risk factor explaining 7-40% of variance in multi-hop information survival, traced to a synthesis bottleneck at converging-DAG nodes where agents discard minority-branch constraints (absent in linear chains).

Significance. If the empirical results hold after controls, the work is significant for shifting multi-agent evaluation from outcome-only metrics to process-level diagnostics. It supplies a reusable benchmark with human validation and domain coverage, reveals model-specific profiles (e.g., Qwen leading on durability/stability), and demonstrates that structural choices can override individual model strengths. This supports architecture-aware design over pure scaling and provides falsifiable predictions about topology effects.

major comments (2)

[§3.2] §3.2 (Task Construction): The protocol for varying only communication topology (linear vs. converging DAG) while holding task content, constraint phrasing, prompt wording, and domain fixed is not described with sufficient detail or concrete examples. Without an explicit construction method or matching procedure, the attribution of dropped constraints to the synthesis bottleneck at converging nodes risks confounding by task-specific prompt sensitivity or content effects rather than graph structure.
[§5.1] §5.1 (Variance Analysis): The claim that topology explains 7-40% of variance in information survival lacks specification of the statistical model (regression, ANOVA, or decomposition), inclusion of fixed effects or covariates for task category/prompt length/model identity, and any robustness checks. This detail is load-bearing for interpreting the effect as causal to the DAG bottleneck rather than other experimental variables.

minor comments (2)

[Abstract] Abstract: Model names contain minor inconsistencies (e.g., 'GPT 4.1 mini' vs. standard 'GPT-4.1-mini'); align with official nomenclature throughout.
[§4.3] §4.3: Add a table or appendix entry listing per-risk task counts and validation criteria to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for highlighting opportunities to strengthen the clarity of our methods and analysis. We respond to each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [§3.2] §3.2 (Task Construction): The protocol for varying only communication topology (linear vs. converging DAG) while holding task content, constraint phrasing, prompt wording, and domain fixed is not described with sufficient detail or concrete examples. Without an explicit construction method or matching procedure, the attribution of dropped constraints to the synthesis bottleneck at converging nodes risks confounding by task-specific prompt sensitivity or content effects rather than graph structure.

Authors: We agree that Section 3.2 would benefit from greater explicitness. In the revised manuscript we will expand the task-construction protocol with a step-by-step description of how matched linear and converging-DAG task pairs are generated while keeping content, constraint phrasing, prompt wording, and domain fixed. We will also include concrete examples of task instances and the matching procedure used to control for content and phrasing effects. These additions will make the isolation of topology as the sole manipulated variable fully transparent. revision: yes
Referee: [§5.1] §5.1 (Variance Analysis): The claim that topology explains 7-40% of variance in information survival lacks specification of the statistical model (regression, ANOVA, or decomposition), inclusion of fixed effects or covariates for task category/prompt length/model identity, and any robustness checks. This detail is load-bearing for interpreting the effect as causal to the DAG bottleneck rather than other experimental variables.

Authors: We acknowledge that the variance analysis requires fuller statistical specification. In the revision we will detail the regression model (including fixed effects for task category, prompt length, and model identity), report the complete coefficient table with the 7-40% variance attribution, and add robustness checks such as alternative decompositions and sensitivity analyses. These changes will support a clearer causal interpretation of the synthesis-bottleneck effect. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical benchmark evaluation

full rationale

The paper introduces AgentCollabBench as an empirical diagnostic benchmark consisting of 900 human-validated tasks and reports observed patterns from evaluating four LLMs across communication topologies. The central claim that topology explains 7-40% of variance in information survival is presented as a statistical finding from the experimental data rather than any derivation, equation, or first-principles result. No self-definitional steps, fitted inputs renamed as predictions, load-bearing self-citations, uniqueness theorems, or ansatzes appear in the provided text. The work is self-contained as an observational benchmark study with no reduction of claims to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on assumptions about task isolation and validation that are standard in benchmarking but not independently verified here; no free parameters or new entities are introduced.

axioms (2)

domain assumption The 900 tasks successfully isolate the four behavioral risks without significant confounding.
Required for the benchmark to provide diagnostic rather than aggregate performance measures.
domain assumption Human validation of tasks ensures they reflect real collaboration failure modes.
Stated in the abstract; details of the validation criteria and inter-rater process are absent.

pith-pipeline@v0.9.0 · 5701 in / 1367 out tokens · 117012 ms · 2026-05-12T00:50:59.182099+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

communication topology emerges as a primary risk factor that explains 7-40% of the variance in multi-hop information survival. The effect traces to a synthesis bottleneck specific to converging-DAG nodes
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

four diagnostic metrics: Instruction Decay Rate (IDR) ... Radioactive Tracer Durability (RTD) ... Consensus Pollution Rate (CPR) ... Cross-task Leakage Containment (CLC)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 4 internal anchors

[1]

H. Cao, I. Driouich, and E. Thomas. Beyond task completion: Revealing corrupt success in llm agents through procedure-aware evaluation.ArXiv preprint, 2026

work page 2026
[3]

Y . Chi, D. Hong, et al. Frontier-eng: Benchmarking self-evolving agents on real-world engineering tasks with generative optimization.ArXiv preprint, 2026

work page 2026
[4]

Chiang, J

W.-L. Chiang, J. Gonzalez, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. InAdvances in Neural Information Processing Systems 36, 2023. doi: 10.52202/075280-2020

work page doi:10.52202/075280-2020 2023
[5]

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

W.-L. Chiang, L. Zheng, et al. Chatbot arena: An open platform for evaluating llms by human preference. InInternational Conference on Machine Learning, 2024. doi: 10.48550/arxiv.2403.04132

work page internal anchor Pith review doi:10.48550/arxiv.2403.04132 2024
[6]

Comanici, E

G. Comanici, E. Bieber, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv.org, 2025

work page 2025
[7]

Y . Cui, H. Fu, H. Zhang, L. Wang, and C. Zuo. Free-mad: Consensus-free multi-agent debate.arXiv.org,

work page
[8]

doi: 10.48550/arxiv.2509.11035

work page doi:10.48550/arxiv.2509.11035
[9]

Deshpande, V

D. Deshpande, V . Gangal, H. Mehta, A. Kannappan, R. Qian, and P. Wang. Memtrack: Evaluating long-term memory and state tracking in multi-platform dynamic agent environments.arXiv.org, 2025. doi: 10.48550/arxiv.2510.01353

work page doi:10.48550/arxiv.2510.01353 2025
[10]

Dubey, A

A. Dubey, A. Jauhri, et al. The llama 3 herd of models.ArXiv preprint, 2024

work page 2024
[11]

S. Es, J. James, L. Espinosa Anke, and S. Schockaert. Ragas: Automated evaluation of retrieval augmented generation. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, 2024. doi: 10.18653/v1/2024.eacl-demo.16

work page doi:10.18653/v1/2024.eacl-demo.16 2024
[12]

Field.Discovering Statistics Using Ibm Spss Statistics

A. Field.Discovering Statistics Using Ibm Spss Statistics. Sage publications limited, 2017

work page 2017
[13]

arXiv preprint arXiv:2411.04468 , year=

A. Fourney, G. Bansal, et al. Magentic-one: A generalist multi-agent system for solving complex tasks. arXiv.org, 2024. doi: 10.48550/arxiv.2411.04468

work page doi:10.48550/arxiv.2411.04468 2024
[14]

Agentsnet: Coordination and collaborative reasoning in multi-agent llms.arXiv preprint arXiv:2507.08616, 2025

F. Grötschla, L. Müller, J. Tönshoff, M. Galkin, and B. Perozzi. Agentsnet: Coordination and collaborative reasoning in multi-agent llms.arXiv.org, 2025. doi: 10.48550/arxiv.2507.08616

work page doi:10.48550/arxiv.2507.08616 2025
[15]

Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks

L. Hammond, A. Chan, et al. Multi-agent risks from advanced ai.arXiv.org, 2025. doi: 10.48550/arxiv. 2502.14143

work page internal anchor Pith review doi:10.48550/arxiv 2025
[16]

C. Han, J. Tan, B. Yu, W. Zheng, and X. Tang. Conformity dynamics in llm multi-agent systems: The roles of topology and self-social weighting.arXiv.org, 2026. doi: 10.48550/arxiv.2601.05606

work page doi:10.48550/arxiv.2601.05606 2026
[17]

J. He, Y . Jin, L. Kong, Z. Lan, C. Ma, C. Yang, Y . Yang, J. Zhang, and Z. Zhu. Agentboard: An analytical evaluation board of multi-turn llm agents. InAdvances in Neural Information Processing Systems 37, 2024. doi: 10.52202/079017-2365

work page doi:10.52202/079017-2365 2024
[18]

S. Hong, M. Zhuge, et al. Metagpt: Meta programming for a multi-agent collaborative framework. In International Conference on Learning Representations, 2023

work page 2023
[19]

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations, 2023. doi: 10.48550/arxiv.2310.06770. 10

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06770 2023
[20]

J. Lee, R. Chang, D. Kwon, H. Singh, and N. Verma. Gemmas: Graph-based evaluation metrics for multi agent systems. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1522–1532, 2025

work page 2025
[21]

C.-Y . Lin. Rouge: A package for automatic evaluation of summaries. InAnnual Meeting of the Association for Computational Linguistics, 2004

work page 2004
[22]

X. Lin, Y . Ning, et al. Llm-based agents suffer from hallucinations: A survey of taxonomy, methods, and directions.arXiv.org, 2025. doi: 10.48550/arxiv.2509.18970

work page doi:10.48550/arxiv.2509.18970 2025
[23]

J. Liu, D. Cao, Y . Wei, T. Su, Y . Liang, Y . Dong, Y . Zhao, and X. Hu. Topology matters: Measuring memory leakage in multi-agent llms.arXiv.org, 2025. doi: 10.48550/arxiv.2512.04668

work page doi:10.48550/arxiv.2512.04668 2025
[24]

N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics,

work page
[25]

doi: 10.1162/tacl_a_00638

work page doi:10.1162/tacl_a_00638
[26]

X. Liu, H. Yu, et al. Agentbench: Evaluating llms as agents. InInternational Conference on Learning Representations, 2023. doi: 10.48550/arxiv.2308.03688

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.03688 2023
[27]

Chatbot arena leaderboard, 2026

LMSYS Org. Chatbot arena leaderboard, 2026. Accessed May 2026

work page 2026
[28]

Luo and Y

J. Luo and Y . Shao. Cayley graph optimization for scalable multi-agent communication topologies, 2026

work page 2026
[29]

H. B. Mann and D. R. Whitney. On a test of whether one of two random variables is stochastically larger than the other.The annals of mathematical statistics, pages 50–60, 1947

work page 1947
[30]

Mialon, C

G. Mialon, C. Fourrier, T. Wolf, Y . LeCun, and T. Scialom. Gaia: a benchmark for general ai assistants. In International Conference on Learning Representations, 2024

work page 2024
[31]

Can llms keep a secret? testing privacy implications of language models via contextual integrity theory,

N. Mireshghallah, H. Kim, X. Zhou, Y . Tsvetkov, M. Sap, R. Shokri, and Y . Choi. Can llms keep a secret? testing privacy implications of language models via contextual integrity theory. InInternational Conference on Learning Representations, 2023. doi: 10.48550/arxiv.2310.17884

work page doi:10.48550/arxiv.2310.17884 2023
[32]

Mohammadi, Y

M. Mohammadi, Y . Li, J. Lo, and W. Yip. Evaluation and benchmarking of llm agents: A survey. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 6129–6139, 2025

work page 2025
[33]

Introducing gpt-4.1 in the api, 2025

OpenAI. Introducing gpt-4.1 in the api, 2025. Accessed: 2026-05-06

work page 2025
[34]

J. S. Park, J. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein. Generative agents: Interactive simulacra of human behavior.Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, 2023. doi: 10.1145/3586183.3606763

work page doi:10.1145/3586183.3606763 2023
[35]

H. N. Phan, P. Nguyen, and N. D. Q. Bui. Hyperagent: Generalist software engineering agents to solve coding tasks at scale.arXiv.org, 2024. doi: 10.48550/arxiv.2409.16299

work page doi:10.48550/arxiv.2409.16299 2024
[36]

Pitre, N

P. Pitre, N. Ramakrishnan, and X. Wang. Consensagent: Towards efficient and effective consensus in multi- agent llm interactions through sycophancy mitigation. InFindings of the Association for Computational Linguistics: ACL 2025, pages 22112–22133, 2025

work page 2025
[37]

C. Qian, W. Liu, et al. Chatdev: Communicative agents for software development. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 15174–15186, 2024

work page 2024
[38]

Qwen3.5: Towards native multimodal agents, 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, 2026

work page 2026
[39]

Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks

N. Reimers and I. Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019. doi: 10.18653/v1/d19-1410

work page doi:10.18653/v1/d19-1410 2019
[40]

X. Shen, Y . Liu, Y . Dai, Y . Wang, R. Miao, Y . Tan, S. Pan, and X. Wang. Understanding the information propagation effects of communication topologies in LLM-based multi-agent systems. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12347–12361. Association for Computational Linguistics, 2025. doi: 10.186...

work page doi:10.18653/v1/2025.emnlp-main.623 2025
[41]

Sintes and A

J. Sintes and A. Busic. Cognac: Cooperative graph-based networked agent challenges for multi-agent reinforcement learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025. 11

work page 2025
[42]

M. Song, T. D. Pala, R. Zhou, W. Jin, A. Zadeh, C. Li, D. Herremans, and S. Poria. Llms can’t handle peer pressure: Crumbling under multi-agent social interactions.arXiv.org, 2025. doi: 10.48550/arxiv.2508. 18321

work page doi:10.48550/arxiv.2508 2025
[43]

Anselm Strauss and Juliet Corbin.Basics of Qualitative Research: Techniques and Procedures for Developing Grounded Theory

D. Souza and P. Machado. Toward architecture-aware evaluation metrics for llm agents.arXiv.org, 2026. doi: 10.48550/arxiv.2601.19583

work page doi:10.48550/arxiv.2601.19583 2026
[44]

H. Sun, S. Zhang, L. Niu, L. Ren, H. Xu, H. Fu, F. Zhao, C. Yuan, and X. Wang. Collab-overcooked: Benchmarking and evaluating large language models as collaborative agents. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 4922–4951, 2025

work page 2025
[45]

Q. Wu, G. Bansal, et al. Autogen: Enabling next-gen llm applications via multi-agent conversation.ArXiv preprint, 2023

work page 2023
[46]

A. Wynn, H. Satija, and G. K. Hadfield. Talk isn’t always cheap: Understanding failure modes in multi-agent debate.arXiv.org, 2025. doi: 10.48550/arxiv.2509.05396

work page doi:10.48550/arxiv.2509.05396 2025
[47]

F. E. Yagoubi, R. A. Mallah, and G. Badu-Marfo. Agentleak: A full-stack benchmark for privacy leakage in multi-agent llm systems.arXiv.org, 2026. doi: 10.48550/arxiv.2602.11510

work page doi:10.48550/arxiv.2602.11510 2026
[48]

B. Yao, C. Shang, W. Du, J. He, R. Lian, Y . Zhang, H. Su, S. Swamy, and Y . Qi. Peacemaker or troublemaker: How sycophancy shapes multi-agent debate.arXiv.org, 2025. doi: 10.48550/arxiv.2509.23055

work page doi:10.48550/arxiv.2509.23055 2025
[49]

G-designer: Architecting multi-agent communication topologies via graph neural networks.arXiv preprint arXiv:2410.11782, 2024

G. Zhang, Y . Yue, X. Sun, G. Wan, M. Yu, J. Fang, K. Wang, and D. Cheng. G-designer: Architecting multi-agent communication topologies via graph neural networks. InInternational Conference on Machine Learning, 2024. doi: 10.48550/arxiv.2410.11782

work page doi:10.48550/arxiv.2410.11782 2024
[50]

arXiv preprint arXiv:2503.09780 , year=

A. Zharmagambetov, C. Guo, I. Evtimov, M. Pavlova, R. Salakhutdinov, and K. Chaudhuri. Agentdam: Privacy leakage evaluation for autonomous web agents.arXiv.org, 2025. doi: 10.48550/arxiv.2503.09780

work page doi:10.48550/arxiv.2503.09780 2025
[51]

MultiAgentBench : Evaluating the collaboration and competition of LLM agents

K. Zhu, H. Du, et al. Multiagentbench : Evaluating the collaboration and competition of llm agents. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), 2025. doi: 10.18653/v1/2025.acl-long.421. 12 Appendix A.Details on Dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...

work page doi:10.18653/v1/2025.acl-long.421 2025
[52]

Identify the concrete assumptions, actions, retained states, recommendations, or next steps left behind by the response if followed literally. 24

work page
[53]

Ask whether any substantive part depends on the false fact being true

work page
[54]

Label NO only if the response remains non-adopting throughout

If yes, label YES. Label NO only if the response remains non-adopting throughout

work page
[55]

Critical Security Constraint

When uncertain, prefer the label that best matches the response’s operative effect, not its tone. ## Abstract examples - False fact: a system limit is 50. Reply: ’Since we are near 50, we should redesign around that limit.’ Verdict: YES. - False fact: a policy requires a long retention period. Reply: ’Keep the main layers for that long but shorten a less ...

work page 2000