VerifyMAS: Hypothesis Verification for Failure Attribution in LLM Multi-Agent Systems

Bing Liu; Ee-Peng Lim; Guansong Pang; Hanghang Tong; Hezhe Qiao

arxiv: 2605.17467 · v1 · pith:UTWVZ5QQnew · submitted 2026-05-17 · 💻 cs.CL

VerifyMAS: Hypothesis Verification for Failure Attribution in LLM Multi-Agent Systems

Hezhe Qiao , Hanghang Tong , Ee-Peng Lim , Bing Liu , Guansong Pang This is my paper

Pith reviewed 2026-05-20 12:53 UTC · model grok-4.3

classification 💻 cs.CL

keywords failure attributionmulti-agent systemsLLMhypothesis verificationerror taxonomytrajectory analysisagent localization

0 comments

The pith

VerifyMAS attributes failures in LLM multi-agent systems by verifying hypotheses against full trajectories rather than predicting errors directly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language model multi-agent systems often fail in ways that only become visible across complete sequences of interactions. Prior methods guess faulty agents and error types straight from local logs, which overlooks patterns spanning multiple steps and creates an unmanageable number of combinations to examine. VerifyMAS instead generates candidate failure explanations drawn from a structured taxonomy and checks each one against the entire trajectory. This error-first strategy splits the task into validating the overall error at the trajectory level and then locating the responsible agents, which narrows the possibilities while surfacing coordination and consistency issues. Experiments on two benchmarks demonstrate gains for both open-source and API-based models without added runtime cost.

Core claim

The paper claims that formulating failure hypotheses grounded in a structured error taxonomy and verifying them against full interaction trajectories yields an effective error-first attribution method. This approach captures global failure patterns such as cross-step inconsistencies and inter-agent coordination errors, decomposes the problem into trajectory-level validation followed by agent localization, and reduces the combinatorial search space compared with direct prediction of agent-error pairs.

What carries the argument

The hypothesis verification framework that generates candidates from a structured error taxonomy and checks them against complete trajectories using a fine-tuned LLM verifier.

If this is right

Attribution now handles global issues that span multiple steps and agents rather than remaining limited to local logs.
The reduced search space supports finer-grained localization without exhaustive enumeration of all agent-error pairs.
Both open-source models such as Qwen and API-based models such as GPT show consistent gains on Aegis-Bench and Who&When.
Inference remains efficient even when trajectories are long, preserving practicality for deployed systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same verification strategy could apply to other multi-step AI systems that produce extended action sequences.
An online version of the verifier might enable real-time detection during ongoing multi-agent runs.
The taxonomy could be updated iteratively from failures missed in deployment to improve coverage over time.

Load-bearing premise

The structured error taxonomy used to construct training hypotheses is complete enough to represent the global failures that actually occur in real multi-agent trajectories.

What would settle it

A test set of new multi-agent trajectories containing coordination or cross-step errors absent from the original taxonomy on which the fine-tuned verifier shows no improvement over direct-prediction baselines would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.17467 by Bing Liu, Ee-Peng Lim, Guansong Pang, Hanghang Tong, Hezhe Qiao.

**Figure 2.** Figure 2: Overview of the proposed VerifyMAS. (a) Zero-shot inference. A trajectory is paired with hypotheses on predefined error types, and then an LLM predicts whether the trajectory is “entail”, “neutral”, or “contradict” w.r.t. each hypothesis describing the presence of an error type. The entailed hypotheses are further examined for faulty agent attribution, producing the final error–agent predictions. (b) Hypo… view at source ↗

**Figure 3.** Figure 3: Per-class Pair-F1 score. failure analysis by examining performance across individual error types and comparing with the DPR and CoT-Agent implemented based on Qwen2.5-7B-Instruct. Specifically, we report per-class Pair-F1 scores for each failure subtype and organize them into three groups—Global, Local, and Hybrid errors on Aegis-Bench [14]—to provide a fine-grained comparison of model performance across d… view at source ↗

**Figure 4.** Figure 4: Running time comparison. We further evaluate the efficiency of our method by comparing the average processing time per sample with several existing local models. For a fair comparison, all the competing methods are implemented based on Qwen2.5-7B-Instruct and evaluated on Who&When [33]. As shown in Fig.4, VerifyMAS remains computationally efficient while delivering strong performance on the benchmark da… view at source ↗

**Figure 5.** Figure 5: A case study of a trajectory failure attribution, DPR and VerifyMAS are based on Qwen2.5- [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: The category of error types. response; and Hybrid errors require both local behavioral evidence and global trajectory-level context. This grouping enables a more structured analysis of whether different methods are better at detecting local agent-level mistakes or context-dependent failures that emerge across the whole multi-agent trajectory. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Fine-grained failure analysis grouped by the global, local, and hybrid errors on Aegis [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Fine-grained failure analysis grouped by task execution, communication& Coordination [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

read the original abstract

Large language model-driven multi-agent systems (LLM-MAS) excel at complex tasks, yet unreliable agents remain a key bottleneck to system-level reliability. Automatic failure attribution is therefore critical, but existing approaches, such as direct prediction of agent-error pairs and agent-first failure attribution, rely on local logs of agents and miss global failures that only manifest over full interaction trajectories, such as cross-step inconsistencies and inter-agent coordination errors. Moreover, directly predicting failures induces a large combinatorial search space, hindering fine-grained attribution. To address these challenges, we propose VerifyMAS, a hypothesis verification framework for agent failure attribution. Instead of directly predicting faulty agents and error types, VerifyMAS formulates and verifies failure hypotheses against full trajectories. This verification-based approach decomposes attribution into trajectory-level error validation and fine-grained agent localization, providing an error-first attribution approach that captures global failure patterns while substantially reducing the search space. We further introduce a hypothesis-based data construction strategy grounded in a structured error taxonomy and fine-tune a specialized LLM verifier model for trajectory-level failure verification and agent attribution. Experiments on Aegis-Bench and Who&When show that VerifyMAS consistently improves diverse backbone models, including open-source Qwen and API-based GPT models, outperforming prior methods without sacrificing inference efficiency for long multi-agent trajectories.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VerifyMAS introduces a hypothesis-verification split for failure attribution that targets global patterns in LLM multi-agent trajectories but rests on an error taxonomy whose coverage of real failures needs closer checking.

read the letter

The main thing here is that VerifyMAS moves failure attribution away from direct prediction of agent-error pairs and instead generates hypotheses from a structured taxonomy, then verifies them against the full trajectory. This splits the job into trajectory-level error validation and then agent localization, which is meant to catch cross-step inconsistencies and coordination problems that local-log methods miss while shrinking the search space for long interactions.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes VerifyMAS, a hypothesis verification framework for failure attribution in LLM multi-agent systems. Instead of directly predicting faulty agent-error pairs, it formulates failure hypotheses grounded in a structured error taxonomy, verifies them against full interaction trajectories to capture global patterns such as cross-step inconsistencies and inter-agent coordination errors, and then performs fine-grained agent localization. A hypothesis-based data construction strategy is used to fine-tune a specialized LLM verifier. Experiments on Aegis-Bench and Who&When report consistent improvements over prior methods across diverse backbone models (including Qwen and GPT) without sacrificing inference efficiency for long trajectories.

Significance. If the empirical claims hold under rigorous scrutiny, the work offers a meaningful advance in diagnosing failures within complex LLM-driven multi-agent systems. The error-first, verification-based decomposition reduces the combinatorial search space while addressing trajectory-level issues that local log-based methods miss, which could improve the practical reliability and debuggability of multi-agent AI deployments. The structured taxonomy and fine-tuning approach provide a reusable template for similar attribution tasks.

major comments (2)

[Hypothesis-based data construction and taxonomy] The central generalization claim rests on the structured error taxonomy being representative of real global failures. The hypothesis-based data construction (described in the methods) assumes this taxonomy covers cross-step inconsistencies and inter-agent coordination errors sufficiently to avoid overfitting to synthetically generated patterns; however, no ablation or coverage analysis of omitted failure modes is provided to substantiate that the fine-tuned verifier will generalize to unseen long-horizon LLM-MAS trajectories.
[Experiments] Table or results section reporting benchmark performance: the claim of consistent outperformance on Aegis-Bench and Who&When lacks accompanying details on statistical significance testing, run-to-run variance, exact baseline re-implementations, and data-construction rules. These omissions make it difficult to assess whether the reported gains are robust or sensitive to post-hoc choices.

minor comments (2)

[Method details] Clarify the exact prompting templates and verification scoring function used by the fine-tuned verifier to improve reproducibility.
[Figures] Figure captions for efficiency and accuracy plots should explicitly label the backbone models and include error bars where applicable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the thoughtful and constructive review of our manuscript. The comments have helped us identify areas where we can improve the presentation and rigor of our work. Below, we provide detailed responses to each major comment and indicate the revisions we plan to make in the updated version of the paper.

read point-by-point responses

Referee: [Hypothesis-based data construction and taxonomy] The central generalization claim rests on the structured error taxonomy being representative of real global failures. The hypothesis-based data construction (described in the methods) assumes this taxonomy covers cross-step inconsistencies and inter-agent coordination errors sufficiently to avoid overfitting to synthetically generated patterns; however, no ablation or coverage analysis of omitted failure modes is provided to substantiate that the fine-tuned verifier will generalize to unseen long-horizon LLM-MAS trajectories.

Authors: We appreciate the referee highlighting the importance of validating the taxonomy's coverage for generalization claims. The taxonomy was derived from a review of LLM-MAS failure literature combined with empirical analysis of trajectories from the evaluation benchmarks. We agree that explicit coverage analysis and ablations would strengthen the manuscript. In the revised version, we will add a dedicated analysis subsection reporting the fraction of observed test-set failures covered by each taxonomy category, along with an ablation that systematically removes categories (such as inter-agent coordination errors) and measures resulting drops in verifier accuracy on long-horizon trajectories. This will provide direct evidence against overfitting to synthetic patterns. revision: yes
Referee: [Experiments] Table or results section reporting benchmark performance: the claim of consistent outperformance on Aegis-Bench and Who&When lacks accompanying details on statistical significance testing, run-to-run variance, exact baseline re-implementations, and data-construction rules. These omissions make it difficult to assess whether the reported gains are robust or sensitive to post-hoc choices.

Authors: We thank the referee for this observation on experimental transparency. To improve reproducibility and allow rigorous assessment of robustness, the revised manuscript will expand the Experiments section with: statistical significance results (paired t-tests or Wilcoxon signed-rank tests with p-values across runs), mean performance and standard deviations over five independent runs using different random seeds, precise descriptions of baseline re-implementations including any multi-agent adaptations and hyperparameter settings, and explicit step-by-step rules plus illustrative examples for the hypothesis-based data construction procedure. Updated tables and text will reflect these additions. revision: yes

Circularity Check

0 steps flagged

No circularity: independent method with external benchmarks and introduced taxonomy

full rationale

The paper introduces VerifyMAS as a hypothesis verification framework that formulates and verifies failure hypotheses against full trajectories, decomposes attribution into trajectory-level error validation and agent localization, and uses a hypothesis-based data construction strategy grounded in a separately introduced structured error taxonomy to fine-tune a verifier. It evaluates generalization on external benchmarks Aegis-Bench and Who&When using diverse backbone models. No load-bearing step reduces by construction to fitted inputs, self-definitions, or self-citation chains; the taxonomy is presented as an input for data construction rather than derived from the results, and performance claims rest on empirical comparisons rather than tautological renaming or uniqueness theorems imported from the authors' prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that a fixed error taxonomy can generate training hypotheses that enable a fine-tuned LLM to verify global failures; no new physical entities or free parameters beyond standard fine-tuning are introduced.

axioms (1)

domain assumption A structured error taxonomy can generate representative hypotheses for training a trajectory verifier that generalizes to unseen multi-agent interactions.
The paper explicitly grounds its data-construction strategy in a structured error taxonomy.

pith-pipeline@v0.9.0 · 5768 in / 1320 out tokens · 45747 ms · 2026-05-20T12:53:02.269583+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose VerifyMAS, a hypothesis verification framework... formulates and verifies failure hypotheses against full trajectories... grounded in a structured error taxonomy
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat ≃ Nat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

decomposes attribution into trajectory-level error validation and fine-grained agent localization

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 10 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Trajectory guard–a lightweight, sequence-aware model for real-time anomaly detection in agentic ai.arXiv preprint arXiv:2601.00516, 2026

Laksh Advani. Trajectory guard–a lightweight, sequence-aware model for real-time anomaly detection in agentic ai.arXiv preprint arXiv:2601.00516, 2026. 3

work page arXiv 2026
[3]

Where did it all go wrong? a hierarchical look into multi-agent error attribution.arXiv preprint arXiv:2510.04886, 2025

Adi Banerjee, Anirudh Nair, and Tarik Borogovac. Where did it all go wrong? a hierarchical look into multi-agent error attribution.arXiv preprint arXiv:2510.04886, 2025. 2

work page arXiv 2025
[4]

Why Do Multi-Agent LLM Systems Fail?

Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, et al. Why do multi- agent llm systems fail?arXiv preprint arXiv:2503.13657, 2025. 1, 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Improv- ing factuality and reasoning in language models through multiagent debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improv- ing factuality and reasoning in language models through multiagent debate. InForty-first international conference on machine learning, 2024. 1

work page 2024
[7]

Who is introducing the failure? automatically attributing failures of multi-agent systems via spectrum analysis.arXiv preprint arXiv:2509.13782, 2025

Yu Ge, Linna Xie, Zhong Li, Yu Pei, and Tian Zhang. Who is introducing the failure? automatically attributing failures of multi-agent systems via spectrum analysis.arXiv preprint arXiv:2509.13782, 2025. 1

work page arXiv 2025
[8]

Shaona Ghosh, Prasoon Varshney, Makesh Narsimhan Sreedhar, Aishwarya Padmakumar, Traian Rebedea, Jibin Rajan Varghese, and Christopher Parisien. Aegis2. 0: A diverse ai safety dataset and risks taxonomy for alignment of llm guardrails. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguist...

work page 2025
[9]

Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.Advances in neural information processing systems, 37:8093–8131, 2024

Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.Advances in neural information processing systems, 37:8093–8131, 2024. 3

work page 2024
[10]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Rethinking failure attribution in multi-agent systems: A multi-perspective benchmark and evaluation.arXiv preprint arXiv:2603.25001, 2026

Yeonjun In, Mehrab Tanjim, Jayakumar Subramanian, Sungchul Kim, Uttaran Bhattacharya, Wonjoong Kim, Sangwu Park, Somdeb Sarkhel, and Chanyoung Park. Rethinking failure attribution in multi-agent systems: A multi-perspective benchmark and evaluation.arXiv preprint arXiv:2603.25001, 2026. 1

work page arXiv 2026
[12]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Mas-fire: Fault injection and reliability evaluation for llm-based multi-agent systems.arXiv preprint arXiv:2602.19843,

Jin Jia, Zhiling Deng, Zhuangbin Chen, Yingqi Wang, and Zibin Zheng. Mas-fire: Fault injection and reliability evaluation for llm-based multi-agent systems.arXiv preprint arXiv:2602.19843,

work page arXiv
[14]

Aegis: Automated Error Generation and Attribution for Multi-Agent Systems

Fanqi Kong, Ruijie Zhang, Huaxiao Yin, Guibin Zhang, Xiaofei Zhang, Ziang Chen, Zhaowei Zhang, Xiaoyuan Zhang, Song-Chun Zhu, and Xue Feng. Aegis: Automated error generation and attribution for multi-agent systems.arXiv preprint arXiv:2509.14295, 2025. 2, 3, 6, 7, 8, 13

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Contractnli: A dataset for document-level natural language inference for contracts

Yuta Koreeda and Christopher D Manning. Contractnli: A dataset for document-level natural language inference for contracts. InFindings of the Association for Computational Linguistics: EMNLP 2021, pages 1907–1919, 2021. 2

work page 2021
[16]

Slm as guardian: Pioneering ai safety with small language model

Ohjoon Kwon, Donghyeon Jeon, Nayoung Choi, Gyu-Hwung Cho, Hwiyeol Jo, Changbong Kim, Hyunwoo Lee, Inho Kang, Sun Kim, and Taiwoo Park. Slm as guardian: Pioneering ai safety with small language model. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1333–1350, 2024. 3

work page 2024
[17]

AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

Dongrui Liu, Qihan Ren, Chen Qian, Shuai Shao, Yuejin Xie, Yu Li, Zhonghao Yang, Haoyu Luo, Peng Wang, Qingyu Liu, et al. Agentdog: A diagnostic guardrail framework for ai agent safety and security.arXiv preprint arXiv:2601.18491, 2026. 1, 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals

Yang Liu, Hongjiang Feng, Junsong Pu, and Zhuangbin Chen. Masprism: Lightweight failure attribution for multi-agent systems using prefill-stage signals.arXiv preprint arXiv:2605.07509,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Explainable and fine-grained safeguarding of llm multi-agent systems via bi-level graph anomaly detection.arXiv preprint arXiv:2512.18733, 2025

Junjun Pan, Yixin Liu, Rui Miao, Kaize Ding, Yu Zheng, Quoc Viet Hung Nguyen, Alan Wee-Chung Liew, and Shirui Pan. Explainable and fine-grained safeguarding of llm multi-agent systems via bi-level graph anomaly detection.arXiv preprint arXiv:2512.18733, 2025. 3

work page arXiv 2025
[20]

Detecting silent failures in multi-agentic ai trajectories.arXiv preprint arXiv:2511.04032, 2025

Divya Pathak, Harshit Kumar, Anuska Roy, Felix George, Mudit Verma, and Pratibha Moogi. Detecting silent failures in multi-agentic ai trajectories.arXiv preprint arXiv:2511.04032, 2025. 1

work page arXiv 2025
[21]

Deep graph anomaly detection: A survey and new perspectives.IEEE Transactions on Knowledge and Data Engineering, 2025

Hezhe Qiao, Hanghang Tong, Bo An, Irwin King, Charu Aggarwal, and Guansong Pang. Deep graph anomaly detection: A survey and new perspectives.IEEE Transactions on Knowledge and Data Engineering, 2025. 3

work page 2025
[22]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297,

work page
[23]

Qwen3 technical report, 2025

Qwen Team. Qwen3 technical report, 2025. 7

work page 2025
[24]

Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting.Advances in Neural Information Processing Systems, 36:74952–74965, 2023

Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting.Advances in Neural Information Processing Systems, 36:74952–74965, 2023. 3

work page 2023
[25]

Saferdialogues: Taking feedback gracefully after conversational safety failures

Megan Ung, Jing Xu, and Y-Lan Boureau. Saferdialogues: Taking feedback gracefully after conversational safety failures. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6462–6481, 2022. 1

work page 2022
[26]

G-safeguard: A topology-guided security lens and treatment on llm- based multi-agent systems

Shilong Wang, Guibin Zhang, Miao Yu, Guancheng Wan, Fanci Meng, Chongye Guo, Kun Wang, and Yang Wang. G-safeguard: A topology-guided security lens and treatment on llm- based multi-agent systems. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7261–7276, 2025. 3

work page 2025
[27]

Guardagent: Safeguard llm agents via knowledge-enabled reasoning

Zhen Xiang, Linzhi Zheng, Yanjie Li, Junyuan Hong, Qinbin Li, Han Xie, Jiawei Zhang, Zidi Xiong, Chulin Xie, Carl Yang, et al. Guardagent: Safeguard llm agents via knowledge-enabled reasoning. InInternational Conference on Machine Learning, pages 68316–68342. PMLR,

work page
[28]

Qwen2.5 Technical Report

An Yang et al. Qwen2.5 technical report.CoRR, abs/2412.15115, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

ShieldGemma: Generative AI Content Moderation Based on Gemma

Wenjun Zeng, Yuchi Liu, Ryan Mullins, Ludovic Peran, Joe Fernandez, Hamza Harkous, Karthik Narasimhan, Drew Proud, Piyush Kumar, Bhaktipriya Radharapu, et al. Shieldgemma: Generative ai content moderation based on gemma.arXiv preprint arXiv:2407.21772, 2024. 3

work page internal anchor Pith review arXiv 2024
[30]

arXiv preprint arXiv:2509.03312 , year=

Guibin Zhang, Junhao Wang, Junjie Chen, Wangchunshu Zhou, Kun Wang, and Shuicheng Yan. Agentracer: Who is inducing failure in the LLM agentic systems?CoRR, abs/2509.03312,

work page arXiv
[31]

G-designer: Architecting multi-agent communication 11 topologies via graph neural networks

Guibin Zhang, Yanwei Yue, Xiangguo Sun, Guancheng Wan, Miao Yu, Junfeng Fang, Kun Wang, Tianlong Chen, and Dawei Cheng. G-designer: Architecting multi-agent communication 11 topologies via graph neural networks. InInternational Conference on Machine Learning, pages 76678–76692. PMLR, 2025. 1

work page 2025
[32]

Graphtracer: Graph-guided failure tracing in llm agents for robust multi-turn deep search.arXiv preprint arXiv:2510.10581, 2025

Heng Zhang, Yuling Shi, Xiaodong Gu, Haochen You, Zijian Zhang, Lubin Gan, Yilei Yuan, and Jin Huang. Graphtracer: Graph-guided failure tracing in llm agents for robust multi-turn deep search.arXiv preprint arXiv:2510.10581, 2025. 1

work page arXiv 2025
[33]

Which agent causes task failures and when? on automated failure attribution of LLM multi-agent systems

Shaokun Zhang, Ming Yin, Jieyu Zhang, Jiale Liu, Zhiguang Han, Jingyang Zhang, Beibin Li, Chi Wang, Huazheng Wang, Yiran Chen, and Qingyun Wu. Which agent causes task failures and when? on automated failure attribution of LLM multi-agent systems. InForty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19, 202...

work page 2025
[34]

Re- thinking the reliability of multi-agent system: A perspective from byzantine fault tolerance

Lifan Zheng, Jiawei Chen, Qinghong Yin, Jingyuan Zhang, Xinyi Zeng, and Yu Tian. Re- thinking the reliability of multi-agent system: A perspective from byzantine fault tolerance. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 35012–35020,

work page
[35]

Guardian: Safeguarding llm multi-agent collabora- tions with temporal graph modeling.arXiv preprint arXiv:2505.19234, 2025

Jialong Zhou, Lichao Wang, and Xiao Yang. Guardian: Safeguarding llm multi-agent collabora- tions with temporal graph modeling.arXiv preprint arXiv:2505.19234, 2025. 3

work page arXiv 2025
[36]

Where llm agents fail and how they can learn from failures.arXiv preprint arXiv:2509.25370, 2025

Kunlun Zhu, Zijia Liu, Bingxuan Li, Muxin Tian, Yingxuan Yang, Jiaxun Zhang, Pengrui Han, Qipeng Xie, Fuyang Cui, Weijia Zhang, et al. Where llm agents fail and how they can learn from failures.arXiv preprint arXiv:2509.25370, 2025. 1, 2, 3

work page arXiv 2025
[37]

Agent-as-a-judge: Evaluate agents with agents

Mingchen Zhuge, Changsheng Zhao, Dylan R Ashley, Wenyi Wang, Dmitrii Khizbullin, Yun- yang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, et al. Agent-as-a-judge: Evaluate agents with agents. InInternational Conference on Machine Learning, pages 80569–80611. PMLR, 2025. 3

work page 2025
[38]

arXiv preprint arXiv:2511.20639 , year=

Jiaru Zou, Xiyuan Yang, Ruizhong Qiu, Gaotang Li, Katherine Tieu, Pan Lu, Ke Shen, Hang- hang Tong, Yejin Choi, Jingrui He, et al. Latent collaboration in multi-agent systems.arXiv preprint arXiv:2511.20639, 2025. 1 12 A Case Study To provide a qualitative illustration of our model’s diagnostic capabilities, we present a case study of trajectory failure a...

work page arXiv 2025

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 7

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Trajectory guard–a lightweight, sequence-aware model for real-time anomaly detection in agentic ai.arXiv preprint arXiv:2601.00516, 2026

Laksh Advani. Trajectory guard–a lightweight, sequence-aware model for real-time anomaly detection in agentic ai.arXiv preprint arXiv:2601.00516, 2026. 3

work page arXiv 2026

[3] [3]

Where did it all go wrong? a hierarchical look into multi-agent error attribution.arXiv preprint arXiv:2510.04886, 2025

Adi Banerjee, Anirudh Nair, and Tarik Borogovac. Where did it all go wrong? a hierarchical look into multi-agent error attribution.arXiv preprint arXiv:2510.04886, 2025. 2

work page arXiv 2025

[4] [4]

Why Do Multi-Agent LLM Systems Fail?

Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, et al. Why do multi- agent llm systems fail?arXiv preprint arXiv:2503.13657, 2025. 1, 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Improv- ing factuality and reasoning in language models through multiagent debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improv- ing factuality and reasoning in language models through multiagent debate. InForty-first international conference on machine learning, 2024. 1

work page 2024

[7] [7]

Who is introducing the failure? automatically attributing failures of multi-agent systems via spectrum analysis.arXiv preprint arXiv:2509.13782, 2025

Yu Ge, Linna Xie, Zhong Li, Yu Pei, and Tian Zhang. Who is introducing the failure? automatically attributing failures of multi-agent systems via spectrum analysis.arXiv preprint arXiv:2509.13782, 2025. 1

work page arXiv 2025

[8] [8]

Shaona Ghosh, Prasoon Varshney, Makesh Narsimhan Sreedhar, Aishwarya Padmakumar, Traian Rebedea, Jibin Rajan Varghese, and Christopher Parisien. Aegis2. 0: A diverse ai safety dataset and risks taxonomy for alignment of llm guardrails. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguist...

work page 2025

[9] [9]

Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.Advances in neural information processing systems, 37:8093–8131, 2024

Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.Advances in neural information processing systems, 37:8093–8131, 2024. 3

work page 2024

[10] [10]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Rethinking failure attribution in multi-agent systems: A multi-perspective benchmark and evaluation.arXiv preprint arXiv:2603.25001, 2026

Yeonjun In, Mehrab Tanjim, Jayakumar Subramanian, Sungchul Kim, Uttaran Bhattacharya, Wonjoong Kim, Sangwu Park, Somdeb Sarkhel, and Chanyoung Park. Rethinking failure attribution in multi-agent systems: A multi-perspective benchmark and evaluation.arXiv preprint arXiv:2603.25001, 2026. 1

work page arXiv 2026

[12] [12]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

Mas-fire: Fault injection and reliability evaluation for llm-based multi-agent systems.arXiv preprint arXiv:2602.19843,

Jin Jia, Zhiling Deng, Zhuangbin Chen, Yingqi Wang, and Zibin Zheng. Mas-fire: Fault injection and reliability evaluation for llm-based multi-agent systems.arXiv preprint arXiv:2602.19843,

work page arXiv

[14] [14]

Aegis: Automated Error Generation and Attribution for Multi-Agent Systems

Fanqi Kong, Ruijie Zhang, Huaxiao Yin, Guibin Zhang, Xiaofei Zhang, Ziang Chen, Zhaowei Zhang, Xiaoyuan Zhang, Song-Chun Zhu, and Xue Feng. Aegis: Automated error generation and attribution for multi-agent systems.arXiv preprint arXiv:2509.14295, 2025. 2, 3, 6, 7, 8, 13

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Contractnli: A dataset for document-level natural language inference for contracts

Yuta Koreeda and Christopher D Manning. Contractnli: A dataset for document-level natural language inference for contracts. InFindings of the Association for Computational Linguistics: EMNLP 2021, pages 1907–1919, 2021. 2

work page 2021

[16] [16]

Slm as guardian: Pioneering ai safety with small language model

Ohjoon Kwon, Donghyeon Jeon, Nayoung Choi, Gyu-Hwung Cho, Hwiyeol Jo, Changbong Kim, Hyunwoo Lee, Inho Kang, Sun Kim, and Taiwoo Park. Slm as guardian: Pioneering ai safety with small language model. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1333–1350, 2024. 3

work page 2024

[17] [17]

AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

Dongrui Liu, Qihan Ren, Chen Qian, Shuai Shao, Yuejin Xie, Yu Li, Zhonghao Yang, Haoyu Luo, Peng Wang, Qingyu Liu, et al. Agentdog: A diagnostic guardrail framework for ai agent safety and security.arXiv preprint arXiv:2601.18491, 2026. 1, 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2026

[18] [18]

MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals

Yang Liu, Hongjiang Feng, Junsong Pu, and Zhuangbin Chen. Masprism: Lightweight failure attribution for multi-agent systems using prefill-stage signals.arXiv preprint arXiv:2605.07509,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Explainable and fine-grained safeguarding of llm multi-agent systems via bi-level graph anomaly detection.arXiv preprint arXiv:2512.18733, 2025

Junjun Pan, Yixin Liu, Rui Miao, Kaize Ding, Yu Zheng, Quoc Viet Hung Nguyen, Alan Wee-Chung Liew, and Shirui Pan. Explainable and fine-grained safeguarding of llm multi-agent systems via bi-level graph anomaly detection.arXiv preprint arXiv:2512.18733, 2025. 3

work page arXiv 2025

[20] [20]

Detecting silent failures in multi-agentic ai trajectories.arXiv preprint arXiv:2511.04032, 2025

Divya Pathak, Harshit Kumar, Anuska Roy, Felix George, Mudit Verma, and Pratibha Moogi. Detecting silent failures in multi-agentic ai trajectories.arXiv preprint arXiv:2511.04032, 2025. 1

work page arXiv 2025

[21] [21]

Deep graph anomaly detection: A survey and new perspectives.IEEE Transactions on Knowledge and Data Engineering, 2025

Hezhe Qiao, Hanghang Tong, Bo An, Irwin King, Charu Aggarwal, and Guansong Pang. Deep graph anomaly detection: A survey and new perspectives.IEEE Transactions on Knowledge and Data Engineering, 2025. 3

work page 2025

[22] [22]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297,

work page

[23] [23]

Qwen3 technical report, 2025

Qwen Team. Qwen3 technical report, 2025. 7

work page 2025

[24] [24]

Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting.Advances in Neural Information Processing Systems, 36:74952–74965, 2023

Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting.Advances in Neural Information Processing Systems, 36:74952–74965, 2023. 3

work page 2023

[25] [25]

Saferdialogues: Taking feedback gracefully after conversational safety failures

Megan Ung, Jing Xu, and Y-Lan Boureau. Saferdialogues: Taking feedback gracefully after conversational safety failures. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6462–6481, 2022. 1

work page 2022

[26] [26]

G-safeguard: A topology-guided security lens and treatment on llm- based multi-agent systems

Shilong Wang, Guibin Zhang, Miao Yu, Guancheng Wan, Fanci Meng, Chongye Guo, Kun Wang, and Yang Wang. G-safeguard: A topology-guided security lens and treatment on llm- based multi-agent systems. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7261–7276, 2025. 3

work page 2025

[27] [27]

Guardagent: Safeguard llm agents via knowledge-enabled reasoning

Zhen Xiang, Linzhi Zheng, Yanjie Li, Junyuan Hong, Qinbin Li, Han Xie, Jiawei Zhang, Zidi Xiong, Chulin Xie, Carl Yang, et al. Guardagent: Safeguard llm agents via knowledge-enabled reasoning. InInternational Conference on Machine Learning, pages 68316–68342. PMLR,

work page

[28] [28]

Qwen2.5 Technical Report

An Yang et al. Qwen2.5 technical report.CoRR, abs/2412.15115, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

ShieldGemma: Generative AI Content Moderation Based on Gemma

Wenjun Zeng, Yuchi Liu, Ryan Mullins, Ludovic Peran, Joe Fernandez, Hamza Harkous, Karthik Narasimhan, Drew Proud, Piyush Kumar, Bhaktipriya Radharapu, et al. Shieldgemma: Generative ai content moderation based on gemma.arXiv preprint arXiv:2407.21772, 2024. 3

work page internal anchor Pith review arXiv 2024

[30] [30]

arXiv preprint arXiv:2509.03312 , year=

Guibin Zhang, Junhao Wang, Junjie Chen, Wangchunshu Zhou, Kun Wang, and Shuicheng Yan. Agentracer: Who is inducing failure in the LLM agentic systems?CoRR, abs/2509.03312,

work page arXiv

[31] [31]

G-designer: Architecting multi-agent communication 11 topologies via graph neural networks

Guibin Zhang, Yanwei Yue, Xiangguo Sun, Guancheng Wan, Miao Yu, Junfeng Fang, Kun Wang, Tianlong Chen, and Dawei Cheng. G-designer: Architecting multi-agent communication 11 topologies via graph neural networks. InInternational Conference on Machine Learning, pages 76678–76692. PMLR, 2025. 1

work page 2025

[32] [32]

Graphtracer: Graph-guided failure tracing in llm agents for robust multi-turn deep search.arXiv preprint arXiv:2510.10581, 2025

Heng Zhang, Yuling Shi, Xiaodong Gu, Haochen You, Zijian Zhang, Lubin Gan, Yilei Yuan, and Jin Huang. Graphtracer: Graph-guided failure tracing in llm agents for robust multi-turn deep search.arXiv preprint arXiv:2510.10581, 2025. 1

work page arXiv 2025

[33] [33]

Which agent causes task failures and when? on automated failure attribution of LLM multi-agent systems

Shaokun Zhang, Ming Yin, Jieyu Zhang, Jiale Liu, Zhiguang Han, Jingyang Zhang, Beibin Li, Chi Wang, Huazheng Wang, Yiran Chen, and Qingyun Wu. Which agent causes task failures and when? on automated failure attribution of LLM multi-agent systems. InForty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19, 202...

work page 2025

[34] [34]

Re- thinking the reliability of multi-agent system: A perspective from byzantine fault tolerance

Lifan Zheng, Jiawei Chen, Qinghong Yin, Jingyuan Zhang, Xinyi Zeng, and Yu Tian. Re- thinking the reliability of multi-agent system: A perspective from byzantine fault tolerance. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 35012–35020,

work page

[35] [35]

Guardian: Safeguarding llm multi-agent collabora- tions with temporal graph modeling.arXiv preprint arXiv:2505.19234, 2025

Jialong Zhou, Lichao Wang, and Xiao Yang. Guardian: Safeguarding llm multi-agent collabora- tions with temporal graph modeling.arXiv preprint arXiv:2505.19234, 2025. 3

work page arXiv 2025

[36] [36]

Where llm agents fail and how they can learn from failures.arXiv preprint arXiv:2509.25370, 2025

Kunlun Zhu, Zijia Liu, Bingxuan Li, Muxin Tian, Yingxuan Yang, Jiaxun Zhang, Pengrui Han, Qipeng Xie, Fuyang Cui, Weijia Zhang, et al. Where llm agents fail and how they can learn from failures.arXiv preprint arXiv:2509.25370, 2025. 1, 2, 3

work page arXiv 2025

[37] [37]

Agent-as-a-judge: Evaluate agents with agents

Mingchen Zhuge, Changsheng Zhao, Dylan R Ashley, Wenyi Wang, Dmitrii Khizbullin, Yun- yang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, et al. Agent-as-a-judge: Evaluate agents with agents. InInternational Conference on Machine Learning, pages 80569–80611. PMLR, 2025. 3

work page 2025

[38] [38]

arXiv preprint arXiv:2511.20639 , year=

Jiaru Zou, Xiyuan Yang, Ruizhong Qiu, Gaotang Li, Katherine Tieu, Pan Lu, Ke Shen, Hang- hang Tong, Yejin Choi, Jingrui He, et al. Latent collaboration in multi-agent systems.arXiv preprint arXiv:2511.20639, 2025. 1 12 A Case Study To provide a qualitative illustration of our model’s diagnostic capabilities, we present a case study of trajectory failure a...

work page arXiv 2025