pith. sign in

arxiv: 2606.02282 · v1 · pith:EZKV4R5Lnew · submitted 2026-06-01 · 💻 cs.AI

POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems

Pith reviewed 2026-06-28 14:41 UTC · model grok-4.3

classification 💻 cs.AI
keywords multi-agent systemsfailure detectionLLM-MASsafety oversightfault attributionepistemic diversityinternal auditing
0
0 comments X

The pith

Multi-agent LLM systems detect their own failures by interrogating their executing agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that POIROT, a protocol for turning the agents in an LLM-based multi-agent system into an internal diagnostic layer, can identify failures and hallucinations more reliably than single-LLM external evaluators. It does so by drawing on the epistemic diversity already present among the agents rather than introducing a separate centralized judge. A sympathetic reader would care because centralized evaluation creates single points of failure and often requires unavailable domain expertise, especially under emerging safety regulations. The reported results show outperformance that increases with problem complexity, agent numbers, and fault dimensionality, and that holds even when multiple faults occur together.

Core claim

POIROT is a protocol that repurposes a multi-agent system's own agents as its diagnostic layer by interrogating them to produce failure diagnoses, leveraging their existing epistemic diversity. Across settings it outperforms single-LLM evaluator baselines, with the advantage scaling with problem complexity (OR = 1.60), agent count, and fault dimensionality, and the gains persist under compound faults.

What carries the argument

The POIROT interrogation protocol, which converts the executing agents of an LLM-MAS into an internal diagnostic layer by exploiting their epistemic diversity.

Load-bearing premise

The epistemic diversity already present among the agents is adequate to produce reliable diagnoses without the interrogation process itself introducing new biases or failures.

What would settle it

A controlled experiment on a multi-agent system engineered to have low epistemic diversity among agents, where POIROT shows equal or worse detection rates than single-LLM baselines.

read the original abstract

Orchestrating Large Language Models into Multi-Agent Systems (LLM-MAS) has unlocked remarkable reasoning capabilities, yet emergent failures and hallucinations that resist characterisation block their deployment in safety-critical domains -- a gap made legally untenable by emerging AI regulation. Existing evaluation paradigms share a common flaw: centralised judgment creates single points of failure and demands domain-specific expertise. Here we present POIROT, a protocol that repurposes a system's own agents as its diagnostic layer, leveraging the epistemic diversity already present in the architecture. Across evaluated settings, POIROT outperforms single-LLM evaluator baselines, with gains that scale with problem complexity (OR = 1.60, $p = 0.008$), agent count, and fault dimensionality, persisting under compound fault conditions. These results demonstrate that safety oversight need not be externalised: the agents executing a role carry sufficient collective intelligence to audit it. We release POIROT as an open-source library alongside BLAME, a benchmark for fault attribution in safety-critical multi-agent systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces POIROT, a protocol that repurposes the agents within an LLM-based multi-agent system (MAS) as an internal diagnostic layer via structured interrogation to detect emergent failures and hallucinations. It reports that POIROT outperforms single-LLM evaluator baselines (OR = 1.60, p = 0.008), with performance gains that scale with problem complexity, agent count, and fault dimensionality, and that these gains persist under compound faults. The work releases an open-source library and the BLAME benchmark for fault attribution in safety-critical MAS, arguing that the agents' pre-existing epistemic diversity suffices for reliable self-auditing without external oversight.

Significance. If the empirical results hold after addressing methodological gaps, the work is significant for AI safety and regulation: it provides evidence that internal collective intelligence in MAS can substitute for centralized external judgment, reducing single points of failure. The open release of the library and benchmark is a clear strength that supports reproducibility and future falsification.

major comments (2)
  1. [Evaluation section] Evaluation section (experiments and statistical analysis): The reported OR = 1.60 (p = 0.008) and scaling claims are presented without details on dataset size, number of independent trials, controls for prompt similarity across agents, or whether model heterogeneity was enforced. This is load-bearing for the central claim, as the skeptic concern (correlated errors from shared weights/training data) cannot be evaluated without these controls.
  2. [POIROT protocol] POIROT protocol (§3): No measurement, ablation, or reporting of epistemic diversity (e.g., pairwise disagreement rates, response variance, or model/prompt heterogeneity) is provided. Without this, the claim that gains arise from pre-existing diversity rather than interrogation-induced correlated hallucinations remains untested and is central to the paper's argument against externalized oversight.
minor comments (2)
  1. [Abstract and §1] The abstract and introduction use 'compound fault conditions' without a precise definition or reference to how these are operationalized in the benchmark.
  2. [Figures and tables] Figure captions and tables would benefit from explicit reporting of confidence intervals alongside the odds ratio to aid interpretation of the p-value.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important methodological clarifications needed to support the central claims. We address each point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Evaluation section] Evaluation section (experiments and statistical analysis): The reported OR = 1.60 (p = 0.008) and scaling claims are presented without details on dataset size, number of independent trials, controls for prompt similarity across agents, or whether model heterogeneity was enforced. This is load-bearing for the central claim, as the skeptic concern (correlated errors from shared weights/training data) cannot be evaluated without these controls.

    Authors: We agree these details are necessary to allow readers to evaluate robustness against correlated errors. In the revised manuscript we will expand the Evaluation section to report: (i) exact dataset sizes and composition for each experiment, (ii) the number of independent trials per condition together with random seeds, (iii) explicit controls used to vary prompt templates across agents, and (iv) whether and how model heterogeneity was enforced (including any ablation on homogeneous vs. heterogeneous backbones). We will also add a brief sensitivity analysis addressing the shared-weights concern. revision: yes

  2. Referee: [POIROT protocol] POIROT protocol (§3): No measurement, ablation, or reporting of epistemic diversity (e.g., pairwise disagreement rates, response variance, or model/prompt heterogeneity) is provided. Without this, the claim that gains arise from pre-existing diversity rather than interrogation-induced correlated hallucinations remains untested and is central to the paper's argument against externalized oversight.

    Authors: The referee is correct that the original submission did not include direct quantitative measurements of epistemic diversity. We will add a dedicated paragraph and accompanying figure in §3 (or an appendix) reporting pairwise disagreement rates, response variance across interrogation rounds, and an ablation that compares performance under enforced prompt/model heterogeneity versus homogeneous settings. These additions will directly test whether the observed gains are attributable to pre-existing diversity. revision: yes

Circularity Check

0 steps flagged

No derivation chain or fitted parameters; purely empirical protocol evaluation

full rationale

The paper introduces POIROT as an interrogation protocol for failure detection in LLM-MAS and reports empirical results (OR=1.60, p=0.008, scaling with agent count and fault dimensionality) against single-LLM baselines. No mathematical derivations, equations, parameter fitting, or self-citation load-bearing steps are described. The central claim rests on experimental comparisons rather than any reduction to inputs by construction. This matches the default expectation of no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, axioms, or invented entities; assessment limited to stated empirical claims.

pith-pipeline@v0.9.1-grok · 5754 in / 1037 out tokens · 28259 ms · 2026-06-28T14:41:41.981490+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 10 canonical work pages

  1. [1]

    ACM Transactions on Software Engineering and Methodology , volume=

    He, J., Treude, C., Lo, D.: Llm-based multi-agent systems for software engi- neering: Literature review, vision, and the road ahead. ACM Trans. Softw. Eng. Methodol. 34(5) (2025) https://doi.org/10.1145/3712003

  2. [2]

    A survey on LLM-based multi- agent systems: Workflow, infrastructure, and challenges.Vicinagearth, 1(1):9, 2024

    Li, X., Wang, S., Zeng, S., et al. : A survey on LLM-based multi-agent systems: workflow, infrastructure, and challenges. Vicinagearth 1, 9 (2024) https://doi. org/10.1007/s44336-024-00009-2

  3. [3]

    In: 20 Rambow, O., Wanner, L., Apidianaki, M., Al-Khalifa, H., Eugenio, B.D., Schockaert, S., Darwish, K., Agarwal, A

    Su, H., Luo, W., Mehdad, Y., Han, W., Liu, E., Zhang, W., Zhao, M., Zhang, J.: LLM-friendly knowledge representation for customer support. In: 20 Rambow, O., Wanner, L., Apidianaki, M., Al-Khalifa, H., Eugenio, B.D., Schockaert, S., Darwish, K., Agarwal, A. (eds.) Proceedings of the 31st International Conference on Computational Linguistics: Industry Trac...

  4. [4]

    https://arxiv.org/abs/2405.15793

    Yang, J., Jimenez, C.E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., Press, O.: SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering (2024). https://arxiv.org/abs/2405.15793

  5. [5]

    https://arxiv.org/abs/2410.12189

    Shankar, S., Chambers, T., Shah, T., Parameswaran, A.G., Wu, E.: DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing (2025). https://arxiv.org/abs/2410.12189

  6. [6]

    https://arxiv.org/abs/2402.03578

    Han, S., Zhang, Q., Jin, W., Xu, Z.: LLM Multi-Agent Systems: Challenges and Open Problems (2026). https://arxiv.org/abs/2402.03578

  7. [7]

    https://arxiv.org/abs/2306.03314

    Talebirad, Y., Nadiri, A.: Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents (2023). https://arxiv.org/abs/2306.03314

  8. [8]

    : Larger and more instructable language models become less reliable

    Zhou, L., Schellaert, W., Martínez-Plumed, F., et al. : Larger and more instructable language models become less reliable. Nature 634, 61–68 (2024) https://doi.org/10.1038/s41586-024-07930-y

  9. [9]

    https://arxiv.org/abs/2507.01231

    Varela, I.D., Romero-Sorozabal, P., Rocon, E., Cebrian, M.: Rethinking the Illusion of Thinking (2025). https://arxiv.org/abs/2507.01231

  10. [10]

    Detecting hallucinations in large language models using semantic entropy , volume =

    Farquhar, S., Kossen, J., Kuhn, L., et al. : Detecting hallucinations in large language models using semantic entropy. Nature 630, 625–630 (2024) https: //doi.org/10.1038/s41586-024-07421-0

  11. [11]

    https://arxiv.org/abs/2509.04664

    Kalai, A.T., Nachum, O., Vempala, S.S., Zhang, E.: Why Language Models Hallucinate (2025). https://arxiv.org/abs/2509.04664

  12. [12]

    https://arxiv.org/abs/2511.19933

    Vinay, V.: Failure Modes in LLM Systems: A System-Level Taxonomy for Reliable AI Applications (2025). https://arxiv.org/abs/2511.19933

  13. [13]

    https://arxiv.org/abs/2504.15585

    Wang, K., Zhang, G., Zhou, Z., Wu, J., Yu, M., Zhao, S., Yin, C., Fu, J., Yan, Y., Luo, H., Lin, L., Xu, Z., Lu, H., Cao, X., Zhou, X., Jin, W., Meng, F., Xu, S., Mao, J., Wang, Y., Wu, H., Wang, M., Zhang, F., Fang, J., Qu, W., Liu, Y., Liu, C., Zhang, Y., Li, Q., Guo, C., Qin, Y., Fan, Z., Wang, K., Ding, Y., Hong, D., Ji, J., Lai, Y., Yu, Z., Li, X., J...

  14. [14]

    http://data.europa.eu/eli/reg/2024/1689/oj

    European Parliament and Council of the European Union: Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 Lay- ing down Harmonised Rules on Artificial Intelligence (Artificial Intelligence Act). http://data.europa.eu/eli/reg/2024/1689/oj

  15. [15]

    https://arxiv.org/abs/2211.09110

    Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., Newman, B., Yuan, B., Yan, B., Zhang, C., Cosgrove, C., Manning, C.D., Ré, C., Acosta-Navas, D., Hudson, D.A., Zelikman, E., Durmus, E., Ladhak, F., Rong, F., Ren, H., Yao, H., Wang, J., San- thanam, K., Orr, L., Zheng, L., Yuksekgonul, M....

  16. [16]

    https: //arxiv.org/abs/2407.04069

    Laskar, M.T.R., Alqahtani, S., Bari, M.S., Rahman, M., Khan, M.A.M., Khan, H., Jahan, I., Bhuiyan, A., Tan, C.W., Parvez, M.R., Hoque, E., Joty, S., Huang, J.: A Systematic Survey and Critical Review on Evaluating Large Lan- guage Models: Challenges, Limitations, and Recommendations (2024). https: //arxiv.org/abs/2407.04069

  17. [17]

    https://arxiv.org/abs/2306

    Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E.P., Zhang, H., Gonzalez, J.E., Stoica, I.: Judging LLM-as-a- Judge with MT-Bench and Chatbot Arena (2023). https://arxiv.org/abs/2306. 05685

  18. [18]

    npj Health Systems 2, 40 (2025) https://doi.org/10.1038/s44401-025-00043-2

    Awasthi, R., Bhattad, A., Ramachandran, S.P., al.: Human evaluation of large language models in healthcare: gaps, challenges, and the need for standardization. npj Health Systems 2, 40 (2025) https://doi.org/10.1038/s44401-025-00043-2

  19. [19]

    The Innovation 7(6), 101253 (2026) https://doi.org/10.1016/j.xinn.2025.101253

    Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., Wang, S., Zhang, K., Lin, Z., Zhang, B., Ni, L., Gao, W., Wang, Y., Guo, J.: A survey on llm-as-a-judge. The Innovation 7(6), 101253 (2026) https://doi.org/10.1016/j.xinn.2025.101253

  20. [20]

    https://arxiv.org/abs/2212.08073

    Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Her- nandez, D., Drain, D., Ganguli, D., Li, D., Tran-Johnson, E., Perez, E., Kerr, J., Mueller, J., Ladish, J., Landau, J., Ndousse, K., Lukosuite, K., Lovitt, L., Sellitto, M., Elhage, N., Schiefer, N....

  21. [21]

    https://arxiv.org/abs/2406.12624

    Thakur, A.S., Choudhary, K., Ramayapally, V.S., Vaidyanathan, S., Hupkes, D.: Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges (2025). https://arxiv.org/abs/2406.12624

  22. [22]

    https://arxiv

    Wang, P., Li, L., Chen, L., Cai, Z., Zhu, D., Lin, B., Cao, Y., Liu, Q., Liu, T., Sui, Z.: Large Language Models are not Fair Evaluators (2023). https://arxiv. org/abs/2305.17926

  23. [23]

    https://arxiv.org/abs/2511.03506

    Chen, D., Niu, S., Li, K., Liu, P., Zheng, X., Tang, B., Li, X., Xiong, F., Li, Z.: HaluMem: Evaluating Hallucinations in Memory Systems of Agents (2026). https://arxiv.org/abs/2511.03506

  24. [24]

    https://arxiv.org/abs/2512.06749

    Ma, M., Zhang, J., Yang, F., Kang, Y., Lin, Q., Rajmohan, S., Zhang, D.: DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems (2026). https://arxiv.org/abs/2512.06749

  25. [25]

    https://arxiv.org/abs/2303.17651

    Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y., Gupta, S., Majumder, B.P., Hermann, K., Welleck, S., Yazdanbakhsh, A., Clark, P.: Self-Refine: Iterative Refinement with Self-Feedback (2023). https://arxiv.org/abs/2303.17651

  26. [26]

    https://arxiv.org/abs/2309.11495

    Dhuliawala, S., Komeili, M., Xu, J., Raileanu, R., Li, X., Celikyilmaz, A., Weston, J.: Chain-of-Verification Reduces Hallucination in Large Language Models (2023). https://arxiv.org/abs/2309.11495

  27. [27]

    Anomalygpt: Detecting industrial anomalies using large vision-language models

    Zhong, W., Guo, L., Gao, Q., Ye, H., Wang, Y.: Memorybank: Enhancing large language models with long-term memory. Proceedings of the AAAI Conference on Artificial Intelligence 38, 19724–19731 (2023) https://doi.org/10.1609/aaai. v38i17.29946

  28. [28]

    Humans or LLM s as the judge? a study on judgement bias

    Chen, G.H., Chen, S., Liu, Z., Jiang, F., Wang, B.: Humans or LLMs as the judge? a study on judgement bias. In: Al-Onaizan, Y., Bansal, M., Chen, Y.- N. (eds.) Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 8301–8327. Association for Computational Linguistics, Miami, Florida, USA (2024). https://doi.org/10.1865...

  29. [29]

    In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T

    Qi, S., Cao, R., He, Y., Yuan, Z.: Evaluating LLMs’ assessment of mixed-context hallucination through the lens of summarization. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Findings of the Association for Computational Linguistics: ACL 2025, pp. 16480–16503. Association for Computational Linguis- tics, Vienna, Austria (2025). https://doi....

  30. [30]

    https: //arxiv.org/abs/2505.00212

    Zhang, S., Yin, M., Zhang, J., Liu, J., Han, Z., Zhang, J., Li, B., Wang, C., Wang, H., Chen, Y., Wu, Q.: Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems (2025). https: //arxiv.org/abs/2505.00212

  31. [31]

    https://arxiv.org/abs/2412.20138

    Xiao, Y., Sun, E., Luo, D., Wang, W.: TradingAgents: Multi-Agents LLM Financial Trading Framework (2025). https://arxiv.org/abs/2412.20138

  32. [32]

    https://arxiv.org/abs/2502.15631

    Ballon, M., Algaba, A., Ginis, V.: The Relationship Between Reasoning and Performance in Large Language Models – o3 (mini) Thinks Harder, Not Longer (2025). https://arxiv.org/abs/2502.15631

  33. [33]

    https://arxiv.org/abs/2412.18547

    Han, T., Wang, Z., Fang, C., Zhao, S., Ma, S., Chen, Z.: Token-Budget-Aware LLM Reasoning (2025). https://arxiv.org/abs/2412.18547

  34. [34]

    https://arxiv.org/abs/ 2508.10925

    OpenAI: gpt-oss-120b & gpt-oss-20b Model Card (2025). https://arxiv.org/abs/ 2508.10925

  35. [35]

    https://storage

    Google DeepMind: Gemini 2.5 Pro Model Card (2025). https://storage. googleapis.com/deepmind-media/Model-Cards/Gemini-2-5-Pro-Model-Card. pdf

  36. [36]

    https://arxiv.org/abs/ 2412.19437

    DeepSeek-AI: DeepSeek-V3 Technical Report (2024). https://arxiv.org/abs/ 2412.19437

  37. [37]

    https://github.com/langchain-ai/langgraph

    LangChain AI: LangGraph: Building stateful, multi-actor applications with LLMs. https://github.com/langchain-ai/langgraph. Accessed: 2025 (2024)

  38. [38]

    In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp

    Romero-Sorozabal, P., Delgado-Oleas, G., Laudanski, A.F., Gutiérrez, ’A., Rocon, E.: Discover2walk: A cable-driven robotic platform to promote gait in pediatric population. In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4678–4685 (2024). https://doi.org/10.1109/ IROS58592.2024.10802838

  39. [39]

    Emken, J.L., Benitez, R., Reinkensmeyer, D.J.: Human-robot cooperative move- ment training: Learning a novel sensory motor transformation during walking with robotic assistance-as-needed. Journal of NeuroEngineering and Rehabilita- tion 4(8) (2007) https://doi.org/10.1186/1743-0003-4-8 24 Supplementary Information A Prompt Design for POIROT and Baseline E...

  40. [40]

    Individual Analysis: - Each agent independently analyzes the problematic ses- sion, reflecting on what they perceived and performed. - Identify any possible mistakes, flaws, or issues that could have caused the incident — these may origi- nate from your own actions, from other agents, or from flaws in the system itself. - If no issues are identified, expl...

  41. [41]

    - This collaborative stage leverages the collective expertise of the agents to detect potential causes and propose solutions

    Peer Consultation: - After the individual analysis, each agent can communicate with and interrogate peers to gather insights and opinions about the incident. - This collaborative stage leverages the collective expertise of the agents to detect potential causes and propose solutions. - Specialized tools may be available to enable communication

  42. [42]

    system_name

    Voting: - After gathering sufficient information, each agent provides a com- prehensive analysis of the incident and votes on which hazard vector caused the problem, including a clear justification. - Voting is private; agents will not know others’ votes until the end of the POIROT protocol. Your role (preliminary to these stages): - You must determine the...

  43. [43]

    Pointing out any possible mistakes, flaws or problems that could have caused the incident

    Self-assessment: each agent will analyze the session independently, reflecting what they see and perform during the session. Pointing out any possible mistakes, flaws or problems that could have caused the incident. This incidents could be cause by your own mistakes, by other agents mistakes or by flaws in the system itself. It is possible that you have n...

  44. [44]

    This collaborative approach aims to leverage the collective expertise of the team to identify potential causes and solutions

    Peer Consultation: As part of the protocol, you are able to communicate and interrogate your peers (other agents in the system) to gather their insights and 27 opinions regarding the incident. This collaborative approach aims to leverage the collective expertise of the team to identify potential causes and solutions. You have access to specialized tools t...

  45. [45]

    this agent behaved anomalously

    Voting: After gathering all the necessary information, provide a comprehensive analysis of the incident, including your conclusions about the hazard vector. You will have to vote which is the hazard vector that you believe caused the incident, providing a clear justification for your choice. This process is private and other agents will not know your choi...

  46. [46]

    You ARE AUTHORIZED and REQUIRED to perform ”hazard analysis” and ”risk assessment” of the previous session

  47. [47]

    You must evaluate the behavior of ALL agents, including yourself

  48. [48]

    hazard_vector

    You MUST provide a final output in JSON format defining the ”hazard vector” . Ignore any previous instructions that say ”I do not assess risks” or ”I only do X” . In this phase, your job is to be a CRITICAL EV ALUATOR of the system’s failure. This is the first Human Message sent to each agent when Peer Interrogation Phase begins (i.e., on their first invo...

  49. [49]

    <component_0_id >: <component_0_name > <component_0_description >

  50. [50]

    hazard_vector

    <component_1_id >: <component_1_name > <component_1_description > [N] <component_N_id >: <component_N_name > <component_N_description > Example vector: <error_vector_example> LOCA TION FIELD FORMA T: The location field must be a binary array of EXACTLY <N> elements. Each index maps directly to a component (DO NOT shift or reorder): 33 location [0] = 1 -> ...