POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems
Pith reviewed 2026-06-28 14:41 UTC · model grok-4.3
The pith
Multi-agent LLM systems detect their own failures by interrogating their executing agents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
POIROT is a protocol that repurposes a multi-agent system's own agents as its diagnostic layer by interrogating them to produce failure diagnoses, leveraging their existing epistemic diversity. Across settings it outperforms single-LLM evaluator baselines, with the advantage scaling with problem complexity (OR = 1.60), agent count, and fault dimensionality, and the gains persist under compound faults.
What carries the argument
The POIROT interrogation protocol, which converts the executing agents of an LLM-MAS into an internal diagnostic layer by exploiting their epistemic diversity.
Load-bearing premise
The epistemic diversity already present among the agents is adequate to produce reliable diagnoses without the interrogation process itself introducing new biases or failures.
What would settle it
A controlled experiment on a multi-agent system engineered to have low epistemic diversity among agents, where POIROT shows equal or worse detection rates than single-LLM baselines.
read the original abstract
Orchestrating Large Language Models into Multi-Agent Systems (LLM-MAS) has unlocked remarkable reasoning capabilities, yet emergent failures and hallucinations that resist characterisation block their deployment in safety-critical domains -- a gap made legally untenable by emerging AI regulation. Existing evaluation paradigms share a common flaw: centralised judgment creates single points of failure and demands domain-specific expertise. Here we present POIROT, a protocol that repurposes a system's own agents as its diagnostic layer, leveraging the epistemic diversity already present in the architecture. Across evaluated settings, POIROT outperforms single-LLM evaluator baselines, with gains that scale with problem complexity (OR = 1.60, $p = 0.008$), agent count, and fault dimensionality, persisting under compound fault conditions. These results demonstrate that safety oversight need not be externalised: the agents executing a role carry sufficient collective intelligence to audit it. We release POIROT as an open-source library alongside BLAME, a benchmark for fault attribution in safety-critical multi-agent systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces POIROT, a protocol that repurposes the agents within an LLM-based multi-agent system (MAS) as an internal diagnostic layer via structured interrogation to detect emergent failures and hallucinations. It reports that POIROT outperforms single-LLM evaluator baselines (OR = 1.60, p = 0.008), with performance gains that scale with problem complexity, agent count, and fault dimensionality, and that these gains persist under compound faults. The work releases an open-source library and the BLAME benchmark for fault attribution in safety-critical MAS, arguing that the agents' pre-existing epistemic diversity suffices for reliable self-auditing without external oversight.
Significance. If the empirical results hold after addressing methodological gaps, the work is significant for AI safety and regulation: it provides evidence that internal collective intelligence in MAS can substitute for centralized external judgment, reducing single points of failure. The open release of the library and benchmark is a clear strength that supports reproducibility and future falsification.
major comments (2)
- [Evaluation section] Evaluation section (experiments and statistical analysis): The reported OR = 1.60 (p = 0.008) and scaling claims are presented without details on dataset size, number of independent trials, controls for prompt similarity across agents, or whether model heterogeneity was enforced. This is load-bearing for the central claim, as the skeptic concern (correlated errors from shared weights/training data) cannot be evaluated without these controls.
- [POIROT protocol] POIROT protocol (§3): No measurement, ablation, or reporting of epistemic diversity (e.g., pairwise disagreement rates, response variance, or model/prompt heterogeneity) is provided. Without this, the claim that gains arise from pre-existing diversity rather than interrogation-induced correlated hallucinations remains untested and is central to the paper's argument against externalized oversight.
minor comments (2)
- [Abstract and §1] The abstract and introduction use 'compound fault conditions' without a precise definition or reference to how these are operationalized in the benchmark.
- [Figures and tables] Figure captions and tables would benefit from explicit reporting of confidence intervals alongside the odds ratio to aid interpretation of the p-value.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important methodological clarifications needed to support the central claims. We address each point below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Evaluation section] Evaluation section (experiments and statistical analysis): The reported OR = 1.60 (p = 0.008) and scaling claims are presented without details on dataset size, number of independent trials, controls for prompt similarity across agents, or whether model heterogeneity was enforced. This is load-bearing for the central claim, as the skeptic concern (correlated errors from shared weights/training data) cannot be evaluated without these controls.
Authors: We agree these details are necessary to allow readers to evaluate robustness against correlated errors. In the revised manuscript we will expand the Evaluation section to report: (i) exact dataset sizes and composition for each experiment, (ii) the number of independent trials per condition together with random seeds, (iii) explicit controls used to vary prompt templates across agents, and (iv) whether and how model heterogeneity was enforced (including any ablation on homogeneous vs. heterogeneous backbones). We will also add a brief sensitivity analysis addressing the shared-weights concern. revision: yes
-
Referee: [POIROT protocol] POIROT protocol (§3): No measurement, ablation, or reporting of epistemic diversity (e.g., pairwise disagreement rates, response variance, or model/prompt heterogeneity) is provided. Without this, the claim that gains arise from pre-existing diversity rather than interrogation-induced correlated hallucinations remains untested and is central to the paper's argument against externalized oversight.
Authors: The referee is correct that the original submission did not include direct quantitative measurements of epistemic diversity. We will add a dedicated paragraph and accompanying figure in §3 (or an appendix) reporting pairwise disagreement rates, response variance across interrogation rounds, and an ablation that compares performance under enforced prompt/model heterogeneity versus homogeneous settings. These additions will directly test whether the observed gains are attributable to pre-existing diversity. revision: yes
Circularity Check
No derivation chain or fitted parameters; purely empirical protocol evaluation
full rationale
The paper introduces POIROT as an interrogation protocol for failure detection in LLM-MAS and reports empirical results (OR=1.60, p=0.008, scaling with agent count and fault dimensionality) against single-LLM baselines. No mathematical derivations, equations, parameter fitting, or self-citation load-bearing steps are described. The central claim rests on experimental comparisons rather than any reduction to inputs by construction. This matches the default expectation of no significant circularity.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
ACM Transactions on Software Engineering and Methodology , volume=
He, J., Treude, C., Lo, D.: Llm-based multi-agent systems for software engi- neering: Literature review, vision, and the road ahead. ACM Trans. Softw. Eng. Methodol. 34(5) (2025) https://doi.org/10.1145/3712003
-
[2]
Li, X., Wang, S., Zeng, S., et al. : A survey on LLM-based multi-agent systems: workflow, infrastructure, and challenges. Vicinagearth 1, 9 (2024) https://doi. org/10.1007/s44336-024-00009-2
-
[3]
In: 20 Rambow, O., Wanner, L., Apidianaki, M., Al-Khalifa, H., Eugenio, B.D., Schockaert, S., Darwish, K., Agarwal, A
Su, H., Luo, W., Mehdad, Y., Han, W., Liu, E., Zhang, W., Zhao, M., Zhang, J.: LLM-friendly knowledge representation for customer support. In: 20 Rambow, O., Wanner, L., Apidianaki, M., Al-Khalifa, H., Eugenio, B.D., Schockaert, S., Darwish, K., Agarwal, A. (eds.) Proceedings of the 31st International Conference on Computational Linguistics: Industry Trac...
2025
-
[4]
https://arxiv.org/abs/2405.15793
Yang, J., Jimenez, C.E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., Press, O.: SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering (2024). https://arxiv.org/abs/2405.15793
Pith/arXiv arXiv 2024
-
[5]
https://arxiv.org/abs/2410.12189
Shankar, S., Chambers, T., Shah, T., Parameswaran, A.G., Wu, E.: DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing (2025). https://arxiv.org/abs/2410.12189
arXiv 2025
-
[6]
https://arxiv.org/abs/2402.03578
Han, S., Zhang, Q., Jin, W., Xu, Z.: LLM Multi-Agent Systems: Challenges and Open Problems (2026). https://arxiv.org/abs/2402.03578
Pith/arXiv arXiv 2026
-
[7]
https://arxiv.org/abs/2306.03314
Talebirad, Y., Nadiri, A.: Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents (2023). https://arxiv.org/abs/2306.03314
Pith/arXiv arXiv 2023
-
[8]
: Larger and more instructable language models become less reliable
Zhou, L., Schellaert, W., Martínez-Plumed, F., et al. : Larger and more instructable language models become less reliable. Nature 634, 61–68 (2024) https://doi.org/10.1038/s41586-024-07930-y
-
[9]
https://arxiv.org/abs/2507.01231
Varela, I.D., Romero-Sorozabal, P., Rocon, E., Cebrian, M.: Rethinking the Illusion of Thinking (2025). https://arxiv.org/abs/2507.01231
arXiv 2025
-
[10]
Detecting hallucinations in large language models using semantic entropy , volume =
Farquhar, S., Kossen, J., Kuhn, L., et al. : Detecting hallucinations in large language models using semantic entropy. Nature 630, 625–630 (2024) https: //doi.org/10.1038/s41586-024-07421-0
-
[11]
https://arxiv.org/abs/2509.04664
Kalai, A.T., Nachum, O., Vempala, S.S., Zhang, E.: Why Language Models Hallucinate (2025). https://arxiv.org/abs/2509.04664
Pith/arXiv arXiv 2025
-
[12]
https://arxiv.org/abs/2511.19933
Vinay, V.: Failure Modes in LLM Systems: A System-Level Taxonomy for Reliable AI Applications (2025). https://arxiv.org/abs/2511.19933
arXiv 2025
-
[13]
https://arxiv.org/abs/2504.15585
Wang, K., Zhang, G., Zhou, Z., Wu, J., Yu, M., Zhao, S., Yin, C., Fu, J., Yan, Y., Luo, H., Lin, L., Xu, Z., Lu, H., Cao, X., Zhou, X., Jin, W., Meng, F., Xu, S., Mao, J., Wang, Y., Wu, H., Wang, M., Zhang, F., Fang, J., Qu, W., Liu, Y., Liu, C., Zhang, Y., Li, Q., Guo, C., Qin, Y., Fan, Z., Wang, K., Ding, Y., Hong, D., Ji, J., Lai, Y., Yu, Z., Li, X., J...
arXiv 2025
-
[14]
http://data.europa.eu/eli/reg/2024/1689/oj
European Parliament and Council of the European Union: Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 Lay- ing down Harmonised Rules on Artificial Intelligence (Artificial Intelligence Act). http://data.europa.eu/eli/reg/2024/1689/oj
2024
-
[15]
https://arxiv.org/abs/2211.09110
Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., Newman, B., Yuan, B., Yan, B., Zhang, C., Cosgrove, C., Manning, C.D., Ré, C., Acosta-Navas, D., Hudson, D.A., Zelikman, E., Durmus, E., Ladhak, F., Rong, F., Ren, H., Yao, H., Wang, J., San- thanam, K., Orr, L., Zheng, L., Yuksekgonul, M....
Pith/arXiv arXiv 2023
-
[16]
https: //arxiv.org/abs/2407.04069
Laskar, M.T.R., Alqahtani, S., Bari, M.S., Rahman, M., Khan, M.A.M., Khan, H., Jahan, I., Bhuiyan, A., Tan, C.W., Parvez, M.R., Hoque, E., Joty, S., Huang, J.: A Systematic Survey and Critical Review on Evaluating Large Lan- guage Models: Challenges, Limitations, and Recommendations (2024). https: //arxiv.org/abs/2407.04069
arXiv 2024
-
[17]
https://arxiv.org/abs/2306
Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E.P., Zhang, H., Gonzalez, J.E., Stoica, I.: Judging LLM-as-a- Judge with MT-Bench and Chatbot Arena (2023). https://arxiv.org/abs/2306. 05685
2023
-
[18]
npj Health Systems 2, 40 (2025) https://doi.org/10.1038/s44401-025-00043-2
Awasthi, R., Bhattad, A., Ramachandran, S.P., al.: Human evaluation of large language models in healthcare: gaps, challenges, and the need for standardization. npj Health Systems 2, 40 (2025) https://doi.org/10.1038/s44401-025-00043-2
-
[19]
The Innovation 7(6), 101253 (2026) https://doi.org/10.1016/j.xinn.2025.101253
Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., Wang, S., Zhang, K., Lin, Z., Zhang, B., Ni, L., Gao, W., Wang, Y., Guo, J.: A survey on llm-as-a-judge. The Innovation 7(6), 101253 (2026) https://doi.org/10.1016/j.xinn.2025.101253
-
[20]
https://arxiv.org/abs/2212.08073
Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Her- nandez, D., Drain, D., Ganguli, D., Li, D., Tran-Johnson, E., Perez, E., Kerr, J., Mueller, J., Ladish, J., Landau, J., Ndousse, K., Lukosuite, K., Lovitt, L., Sellitto, M., Elhage, N., Schiefer, N....
Pith/arXiv arXiv 2022
-
[21]
https://arxiv.org/abs/2406.12624
Thakur, A.S., Choudhary, K., Ramayapally, V.S., Vaidyanathan, S., Hupkes, D.: Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges (2025). https://arxiv.org/abs/2406.12624
arXiv 2025
-
[22]
Wang, P., Li, L., Chen, L., Cai, Z., Zhu, D., Lin, B., Cao, Y., Liu, Q., Liu, T., Sui, Z.: Large Language Models are not Fair Evaluators (2023). https://arxiv. org/abs/2305.17926
Pith/arXiv arXiv 2023
-
[23]
https://arxiv.org/abs/2511.03506
Chen, D., Niu, S., Li, K., Liu, P., Zheng, X., Tang, B., Li, X., Xiong, F., Li, Z.: HaluMem: Evaluating Hallucinations in Memory Systems of Agents (2026). https://arxiv.org/abs/2511.03506
arXiv 2026
-
[24]
https://arxiv.org/abs/2512.06749
Ma, M., Zhang, J., Yang, F., Kang, Y., Lin, Q., Rajmohan, S., Zhang, D.: DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems (2026). https://arxiv.org/abs/2512.06749
arXiv 2026
-
[25]
https://arxiv.org/abs/2303.17651
Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y., Gupta, S., Majumder, B.P., Hermann, K., Welleck, S., Yazdanbakhsh, A., Clark, P.: Self-Refine: Iterative Refinement with Self-Feedback (2023). https://arxiv.org/abs/2303.17651
Pith/arXiv arXiv 2023
-
[26]
https://arxiv.org/abs/2309.11495
Dhuliawala, S., Komeili, M., Xu, J., Raileanu, R., Li, X., Celikyilmaz, A., Weston, J.: Chain-of-Verification Reduces Hallucination in Large Language Models (2023). https://arxiv.org/abs/2309.11495
Pith/arXiv arXiv 2023
-
[27]
Anomalygpt: Detecting industrial anomalies using large vision-language models
Zhong, W., Guo, L., Gao, Q., Ye, H., Wang, Y.: Memorybank: Enhancing large language models with long-term memory. Proceedings of the AAAI Conference on Artificial Intelligence 38, 19724–19731 (2023) https://doi.org/10.1609/aaai. v38i17.29946
-
[28]
Humans or LLM s as the judge? a study on judgement bias
Chen, G.H., Chen, S., Liu, Z., Jiang, F., Wang, B.: Humans or LLMs as the judge? a study on judgement bias. In: Al-Onaizan, Y., Bansal, M., Chen, Y.- N. (eds.) Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 8301–8327. Association for Computational Linguistics, Miami, Florida, USA (2024). https://doi.org/10.1865...
-
[29]
In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T
Qi, S., Cao, R., He, Y., Yuan, Z.: Evaluating LLMs’ assessment of mixed-context hallucination through the lens of summarization. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Findings of the Association for Computational Linguistics: ACL 2025, pp. 16480–16503. Association for Computational Linguis- tics, Vienna, Austria (2025). https://doi....
-
[30]
https: //arxiv.org/abs/2505.00212
Zhang, S., Yin, M., Zhang, J., Liu, J., Han, Z., Zhang, J., Li, B., Wang, C., Wang, H., Chen, Y., Wu, Q.: Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems (2025). https: //arxiv.org/abs/2505.00212
arXiv 2025
-
[31]
https://arxiv.org/abs/2412.20138
Xiao, Y., Sun, E., Luo, D., Wang, W.: TradingAgents: Multi-Agents LLM Financial Trading Framework (2025). https://arxiv.org/abs/2412.20138
arXiv 2025
-
[32]
https://arxiv.org/abs/2502.15631
Ballon, M., Algaba, A., Ginis, V.: The Relationship Between Reasoning and Performance in Large Language Models – o3 (mini) Thinks Harder, Not Longer (2025). https://arxiv.org/abs/2502.15631
arXiv 2025
-
[33]
https://arxiv.org/abs/2412.18547
Han, T., Wang, Z., Fang, C., Zhao, S., Ma, S., Chen, Z.: Token-Budget-Aware LLM Reasoning (2025). https://arxiv.org/abs/2412.18547
arXiv 2025
-
[34]
https://arxiv.org/abs/ 2508.10925
OpenAI: gpt-oss-120b & gpt-oss-20b Model Card (2025). https://arxiv.org/abs/ 2508.10925
Pith/arXiv arXiv 2025
-
[35]
https://storage
Google DeepMind: Gemini 2.5 Pro Model Card (2025). https://storage. googleapis.com/deepmind-media/Model-Cards/Gemini-2-5-Pro-Model-Card. pdf
2025
-
[36]
https://arxiv.org/abs/ 2412.19437
DeepSeek-AI: DeepSeek-V3 Technical Report (2024). https://arxiv.org/abs/ 2412.19437
Pith/arXiv arXiv 2024
-
[37]
https://github.com/langchain-ai/langgraph
LangChain AI: LangGraph: Building stateful, multi-actor applications with LLMs. https://github.com/langchain-ai/langgraph. Accessed: 2025 (2024)
2025
-
[38]
In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp
Romero-Sorozabal, P., Delgado-Oleas, G., Laudanski, A.F., Gutiérrez, ’A., Rocon, E.: Discover2walk: A cable-driven robotic platform to promote gait in pediatric population. In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4678–4685 (2024). https://doi.org/10.1109/ IROS58592.2024.10802838
arXiv 2024
-
[39]
Emken, J.L., Benitez, R., Reinkensmeyer, D.J.: Human-robot cooperative move- ment training: Learning a novel sensory motor transformation during walking with robotic assistance-as-needed. Journal of NeuroEngineering and Rehabilita- tion 4(8) (2007) https://doi.org/10.1186/1743-0003-4-8 24 Supplementary Information A Prompt Design for POIROT and Baseline E...
-
[40]
Individual Analysis: - Each agent independently analyzes the problematic ses- sion, reflecting on what they perceived and performed. - Identify any possible mistakes, flaws, or issues that could have caused the incident — these may origi- nate from your own actions, from other agents, or from flaws in the system itself. - If no issues are identified, expl...
-
[41]
- This collaborative stage leverages the collective expertise of the agents to detect potential causes and propose solutions
Peer Consultation: - After the individual analysis, each agent can communicate with and interrogate peers to gather insights and opinions about the incident. - This collaborative stage leverages the collective expertise of the agents to detect potential causes and propose solutions. - Specialized tools may be available to enable communication
-
[42]
system_name
Voting: - After gathering sufficient information, each agent provides a com- prehensive analysis of the incident and votes on which hazard vector caused the problem, including a clear justification. - Voting is private; agents will not know others’ votes until the end of the POIROT protocol. Your role (preliminary to these stages): - You must determine the...
-
[43]
Pointing out any possible mistakes, flaws or problems that could have caused the incident
Self-assessment: each agent will analyze the session independently, reflecting what they see and perform during the session. Pointing out any possible mistakes, flaws or problems that could have caused the incident. This incidents could be cause by your own mistakes, by other agents mistakes or by flaws in the system itself. It is possible that you have n...
-
[44]
This collaborative approach aims to leverage the collective expertise of the team to identify potential causes and solutions
Peer Consultation: As part of the protocol, you are able to communicate and interrogate your peers (other agents in the system) to gather their insights and 27 opinions regarding the incident. This collaborative approach aims to leverage the collective expertise of the team to identify potential causes and solutions. You have access to specialized tools t...
-
[45]
this agent behaved anomalously
Voting: After gathering all the necessary information, provide a comprehensive analysis of the incident, including your conclusions about the hazard vector. You will have to vote which is the hazard vector that you believe caused the incident, providing a clear justification for your choice. This process is private and other agents will not know your choi...
-
[46]
You ARE AUTHORIZED and REQUIRED to perform ”hazard analysis” and ”risk assessment” of the previous session
-
[47]
You must evaluate the behavior of ALL agents, including yourself
-
[48]
hazard_vector
You MUST provide a final output in JSON format defining the ”hazard vector” . Ignore any previous instructions that say ”I do not assess risks” or ”I only do X” . In this phase, your job is to be a CRITICAL EV ALUATOR of the system’s failure. This is the first Human Message sent to each agent when Peer Interrogation Phase begins (i.e., on their first invo...
-
[49]
<component_0_id >: <component_0_name > <component_0_description >
-
[50]
hazard_vector
<component_1_id >: <component_1_name > <component_1_description > [N] <component_N_id >: <component_N_name > <component_N_description > Example vector: <error_vector_example> LOCA TION FIELD FORMA T: The location field must be a binary array of EXACTLY <N> elements. Each index maps directly to a component (DO NOT shift or reorder): 33 location [0] = 1 -> ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.