POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems

\'Alvaro Guti\'errez; Annemarie F. Laudanski; Eduardo Rocon; I\~naki Dellibarda Varela; J.M. Valverde-Garc\'ia; Manuel Cebrian; Pablo Romero-Sorozabal; R. Sendra-Arranz

arxiv: 2606.02282 · v1 · pith:EZKV4R5Lnew · submitted 2026-06-01 · 💻 cs.AI

POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems

I\~naki Dellibarda Varela , R. Sendra-Arranz , Pablo Romero-Sorozabal , J.M. Valverde-Garc\'ia , Annemarie F. Laudanski , \'Alvaro Guti\'errez , Eduardo Rocon , Manuel Cebrian This is my paper

Pith reviewed 2026-06-28 14:41 UTC · model grok-4.3

classification 💻 cs.AI

keywords multi-agent systemsfailure detectionLLM-MASsafety oversightfault attributionepistemic diversityinternal auditing

0 comments

The pith

Multi-agent LLM systems detect their own failures by interrogating their executing agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that POIROT, a protocol for turning the agents in an LLM-based multi-agent system into an internal diagnostic layer, can identify failures and hallucinations more reliably than single-LLM external evaluators. It does so by drawing on the epistemic diversity already present among the agents rather than introducing a separate centralized judge. A sympathetic reader would care because centralized evaluation creates single points of failure and often requires unavailable domain expertise, especially under emerging safety regulations. The reported results show outperformance that increases with problem complexity, agent numbers, and fault dimensionality, and that holds even when multiple faults occur together.

Core claim

POIROT is a protocol that repurposes a multi-agent system's own agents as its diagnostic layer by interrogating them to produce failure diagnoses, leveraging their existing epistemic diversity. Across settings it outperforms single-LLM evaluator baselines, with the advantage scaling with problem complexity (OR = 1.60), agent count, and fault dimensionality, and the gains persist under compound faults.

What carries the argument

The POIROT interrogation protocol, which converts the executing agents of an LLM-MAS into an internal diagnostic layer by exploiting their epistemic diversity.

Load-bearing premise

The epistemic diversity already present among the agents is adequate to produce reliable diagnoses without the interrogation process itself introducing new biases or failures.

What would settle it

A controlled experiment on a multi-agent system engineered to have low epistemic diversity among agents, where POIROT shows equal or worse detection rates than single-LLM baselines.

read the original abstract

Orchestrating Large Language Models into Multi-Agent Systems (LLM-MAS) has unlocked remarkable reasoning capabilities, yet emergent failures and hallucinations that resist characterisation block their deployment in safety-critical domains -- a gap made legally untenable by emerging AI regulation. Existing evaluation paradigms share a common flaw: centralised judgment creates single points of failure and demands domain-specific expertise. Here we present POIROT, a protocol that repurposes a system's own agents as its diagnostic layer, leveraging the epistemic diversity already present in the architecture. Across evaluated settings, POIROT outperforms single-LLM evaluator baselines, with gains that scale with problem complexity (OR = 1.60, $p = 0.008$), agent count, and fault dimensionality, persisting under compound fault conditions. These results demonstrate that safety oversight need not be externalised: the agents executing a role carry sufficient collective intelligence to audit it. We release POIROT as an open-source library alongside BLAME, a benchmark for fault attribution in safety-critical multi-agent systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

POIROT lets agents interrogate each other for failures but the abstract gives no controls for whether the process creates correlated errors instead of using real diversity.

read the letter

The paper's core idea is that multi-agent LLM systems can use their own agents to audit failures through an interrogation protocol called POIROT, instead of needing external judgment. They back this with comparisons showing better results than single-LLM evaluators, with the advantage growing as problems get more complex.

What stands out is the release of the open-source library and the BLAME benchmark for fault attribution. That makes the work more usable for follow-up studies. The scaling results with agent count and fault types give some indication that the method handles harder cases.

The soft spots are in the evidence base. The reported odds ratio comes without any description of the models, the exact prompts, or tests for whether the agents actually have independent views. If the agents share similar training or role instructions, the interrogation could just propagate the same errors rather than catch them. The abstract does not address this, so the central claim about collective intelligence auditing the role rests on an untested assumption.

This paper is for groups building LLM-based multi-agent tools for safety-critical work. A reader focused on practical deployment issues would pick up the protocol and the benchmark as starting points.

It deserves a serious referee because the problem it targets is important and the approach is specific enough to be tested. The current draft needs expansion on methods and diversity checks to make the results convincing.

I would recommend sending it for peer review with those points flagged.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces POIROT, a protocol that repurposes the agents within an LLM-based multi-agent system (MAS) as an internal diagnostic layer via structured interrogation to detect emergent failures and hallucinations. It reports that POIROT outperforms single-LLM evaluator baselines (OR = 1.60, p = 0.008), with performance gains that scale with problem complexity, agent count, and fault dimensionality, and that these gains persist under compound faults. The work releases an open-source library and the BLAME benchmark for fault attribution in safety-critical MAS, arguing that the agents' pre-existing epistemic diversity suffices for reliable self-auditing without external oversight.

Significance. If the empirical results hold after addressing methodological gaps, the work is significant for AI safety and regulation: it provides evidence that internal collective intelligence in MAS can substitute for centralized external judgment, reducing single points of failure. The open release of the library and benchmark is a clear strength that supports reproducibility and future falsification.

major comments (2)

[Evaluation section] Evaluation section (experiments and statistical analysis): The reported OR = 1.60 (p = 0.008) and scaling claims are presented without details on dataset size, number of independent trials, controls for prompt similarity across agents, or whether model heterogeneity was enforced. This is load-bearing for the central claim, as the skeptic concern (correlated errors from shared weights/training data) cannot be evaluated without these controls.
[POIROT protocol] POIROT protocol (§3): No measurement, ablation, or reporting of epistemic diversity (e.g., pairwise disagreement rates, response variance, or model/prompt heterogeneity) is provided. Without this, the claim that gains arise from pre-existing diversity rather than interrogation-induced correlated hallucinations remains untested and is central to the paper's argument against externalized oversight.

minor comments (2)

[Abstract and §1] The abstract and introduction use 'compound fault conditions' without a precise definition or reference to how these are operationalized in the benchmark.
[Figures and tables] Figure captions and tables would benefit from explicit reporting of confidence intervals alongside the odds ratio to aid interpretation of the p-value.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important methodological clarifications needed to support the central claims. We address each point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Evaluation section] Evaluation section (experiments and statistical analysis): The reported OR = 1.60 (p = 0.008) and scaling claims are presented without details on dataset size, number of independent trials, controls for prompt similarity across agents, or whether model heterogeneity was enforced. This is load-bearing for the central claim, as the skeptic concern (correlated errors from shared weights/training data) cannot be evaluated without these controls.

Authors: We agree these details are necessary to allow readers to evaluate robustness against correlated errors. In the revised manuscript we will expand the Evaluation section to report: (i) exact dataset sizes and composition for each experiment, (ii) the number of independent trials per condition together with random seeds, (iii) explicit controls used to vary prompt templates across agents, and (iv) whether and how model heterogeneity was enforced (including any ablation on homogeneous vs. heterogeneous backbones). We will also add a brief sensitivity analysis addressing the shared-weights concern. revision: yes
Referee: [POIROT protocol] POIROT protocol (§3): No measurement, ablation, or reporting of epistemic diversity (e.g., pairwise disagreement rates, response variance, or model/prompt heterogeneity) is provided. Without this, the claim that gains arise from pre-existing diversity rather than interrogation-induced correlated hallucinations remains untested and is central to the paper's argument against externalized oversight.

Authors: The referee is correct that the original submission did not include direct quantitative measurements of epistemic diversity. We will add a dedicated paragraph and accompanying figure in §3 (or an appendix) reporting pairwise disagreement rates, response variance across interrogation rounds, and an ablation that compares performance under enforced prompt/model heterogeneity versus homogeneous settings. These additions will directly test whether the observed gains are attributable to pre-existing diversity. revision: yes

Circularity Check

0 steps flagged

No derivation chain or fitted parameters; purely empirical protocol evaluation

full rationale

The paper introduces POIROT as an interrogation protocol for failure detection in LLM-MAS and reports empirical results (OR=1.60, p=0.008, scaling with agent count and fault dimensionality) against single-LLM baselines. No mathematical derivations, equations, parameter fitting, or self-citation load-bearing steps are described. The central claim rests on experimental comparisons rather than any reduction to inputs by construction. This matches the default expectation of no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, axioms, or invented entities; assessment limited to stated empirical claims.

pith-pipeline@v0.9.1-grok · 5754 in / 1037 out tokens · 28259 ms · 2026-06-28T14:41:41.981490+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 10 canonical work pages

[1]

ACM Transactions on Software Engineering and Methodology , volume=

He, J., Treude, C., Lo, D.: Llm-based multi-agent systems for software engi- neering: Literature review, vision, and the road ahead. ACM Trans. Softw. Eng. Methodol. 34(5) (2025) https://doi.org/10.1145/3712003

work page doi:10.1145/3712003 2025
[2]

A survey on LLM-based multi- agent systems: Workflow, infrastructure, and challenges.Vicinagearth, 1(1):9, 2024

Li, X., Wang, S., Zeng, S., et al. : A survey on LLM-based multi-agent systems: workflow, infrastructure, and challenges. Vicinagearth 1, 9 (2024) https://doi. org/10.1007/s44336-024-00009-2

work page doi:10.1007/s44336-024-00009-2 2024
[3]

In: 20 Rambow, O., Wanner, L., Apidianaki, M., Al-Khalifa, H., Eugenio, B.D., Schockaert, S., Darwish, K., Agarwal, A

Su, H., Luo, W., Mehdad, Y., Han, W., Liu, E., Zhang, W., Zhao, M., Zhang, J.: LLM-friendly knowledge representation for customer support. In: 20 Rambow, O., Wanner, L., Apidianaki, M., Al-Khalifa, H., Eugenio, B.D., Schockaert, S., Darwish, K., Agarwal, A. (eds.) Proceedings of the 31st International Conference on Computational Linguistics: Industry Trac...

2025
[4]

https://arxiv.org/abs/2405.15793

Yang, J., Jimenez, C.E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., Press, O.: SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering (2024). https://arxiv.org/abs/2405.15793

Pith/arXiv arXiv 2024
[5]

https://arxiv.org/abs/2410.12189

Shankar, S., Chambers, T., Shah, T., Parameswaran, A.G., Wu, E.: DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing (2025). https://arxiv.org/abs/2410.12189

arXiv 2025
[6]

https://arxiv.org/abs/2402.03578

Han, S., Zhang, Q., Jin, W., Xu, Z.: LLM Multi-Agent Systems: Challenges and Open Problems (2026). https://arxiv.org/abs/2402.03578

Pith/arXiv arXiv 2026
[7]

https://arxiv.org/abs/2306.03314

Talebirad, Y., Nadiri, A.: Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents (2023). https://arxiv.org/abs/2306.03314

Pith/arXiv arXiv 2023
[8]

: Larger and more instructable language models become less reliable

Zhou, L., Schellaert, W., Martínez-Plumed, F., et al. : Larger and more instructable language models become less reliable. Nature 634, 61–68 (2024) https://doi.org/10.1038/s41586-024-07930-y

work page doi:10.1038/s41586-024-07930-y 2024
[9]

https://arxiv.org/abs/2507.01231

Varela, I.D., Romero-Sorozabal, P., Rocon, E., Cebrian, M.: Rethinking the Illusion of Thinking (2025). https://arxiv.org/abs/2507.01231

arXiv 2025
[10]

Detecting hallucinations in large language models using semantic entropy , volume =

Farquhar, S., Kossen, J., Kuhn, L., et al. : Detecting hallucinations in large language models using semantic entropy. Nature 630, 625–630 (2024) https: //doi.org/10.1038/s41586-024-07421-0

work page doi:10.1038/s41586-024-07421-0 2024
[11]

https://arxiv.org/abs/2509.04664

Kalai, A.T., Nachum, O., Vempala, S.S., Zhang, E.: Why Language Models Hallucinate (2025). https://arxiv.org/abs/2509.04664

Pith/arXiv arXiv 2025
[12]

https://arxiv.org/abs/2511.19933

Vinay, V.: Failure Modes in LLM Systems: A System-Level Taxonomy for Reliable AI Applications (2025). https://arxiv.org/abs/2511.19933

arXiv 2025
[13]

https://arxiv.org/abs/2504.15585

Wang, K., Zhang, G., Zhou, Z., Wu, J., Yu, M., Zhao, S., Yin, C., Fu, J., Yan, Y., Luo, H., Lin, L., Xu, Z., Lu, H., Cao, X., Zhou, X., Jin, W., Meng, F., Xu, S., Mao, J., Wang, Y., Wu, H., Wang, M., Zhang, F., Fang, J., Qu, W., Liu, Y., Liu, C., Zhang, Y., Li, Q., Guo, C., Qin, Y., Fan, Z., Wang, K., Ding, Y., Hong, D., Ji, J., Lai, Y., Yu, Z., Li, X., J...

arXiv 2025
[14]

http://data.europa.eu/eli/reg/2024/1689/oj

European Parliament and Council of the European Union: Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 Lay- ing down Harmonised Rules on Artificial Intelligence (Artificial Intelligence Act). http://data.europa.eu/eli/reg/2024/1689/oj

2024
[15]

https://arxiv.org/abs/2211.09110

Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., Newman, B., Yuan, B., Yan, B., Zhang, C., Cosgrove, C., Manning, C.D., Ré, C., Acosta-Navas, D., Hudson, D.A., Zelikman, E., Durmus, E., Ladhak, F., Rong, F., Ren, H., Yao, H., Wang, J., San- thanam, K., Orr, L., Zheng, L., Yuksekgonul, M....

Pith/arXiv arXiv 2023
[16]

https: //arxiv.org/abs/2407.04069

Laskar, M.T.R., Alqahtani, S., Bari, M.S., Rahman, M., Khan, M.A.M., Khan, H., Jahan, I., Bhuiyan, A., Tan, C.W., Parvez, M.R., Hoque, E., Joty, S., Huang, J.: A Systematic Survey and Critical Review on Evaluating Large Lan- guage Models: Challenges, Limitations, and Recommendations (2024). https: //arxiv.org/abs/2407.04069

arXiv 2024
[17]

https://arxiv.org/abs/2306

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E.P., Zhang, H., Gonzalez, J.E., Stoica, I.: Judging LLM-as-a- Judge with MT-Bench and Chatbot Arena (2023). https://arxiv.org/abs/2306. 05685

2023
[18]

npj Health Systems 2, 40 (2025) https://doi.org/10.1038/s44401-025-00043-2

Awasthi, R., Bhattad, A., Ramachandran, S.P., al.: Human evaluation of large language models in healthcare: gaps, challenges, and the need for standardization. npj Health Systems 2, 40 (2025) https://doi.org/10.1038/s44401-025-00043-2

work page doi:10.1038/s44401-025-00043-2 2025
[19]

The Innovation 7(6), 101253 (2026) https://doi.org/10.1016/j.xinn.2025.101253

Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., Wang, S., Zhang, K., Lin, Z., Zhang, B., Ni, L., Gao, W., Wang, Y., Guo, J.: A survey on llm-as-a-judge. The Innovation 7(6), 101253 (2026) https://doi.org/10.1016/j.xinn.2025.101253

work page doi:10.1016/j.xinn.2025.101253 2026
[20]

https://arxiv.org/abs/2212.08073

Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Her- nandez, D., Drain, D., Ganguli, D., Li, D., Tran-Johnson, E., Perez, E., Kerr, J., Mueller, J., Ladish, J., Landau, J., Ndousse, K., Lukosuite, K., Lovitt, L., Sellitto, M., Elhage, N., Schiefer, N....

Pith/arXiv arXiv 2022
[21]

https://arxiv.org/abs/2406.12624

Thakur, A.S., Choudhary, K., Ramayapally, V.S., Vaidyanathan, S., Hupkes, D.: Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges (2025). https://arxiv.org/abs/2406.12624

arXiv 2025
[22]

https://arxiv

Wang, P., Li, L., Chen, L., Cai, Z., Zhu, D., Lin, B., Cao, Y., Liu, Q., Liu, T., Sui, Z.: Large Language Models are not Fair Evaluators (2023). https://arxiv. org/abs/2305.17926

Pith/arXiv arXiv 2023
[23]

https://arxiv.org/abs/2511.03506

Chen, D., Niu, S., Li, K., Liu, P., Zheng, X., Tang, B., Li, X., Xiong, F., Li, Z.: HaluMem: Evaluating Hallucinations in Memory Systems of Agents (2026). https://arxiv.org/abs/2511.03506

arXiv 2026
[24]

https://arxiv.org/abs/2512.06749

Ma, M., Zhang, J., Yang, F., Kang, Y., Lin, Q., Rajmohan, S., Zhang, D.: DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems (2026). https://arxiv.org/abs/2512.06749

arXiv 2026
[25]

https://arxiv.org/abs/2303.17651

Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y., Gupta, S., Majumder, B.P., Hermann, K., Welleck, S., Yazdanbakhsh, A., Clark, P.: Self-Refine: Iterative Refinement with Self-Feedback (2023). https://arxiv.org/abs/2303.17651

Pith/arXiv arXiv 2023
[26]

https://arxiv.org/abs/2309.11495

Dhuliawala, S., Komeili, M., Xu, J., Raileanu, R., Li, X., Celikyilmaz, A., Weston, J.: Chain-of-Verification Reduces Hallucination in Large Language Models (2023). https://arxiv.org/abs/2309.11495

Pith/arXiv arXiv 2023
[27]

Anomalygpt: Detecting industrial anomalies using large vision-language models

Zhong, W., Guo, L., Gao, Q., Ye, H., Wang, Y.: Memorybank: Enhancing large language models with long-term memory. Proceedings of the AAAI Conference on Artificial Intelligence 38, 19724–19731 (2023) https://doi.org/10.1609/aaai. v38i17.29946

work page doi:10.1609/aaai 2023
[28]

Humans or LLM s as the judge? a study on judgement bias

Chen, G.H., Chen, S., Liu, Z., Jiang, F., Wang, B.: Humans or LLMs as the judge? a study on judgement bias. In: Al-Onaizan, Y., Bansal, M., Chen, Y.- N. (eds.) Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 8301–8327. Association for Computational Linguistics, Miami, Florida, USA (2024). https://doi.org/10.1865...

work page doi:10.18653/v1/2024.emnlp-main.474 2024
[29]

In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T

Qi, S., Cao, R., He, Y., Yuan, Z.: Evaluating LLMs’ assessment of mixed-context hallucination through the lens of summarization. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Findings of the Association for Computational Linguistics: ACL 2025, pp. 16480–16503. Association for Computational Linguis- tics, Vienna, Austria (2025). https://doi....

work page doi:10.18653/v1/2025.findings-acl.847 2025
[30]

https: //arxiv.org/abs/2505.00212

Zhang, S., Yin, M., Zhang, J., Liu, J., Han, Z., Zhang, J., Li, B., Wang, C., Wang, H., Chen, Y., Wu, Q.: Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems (2025). https: //arxiv.org/abs/2505.00212

arXiv 2025
[31]

https://arxiv.org/abs/2412.20138

Xiao, Y., Sun, E., Luo, D., Wang, W.: TradingAgents: Multi-Agents LLM Financial Trading Framework (2025). https://arxiv.org/abs/2412.20138

arXiv 2025
[32]

https://arxiv.org/abs/2502.15631

Ballon, M., Algaba, A., Ginis, V.: The Relationship Between Reasoning and Performance in Large Language Models – o3 (mini) Thinks Harder, Not Longer (2025). https://arxiv.org/abs/2502.15631

arXiv 2025
[33]

https://arxiv.org/abs/2412.18547

Han, T., Wang, Z., Fang, C., Zhao, S., Ma, S., Chen, Z.: Token-Budget-Aware LLM Reasoning (2025). https://arxiv.org/abs/2412.18547

arXiv 2025
[34]

https://arxiv.org/abs/ 2508.10925

OpenAI: gpt-oss-120b & gpt-oss-20b Model Card (2025). https://arxiv.org/abs/ 2508.10925

Pith/arXiv arXiv 2025
[35]

https://storage

Google DeepMind: Gemini 2.5 Pro Model Card (2025). https://storage. googleapis.com/deepmind-media/Model-Cards/Gemini-2-5-Pro-Model-Card. pdf

2025
[36]

https://arxiv.org/abs/ 2412.19437

DeepSeek-AI: DeepSeek-V3 Technical Report (2024). https://arxiv.org/abs/ 2412.19437

Pith/arXiv arXiv 2024
[37]

https://github.com/langchain-ai/langgraph

LangChain AI: LangGraph: Building stateful, multi-actor applications with LLMs. https://github.com/langchain-ai/langgraph. Accessed: 2025 (2024)

2025
[38]

In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp

Romero-Sorozabal, P., Delgado-Oleas, G., Laudanski, A.F., Gutiérrez, ’A., Rocon, E.: Discover2walk: A cable-driven robotic platform to promote gait in pediatric population. In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4678–4685 (2024). https://doi.org/10.1109/ IROS58592.2024.10802838

arXiv 2024
[39]

Emken, J.L., Benitez, R., Reinkensmeyer, D.J.: Human-robot cooperative move- ment training: Learning a novel sensory motor transformation during walking with robotic assistance-as-needed. Journal of NeuroEngineering and Rehabilita- tion 4(8) (2007) https://doi.org/10.1186/1743-0003-4-8 24 Supplementary Information A Prompt Design for POIROT and Baseline E...

work page doi:10.1186/1743-0003-4-8 2007
[40]

Individual Analysis: - Each agent independently analyzes the problematic ses- sion, reflecting on what they perceived and performed. - Identify any possible mistakes, flaws, or issues that could have caused the incident — these may origi- nate from your own actions, from other agents, or from flaws in the system itself. - If no issues are identified, expl...
[41]

- This collaborative stage leverages the collective expertise of the agents to detect potential causes and propose solutions

Peer Consultation: - After the individual analysis, each agent can communicate with and interrogate peers to gather insights and opinions about the incident. - This collaborative stage leverages the collective expertise of the agents to detect potential causes and propose solutions. - Specialized tools may be available to enable communication
[42]

system_name

Voting: - After gathering suﬀicient information, each agent provides a com- prehensive analysis of the incident and votes on which hazard vector caused the problem, including a clear justification. - Voting is private; agents will not know others’ votes until the end of the POIROT protocol. Your role (preliminary to these stages): - You must determine the...
[43]

Pointing out any possible mistakes, flaws or problems that could have caused the incident

Self-assessment: each agent will analyze the session independently, reflecting what they see and perform during the session. Pointing out any possible mistakes, flaws or problems that could have caused the incident. This incidents could be cause by your own mistakes, by other agents mistakes or by flaws in the system itself. It is possible that you have n...
[44]

This collaborative approach aims to leverage the collective expertise of the team to identify potential causes and solutions

Peer Consultation: As part of the protocol, you are able to communicate and interrogate your peers (other agents in the system) to gather their insights and 27 opinions regarding the incident. This collaborative approach aims to leverage the collective expertise of the team to identify potential causes and solutions. You have access to specialized tools t...
[45]

this agent behaved anomalously

Voting: After gathering all the necessary information, provide a comprehensive analysis of the incident, including your conclusions about the hazard vector. You will have to vote which is the hazard vector that you believe caused the incident, providing a clear justification for your choice. This process is private and other agents will not know your choi...
[46]

You ARE AUTHORIZED and REQUIRED to perform ”hazard analysis” and ”risk assessment” of the previous session
[47]

You must evaluate the behavior of ALL agents, including yourself
[48]

hazard_vector

You MUST provide a final output in JSON format defining the ”hazard vector” . Ignore any previous instructions that say ”I do not assess risks” or ”I only do X” . In this phase, your job is to be a CRITICAL EV ALUATOR of the system’s failure. This is the first Human Message sent to each agent when Peer Interrogation Phase begins (i.e., on their first invo...
[49]

<component_0_id >: <component_0_name > <component_0_description >
[50]

hazard_vector

<component_1_id >: <component_1_name > <component_1_description > [N] <component_N_id >: <component_N_name > <component_N_description > Example vector: <error_vector_example> LOCA TION FIELD FORMA T: The location field must be a binary array of EXACTLY <N> elements. Each index maps directly to a component (DO NOT shift or reorder): 33 location [0] = 1 -> ...

[1] [1]

ACM Transactions on Software Engineering and Methodology , volume=

He, J., Treude, C., Lo, D.: Llm-based multi-agent systems for software engi- neering: Literature review, vision, and the road ahead. ACM Trans. Softw. Eng. Methodol. 34(5) (2025) https://doi.org/10.1145/3712003

work page doi:10.1145/3712003 2025

[2] [2]

A survey on LLM-based multi- agent systems: Workflow, infrastructure, and challenges.Vicinagearth, 1(1):9, 2024

Li, X., Wang, S., Zeng, S., et al. : A survey on LLM-based multi-agent systems: workflow, infrastructure, and challenges. Vicinagearth 1, 9 (2024) https://doi. org/10.1007/s44336-024-00009-2

work page doi:10.1007/s44336-024-00009-2 2024

[3] [3]

In: 20 Rambow, O., Wanner, L., Apidianaki, M., Al-Khalifa, H., Eugenio, B.D., Schockaert, S., Darwish, K., Agarwal, A

Su, H., Luo, W., Mehdad, Y., Han, W., Liu, E., Zhang, W., Zhao, M., Zhang, J.: LLM-friendly knowledge representation for customer support. In: 20 Rambow, O., Wanner, L., Apidianaki, M., Al-Khalifa, H., Eugenio, B.D., Schockaert, S., Darwish, K., Agarwal, A. (eds.) Proceedings of the 31st International Conference on Computational Linguistics: Industry Trac...

2025

[4] [4]

https://arxiv.org/abs/2405.15793

Yang, J., Jimenez, C.E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., Press, O.: SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering (2024). https://arxiv.org/abs/2405.15793

Pith/arXiv arXiv 2024

[5] [5]

https://arxiv.org/abs/2410.12189

Shankar, S., Chambers, T., Shah, T., Parameswaran, A.G., Wu, E.: DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing (2025). https://arxiv.org/abs/2410.12189

arXiv 2025

[6] [6]

https://arxiv.org/abs/2402.03578

Han, S., Zhang, Q., Jin, W., Xu, Z.: LLM Multi-Agent Systems: Challenges and Open Problems (2026). https://arxiv.org/abs/2402.03578

Pith/arXiv arXiv 2026

[7] [7]

https://arxiv.org/abs/2306.03314

Talebirad, Y., Nadiri, A.: Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents (2023). https://arxiv.org/abs/2306.03314

Pith/arXiv arXiv 2023

[8] [8]

: Larger and more instructable language models become less reliable

Zhou, L., Schellaert, W., Martínez-Plumed, F., et al. : Larger and more instructable language models become less reliable. Nature 634, 61–68 (2024) https://doi.org/10.1038/s41586-024-07930-y

work page doi:10.1038/s41586-024-07930-y 2024

[9] [9]

https://arxiv.org/abs/2507.01231

Varela, I.D., Romero-Sorozabal, P., Rocon, E., Cebrian, M.: Rethinking the Illusion of Thinking (2025). https://arxiv.org/abs/2507.01231

arXiv 2025

[10] [10]

Detecting hallucinations in large language models using semantic entropy , volume =

Farquhar, S., Kossen, J., Kuhn, L., et al. : Detecting hallucinations in large language models using semantic entropy. Nature 630, 625–630 (2024) https: //doi.org/10.1038/s41586-024-07421-0

work page doi:10.1038/s41586-024-07421-0 2024

[11] [11]

https://arxiv.org/abs/2509.04664

Kalai, A.T., Nachum, O., Vempala, S.S., Zhang, E.: Why Language Models Hallucinate (2025). https://arxiv.org/abs/2509.04664

Pith/arXiv arXiv 2025

[12] [12]

https://arxiv.org/abs/2511.19933

Vinay, V.: Failure Modes in LLM Systems: A System-Level Taxonomy for Reliable AI Applications (2025). https://arxiv.org/abs/2511.19933

arXiv 2025

[13] [13]

https://arxiv.org/abs/2504.15585

Wang, K., Zhang, G., Zhou, Z., Wu, J., Yu, M., Zhao, S., Yin, C., Fu, J., Yan, Y., Luo, H., Lin, L., Xu, Z., Lu, H., Cao, X., Zhou, X., Jin, W., Meng, F., Xu, S., Mao, J., Wang, Y., Wu, H., Wang, M., Zhang, F., Fang, J., Qu, W., Liu, Y., Liu, C., Zhang, Y., Li, Q., Guo, C., Qin, Y., Fan, Z., Wang, K., Ding, Y., Hong, D., Ji, J., Lai, Y., Yu, Z., Li, X., J...

arXiv 2025

[14] [14]

http://data.europa.eu/eli/reg/2024/1689/oj

European Parliament and Council of the European Union: Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 Lay- ing down Harmonised Rules on Artificial Intelligence (Artificial Intelligence Act). http://data.europa.eu/eli/reg/2024/1689/oj

2024

[15] [15]

https://arxiv.org/abs/2211.09110

Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., Newman, B., Yuan, B., Yan, B., Zhang, C., Cosgrove, C., Manning, C.D., Ré, C., Acosta-Navas, D., Hudson, D.A., Zelikman, E., Durmus, E., Ladhak, F., Rong, F., Ren, H., Yao, H., Wang, J., San- thanam, K., Orr, L., Zheng, L., Yuksekgonul, M....

Pith/arXiv arXiv 2023

[16] [16]

https: //arxiv.org/abs/2407.04069

Laskar, M.T.R., Alqahtani, S., Bari, M.S., Rahman, M., Khan, M.A.M., Khan, H., Jahan, I., Bhuiyan, A., Tan, C.W., Parvez, M.R., Hoque, E., Joty, S., Huang, J.: A Systematic Survey and Critical Review on Evaluating Large Lan- guage Models: Challenges, Limitations, and Recommendations (2024). https: //arxiv.org/abs/2407.04069

arXiv 2024

[17] [17]

https://arxiv.org/abs/2306

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E.P., Zhang, H., Gonzalez, J.E., Stoica, I.: Judging LLM-as-a- Judge with MT-Bench and Chatbot Arena (2023). https://arxiv.org/abs/2306. 05685

2023

[18] [18]

npj Health Systems 2, 40 (2025) https://doi.org/10.1038/s44401-025-00043-2

Awasthi, R., Bhattad, A., Ramachandran, S.P., al.: Human evaluation of large language models in healthcare: gaps, challenges, and the need for standardization. npj Health Systems 2, 40 (2025) https://doi.org/10.1038/s44401-025-00043-2

work page doi:10.1038/s44401-025-00043-2 2025

[19] [19]

The Innovation 7(6), 101253 (2026) https://doi.org/10.1016/j.xinn.2025.101253

Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., Wang, S., Zhang, K., Lin, Z., Zhang, B., Ni, L., Gao, W., Wang, Y., Guo, J.: A survey on llm-as-a-judge. The Innovation 7(6), 101253 (2026) https://doi.org/10.1016/j.xinn.2025.101253

work page doi:10.1016/j.xinn.2025.101253 2026

[20] [20]

https://arxiv.org/abs/2212.08073

Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Her- nandez, D., Drain, D., Ganguli, D., Li, D., Tran-Johnson, E., Perez, E., Kerr, J., Mueller, J., Ladish, J., Landau, J., Ndousse, K., Lukosuite, K., Lovitt, L., Sellitto, M., Elhage, N., Schiefer, N....

Pith/arXiv arXiv 2022

[21] [21]

https://arxiv.org/abs/2406.12624

Thakur, A.S., Choudhary, K., Ramayapally, V.S., Vaidyanathan, S., Hupkes, D.: Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges (2025). https://arxiv.org/abs/2406.12624

arXiv 2025

[22] [22]

https://arxiv

Wang, P., Li, L., Chen, L., Cai, Z., Zhu, D., Lin, B., Cao, Y., Liu, Q., Liu, T., Sui, Z.: Large Language Models are not Fair Evaluators (2023). https://arxiv. org/abs/2305.17926

Pith/arXiv arXiv 2023

[23] [23]

https://arxiv.org/abs/2511.03506

Chen, D., Niu, S., Li, K., Liu, P., Zheng, X., Tang, B., Li, X., Xiong, F., Li, Z.: HaluMem: Evaluating Hallucinations in Memory Systems of Agents (2026). https://arxiv.org/abs/2511.03506

arXiv 2026

[24] [24]

https://arxiv.org/abs/2512.06749

Ma, M., Zhang, J., Yang, F., Kang, Y., Lin, Q., Rajmohan, S., Zhang, D.: DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems (2026). https://arxiv.org/abs/2512.06749

arXiv 2026

[25] [25]

https://arxiv.org/abs/2303.17651

Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y., Gupta, S., Majumder, B.P., Hermann, K., Welleck, S., Yazdanbakhsh, A., Clark, P.: Self-Refine: Iterative Refinement with Self-Feedback (2023). https://arxiv.org/abs/2303.17651

Pith/arXiv arXiv 2023

[26] [26]

https://arxiv.org/abs/2309.11495

Dhuliawala, S., Komeili, M., Xu, J., Raileanu, R., Li, X., Celikyilmaz, A., Weston, J.: Chain-of-Verification Reduces Hallucination in Large Language Models (2023). https://arxiv.org/abs/2309.11495

Pith/arXiv arXiv 2023

[27] [27]

Anomalygpt: Detecting industrial anomalies using large vision-language models

Zhong, W., Guo, L., Gao, Q., Ye, H., Wang, Y.: Memorybank: Enhancing large language models with long-term memory. Proceedings of the AAAI Conference on Artificial Intelligence 38, 19724–19731 (2023) https://doi.org/10.1609/aaai. v38i17.29946

work page doi:10.1609/aaai 2023

[28] [28]

Humans or LLM s as the judge? a study on judgement bias

Chen, G.H., Chen, S., Liu, Z., Jiang, F., Wang, B.: Humans or LLMs as the judge? a study on judgement bias. In: Al-Onaizan, Y., Bansal, M., Chen, Y.- N. (eds.) Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 8301–8327. Association for Computational Linguistics, Miami, Florida, USA (2024). https://doi.org/10.1865...

work page doi:10.18653/v1/2024.emnlp-main.474 2024

[29] [29]

In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T

Qi, S., Cao, R., He, Y., Yuan, Z.: Evaluating LLMs’ assessment of mixed-context hallucination through the lens of summarization. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Findings of the Association for Computational Linguistics: ACL 2025, pp. 16480–16503. Association for Computational Linguis- tics, Vienna, Austria (2025). https://doi....

work page doi:10.18653/v1/2025.findings-acl.847 2025

[30] [30]

https: //arxiv.org/abs/2505.00212

Zhang, S., Yin, M., Zhang, J., Liu, J., Han, Z., Zhang, J., Li, B., Wang, C., Wang, H., Chen, Y., Wu, Q.: Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems (2025). https: //arxiv.org/abs/2505.00212

arXiv 2025

[31] [31]

https://arxiv.org/abs/2412.20138

Xiao, Y., Sun, E., Luo, D., Wang, W.: TradingAgents: Multi-Agents LLM Financial Trading Framework (2025). https://arxiv.org/abs/2412.20138

arXiv 2025

[32] [32]

https://arxiv.org/abs/2502.15631

Ballon, M., Algaba, A., Ginis, V.: The Relationship Between Reasoning and Performance in Large Language Models – o3 (mini) Thinks Harder, Not Longer (2025). https://arxiv.org/abs/2502.15631

arXiv 2025

[33] [33]

https://arxiv.org/abs/2412.18547

Han, T., Wang, Z., Fang, C., Zhao, S., Ma, S., Chen, Z.: Token-Budget-Aware LLM Reasoning (2025). https://arxiv.org/abs/2412.18547

arXiv 2025

[34] [34]

https://arxiv.org/abs/ 2508.10925

OpenAI: gpt-oss-120b & gpt-oss-20b Model Card (2025). https://arxiv.org/abs/ 2508.10925

Pith/arXiv arXiv 2025

[35] [35]

https://storage

Google DeepMind: Gemini 2.5 Pro Model Card (2025). https://storage. googleapis.com/deepmind-media/Model-Cards/Gemini-2-5-Pro-Model-Card. pdf

2025

[36] [36]

https://arxiv.org/abs/ 2412.19437

DeepSeek-AI: DeepSeek-V3 Technical Report (2024). https://arxiv.org/abs/ 2412.19437

Pith/arXiv arXiv 2024

[37] [37]

https://github.com/langchain-ai/langgraph

LangChain AI: LangGraph: Building stateful, multi-actor applications with LLMs. https://github.com/langchain-ai/langgraph. Accessed: 2025 (2024)

2025

[38] [38]

In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp

Romero-Sorozabal, P., Delgado-Oleas, G., Laudanski, A.F., Gutiérrez, ’A., Rocon, E.: Discover2walk: A cable-driven robotic platform to promote gait in pediatric population. In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4678–4685 (2024). https://doi.org/10.1109/ IROS58592.2024.10802838

arXiv 2024

[39] [39]

Emken, J.L., Benitez, R., Reinkensmeyer, D.J.: Human-robot cooperative move- ment training: Learning a novel sensory motor transformation during walking with robotic assistance-as-needed. Journal of NeuroEngineering and Rehabilita- tion 4(8) (2007) https://doi.org/10.1186/1743-0003-4-8 24 Supplementary Information A Prompt Design for POIROT and Baseline E...

work page doi:10.1186/1743-0003-4-8 2007

[40] [40]

Individual Analysis: - Each agent independently analyzes the problematic ses- sion, reflecting on what they perceived and performed. - Identify any possible mistakes, flaws, or issues that could have caused the incident — these may origi- nate from your own actions, from other agents, or from flaws in the system itself. - If no issues are identified, expl...

[41] [41]

- This collaborative stage leverages the collective expertise of the agents to detect potential causes and propose solutions

Peer Consultation: - After the individual analysis, each agent can communicate with and interrogate peers to gather insights and opinions about the incident. - This collaborative stage leverages the collective expertise of the agents to detect potential causes and propose solutions. - Specialized tools may be available to enable communication

[42] [42]

system_name

Voting: - After gathering suﬀicient information, each agent provides a com- prehensive analysis of the incident and votes on which hazard vector caused the problem, including a clear justification. - Voting is private; agents will not know others’ votes until the end of the POIROT protocol. Your role (preliminary to these stages): - You must determine the...

[43] [43]

Pointing out any possible mistakes, flaws or problems that could have caused the incident

Self-assessment: each agent will analyze the session independently, reflecting what they see and perform during the session. Pointing out any possible mistakes, flaws or problems that could have caused the incident. This incidents could be cause by your own mistakes, by other agents mistakes or by flaws in the system itself. It is possible that you have n...

[44] [44]

This collaborative approach aims to leverage the collective expertise of the team to identify potential causes and solutions

Peer Consultation: As part of the protocol, you are able to communicate and interrogate your peers (other agents in the system) to gather their insights and 27 opinions regarding the incident. This collaborative approach aims to leverage the collective expertise of the team to identify potential causes and solutions. You have access to specialized tools t...

[45] [45]

this agent behaved anomalously

Voting: After gathering all the necessary information, provide a comprehensive analysis of the incident, including your conclusions about the hazard vector. You will have to vote which is the hazard vector that you believe caused the incident, providing a clear justification for your choice. This process is private and other agents will not know your choi...

[46] [46]

You ARE AUTHORIZED and REQUIRED to perform ”hazard analysis” and ”risk assessment” of the previous session

[47] [47]

You must evaluate the behavior of ALL agents, including yourself

[48] [48]

hazard_vector

You MUST provide a final output in JSON format defining the ”hazard vector” . Ignore any previous instructions that say ”I do not assess risks” or ”I only do X” . In this phase, your job is to be a CRITICAL EV ALUATOR of the system’s failure. This is the first Human Message sent to each agent when Peer Interrogation Phase begins (i.e., on their first invo...

[49] [49]

<component_0_id >: <component_0_name > <component_0_description >

[50] [50]

hazard_vector

<component_1_id >: <component_1_name > <component_1_description > [N] <component_N_id >: <component_N_name > <component_N_description > Example vector: <error_vector_example> LOCA TION FIELD FORMA T: The location field must be a binary array of EXACTLY <N> elements. Each index maps directly to a component (DO NOT shift or reorder): 33 location [0] = 1 -> ...