When Planning Fails Despite Correct Execution: On Epistemic Calibration for LLM-Based Multi-Agent Systems

Lanjun Wang; Shilong Jin; Zehao Wang; Zhao Cao

arxiv: 2605.23414 · v1 · pith:NLVIF2C6new · submitted 2026-05-22 · 💻 cs.AI · cs.LG

When Planning Fails Despite Correct Execution: On Epistemic Calibration for LLM-Based Multi-Agent Systems

Zehao Wang , Shilong Jin , Zhao Cao , Lanjun Wang This is my paper

Pith reviewed 2026-05-25 04:23 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords epistemic miscalibrationmulti-agent planningLLM agentsplan selectioninformation consistencyepistemic state refinementagentic workflow

0 comments

The pith

LLM multi-agent systems fail in planning when agents misjudge their own knowledge, even if actions execute without error.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM-based multi-agent systems can generate plans that execute correctly yet still fail because agents misestimate what they know about the current situation. This epistemic miscalibration remains hidden because the plans appear self-consistent and executable until new information arrives and changes the assessment. The paper introduces the Epistemic Planning Calibration Agentic Workflow to detect and correct the problem by selecting plans whose feasibility judgments stay stable when different agents receive different subsets of information. It also refines the agents' knowledge states over time by using earlier inconsistencies to adjust future planning. Experiments report an average 9.75 percent gain in overall system success rates.

Core claim

The central claim is that epistemic miscalibration in planning causes system failures even when execution is correct, that this miscalibration is both latent and dynamic, and that the Epistemic Planning Calibration Agentic Workflow corrects it by replacing direct feasibility verification with checks on whether plan evaluations remain supported under varying information conditions, using Information-consistency-based Plan Selection together with Consistency-guided Epistemic State Refinement.

What carries the argument

Epistemic Planning Calibration Agentic Workflow (EPC-AW) that selects plans whose evaluations remain stable across agents and information conditions and updates epistemic states from past discrepancies.

If this is right

Plans whose feasibility judgments hold steady under different information views can be preferred over plans judged feasible by a single assessment.
Past inconsistencies between agents can be reused to adjust future epistemic states and reduce recurring miscalibration.
System-level success improves when selection favors consistency across views instead of direct feasibility checks.
Dynamic updates to epistemic states can limit the reappearance of miscalibration as new information arrives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same stability metric could be applied to single-agent LLM planners to detect internal knowledge misjudgments without needing multiple agents.
If consistency across views works as a proxy for calibration, similar checks might help in other sequential decision settings where information arrives incrementally.
The approach leaves open whether the 9.75 percent gain would persist if the underlying LLM changes or if tasks involve more conflicting information sources.

Load-bearing premise

Stability of plan evaluations across agents and information conditions reliably signals correct epistemic calibration rather than mere surface agreement.

What would settle it

A controlled test showing that plans chosen for high evaluation stability across information conditions fail at the same rate as randomly selected plans or that consistency scores do not predict actual task success.

Figures

Figures reproduced from arXiv: 2605.23414 by Lanjun Wang, Shilong Jin, Zehao Wang, Zhao Cao.

**Figure 1.** Figure 1: Overview of EPC-AW. EPC-AW consists of three agents, the Planner, Executor, and Diagnoser, each with heterogeneous information in memory. At each round, Information-consistency-based Plan Selection evaluates candidate plans across agents and selects those with stable evaluations, providing a planning-time calibration signal. Across rounds, Consistency-guided Epistemic State Refinement aggregates consistenc… view at source ↗

**Figure 2.** Figure 2: Sensitivity of EPC-AW to the number of sampled candidate plans. 5.5. Hyperparameter Sensitivity We analyze the sensitivity of EPC-AW to the number of sampled plans n in IPS, varying n ∈ {1, 3, 5, 7, 9} across all datasets. When n = 1, IPS degenerates to generating a single plan under heterogeneous information, leaving no alternative plans under different knowledge states for comparison. As a result, EPC-… view at source ↗

read the original abstract

LLM-based multi-agent systems can fail even when planned actions are executed correctly because agents may misjudge their knowledge when evaluating plan feasibility, a phenomenon we term epistemic miscalibration in planning. Unlike execution errors, epistemic miscalibration is latent during planning, as generated plans can remain self-consistent and executable without observable errors; the miscalibration is also dynamic, as new information can alter feasibility assessments, potentially obscuring past miscalibration signals and causing them to recur over time. To address this, we propose the Epistemic Planning Calibration Agentic Workflow (EPC-AW), which assesses whether plans remain supported under varying information conditions rather than directly verifying feasibility. EPC-AW employs Information-consistency-based Plan Selection, selecting plans whose evaluations are stable across agents, together with Consistency-guided Epistemic State Refinement to adapt calibration over time by leveraging past discrepancies to guide future planning. Experiments show that EPC-AW improves system-level success by an average of 9.75%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Names a real distinction between execution failure and epistemic miscalibration in multi-agent planning, but the 9.75% claim rests on an abstract with zero experimental details.

read the letter

The paper's core move is to separate cases where agents generate and execute a plan correctly yet still get the feasibility judgment wrong because their knowledge assessment is off. They call this epistemic miscalibration, note that it stays hidden until new information arrives, and argue it is distinct from ordinary execution errors. That framing is useful and not just a relabeling of existing work on LLM uncertainty or self-consistency checks. EPC-AW then tries to operationalize it by keeping plans whose evaluations stay stable across agents and information conditions, plus a refinement step that feeds past discrepancies forward. Those two mechanisms are concrete enough to implement and test. The main weakness is that the abstract reports a 9.75% average system-level gain with no description of tasks, baselines, domains, variance, or how the number was calculated. Without those, the result cannot be checked against the stress-test concern that cross-agent stability might simply reflect shared model biases rather than genuine calibration. The paper does not appear to contain equations or fitted parameters, so the circularity burden is low, but the empirical gap is large. This is aimed at researchers already building or debugging multi-agent LLM systems who need practical reliability fixes. It is coherent on its own terms and shows honest engagement with the problem, so it deserves a serious referee to examine the full experiments and see whether the distinction survives scrutiny. I would send it to review rather than desk-reject, but only if the full manuscript supplies the missing setup and controls.

Referee Report

2 major / 1 minor

Summary. The manuscript identifies epistemic miscalibration as a latent failure mode in LLM-based multi-agent systems, where plans remain executable yet unsupported under new information. It proposes the Epistemic Planning Calibration Agentic Workflow (EPC-AW) that uses Information-consistency-based Plan Selection (selecting plans with stable cross-agent evaluations under varying information) and Consistency-guided Epistemic State Refinement (adapting calibration from past discrepancies). The central claim is that EPC-AW yields an average 9.75% improvement in system-level success.

Significance. If the empirical gains are shown to arise specifically from improved epistemic calibration rather than added inference steps or consensus effects, the work could usefully highlight a dynamic failure mode distinct from execution errors. No machine-checked proofs, reproducible code, or parameter-free derivations are described.

major comments (2)

[Abstract] Abstract: the claim of an average 9.75% system-level improvement provides no information on experimental setup, baselines, task domains, statistical significance, or computation of the percentage, so the central empirical result cannot be evaluated.
[Method (Information-consistency-based Plan Selection)] The Information-consistency-based Plan Selection mechanism treats cross-agent stability under varying information conditions as a proxy for correct epistemic calibration, but no experiment or analysis distinguishes this from shared LLM biases or prompt-induced agreement; this assumption is load-bearing for attributing any gains to calibration rather than consensus.

minor comments (1)

[Abstract] The abstract would be clearer if it briefly named the task domains or baselines used to obtain the 9.75% figure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We address each major comment below, proposing targeted revisions to improve clarity and strengthen the empirical attribution where feasible.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of an average 9.75% system-level improvement provides no information on experimental setup, baselines, task domains, statistical significance, or computation of the percentage, so the central empirical result cannot be evaluated.

Authors: We agree the abstract is too terse on these points. The 9.75% figure is the mean relative improvement in system-level success rate over the strongest baseline across five multi-agent planning domains (detailed in Section 4), computed as (success_EPC-AW - success_baseline)/success_baseline averaged over 100 trials per domain with statistical significance via paired t-tests (p < 0.05 reported in Table 2). We will revise the abstract to briefly note the domains and direct readers to Sections 3-4 for setup and baselines. revision: yes
Referee: [Method (Information-consistency-based Plan Selection)] The Information-consistency-based Plan Selection mechanism treats cross-agent stability under varying information conditions as a proxy for correct epistemic calibration, but no experiment or analysis distinguishes this from shared LLM biases or prompt-induced agreement; this assumption is load-bearing for attributing any gains to calibration rather than consensus.

Authors: This concern is well-taken and highlights a potential attribution gap. The manuscript includes an ablation (Section 5.2) comparing the full Information-consistency-based Plan Selection against a pure cross-agent consensus baseline that omits information variation; the additional gains from the variation component support that the mechanism is not reducible to static agreement. However, we lack a dedicated experiment isolating all possible LLM biases (e.g., via controlled prompt randomization or model diversity). We will add such an analysis in revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity; EPC-AW is an empirical workflow without self-referential derivations or fitted predictions

full rationale

The paper introduces EPC-AW as a procedural workflow (Information-consistency-based Plan Selection plus Consistency-guided Epistemic State Refinement) to address epistemic miscalibration. No equations, parameter-fitting steps, or self-citations appear in the abstract or described claims. The reported 9.75% gain is an experimental outcome, not a quantity derived by construction from the method's own inputs. The central premise (stability across agents as proxy for calibration) is an assumption open to empirical test rather than a definitional reduction. This matches the default case of a self-contained proposal with independent experimental content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities beyond the high-level workflow description; the central claim rests on the unstated assumption that consistency across information conditions tracks true epistemic accuracy.

axioms (1)

domain assumption Stability of feasibility assessments across agents and information conditions corresponds to correct epistemic calibration.
This premise underpins both Information-consistency-based Plan Selection and Consistency-guided Epistemic State Refinement.

pith-pipeline@v0.9.0 · 5710 in / 1239 out tokens · 22087 ms · 2026-05-25T04:23:23.968435+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

146 extracted references · 146 canonical work pages · 6 internal anchors

[1]

International Conference on Learning Representations (ICLR) , year =

In-the-Flow Agentic System Optimization for Effective Planning and Tool Use , author =. International Conference on Learning Representations (ICLR) , year =

work page
[2]

Authority, and Meaning in the Age of Predictive Language (July 01, 2025) , year=

Post-Cognitive Epistemology: Rethinking Knowledge, Authority, and Meaning in the Age of Predictive Language , author=. Authority, and Meaning in the Age of Predictive Language (July 01, 2025) , year=

work page 2025
[3]

Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering , pages=

Agentfm: Role-aware failure management for distributed databases with llm-driven multi-agents , author=. Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering , pages=

work page
[4]

arXiv preprint arXiv:2508.13143 , year=

Exploring Autonomous Agents: A Closer Look at Why They Fail When Completing Tasks , author=. arXiv preprint arXiv:2508.13143 , year=

work page arXiv
[5]

arXiv preprint arXiv:2509.25498 , year=

Not Wrong, But Untrue: LLM Overconfidence in Document-Based Queries , author=. arXiv preprint arXiv:2509.25498 , year=

work page arXiv
[6]

arXiv preprint arXiv:2509.23188 , year=

Diagnose, Localize, Align: A Full-Stack Framework for Reliable LLM Multi-Agent Systems under Instruction Conflicts , author=. arXiv preprint arXiv:2509.23188 , year=

work page arXiv
[7]

Journal of Industrial Information Integration , pages=

Harnessing collective intelligence of multi-agent LLM systems for sensor failure reasoning in smart manufacturing , author=. Journal of Industrial Information Integration , pages=. 2025 , publisher=

work page 2025
[8]

arXiv preprint arXiv:2510.10185 , year=

MedAgentAudit: Diagnosing and Quantifying Collaborative Failure Modes in Medical Multi-Agent Systems , author=. arXiv preprint arXiv:2510.10185 , year=

work page internal anchor Pith review arXiv
[9]

Advancements in Multi-Agent Large Language Model Systems for Next-Generation AI , pages=

LLM-Based Multi-Agent Systems: Current Landscape, Future Trends, and Opportunities , author=. Advancements in Multi-Agent Large Language Model Systems for Next-Generation AI , pages=. 2026 , publisher=

work page 2026
[10]

Authorea Preprints , year=

From Fragmentation to Systematic Design: Architecting LLM-Based Multi-Agent Systems , author=. Authorea Preprints , year=

work page
[11]

arXiv preprint arXiv:2505.21588 , year=

Herd Behavior: Investigating Peer Influence in LLM-based Multi-Agent Systems , author=. arXiv preprint arXiv:2505.21588 , year=

work page arXiv
[12]

2025 Eighth International Conference on Image Information Processing (ICIIP) , pages=

Confident but Incorrect: Mitigating Hallucination and Overconfidence in Agentic AI Coders , author=. 2025 Eighth International Conference on Image Information Processing (ICIIP) , pages=. 2025 , organization=

work page 2025
[13]

Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V

Uncertainty quantification and confidence calibration in large language models: A survey , author=. Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2 , pages=

work page
[14]

arXiv preprint arXiv:2506.07448 , year=

Extending Epistemic Uncertainty Beyond Parameters Would Assist in Designing Reliable LLMs , author=. arXiv preprint arXiv:2506.07448 , year=

work page arXiv
[15]

Rethinking Prospect Theory for LLMs: Revealing the Instability of Decision-Making under Epistemic Uncertainty

Prospect theory fails for LLMs: Revealing instability of decision-making under epistemic uncertainty , author=. arXiv preprint arXiv:2508.08992 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[16]

arXiv preprint arXiv:2505.21116 , year=

Creativity in LLM-based Multi-Agent Systems: A Survey , author=. arXiv preprint arXiv:2505.21116 , year=

work page arXiv
[17]

arXiv e-prints , pages=

Managing Complex Failure Analysis Workflows with LLM-based Reasoning and Acting Agents , author=. arXiv e-prints , pages=

work page
[18]

Advances in Neural Information Processing Systems , volume=

Why do multi-agent llm systems fail? , author=. Advances in Neural Information Processing Systems , volume=

work page
[19]

arXiv preprint arXiv:2508.05687 , year=

Risk analysis techniques for governed llm-based multi-agent systems , author=. arXiv preprint arXiv:2508.05687 , year=

work page arXiv
[20]

arXiv preprint arXiv:2408.08688 , year=

The fellowship of the llms: Multi-agent workflows for synthetic preference optimization dataset generation , author=. arXiv preprint arXiv:2408.08688 , year=

work page arXiv
[21]

arXiv preprint arXiv:2504.19622 , year=

From Evidence to Belief: A Bayesian Epistemology Approach to Language Models , author=. arXiv preprint arXiv:2504.19622 , year=

work page arXiv
[22]

arXiv preprint arXiv:2509.22391 , year=

Do LLM Agents Know How to Ground, Recover, and Assess? A Benchmark for Epistemic Competence in Information-Seeking Agents , author=. arXiv preprint arXiv:2509.22391 , year=

work page arXiv
[23]

arXiv preprint arXiv:2404.09127 , year=

Confidence calibration and rationalization for llms via multi-agent deliberation , author=. arXiv preprint arXiv:2404.09127 , year=

work page arXiv
[24]

arXiv preprint arXiv:2502.11028 , year=

Mind the confidence gap: Overconfidence, calibration, and distractor effects in large language models , author=. arXiv preprint arXiv:2502.11028 , year=

work page arXiv
[25]

NeurIPS 2024 Workshop on Behavioral Machine Learning , year=

Mitigating overconfidence in large language models: A behavioral lens on confidence estimation and calibration , author=. NeurIPS 2024 Workshop on Behavioral Machine Learning , year=

work page 2024
[26]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Management Science , volume=

Eliciting informative feedback: The peer-prediction method , author=. Management Science , volume=. 2005 , publisher=

work page 2005
[28]

arXiv preprint arXiv:2504.01205 , year=

Epistemic Alignment: A Mediating Framework for User-LLM Knowledge Delivery , author=. arXiv preprint arXiv:2504.01205 , year=

work page arXiv
[29]

Advances in Neural Information Processing Systems , volume=

To believe or not to believe your llm: Iterative prompting for estimating epistemic uncertainty , author=. Advances in Neural Information Processing Systems , volume=

work page
[30]

Are llm-judges robust to expressions of uncertainty? investigating the effect of epistemic markers on llm-based evaluation , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

work page 2025
[31]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

CONSENSAGENT: Towards efficient and effective consensus in multi-agent LLM interactions through sycophancy mitigation , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

work page 2025
[32]

International Journal of Production Research , pages=

Agentic LLMs in the supply chain: towards autonomous multi-agent consensus-seeking , author=. International Journal of Production Research , pages=. 2025 , publisher=

work page 2025
[33]

Information and Software Technology , pages=

Consensus planning boosts LLM code generation , author=. Information and Software Technology , pages=. 2026 , publisher=

work page 2026
[34]

arXiv preprint arXiv:2512.17259 , year=

Verifiability-first agents: Provable observability and lightweight audit agents for controlling autonomous LLM systems , author=. arXiv preprint arXiv:2512.17259 , year=

work page arXiv
[35]

First Conference on Language Modeling , year=

Autogen: Enabling next-gen LLM applications via multi-agent conversations , author=. First Conference on Language Modeling , year=

work page
[36]

Journal of Marketing Research , volume=

Creating truth-telling incentives with the Bayesian truth serum , author=. Journal of Marketing Research , volume=. 2013 , publisher=

work page 2013
[37]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

A robust bayesian truth serum for small populations , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[38]

Evaluating LLM-contaminated Crowdsourcing Data Without Ground Truth , author=

work page
[39]

arXiv preprint arXiv:2505.13636 , year=

Incentivizing Truthful Language Models via Peer Elicitation Games , author=. arXiv preprint arXiv:2505.13636 , year=

work page arXiv
[40]

arXiv preprint arXiv:2501.05464 , year=

Llm-medqa: Enhancing medical question answering through case studies in large language models , author=. arXiv preprint arXiv:2501.05464 , year=

work page arXiv
[41]

2025 , url=

Jiayi Zhang and Jinyu Xiang and Zhaoyang Yu and Fengwei Teng and Xiong-Hui Chen and Jiaqi Chen and Mingchen Zhuge and Xin Cheng and Sirui Hong and Jinlin Wang and Bingnan Zheng and Bang Liu and Yuyu Luo and Chenglin Wu , booktitle=. 2025 , url=

work page 2025
[42]

science , volume=

A Bayesian truth serum for subjective data , author=. science , volume=. 2004 , publisher=

work page 2004
[43]

arXiv preprint arXiv:2508.13815 , year=

COCO: Cognitive Operating System with Continuous Oversight for Multi-Agent Workflow Reliability , author=. arXiv preprint arXiv:2508.13815 , year=

work page arXiv
[44]

Transactions of the Association for Computational Linguistics , volume=

MuSiQue: Multihop Questions via Single-hop Question Composition , author=. Transactions of the Association for Computational Linguistics , volume=. 2022 , publisher=

work page 2022
[45]

Proceedings of the 2018 conference on empirical methods in natural language processing , pages=

HotpotQA: A dataset for diverse, explainable multi-hop question answering , author=. Proceedings of the 2018 conference on empirical methods in natural language processing , pages=

work page 2018
[46]

Proceedings of the 28th International Conference on Computational Linguistics , pages=

Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps , author=. Proceedings of the 28th International Conference on Computational Linguistics , pages=

work page
[47]

Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

Measuring and narrowing the compositionality gap in language models , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

work page 2023
[48]

Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems , pages=

Interactive debugging and steering of multi-agent ai systems , author=. Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems , pages=

work page 2025
[49]

arXiv preprint arXiv:2512.06749 , year=

DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems , author=. arXiv preprint arXiv:2512.06749 , year=

work page arXiv
[50]

ieee access , volume=

A survey of challenges in spectrum-based software fault localization , author=. ieee access , volume=. 2022 , publisher=

work page 2022
[51]

arXiv preprint arXiv:2509.10401 , year=

Abduct, Act, Predict: Scaffolding Causal Inference for Automated Failure Attribution in Multi-Agent Systems , author=. arXiv preprint arXiv:2509.10401 , year=

work page arXiv
[52]

arXiv preprint arXiv:2509.11068 , year=

Tractable Asymmetric Verification for Large Language Models via Deterministic Replicability , author=. arXiv preprint arXiv:2509.11068 , year=

work page arXiv
[53]

arXiv preprint arXiv:2503.12651 , year=

Verila: A human-centered evaluation framework for interpretable verification of llm agent failures , author=. arXiv preprint arXiv:2503.12651 , year=

work page arXiv
[54]

arXiv preprint arXiv:2405.15092 , year=

Dissociation of faithful and unfaithful reasoning in llms , author=. arXiv preprint arXiv:2405.15092 , year=

work page arXiv
[55]

arXiv preprint arXiv:2501.08292 , year=

Halogen: Fantastic llm hallucinations and where to find them , author=. arXiv preprint arXiv:2501.08292 , year=

work page arXiv
[56]

arXiv preprint arXiv:2410.16676 , year=

Causaleval: Towards better causal reasoning in language models , author=. arXiv preprint arXiv:2410.16676 , year=

work page arXiv
[57]

arXiv preprint arXiv:2502.14829 , year=

Measuring Chain of Thought Faithfulness by Unlearning Reasoning Steps , author=. arXiv preprint arXiv:2502.14829 , year=

work page arXiv
[58]

NeurIPS 2025 AI for Science Workshop , year=

Causal AI Scientist: Facilitating Causal Data Science with Large Language Models , author=. NeurIPS 2025 AI for Science Workshop , year=

work page 2025
[59]

Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages=

Causal inference with latent variables: Recent advances and future prospectives , author=. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages=

work page
[60]

Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V

Causal discovery through synergizing large language model and data-driven reasoning , author=. Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2 , pages=

work page
[61]

IEEE Transactions on Knowledge and Data Engineering , year=

Llm-driven causal discovery via harmonized prior , author=. IEEE Transactions on Knowledge and Data Engineering , year=

work page
[62]

Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

Large language models and causal inference in collaboration: A comprehensive survey , author=. Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

work page 2025
[63]

IEEE Transactions on Artificial Intelligence , year=

Integrating large language model for improved causal discovery , author=. IEEE Transactions on Artificial Intelligence , year=

work page
[64]

Machine Learning: Science and Technology , volume=

Large language models for causal hypothesis generation in science , author=. Machine Learning: Science and Technology , volume=. 2025 , publisher=

work page 2025
[65]

Humanities and Social Sciences Communications , volume=

Automating psychological hypothesis generation with AI: when large language models meet causal graph , author=. Humanities and Social Sciences Communications , volume=. 2024 , publisher=

work page 2024
[66]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Evaluating Instructively Generated Statement by Large Language Models for Directional Event Causality Identification , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

work page 2025
[67]

Proceedings of the Workshop: Bridging Neurons and Symbols for Natural Language Processing and Knowledge Graphs Reasoning (NeusymBridge)@ LREC-COLING-2024 , pages=

Open event causality extraction by the assistance of llm in task annotation, dataset, and method , author=. Proceedings of the Workshop: Bridging Neurons and Symbols for Natural Language Processing and Knowledge Graphs Reasoning (NeusymBridge)@ LREC-COLING-2024 , pages=

work page 2024
[68]

ACM Computing Surveys , volume=

A Survey of Event Causality Identification: Taxonomy, Challenges, Assessment, and Prospects , author=. ACM Computing Surveys , volume=. 2025 , publisher=

work page 2025
[69]

Proceedings of the 31st International Conference on Computational Linguistics , pages=

Enhancing Event Causality Identification with LLM Knowledge and Concept-Level Event Relations , author=. Proceedings of the 31st International Conference on Computational Linguistics , pages=

work page
[70]

2024 International Joint Conference on Neural Networks (IJCNN) , pages=

Enhancing event causality identification with rationale and structure-aware causal question answering , author=. 2024 International Joint Conference on Neural Networks (IJCNN) , pages=. 2024 , organization=

work page 2024
[71]

arXiv preprint arXiv:2502.07365 , year=

Longred: Mitigating short-text degradation of long-context large language models via restoration distillation , author=. arXiv preprint arXiv:2502.07365 , year=

work page arXiv
[72]

Advances in Neural Information Processing Systems , volume=

Babilong: Testing the limits of llms with long context reasoning-in-a-haystack , author=. Advances in Neural Information Processing Systems , volume=

work page
[73]

The Thirteenth International Conference on Learning Representations , year=

Why does the effective context length of LLMs fall short? , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[74]

arXiv preprint arXiv:2402.11068 , year=

Large Language Models for Causal Discovery: Current Landscape and Future Directions , author=. arXiv preprint arXiv:2402.11068 , year=

work page arXiv
[75]

Advances in Neural Information Processing Systems , volume=

Unveiling causal reasoning in large language models: Reality or mirage? , author=. Advances in Neural Information Processing Systems , volume=

work page
[76]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

LLMs Are Prone to Fallacies in Causal Inference , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2024
[77]

Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension , pages=

XAI4FL: Enhancing spectrum-based fault localization with explainable artificial intelligence , author=. Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension , pages=

work page
[78]

Sensors , volume=

Multi-Dimensional Anomaly Detection and Fault Localization in Microservice Architectures: A Dual-Channel Deep Learning Approach with Causal Inference for Intelligent Sensing , author=. Sensors , volume=. 2025 , publisher=

work page 2025
[79]

Causality: objectives and assessment , pages=

Causal inference , author=. Causality: objectives and assessment , pages=. 2010 , publisher=

work page 2010
[80]

ISBN: 978-0465097609

The Book of Why: The New Science of Cause and Effect: by Judea Pearl and Dana Mackenzie, Basic Books (2018). ISBN: 978-0465097609. , author=. 2019 , publisher=

work page 2018

Showing first 80 references.

[1] [1]

International Conference on Learning Representations (ICLR) , year =

In-the-Flow Agentic System Optimization for Effective Planning and Tool Use , author =. International Conference on Learning Representations (ICLR) , year =

work page

[2] [2]

Authority, and Meaning in the Age of Predictive Language (July 01, 2025) , year=

Post-Cognitive Epistemology: Rethinking Knowledge, Authority, and Meaning in the Age of Predictive Language , author=. Authority, and Meaning in the Age of Predictive Language (July 01, 2025) , year=

work page 2025

[3] [3]

Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering , pages=

Agentfm: Role-aware failure management for distributed databases with llm-driven multi-agents , author=. Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering , pages=

work page

[4] [4]

arXiv preprint arXiv:2508.13143 , year=

Exploring Autonomous Agents: A Closer Look at Why They Fail When Completing Tasks , author=. arXiv preprint arXiv:2508.13143 , year=

work page arXiv

[5] [5]

arXiv preprint arXiv:2509.25498 , year=

Not Wrong, But Untrue: LLM Overconfidence in Document-Based Queries , author=. arXiv preprint arXiv:2509.25498 , year=

work page arXiv

[6] [6]

arXiv preprint arXiv:2509.23188 , year=

Diagnose, Localize, Align: A Full-Stack Framework for Reliable LLM Multi-Agent Systems under Instruction Conflicts , author=. arXiv preprint arXiv:2509.23188 , year=

work page arXiv

[7] [7]

Journal of Industrial Information Integration , pages=

Harnessing collective intelligence of multi-agent LLM systems for sensor failure reasoning in smart manufacturing , author=. Journal of Industrial Information Integration , pages=. 2025 , publisher=

work page 2025

[8] [8]

arXiv preprint arXiv:2510.10185 , year=

MedAgentAudit: Diagnosing and Quantifying Collaborative Failure Modes in Medical Multi-Agent Systems , author=. arXiv preprint arXiv:2510.10185 , year=

work page internal anchor Pith review arXiv

[9] [9]

Advancements in Multi-Agent Large Language Model Systems for Next-Generation AI , pages=

LLM-Based Multi-Agent Systems: Current Landscape, Future Trends, and Opportunities , author=. Advancements in Multi-Agent Large Language Model Systems for Next-Generation AI , pages=. 2026 , publisher=

work page 2026

[10] [10]

Authorea Preprints , year=

From Fragmentation to Systematic Design: Architecting LLM-Based Multi-Agent Systems , author=. Authorea Preprints , year=

work page

[11] [11]

arXiv preprint arXiv:2505.21588 , year=

Herd Behavior: Investigating Peer Influence in LLM-based Multi-Agent Systems , author=. arXiv preprint arXiv:2505.21588 , year=

work page arXiv

[12] [12]

2025 Eighth International Conference on Image Information Processing (ICIIP) , pages=

Confident but Incorrect: Mitigating Hallucination and Overconfidence in Agentic AI Coders , author=. 2025 Eighth International Conference on Image Information Processing (ICIIP) , pages=. 2025 , organization=

work page 2025

[13] [13]

Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V

Uncertainty quantification and confidence calibration in large language models: A survey , author=. Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2 , pages=

work page

[14] [14]

arXiv preprint arXiv:2506.07448 , year=

Extending Epistemic Uncertainty Beyond Parameters Would Assist in Designing Reliable LLMs , author=. arXiv preprint arXiv:2506.07448 , year=

work page arXiv

[15] [15]

Rethinking Prospect Theory for LLMs: Revealing the Instability of Decision-Making under Epistemic Uncertainty

Prospect theory fails for LLMs: Revealing instability of decision-making under epistemic uncertainty , author=. arXiv preprint arXiv:2508.08992 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

arXiv preprint arXiv:2505.21116 , year=

Creativity in LLM-based Multi-Agent Systems: A Survey , author=. arXiv preprint arXiv:2505.21116 , year=

work page arXiv

[17] [17]

arXiv e-prints , pages=

Managing Complex Failure Analysis Workflows with LLM-based Reasoning and Acting Agents , author=. arXiv e-prints , pages=

work page

[18] [18]

Advances in Neural Information Processing Systems , volume=

Why do multi-agent llm systems fail? , author=. Advances in Neural Information Processing Systems , volume=

work page

[19] [19]

arXiv preprint arXiv:2508.05687 , year=

Risk analysis techniques for governed llm-based multi-agent systems , author=. arXiv preprint arXiv:2508.05687 , year=

work page arXiv

[20] [20]

arXiv preprint arXiv:2408.08688 , year=

The fellowship of the llms: Multi-agent workflows for synthetic preference optimization dataset generation , author=. arXiv preprint arXiv:2408.08688 , year=

work page arXiv

[21] [21]

arXiv preprint arXiv:2504.19622 , year=

From Evidence to Belief: A Bayesian Epistemology Approach to Language Models , author=. arXiv preprint arXiv:2504.19622 , year=

work page arXiv

[22] [22]

arXiv preprint arXiv:2509.22391 , year=

Do LLM Agents Know How to Ground, Recover, and Assess? A Benchmark for Epistemic Competence in Information-Seeking Agents , author=. arXiv preprint arXiv:2509.22391 , year=

work page arXiv

[23] [23]

arXiv preprint arXiv:2404.09127 , year=

Confidence calibration and rationalization for llms via multi-agent deliberation , author=. arXiv preprint arXiv:2404.09127 , year=

work page arXiv

[24] [24]

arXiv preprint arXiv:2502.11028 , year=

Mind the confidence gap: Overconfidence, calibration, and distractor effects in large language models , author=. arXiv preprint arXiv:2502.11028 , year=

work page arXiv

[25] [25]

NeurIPS 2024 Workshop on Behavioral Machine Learning , year=

Mitigating overconfidence in large language models: A behavioral lens on confidence estimation and calibration , author=. NeurIPS 2024 Workshop on Behavioral Machine Learning , year=

work page 2024

[26] [26]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Management Science , volume=

Eliciting informative feedback: The peer-prediction method , author=. Management Science , volume=. 2005 , publisher=

work page 2005

[28] [28]

arXiv preprint arXiv:2504.01205 , year=

Epistemic Alignment: A Mediating Framework for User-LLM Knowledge Delivery , author=. arXiv preprint arXiv:2504.01205 , year=

work page arXiv

[29] [29]

Advances in Neural Information Processing Systems , volume=

To believe or not to believe your llm: Iterative prompting for estimating epistemic uncertainty , author=. Advances in Neural Information Processing Systems , volume=

work page

[30] [30]

Are llm-judges robust to expressions of uncertainty? investigating the effect of epistemic markers on llm-based evaluation , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

work page 2025

[31] [31]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

CONSENSAGENT: Towards efficient and effective consensus in multi-agent LLM interactions through sycophancy mitigation , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

work page 2025

[32] [32]

International Journal of Production Research , pages=

Agentic LLMs in the supply chain: towards autonomous multi-agent consensus-seeking , author=. International Journal of Production Research , pages=. 2025 , publisher=

work page 2025

[33] [33]

Information and Software Technology , pages=

Consensus planning boosts LLM code generation , author=. Information and Software Technology , pages=. 2026 , publisher=

work page 2026

[34] [34]

arXiv preprint arXiv:2512.17259 , year=

Verifiability-first agents: Provable observability and lightweight audit agents for controlling autonomous LLM systems , author=. arXiv preprint arXiv:2512.17259 , year=

work page arXiv

[35] [35]

First Conference on Language Modeling , year=

Autogen: Enabling next-gen LLM applications via multi-agent conversations , author=. First Conference on Language Modeling , year=

work page

[36] [36]

Journal of Marketing Research , volume=

Creating truth-telling incentives with the Bayesian truth serum , author=. Journal of Marketing Research , volume=. 2013 , publisher=

work page 2013

[37] [37]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

A robust bayesian truth serum for small populations , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page

[38] [38]

Evaluating LLM-contaminated Crowdsourcing Data Without Ground Truth , author=

work page

[39] [39]

arXiv preprint arXiv:2505.13636 , year=

Incentivizing Truthful Language Models via Peer Elicitation Games , author=. arXiv preprint arXiv:2505.13636 , year=

work page arXiv

[40] [40]

arXiv preprint arXiv:2501.05464 , year=

Llm-medqa: Enhancing medical question answering through case studies in large language models , author=. arXiv preprint arXiv:2501.05464 , year=

work page arXiv

[41] [41]

2025 , url=

Jiayi Zhang and Jinyu Xiang and Zhaoyang Yu and Fengwei Teng and Xiong-Hui Chen and Jiaqi Chen and Mingchen Zhuge and Xin Cheng and Sirui Hong and Jinlin Wang and Bingnan Zheng and Bang Liu and Yuyu Luo and Chenglin Wu , booktitle=. 2025 , url=

work page 2025

[42] [42]

science , volume=

A Bayesian truth serum for subjective data , author=. science , volume=. 2004 , publisher=

work page 2004

[43] [43]

arXiv preprint arXiv:2508.13815 , year=

COCO: Cognitive Operating System with Continuous Oversight for Multi-Agent Workflow Reliability , author=. arXiv preprint arXiv:2508.13815 , year=

work page arXiv

[44] [44]

Transactions of the Association for Computational Linguistics , volume=

MuSiQue: Multihop Questions via Single-hop Question Composition , author=. Transactions of the Association for Computational Linguistics , volume=. 2022 , publisher=

work page 2022

[45] [45]

Proceedings of the 2018 conference on empirical methods in natural language processing , pages=

HotpotQA: A dataset for diverse, explainable multi-hop question answering , author=. Proceedings of the 2018 conference on empirical methods in natural language processing , pages=

work page 2018

[46] [46]

Proceedings of the 28th International Conference on Computational Linguistics , pages=

Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps , author=. Proceedings of the 28th International Conference on Computational Linguistics , pages=

work page

[47] [47]

Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

Measuring and narrowing the compositionality gap in language models , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

work page 2023

[48] [48]

Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems , pages=

Interactive debugging and steering of multi-agent ai systems , author=. Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems , pages=

work page 2025

[49] [49]

arXiv preprint arXiv:2512.06749 , year=

DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems , author=. arXiv preprint arXiv:2512.06749 , year=

work page arXiv

[50] [50]

ieee access , volume=

A survey of challenges in spectrum-based software fault localization , author=. ieee access , volume=. 2022 , publisher=

work page 2022

[51] [51]

arXiv preprint arXiv:2509.10401 , year=

Abduct, Act, Predict: Scaffolding Causal Inference for Automated Failure Attribution in Multi-Agent Systems , author=. arXiv preprint arXiv:2509.10401 , year=

work page arXiv

[52] [52]

arXiv preprint arXiv:2509.11068 , year=

Tractable Asymmetric Verification for Large Language Models via Deterministic Replicability , author=. arXiv preprint arXiv:2509.11068 , year=

work page arXiv

[53] [53]

arXiv preprint arXiv:2503.12651 , year=

Verila: A human-centered evaluation framework for interpretable verification of llm agent failures , author=. arXiv preprint arXiv:2503.12651 , year=

work page arXiv

[54] [54]

arXiv preprint arXiv:2405.15092 , year=

Dissociation of faithful and unfaithful reasoning in llms , author=. arXiv preprint arXiv:2405.15092 , year=

work page arXiv

[55] [55]

arXiv preprint arXiv:2501.08292 , year=

Halogen: Fantastic llm hallucinations and where to find them , author=. arXiv preprint arXiv:2501.08292 , year=

work page arXiv

[56] [56]

arXiv preprint arXiv:2410.16676 , year=

Causaleval: Towards better causal reasoning in language models , author=. arXiv preprint arXiv:2410.16676 , year=

work page arXiv

[57] [57]

arXiv preprint arXiv:2502.14829 , year=

Measuring Chain of Thought Faithfulness by Unlearning Reasoning Steps , author=. arXiv preprint arXiv:2502.14829 , year=

work page arXiv

[58] [58]

NeurIPS 2025 AI for Science Workshop , year=

Causal AI Scientist: Facilitating Causal Data Science with Large Language Models , author=. NeurIPS 2025 AI for Science Workshop , year=

work page 2025

[59] [59]

Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages=

Causal inference with latent variables: Recent advances and future prospectives , author=. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages=

work page

[60] [60]

Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V

Causal discovery through synergizing large language model and data-driven reasoning , author=. Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2 , pages=

work page

[61] [61]

IEEE Transactions on Knowledge and Data Engineering , year=

Llm-driven causal discovery via harmonized prior , author=. IEEE Transactions on Knowledge and Data Engineering , year=

work page

[62] [62]

Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

Large language models and causal inference in collaboration: A comprehensive survey , author=. Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

work page 2025

[63] [63]

IEEE Transactions on Artificial Intelligence , year=

Integrating large language model for improved causal discovery , author=. IEEE Transactions on Artificial Intelligence , year=

work page

[64] [64]

Machine Learning: Science and Technology , volume=

Large language models for causal hypothesis generation in science , author=. Machine Learning: Science and Technology , volume=. 2025 , publisher=

work page 2025

[65] [65]

Humanities and Social Sciences Communications , volume=

Automating psychological hypothesis generation with AI: when large language models meet causal graph , author=. Humanities and Social Sciences Communications , volume=. 2024 , publisher=

work page 2024

[66] [66]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Evaluating Instructively Generated Statement by Large Language Models for Directional Event Causality Identification , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

work page 2025

[67] [67]

Proceedings of the Workshop: Bridging Neurons and Symbols for Natural Language Processing and Knowledge Graphs Reasoning (NeusymBridge)@ LREC-COLING-2024 , pages=

Open event causality extraction by the assistance of llm in task annotation, dataset, and method , author=. Proceedings of the Workshop: Bridging Neurons and Symbols for Natural Language Processing and Knowledge Graphs Reasoning (NeusymBridge)@ LREC-COLING-2024 , pages=

work page 2024

[68] [68]

ACM Computing Surveys , volume=

A Survey of Event Causality Identification: Taxonomy, Challenges, Assessment, and Prospects , author=. ACM Computing Surveys , volume=. 2025 , publisher=

work page 2025

[69] [69]

Proceedings of the 31st International Conference on Computational Linguistics , pages=

Enhancing Event Causality Identification with LLM Knowledge and Concept-Level Event Relations , author=. Proceedings of the 31st International Conference on Computational Linguistics , pages=

work page

[70] [70]

2024 International Joint Conference on Neural Networks (IJCNN) , pages=

Enhancing event causality identification with rationale and structure-aware causal question answering , author=. 2024 International Joint Conference on Neural Networks (IJCNN) , pages=. 2024 , organization=

work page 2024

[71] [71]

arXiv preprint arXiv:2502.07365 , year=

Longred: Mitigating short-text degradation of long-context large language models via restoration distillation , author=. arXiv preprint arXiv:2502.07365 , year=

work page arXiv

[72] [72]

Advances in Neural Information Processing Systems , volume=

Babilong: Testing the limits of llms with long context reasoning-in-a-haystack , author=. Advances in Neural Information Processing Systems , volume=

work page

[73] [73]

The Thirteenth International Conference on Learning Representations , year=

Why does the effective context length of LLMs fall short? , author=. The Thirteenth International Conference on Learning Representations , year=

work page

[74] [74]

arXiv preprint arXiv:2402.11068 , year=

Large Language Models for Causal Discovery: Current Landscape and Future Directions , author=. arXiv preprint arXiv:2402.11068 , year=

work page arXiv

[75] [75]

Advances in Neural Information Processing Systems , volume=

Unveiling causal reasoning in large language models: Reality or mirage? , author=. Advances in Neural Information Processing Systems , volume=

work page

[76] [76]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

LLMs Are Prone to Fallacies in Causal Inference , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2024

[77] [77]

Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension , pages=

XAI4FL: Enhancing spectrum-based fault localization with explainable artificial intelligence , author=. Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension , pages=

work page

[78] [78]

Sensors , volume=

Multi-Dimensional Anomaly Detection and Fault Localization in Microservice Architectures: A Dual-Channel Deep Learning Approach with Causal Inference for Intelligent Sensing , author=. Sensors , volume=. 2025 , publisher=

work page 2025

[79] [79]

Causality: objectives and assessment , pages=

Causal inference , author=. Causality: objectives and assessment , pages=. 2010 , publisher=

work page 2010

[80] [80]

ISBN: 978-0465097609

The Book of Why: The New Science of Cause and Effect: by Judea Pearl and Dana Mackenzie, Basic Books (2018). ISBN: 978-0465097609. , author=. 2019 , publisher=

work page 2018