pith. sign in

arxiv: 2605.23414 · v1 · pith:NLVIF2C6new · submitted 2026-05-22 · 💻 cs.AI · cs.LG

When Planning Fails Despite Correct Execution: On Epistemic Calibration for LLM-Based Multi-Agent Systems

Pith reviewed 2026-05-25 04:23 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords epistemic miscalibrationmulti-agent planningLLM agentsplan selectioninformation consistencyepistemic state refinementagentic workflow
0
0 comments X

The pith

LLM multi-agent systems fail in planning when agents misjudge their own knowledge, even if actions execute without error.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM-based multi-agent systems can generate plans that execute correctly yet still fail because agents misestimate what they know about the current situation. This epistemic miscalibration remains hidden because the plans appear self-consistent and executable until new information arrives and changes the assessment. The paper introduces the Epistemic Planning Calibration Agentic Workflow to detect and correct the problem by selecting plans whose feasibility judgments stay stable when different agents receive different subsets of information. It also refines the agents' knowledge states over time by using earlier inconsistencies to adjust future planning. Experiments report an average 9.75 percent gain in overall system success rates.

Core claim

The central claim is that epistemic miscalibration in planning causes system failures even when execution is correct, that this miscalibration is both latent and dynamic, and that the Epistemic Planning Calibration Agentic Workflow corrects it by replacing direct feasibility verification with checks on whether plan evaluations remain supported under varying information conditions, using Information-consistency-based Plan Selection together with Consistency-guided Epistemic State Refinement.

What carries the argument

Epistemic Planning Calibration Agentic Workflow (EPC-AW) that selects plans whose evaluations remain stable across agents and information conditions and updates epistemic states from past discrepancies.

If this is right

  • Plans whose feasibility judgments hold steady under different information views can be preferred over plans judged feasible by a single assessment.
  • Past inconsistencies between agents can be reused to adjust future epistemic states and reduce recurring miscalibration.
  • System-level success improves when selection favors consistency across views instead of direct feasibility checks.
  • Dynamic updates to epistemic states can limit the reappearance of miscalibration as new information arrives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same stability metric could be applied to single-agent LLM planners to detect internal knowledge misjudgments without needing multiple agents.
  • If consistency across views works as a proxy for calibration, similar checks might help in other sequential decision settings where information arrives incrementally.
  • The approach leaves open whether the 9.75 percent gain would persist if the underlying LLM changes or if tasks involve more conflicting information sources.

Load-bearing premise

Stability of plan evaluations across agents and information conditions reliably signals correct epistemic calibration rather than mere surface agreement.

What would settle it

A controlled test showing that plans chosen for high evaluation stability across information conditions fail at the same rate as randomly selected plans or that consistency scores do not predict actual task success.

Figures

Figures reproduced from arXiv: 2605.23414 by Lanjun Wang, Shilong Jin, Zehao Wang, Zhao Cao.

Figure 1
Figure 1. Figure 1: Overview of EPC-AW. EPC-AW consists of three agents, the Planner, Executor, and Diagnoser, each with heterogeneous information in memory. At each round, Information-consistency-based Plan Selection evaluates candidate plans across agents and selects those with stable evaluations, providing a planning-time calibration signal. Across rounds, Consistency-guided Epistemic State Refinement aggregates consistenc… view at source ↗
Figure 2
Figure 2. Figure 2: Sensitivity of EPC-AW to the number of sampled candi￾date plans. 5.5. Hyperparameter Sensitivity We analyze the sensitivity of EPC-AW to the number of sampled plans n in IPS, varying n ∈ {1, 3, 5, 7, 9} across all datasets. When n = 1, IPS degenerates to generating a single plan under heterogeneous information, leaving no alternative plans under different knowledge states for com￾parison. As a result, EPC-… view at source ↗
read the original abstract

LLM-based multi-agent systems can fail even when planned actions are executed correctly because agents may misjudge their knowledge when evaluating plan feasibility, a phenomenon we term epistemic miscalibration in planning. Unlike execution errors, epistemic miscalibration is latent during planning, as generated plans can remain self-consistent and executable without observable errors; the miscalibration is also dynamic, as new information can alter feasibility assessments, potentially obscuring past miscalibration signals and causing them to recur over time. To address this, we propose the Epistemic Planning Calibration Agentic Workflow (EPC-AW), which assesses whether plans remain supported under varying information conditions rather than directly verifying feasibility. EPC-AW employs Information-consistency-based Plan Selection, selecting plans whose evaluations are stable across agents, together with Consistency-guided Epistemic State Refinement to adapt calibration over time by leveraging past discrepancies to guide future planning. Experiments show that EPC-AW improves system-level success by an average of 9.75%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript identifies epistemic miscalibration as a latent failure mode in LLM-based multi-agent systems, where plans remain executable yet unsupported under new information. It proposes the Epistemic Planning Calibration Agentic Workflow (EPC-AW) that uses Information-consistency-based Plan Selection (selecting plans with stable cross-agent evaluations under varying information) and Consistency-guided Epistemic State Refinement (adapting calibration from past discrepancies). The central claim is that EPC-AW yields an average 9.75% improvement in system-level success.

Significance. If the empirical gains are shown to arise specifically from improved epistemic calibration rather than added inference steps or consensus effects, the work could usefully highlight a dynamic failure mode distinct from execution errors. No machine-checked proofs, reproducible code, or parameter-free derivations are described.

major comments (2)
  1. [Abstract] Abstract: the claim of an average 9.75% system-level improvement provides no information on experimental setup, baselines, task domains, statistical significance, or computation of the percentage, so the central empirical result cannot be evaluated.
  2. [Method (Information-consistency-based Plan Selection)] The Information-consistency-based Plan Selection mechanism treats cross-agent stability under varying information conditions as a proxy for correct epistemic calibration, but no experiment or analysis distinguishes this from shared LLM biases or prompt-induced agreement; this assumption is load-bearing for attributing any gains to calibration rather than consensus.
minor comments (1)
  1. [Abstract] The abstract would be clearer if it briefly named the task domains or baselines used to obtain the 9.75% figure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We address each major comment below, proposing targeted revisions to improve clarity and strengthen the empirical attribution where feasible.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of an average 9.75% system-level improvement provides no information on experimental setup, baselines, task domains, statistical significance, or computation of the percentage, so the central empirical result cannot be evaluated.

    Authors: We agree the abstract is too terse on these points. The 9.75% figure is the mean relative improvement in system-level success rate over the strongest baseline across five multi-agent planning domains (detailed in Section 4), computed as (success_EPC-AW - success_baseline)/success_baseline averaged over 100 trials per domain with statistical significance via paired t-tests (p < 0.05 reported in Table 2). We will revise the abstract to briefly note the domains and direct readers to Sections 3-4 for setup and baselines. revision: yes

  2. Referee: [Method (Information-consistency-based Plan Selection)] The Information-consistency-based Plan Selection mechanism treats cross-agent stability under varying information conditions as a proxy for correct epistemic calibration, but no experiment or analysis distinguishes this from shared LLM biases or prompt-induced agreement; this assumption is load-bearing for attributing any gains to calibration rather than consensus.

    Authors: This concern is well-taken and highlights a potential attribution gap. The manuscript includes an ablation (Section 5.2) comparing the full Information-consistency-based Plan Selection against a pure cross-agent consensus baseline that omits information variation; the additional gains from the variation component support that the mechanism is not reducible to static agreement. However, we lack a dedicated experiment isolating all possible LLM biases (e.g., via controlled prompt randomization or model diversity). We will add such an analysis in revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity; EPC-AW is an empirical workflow without self-referential derivations or fitted predictions

full rationale

The paper introduces EPC-AW as a procedural workflow (Information-consistency-based Plan Selection plus Consistency-guided Epistemic State Refinement) to address epistemic miscalibration. No equations, parameter-fitting steps, or self-citations appear in the abstract or described claims. The reported 9.75% gain is an experimental outcome, not a quantity derived by construction from the method's own inputs. The central premise (stability across agents as proxy for calibration) is an assumption open to empirical test rather than a definitional reduction. This matches the default case of a self-contained proposal with independent experimental content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities beyond the high-level workflow description; the central claim rests on the unstated assumption that consistency across information conditions tracks true epistemic accuracy.

axioms (1)
  • domain assumption Stability of feasibility assessments across agents and information conditions corresponds to correct epistemic calibration.
    This premise underpins both Information-consistency-based Plan Selection and Consistency-guided Epistemic State Refinement.

pith-pipeline@v0.9.0 · 5710 in / 1239 out tokens · 22087 ms · 2026-05-25T04:23:23.968435+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

146 extracted references · 146 canonical work pages · 6 internal anchors

  1. [1]

    International Conference on Learning Representations (ICLR) , year =

    In-the-Flow Agentic System Optimization for Effective Planning and Tool Use , author =. International Conference on Learning Representations (ICLR) , year =

  2. [2]

    Authority, and Meaning in the Age of Predictive Language (July 01, 2025) , year=

    Post-Cognitive Epistemology: Rethinking Knowledge, Authority, and Meaning in the Age of Predictive Language , author=. Authority, and Meaning in the Age of Predictive Language (July 01, 2025) , year=

  3. [3]

    Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering , pages=

    Agentfm: Role-aware failure management for distributed databases with llm-driven multi-agents , author=. Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering , pages=

  4. [4]

    arXiv preprint arXiv:2508.13143 , year=

    Exploring Autonomous Agents: A Closer Look at Why They Fail When Completing Tasks , author=. arXiv preprint arXiv:2508.13143 , year=

  5. [5]

    arXiv preprint arXiv:2509.25498 , year=

    Not Wrong, But Untrue: LLM Overconfidence in Document-Based Queries , author=. arXiv preprint arXiv:2509.25498 , year=

  6. [6]

    arXiv preprint arXiv:2509.23188 , year=

    Diagnose, Localize, Align: A Full-Stack Framework for Reliable LLM Multi-Agent Systems under Instruction Conflicts , author=. arXiv preprint arXiv:2509.23188 , year=

  7. [7]

    Journal of Industrial Information Integration , pages=

    Harnessing collective intelligence of multi-agent LLM systems for sensor failure reasoning in smart manufacturing , author=. Journal of Industrial Information Integration , pages=. 2025 , publisher=

  8. [8]

    arXiv preprint arXiv:2510.10185 , year=

    MedAgentAudit: Diagnosing and Quantifying Collaborative Failure Modes in Medical Multi-Agent Systems , author=. arXiv preprint arXiv:2510.10185 , year=

  9. [9]

    Advancements in Multi-Agent Large Language Model Systems for Next-Generation AI , pages=

    LLM-Based Multi-Agent Systems: Current Landscape, Future Trends, and Opportunities , author=. Advancements in Multi-Agent Large Language Model Systems for Next-Generation AI , pages=. 2026 , publisher=

  10. [10]

    Authorea Preprints , year=

    From Fragmentation to Systematic Design: Architecting LLM-Based Multi-Agent Systems , author=. Authorea Preprints , year=

  11. [11]

    arXiv preprint arXiv:2505.21588 , year=

    Herd Behavior: Investigating Peer Influence in LLM-based Multi-Agent Systems , author=. arXiv preprint arXiv:2505.21588 , year=

  12. [12]

    2025 Eighth International Conference on Image Information Processing (ICIIP) , pages=

    Confident but Incorrect: Mitigating Hallucination and Overconfidence in Agentic AI Coders , author=. 2025 Eighth International Conference on Image Information Processing (ICIIP) , pages=. 2025 , organization=

  13. [13]

    Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V

    Uncertainty quantification and confidence calibration in large language models: A survey , author=. Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2 , pages=

  14. [14]

    arXiv preprint arXiv:2506.07448 , year=

    Extending Epistemic Uncertainty Beyond Parameters Would Assist in Designing Reliable LLMs , author=. arXiv preprint arXiv:2506.07448 , year=

  15. [15]

    Rethinking Prospect Theory for LLMs: Revealing the Instability of Decision-Making under Epistemic Uncertainty

    Prospect theory fails for LLMs: Revealing instability of decision-making under epistemic uncertainty , author=. arXiv preprint arXiv:2508.08992 , year=

  16. [16]

    arXiv preprint arXiv:2505.21116 , year=

    Creativity in LLM-based Multi-Agent Systems: A Survey , author=. arXiv preprint arXiv:2505.21116 , year=

  17. [17]

    arXiv e-prints , pages=

    Managing Complex Failure Analysis Workflows with LLM-based Reasoning and Acting Agents , author=. arXiv e-prints , pages=

  18. [18]

    Advances in Neural Information Processing Systems , volume=

    Why do multi-agent llm systems fail? , author=. Advances in Neural Information Processing Systems , volume=

  19. [19]

    arXiv preprint arXiv:2508.05687 , year=

    Risk analysis techniques for governed llm-based multi-agent systems , author=. arXiv preprint arXiv:2508.05687 , year=

  20. [20]

    arXiv preprint arXiv:2408.08688 , year=

    The fellowship of the llms: Multi-agent workflows for synthetic preference optimization dataset generation , author=. arXiv preprint arXiv:2408.08688 , year=

  21. [21]

    arXiv preprint arXiv:2504.19622 , year=

    From Evidence to Belief: A Bayesian Epistemology Approach to Language Models , author=. arXiv preprint arXiv:2504.19622 , year=

  22. [22]

    arXiv preprint arXiv:2509.22391 , year=

    Do LLM Agents Know How to Ground, Recover, and Assess? A Benchmark for Epistemic Competence in Information-Seeking Agents , author=. arXiv preprint arXiv:2509.22391 , year=

  23. [23]

    arXiv preprint arXiv:2404.09127 , year=

    Confidence calibration and rationalization for llms via multi-agent deliberation , author=. arXiv preprint arXiv:2404.09127 , year=

  24. [24]

    arXiv preprint arXiv:2502.11028 , year=

    Mind the confidence gap: Overconfidence, calibration, and distractor effects in large language models , author=. arXiv preprint arXiv:2502.11028 , year=

  25. [25]

    NeurIPS 2024 Workshop on Behavioral Machine Learning , year=

    Mitigating overconfidence in large language models: A behavioral lens on confidence estimation and calibration , author=. NeurIPS 2024 Workshop on Behavioral Machine Learning , year=

  26. [26]

    Qwen3 Technical Report

    Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

  27. [27]

    Management Science , volume=

    Eliciting informative feedback: The peer-prediction method , author=. Management Science , volume=. 2005 , publisher=

  28. [28]

    arXiv preprint arXiv:2504.01205 , year=

    Epistemic Alignment: A Mediating Framework for User-LLM Knowledge Delivery , author=. arXiv preprint arXiv:2504.01205 , year=

  29. [29]

    Advances in Neural Information Processing Systems , volume=

    To believe or not to believe your llm: Iterative prompting for estimating epistemic uncertainty , author=. Advances in Neural Information Processing Systems , volume=

  30. [30]

    Are llm-judges robust to expressions of uncertainty? investigating the effect of epistemic markers on llm-based evaluation , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

  31. [31]

    Findings of the Association for Computational Linguistics: ACL 2025 , pages=

    CONSENSAGENT: Towards efficient and effective consensus in multi-agent LLM interactions through sycophancy mitigation , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

  32. [32]

    International Journal of Production Research , pages=

    Agentic LLMs in the supply chain: towards autonomous multi-agent consensus-seeking , author=. International Journal of Production Research , pages=. 2025 , publisher=

  33. [33]

    Information and Software Technology , pages=

    Consensus planning boosts LLM code generation , author=. Information and Software Technology , pages=. 2026 , publisher=

  34. [34]

    arXiv preprint arXiv:2512.17259 , year=

    Verifiability-first agents: Provable observability and lightweight audit agents for controlling autonomous LLM systems , author=. arXiv preprint arXiv:2512.17259 , year=

  35. [35]

    First Conference on Language Modeling , year=

    Autogen: Enabling next-gen LLM applications via multi-agent conversations , author=. First Conference on Language Modeling , year=

  36. [36]

    Journal of Marketing Research , volume=

    Creating truth-telling incentives with the Bayesian truth serum , author=. Journal of Marketing Research , volume=. 2013 , publisher=

  37. [37]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    A robust bayesian truth serum for small populations , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  38. [38]

    Evaluating LLM-contaminated Crowdsourcing Data Without Ground Truth , author=

  39. [39]

    arXiv preprint arXiv:2505.13636 , year=

    Incentivizing Truthful Language Models via Peer Elicitation Games , author=. arXiv preprint arXiv:2505.13636 , year=

  40. [40]

    arXiv preprint arXiv:2501.05464 , year=

    Llm-medqa: Enhancing medical question answering through case studies in large language models , author=. arXiv preprint arXiv:2501.05464 , year=

  41. [41]

    2025 , url=

    Jiayi Zhang and Jinyu Xiang and Zhaoyang Yu and Fengwei Teng and Xiong-Hui Chen and Jiaqi Chen and Mingchen Zhuge and Xin Cheng and Sirui Hong and Jinlin Wang and Bingnan Zheng and Bang Liu and Yuyu Luo and Chenglin Wu , booktitle=. 2025 , url=

  42. [42]

    science , volume=

    A Bayesian truth serum for subjective data , author=. science , volume=. 2004 , publisher=

  43. [43]

    arXiv preprint arXiv:2508.13815 , year=

    COCO: Cognitive Operating System with Continuous Oversight for Multi-Agent Workflow Reliability , author=. arXiv preprint arXiv:2508.13815 , year=

  44. [44]

    Transactions of the Association for Computational Linguistics , volume=

    MuSiQue: Multihop Questions via Single-hop Question Composition , author=. Transactions of the Association for Computational Linguistics , volume=. 2022 , publisher=

  45. [45]

    Proceedings of the 2018 conference on empirical methods in natural language processing , pages=

    HotpotQA: A dataset for diverse, explainable multi-hop question answering , author=. Proceedings of the 2018 conference on empirical methods in natural language processing , pages=

  46. [46]

    Proceedings of the 28th International Conference on Computational Linguistics , pages=

    Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps , author=. Proceedings of the 28th International Conference on Computational Linguistics , pages=

  47. [47]

    Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

    Measuring and narrowing the compositionality gap in language models , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

  48. [48]

    Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems , pages=

    Interactive debugging and steering of multi-agent ai systems , author=. Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems , pages=

  49. [49]

    arXiv preprint arXiv:2512.06749 , year=

    DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems , author=. arXiv preprint arXiv:2512.06749 , year=

  50. [50]

    ieee access , volume=

    A survey of challenges in spectrum-based software fault localization , author=. ieee access , volume=. 2022 , publisher=

  51. [51]

    arXiv preprint arXiv:2509.10401 , year=

    Abduct, Act, Predict: Scaffolding Causal Inference for Automated Failure Attribution in Multi-Agent Systems , author=. arXiv preprint arXiv:2509.10401 , year=

  52. [52]

    arXiv preprint arXiv:2509.11068 , year=

    Tractable Asymmetric Verification for Large Language Models via Deterministic Replicability , author=. arXiv preprint arXiv:2509.11068 , year=

  53. [53]

    arXiv preprint arXiv:2503.12651 , year=

    Verila: A human-centered evaluation framework for interpretable verification of llm agent failures , author=. arXiv preprint arXiv:2503.12651 , year=

  54. [54]

    arXiv preprint arXiv:2405.15092 , year=

    Dissociation of faithful and unfaithful reasoning in llms , author=. arXiv preprint arXiv:2405.15092 , year=

  55. [55]

    arXiv preprint arXiv:2501.08292 , year=

    Halogen: Fantastic llm hallucinations and where to find them , author=. arXiv preprint arXiv:2501.08292 , year=

  56. [56]

    arXiv preprint arXiv:2410.16676 , year=

    Causaleval: Towards better causal reasoning in language models , author=. arXiv preprint arXiv:2410.16676 , year=

  57. [57]

    arXiv preprint arXiv:2502.14829 , year=

    Measuring Chain of Thought Faithfulness by Unlearning Reasoning Steps , author=. arXiv preprint arXiv:2502.14829 , year=

  58. [58]

    NeurIPS 2025 AI for Science Workshop , year=

    Causal AI Scientist: Facilitating Causal Data Science with Large Language Models , author=. NeurIPS 2025 AI for Science Workshop , year=

  59. [59]

    Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages=

    Causal inference with latent variables: Recent advances and future prospectives , author=. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages=

  60. [60]

    Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V

    Causal discovery through synergizing large language model and data-driven reasoning , author=. Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2 , pages=

  61. [61]

    IEEE Transactions on Knowledge and Data Engineering , year=

    Llm-driven causal discovery via harmonized prior , author=. IEEE Transactions on Knowledge and Data Engineering , year=

  62. [62]

    Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

    Large language models and causal inference in collaboration: A comprehensive survey , author=. Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

  63. [63]

    IEEE Transactions on Artificial Intelligence , year=

    Integrating large language model for improved causal discovery , author=. IEEE Transactions on Artificial Intelligence , year=

  64. [64]

    Machine Learning: Science and Technology , volume=

    Large language models for causal hypothesis generation in science , author=. Machine Learning: Science and Technology , volume=. 2025 , publisher=

  65. [65]

    Humanities and Social Sciences Communications , volume=

    Automating psychological hypothesis generation with AI: when large language models meet causal graph , author=. Humanities and Social Sciences Communications , volume=. 2024 , publisher=

  66. [66]

    Findings of the Association for Computational Linguistics: ACL 2025 , pages=

    Evaluating Instructively Generated Statement by Large Language Models for Directional Event Causality Identification , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

  67. [67]

    Proceedings of the Workshop: Bridging Neurons and Symbols for Natural Language Processing and Knowledge Graphs Reasoning (NeusymBridge)@ LREC-COLING-2024 , pages=

    Open event causality extraction by the assistance of llm in task annotation, dataset, and method , author=. Proceedings of the Workshop: Bridging Neurons and Symbols for Natural Language Processing and Knowledge Graphs Reasoning (NeusymBridge)@ LREC-COLING-2024 , pages=

  68. [68]

    ACM Computing Surveys , volume=

    A Survey of Event Causality Identification: Taxonomy, Challenges, Assessment, and Prospects , author=. ACM Computing Surveys , volume=. 2025 , publisher=

  69. [69]

    Proceedings of the 31st International Conference on Computational Linguistics , pages=

    Enhancing Event Causality Identification with LLM Knowledge and Concept-Level Event Relations , author=. Proceedings of the 31st International Conference on Computational Linguistics , pages=

  70. [70]

    2024 International Joint Conference on Neural Networks (IJCNN) , pages=

    Enhancing event causality identification with rationale and structure-aware causal question answering , author=. 2024 International Joint Conference on Neural Networks (IJCNN) , pages=. 2024 , organization=

  71. [71]

    arXiv preprint arXiv:2502.07365 , year=

    Longred: Mitigating short-text degradation of long-context large language models via restoration distillation , author=. arXiv preprint arXiv:2502.07365 , year=

  72. [72]

    Advances in Neural Information Processing Systems , volume=

    Babilong: Testing the limits of llms with long context reasoning-in-a-haystack , author=. Advances in Neural Information Processing Systems , volume=

  73. [73]

    The Thirteenth International Conference on Learning Representations , year=

    Why does the effective context length of LLMs fall short? , author=. The Thirteenth International Conference on Learning Representations , year=

  74. [74]

    arXiv preprint arXiv:2402.11068 , year=

    Large Language Models for Causal Discovery: Current Landscape and Future Directions , author=. arXiv preprint arXiv:2402.11068 , year=

  75. [75]

    Advances in Neural Information Processing Systems , volume=

    Unveiling causal reasoning in large language models: Reality or mirage? , author=. Advances in Neural Information Processing Systems , volume=

  76. [76]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

    LLMs Are Prone to Fallacies in Causal Inference , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

  77. [77]

    Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension , pages=

    XAI4FL: Enhancing spectrum-based fault localization with explainable artificial intelligence , author=. Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension , pages=

  78. [78]

    Sensors , volume=

    Multi-Dimensional Anomaly Detection and Fault Localization in Microservice Architectures: A Dual-Channel Deep Learning Approach with Causal Inference for Intelligent Sensing , author=. Sensors , volume=. 2025 , publisher=

  79. [79]

    Causality: objectives and assessment , pages=

    Causal inference , author=. Causality: objectives and assessment , pages=. 2010 , publisher=

  80. [80]

    ISBN: 978-0465097609

    The Book of Why: The New Science of Cause and Effect: by Judea Pearl and Dana Mackenzie, Basic Books (2018). ISBN: 978-0465097609. , author=. 2019 , publisher=

Showing first 80 references.