pith. sign in

arxiv: 2602.08939 · v2 · pith:F5VCCCJFnew · submitted 2026-02-09 · 💻 cs.AI

CausalT5k: Diagnosing Refusal and Failure Modes in Trustworthy Causal Reasoning Across Causal Rungs

classification 💻 cs.AI
keywords causalfailuremodespressurerefusalaccuracyacrossaggregate
0
0 comments X
read the original abstract

Large language models increasingly produce fluent causal explanations, yet they often fail in ways aggregate accuracy cannot diagnose: confusing association with intervention, abandoning correct judgments under pressure, over-refusing valid claims, or answering when evidence is underdetermined. We introduce CTK, a diagnostic benchmark of 5,147 cases and growing, across 10 domains and all three levels of Pearl's Ladder of Causation. Unlike benchmarks that only score correctness, CTK reveals why a model failed by annotating causal rung, trap type, pressure sensitivity, refusal quality, and Utility-Safety tradeoffs. Its Sheep/Wolf taxonomy separates valid causal designs from inferential traps; paired neutral/pressure variants measure sycophantic drift through Bad Flip Rate; and Wise Refusal fields test whether a model identifies the missing information needed before endorsing a claim. CTK exposes failure modes hidden by aggregate accuracy: the Skepticism Trap, Rung Collapse under scaling, pressure-induced drift, Detection-Correction gaps, and counterfactual error modes. Rather than prescribing a correction method, it provides the diagnostic substrate for studying causal-reasoning failure profiles.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SoundnessBench: Can Your AI Scientist Really Tell Good Research Ideas from Bad Ones?

    cs.LG 2026-05 conditional novelty 6.0

    SoundnessBench shows frontier LLMs exhibit pervasive optimism bias when rating the soundness of ML research proposals, frequently calling low-soundness ideas sound under standard prompts.

  2. The Trap of Trajectory: Towards Understanding and Mitigating Spurious Correlations in Agentic Memory

    cs.LG 2026-05 unverdicted novelty 6.0

    Agentic memory improves clean reasoning but worsens performance when spurious patterns are present in stored trajectories; CAMEL calibration reduces this reliance while preserving clean performance.