CausalT5k: Diagnosing Refusal and Failure Modes in Trustworthy Causal Reasoning Across Causal Rungs

Andy Ouyang; Ankit Rai; Daphne Barretto; Edward Y. Chang; Gia Ancone; Longling Geng; Matthew John Hayes; Matthew Wolfman; Patrick Flanagan; Rachael Cooper

arxiv: 2602.08939 · v2 · pith:F5VCCCJFnew · submitted 2026-02-09 · 💻 cs.AI

CausalT5k: Diagnosing Refusal and Failure Modes in Trustworthy Causal Reasoning Across Causal Rungs

Longling Geng , Andy Ouyang , Theodore Wu , Daphne Barretto , Matthew John Hayes , Rachael Cooper , Yuqiao Zeng , Sameer Vijay

show 5 more authors

Gia Ancone Ankit Rai Matthew Wolfman Patrick Flanagan Edward Y. Chang

This is my paper

classification 💻 cs.AI

keywords causalfailuremodespressurerefusalaccuracyacrossaggregate

0 comments

read the original abstract

Large language models increasingly produce fluent causal explanations, yet they often fail in ways aggregate accuracy cannot diagnose: confusing association with intervention, abandoning correct judgments under pressure, over-refusing valid claims, or answering when evidence is underdetermined. We introduce CTK, a diagnostic benchmark of 5,147 cases and growing, across 10 domains and all three levels of Pearl's Ladder of Causation. Unlike benchmarks that only score correctness, CTK reveals why a model failed by annotating causal rung, trap type, pressure sensitivity, refusal quality, and Utility-Safety tradeoffs. Its Sheep/Wolf taxonomy separates valid causal designs from inferential traps; paired neutral/pressure variants measure sycophantic drift through Bad Flip Rate; and Wise Refusal fields test whether a model identifies the missing information needed before endorsing a claim. CTK exposes failure modes hidden by aggregate accuracy: the Skepticism Trap, Rung Collapse under scaling, pressure-induced drift, Detection-Correction gaps, and counterfactual error modes. Rather than prescribing a correction method, it provides the diagnostic substrate for studying causal-reasoning failure profiles.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SoundnessBench: Can Your AI Scientist Really Tell Good Research Ideas from Bad Ones?
cs.LG 2026-05 conditional novelty 6.0

SoundnessBench shows frontier LLMs exhibit pervasive optimism bias when rating the soundness of ML research proposals, frequently calling low-soundness ideas sound under standard prompts.
The Trap of Trajectory: Towards Understanding and Mitigating Spurious Correlations in Agentic Memory
cs.LG 2026-05 unverdicted novelty 6.0

Agentic memory improves clean reasoning but worsens performance when spurious patterns are present in stored trajectories; CAMEL calibration reduces this reliance while preserving clean performance.