METER benchmark reveals LLMs decline sharply in causal reasoning proficiency from association to intervention to counterfactual levels due to distraction by irrelevant facts and loss of faithfulness to provided context.
ForCausal Discovery, the templates consist of two categories: those inquiring about causes and those inquiring about effects
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models
METER benchmark reveals LLMs decline sharply in causal reasoning proficiency from association to intervention to counterfactual levels due to distraction by irrelevant facts and loss of faithfulness to provided context.