CounterBench: Evaluating and Improving Counterfactual Reasoning in Large Language Models
Pith reviewed 2026-05-23 03:07 UTC · model grok-4.3
The pith
Large language models perform near random guessing on formal counterfactual reasoning but improve with iterative guidance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Counterfactual reasoning using formal rules remains difficult for large language models, with performance often matching random guessing on the CounterBench dataset of one thousand questions featuring varied causal structures and nonsensical names. The CoIn paradigm, which prompts models to perform iterative reasoning with backtracking, leads to significant and consistent improvements across different large language models.
What carries the argument
The CoIn method, which guides LLMs through iterative reasoning and backtracking to systematically explore counterfactual solutions.
If this is right
- Counterfactual performance can be boosted in existing models without additional training.
- Results hold across multiple different large language models.
- The benchmark design with nonsensical names and varied graphs helps isolate formal inference.
- Models struggle more on complex graph structures and certain question types.
Where Pith is reading between the lines
- CoIn might extend to other structured reasoning tasks beyond counterfactuals.
- Combining CoIn with retrieval or external tools could further enhance causal reasoning.
- Testing on real-world scenarios could reveal whether the gains transfer outside the benchmark.
Load-bearing premise
The questions in CounterBench test only formal counterfactual inference and do not allow models to succeed through patterns learned during pretraining.
What would settle it
Running the models on a fresh set of counterfactual questions with novel causal graphs and names, confirming that accuracy stays near random without CoIn and rises with it.
Figures
read the original abstract
Counterfactual reasoning is widely recognized as one of the most challenging and intricate aspects of causality in artificial intelligence. In this paper, we evaluate the performance of large language models (LLMs) in counterfactual reasoning. In contrast to previous studies that primarily focus on commonsense causal reasoning, where LLMs often rely on prior knowledge for inference, we specifically assess their ability to perform counterfactual inference using a set of formal rules. To support this evaluation, we introduce a new benchmark dataset, CounterBench, comprising 1K counterfactual reasoning questions. The dataset is designed with varying levels of difficulty, diverse causal graph structures, distinct types of counterfactual questions, and multiple nonsensical name variants. Our experiments demonstrate that counterfactual reasoning poses a significant challenge for LLMs, with most models performing at levels comparable to random guessing. To enhance LLM's counterfactual reasoning ability, we propose a novel reasoning paradigm, CoIn, which guides LLMs through iterative reasoning and backtracking to systematically explore counterfactual solutions. Experimental results show that our method significantly improves LLM performance on counterfactual reasoning tasks and consistently enhances performance across different LLMs.Our dataset is available at https://huggingface.co/datasets/CounterBench/CounterBench.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CounterBench, a benchmark dataset of 1K counterfactual reasoning questions using formal rules, varying difficulty levels, diverse causal graph structures, distinct question types, and multiple nonsensical name variants to evaluate LLMs on formal counterfactual inference (distinct from commonsense causal reasoning). Experiments indicate that most LLMs perform at levels comparable to random guessing on these tasks. The authors propose CoIn, a reasoning paradigm that guides LLMs through iterative reasoning and backtracking, and report that it significantly improves performance across different LLMs. The dataset is released publicly.
Significance. If the benchmark successfully isolates formal counterfactual inference without pretraining leakage or surface heuristics, the near-random baseline performance and consistent gains from CoIn would highlight a core limitation in current LLMs' causal reasoning and provide a practical improvement method with relevance to AI safety and robust reasoning systems. The public dataset release supports reproducibility and further work.
major comments (2)
- [Abstract] Abstract: The central claims that LLMs perform at levels comparable to random guessing and that CoIn significantly improves performance lack any quantitative results, error bars, statistical tests, or details on validation of difficulty levels and graph structures. This is load-bearing for the empirical findings.
- [Abstract] Abstract / Dataset Design: The design uses nonsensical name variants and varied causal graph structures to isolate formal counterfactual rules from pretraining leakage or surface patterns, but no control experiments (e.g., accuracy delta on real vs. nonsensical names, performance on held-out graph structures, or comparison to post-cutoff models) are reported to verify this isolation. This directly affects whether the near-random results can be interpreted as evidence of failure on formal inference.
minor comments (2)
- [Abstract] Abstract contains a missing space: 'LLMs.Our dataset' should be 'LLMs. Our dataset'.
- [Abstract] Abstract: 'enhance LLM's counterfactual reasoning ability' should read 'enhance LLMs' counterfactual reasoning abilities' for grammatical consistency.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address each major comment below and will revise the manuscript to strengthen the presentation of results and validation of the benchmark design.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claims that LLMs perform at levels comparable to random guessing and that CoIn significantly improves performance lack any quantitative results, error bars, statistical tests, or details on validation of difficulty levels and graph structures. This is load-bearing for the empirical findings.
Authors: We agree that the abstract should include quantitative support for the claims. In the revised manuscript we will update the abstract to report key metrics such as average LLM accuracy (near random baseline levels), the magnitude of CoIn gains, reference to error bars from repeated runs, and brief notes on how difficulty levels and graph structures were constructed and validated. revision: yes
-
Referee: [Abstract] Abstract / Dataset Design: The design uses nonsensical name variants and varied causal graph structures to isolate formal counterfactual rules from pretraining leakage or surface patterns, but no control experiments (e.g., accuracy delta on real vs. nonsensical names, performance on held-out graph structures, or comparison to post-cutoff models) are reported to verify this isolation. This directly affects whether the near-random results can be interpreted as evidence of failure on formal inference.
Authors: We acknowledge the value of explicit controls. While nonsensical names and diverse structures were chosen to minimize leakage, we did not report direct comparisons. We will add control analyses in the revision, including accuracy differences between real and nonsensical name variants and results on held-out graph structures, to better substantiate the isolation claim. revision: yes
Circularity Check
Empirical benchmark and method evaluation with no self-referential derivations or fitted predictions
full rationale
The paper introduces CounterBench as an external dataset of 1K questions and evaluates LLMs plus the CoIn method on it. No equations, parameters, or first-principles derivations are present that reduce any claimed result to its own inputs by construction. The design choices (nonsensical names, graph variants) are presented as methodological safeguards rather than fitted quantities renamed as predictions. Self-citations, if any, are not load-bearing for the central empirical claims. The work is therefore self-contained against its stated external benchmark.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 2 Pith papers
-
Counterfactual Modeling with Fine-Tuned LLMs for Health Intervention Design and Sensor Data Augmentation
Fine-tuned LLMs produce plausible counterfactuals for health interventions and recover 20% F1 via data augmentation in label-scarce sensor datasets.
-
DeepImagine: Learning Biomedical Reasoning via Successive Counterfactual Imagining
DeepImagine trains LLMs on counterfactual pairs from clinical trials using supervised fine-tuning and reinforcement learning to improve outcome prediction by approximating causal mechanisms.
Reference graph
Works this paper leans on
-
[1]
Alwin. 2023. Understanding causal ai: Bridging the gap between correlation and causation. https://www.alwin.io/causal-ai. Accessed: 2025-01-06
work page 2023
-
[2]
Anthropic. 2024. Claude. https://www.anthropic.com/api. Accessed: 2025-01-06
work page 2024
-
[3]
Swagata Ashwani, Kshiteesh Hegde, Nishith Reddy Mannuru, Dushyant Singh Sengar, Mayank Jindal, Krishna Chaitanya Rao Kathala, Dishant Banga, Vinija Jain, and Aman Chadha. 2024. Cause and effect: Can large language models truly understand causality? In Proceedings of the AAAI Symposium Series, volume 4, pages 2--9
work page 2024
- [4]
- [5]
-
[6]
DeepSeek . 2024. https://www.deepseek.com/ DeepSeek: AI-Powered Search Engine . Accessed: 2025-02-15
work page 2024
-
[7]
Amir Feder, Katherine A Keith, Emaad Manzoor, Reid Pryzant, Dhanya Sridhar, Zach Wood-Doughty, Jacob Eisenstein, Justin Grimmer, Roi Reichart, Margaret E Roberts, et al. 2022. Causal inference in natural language processing: Estimation, prediction, interpretation and beyond. Transactions of the Association for Computational Linguistics, 10:1138--1158
work page 2022
-
[8]
Google. 2024. Gemini. https://gemini.google.com/. Accessed: 2025-01-06
work page 2024
-
[9]
Ian D Gow, David F Larcker, and Peter C Reiss. 2016. Causal inference in accounting research. Journal of Accounting Research, 54(2):477--523
work page 2016
-
[10]
Emilia Gvozdenovi \'c , Lucio Malvisi, Elisa Cinconze, Stijn Vansteelandt, Phoebe Nakanwagi, Emmanuel Aris, and Dominique Rosillon. 2021. Causal inference concepts applied to three observational studies in the context of vaccine development: from theory to practice. BMC Medical Research Methodology, 21:1--10
work page 2021
- [11]
-
[12]
Paul W Holland. 1986. Statistics and causal inference. Journal of the American statistical Association, 81(396):945--960
work page 1986
-
[13]
Zhenyang Hua, Shuyue Xing, Huixing Jiang, Chen Wei, and Xiaojie Wang. 2024. Improving causal inference of large language models with scm tools. In CCF International Conference on Natural Language Processing and Chinese Computing, pages 3--14. Springer
work page 2024
-
[14]
Zhijing Jin, Yuen Chen, Felix Leeb, Luigi Gresele, Ojasv Kamal, LYU Zhiheng, Kevin Blin, Fernando Gonzalez Adauto, Max Kleiman-Weiner, Mrinmaya Sachan, et al. 2023. Cladder: Assessing causal reasoning in language models. In Thirty-seventh conference on neural information processing systems
work page 2023
-
[15]
Fredrik Johansson, Uri Shalit, and David Sontag. 2016. Learning representations for counterfactual inference. In International conference on machine learning, pages 3020--3029. PMLR
work page 2016
-
[16]
Atoosa Kasirzadeh and Andrew Smart. 2021. The use and misuse of counterfactuals in ethical machine learning. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 228--236
work page 2021
- [17]
-
[18]
Lisa Koonce, Karen K Nelson, and Catherine M Shakespeare. 2011. Judging the relevance of fair value for financial instruments. The Accounting Review, 86(6):2075--2098
work page 2011
-
[19]
Parthasarathy Krishnamurthy and Anuradha Sivaraman. 2002. Counterfactual thinking and advertising responses. Journal of Consumer Research, 28(4):650--658
work page 2002
-
[20]
Evangelia Kyrimi, Somayyeh Mossadegh, Jared M Wohlgemut, Rebecca S Stoner, Nigel RM Tai, and William Marsh. 2025. Counterfactual reasoning using causal bayesian networks as a healthcare governance tool. International Journal of Medical Informatics, 193:105681
work page 2025
- [21]
- [22]
-
[23]
Jinxin Liu, Shulin Cao, Jiaxin Shi, Tingjian Zhang, Lunyiu Nie, Linmei Hu, Lei Hou, and Juanzi Li. 2024 a . How proficient are large language models in formal languages? an in-depth insight for knowledge base question answering. In Findings of the Association for Computational Linguistics ACL 2024, pages 792--815
work page 2024
- [24]
-
[25]
Massimo Loi and Margarida Rodrigues. 2012. A note on the impact evaluation of public policies: the counterfactual analysis
work page 2012
-
[26]
Christos Louizos, Uri Shalit, Joris M Mooij, David Sontag, Richard Zemel, and Max Welling. 2017. Causal effect inference with deep latent-variable models. Advances in neural information processing systems, 30
work page 2017
- [27]
-
[28]
SL Morgan. 2015. Counterfactuals and causal inference. Cambridge University Press
work page 2015
-
[29]
Elena Musi and Rudi Palmieri. 2024. The fallacy of explainable generative ai: evidence from argumentative prompting in two domains. In CEUR Workshop Proceedings, volume 3769, pages 59--69
work page 2024
-
[30]
Xuefei Ning, Zinan Lin, Zixuan Zhou, Zifu Wang, Huazhong Yang, and Yu Wang. 2024. Skeleton-of-thought: Prompting llms for efficient parallel generation. In The Twelfth International Conference on Learning Representations
work page 2024
-
[31]
OpenAI. 2024. Models. https://platform.openai.com/docs/models. Accessed: 2025-01-06
work page 2024
-
[32]
Judea Pearl. 2009. Causality. Cambridge university press
work page 2009
-
[33]
Judea Pearl and Dana Mackenzie. 2018. The book of why: the new science of cause and effect. Basic books
work page 2018
-
[34]
Fabio Petroni, Tim Rockt \"a schel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H Miller, and Sebastian Riedel. 2019. Language models as knowledge bases? arXiv preprint arXiv:1909.01066
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[35]
Neil Sahota. 2023. Causal ai: Bridging the gap between correlation and causation. https://www.neilsahota.com https://www.neilsahota.com/causal-ai-bridging-the-gap-between-correlation-and-causation/. Accessed: 2025-01-06
work page 2023
- [36]
-
[37]
Uri Shalit, Fredrik D Johansson, and David Sontag. 2017. Estimating individual treatment effect: generalization bounds and algorithms. In International conference on machine learning, pages 3076--3085. PMLR
work page 2017
- [38]
- [39]
-
[40]
Vicuna . 2023. Vicuna: An open-source chatbot impressing GPT-4 with 90\ https://vicuna.lmsys.org/. Accessed: 2023
work page 2023
-
[41]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824--24837
work page 2022
-
[42]
Jinsung Yoon, James Jordon, and Mihaela Van Der Schaar. 2018. Ganite: Estimation of individualized treatment effects using generative adversarial nets. In International conference on learning representations
work page 2018
- [43]
- [44]
-
[45]
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
-
[46]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.