CounterBench: Evaluating and Improving Counterfactual Reasoning in Large Language Models

Jing Ma; Ruixiang Tang; Vivek K.Singh; Yuefei Chen

arxiv: 2502.11008 · v2 · submitted 2025-02-16 · 💻 cs.CL

CounterBench: Evaluating and Improving Counterfactual Reasoning in Large Language Models

Yuefei Chen , Vivek K.Singh , Jing Ma , Ruixiang Tang This is my paper

Pith reviewed 2026-05-23 03:07 UTC · model grok-4.3

classification 💻 cs.CL

keywords counterfactual reasoninglarge language modelsbenchmarkcausal reasoningiterative reasoningbacktracking

0 comments

The pith

Large language models perform near random guessing on formal counterfactual reasoning but improve with iterative guidance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can follow explicit formal rules to reason about what would happen in alternative scenarios. It presents CounterBench, a collection of one thousand questions built on diverse causal graphs and invented names to reduce reliance on memorized facts. Experiments reveal that most models score at chance levels. The authors introduce the CoIn method, which directs models to reason iteratively and backtrack when needed, raising performance on the benchmark for multiple models.

Core claim

Counterfactual reasoning using formal rules remains difficult for large language models, with performance often matching random guessing on the CounterBench dataset of one thousand questions featuring varied causal structures and nonsensical names. The CoIn paradigm, which prompts models to perform iterative reasoning with backtracking, leads to significant and consistent improvements across different large language models.

What carries the argument

The CoIn method, which guides LLMs through iterative reasoning and backtracking to systematically explore counterfactual solutions.

If this is right

Counterfactual performance can be boosted in existing models without additional training.
Results hold across multiple different large language models.
The benchmark design with nonsensical names and varied graphs helps isolate formal inference.
Models struggle more on complex graph structures and certain question types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

CoIn might extend to other structured reasoning tasks beyond counterfactuals.
Combining CoIn with retrieval or external tools could further enhance causal reasoning.
Testing on real-world scenarios could reveal whether the gains transfer outside the benchmark.

Load-bearing premise

The questions in CounterBench test only formal counterfactual inference and do not allow models to succeed through patterns learned during pretraining.

What would settle it

Running the models on a fresh set of counterfactual questions with novel causal graphs and names, confirming that accuracy stays near random without CoIn and rises with it.

Figures

Figures reproduced from arXiv: 2502.11008 by Jing Ma, Ruixiang Tang, Vivek K.Singh, Yuefei Chen.

**Figure 2.** Figure 2: Illustration of our framework. We create CounterBench, a dataset featuring four types of counterfactual [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Error Analysis of CausalCoT. 4 Proposed Reasoning Strategy As discussed in Section 3, the primary challenge for LLMs is to minimize incorrect inferences, which are a major source of errors. To tackle the challenges in counterfactual inference that current LLMs face, we introduce an enhanced reasoning framework. This framework utilizes evaluation and backtracking capabilities within a multi-step causal chai… view at source ↗

**Figure 4.** Figure 4: Accuracy comparison between Standard, CoIn, and CausalCoT in Anticommonsense and Commonsense Dataset. 6 Related Work Counterfactual Reasoning. Counterfactual reasoning explores how outcomes change when certain variables are altered from their historical states. In Structural Causal Models (SCMs), Pearl’s (Pearl, 2009) “surgery” and do-calculus provide systematic ways to infer intervention outcomes, hig… view at source ↗

**Figure 5.** Figure 5: The prompt design of CoIn [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 7.** Figure 7: Anti-commonsense Example Question: Imagine a self-contained, hypothetical world with only the following conditions, and without any unmentioned factors or causal relationships: The man in the room has a direct effect on room. The candle has a direct effect on room. We know that blowing out the candle and candle with wax causes dark room. We observed the candle has wax. Would the room is dark if not blowi… view at source ↗

**Figure 6.** Figure 6: Error Analysis comparison between Our Method and CausalCoT. A.6 CLADDER Dataset Example Two examples are generated from the CLADDER dataset. It is shown in [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 10.** Figure 10: Error Analysis for Babbage-002 in Causal [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: CausalCoT Instruction Example [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 12.** Figure 12: CoIn Instruction Example [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗

**Figure 13.** Figure 13: Conclusion Error Example Response The correct answer is (0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … view at source ↗

**Figure 14.** Figure 14: Type Mismatch [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗

read the original abstract

Counterfactual reasoning is widely recognized as one of the most challenging and intricate aspects of causality in artificial intelligence. In this paper, we evaluate the performance of large language models (LLMs) in counterfactual reasoning. In contrast to previous studies that primarily focus on commonsense causal reasoning, where LLMs often rely on prior knowledge for inference, we specifically assess their ability to perform counterfactual inference using a set of formal rules. To support this evaluation, we introduce a new benchmark dataset, CounterBench, comprising 1K counterfactual reasoning questions. The dataset is designed with varying levels of difficulty, diverse causal graph structures, distinct types of counterfactual questions, and multiple nonsensical name variants. Our experiments demonstrate that counterfactual reasoning poses a significant challenge for LLMs, with most models performing at levels comparable to random guessing. To enhance LLM's counterfactual reasoning ability, we propose a novel reasoning paradigm, CoIn, which guides LLMs through iterative reasoning and backtracking to systematically explore counterfactual solutions. Experimental results show that our method significantly improves LLM performance on counterfactual reasoning tasks and consistently enhances performance across different LLMs.Our dataset is available at https://huggingface.co/datasets/CounterBench/CounterBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New benchmark for formal counterfactuals in LLMs plus an iterative prompting method, but the isolation from pretraining leakage rests on unverified design choices.

read the letter

The paper's core contribution is CounterBench, a 1K-question dataset built around formal causal graphs and rules rather than commonsense knowledge, plus the CoIn prompting approach that adds iteration and backtracking to standard chain-of-thought. The authors use nonsensical name variants and varied graph structures to try to force genuine rule application instead of pattern matching. That distinction from prior commonsense causality work is the main novelty, and the dataset construction with explicit difficulty levels and question types is a concrete step that could be reused. CoIn is presented as a lightweight way to improve performance across models, which aligns with known benefits of structured prompting on reasoning tasks. The abstract claims most LLMs sit near random guessing and that CoIn lifts results, which is a reasonable hypothesis given how LLMs handle out-of-distribution inference. The work is honest about focusing on formal rules instead of everyday causality, and the public release on Hugging Face is a plus for anyone who wants to test the claims themselves. The soft spot is the missing evidence that the design actually blocks leakage or surface heuristics. The abstract and stress-test note both flag the absence of control experiments, such as accuracy deltas between real and nonsensical names, held-out graph structures, or checks against post-cutoff models. Without those, the near-random baseline and CoIn gains could still come from residual memorization rather than pure formal reasoning. No quantitative results, error bars, or validation details appear in the provided abstract, which leaves the central claims hard to assess. This is for people running LLM evaluations or building causal reasoning systems who need a controlled test set. A reader looking for a ready-to-use benchmark with some prompting ideas would find value in the construction details even if the isolation claim needs more support. It deserves peer review because a properly validated benchmark in this narrow area would be worth having for the field, provided the authors add the missing controls on leakage.

Referee Report

2 major / 2 minor

Summary. The paper introduces CounterBench, a benchmark dataset of 1K counterfactual reasoning questions using formal rules, varying difficulty levels, diverse causal graph structures, distinct question types, and multiple nonsensical name variants to evaluate LLMs on formal counterfactual inference (distinct from commonsense causal reasoning). Experiments indicate that most LLMs perform at levels comparable to random guessing on these tasks. The authors propose CoIn, a reasoning paradigm that guides LLMs through iterative reasoning and backtracking, and report that it significantly improves performance across different LLMs. The dataset is released publicly.

Significance. If the benchmark successfully isolates formal counterfactual inference without pretraining leakage or surface heuristics, the near-random baseline performance and consistent gains from CoIn would highlight a core limitation in current LLMs' causal reasoning and provide a practical improvement method with relevance to AI safety and robust reasoning systems. The public dataset release supports reproducibility and further work.

major comments (2)

[Abstract] Abstract: The central claims that LLMs perform at levels comparable to random guessing and that CoIn significantly improves performance lack any quantitative results, error bars, statistical tests, or details on validation of difficulty levels and graph structures. This is load-bearing for the empirical findings.
[Abstract] Abstract / Dataset Design: The design uses nonsensical name variants and varied causal graph structures to isolate formal counterfactual rules from pretraining leakage or surface patterns, but no control experiments (e.g., accuracy delta on real vs. nonsensical names, performance on held-out graph structures, or comparison to post-cutoff models) are reported to verify this isolation. This directly affects whether the near-random results can be interpreted as evidence of failure on formal inference.

minor comments (2)

[Abstract] Abstract contains a missing space: 'LLMs.Our dataset' should be 'LLMs. Our dataset'.
[Abstract] Abstract: 'enhance LLM's counterfactual reasoning ability' should read 'enhance LLMs' counterfactual reasoning abilities' for grammatical consistency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and will revise the manuscript to strengthen the presentation of results and validation of the benchmark design.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims that LLMs perform at levels comparable to random guessing and that CoIn significantly improves performance lack any quantitative results, error bars, statistical tests, or details on validation of difficulty levels and graph structures. This is load-bearing for the empirical findings.

Authors: We agree that the abstract should include quantitative support for the claims. In the revised manuscript we will update the abstract to report key metrics such as average LLM accuracy (near random baseline levels), the magnitude of CoIn gains, reference to error bars from repeated runs, and brief notes on how difficulty levels and graph structures were constructed and validated. revision: yes
Referee: [Abstract] Abstract / Dataset Design: The design uses nonsensical name variants and varied causal graph structures to isolate formal counterfactual rules from pretraining leakage or surface patterns, but no control experiments (e.g., accuracy delta on real vs. nonsensical names, performance on held-out graph structures, or comparison to post-cutoff models) are reported to verify this isolation. This directly affects whether the near-random results can be interpreted as evidence of failure on formal inference.

Authors: We acknowledge the value of explicit controls. While nonsensical names and diverse structures were chosen to minimize leakage, we did not report direct comparisons. We will add control analyses in the revision, including accuracy differences between real and nonsensical name variants and results on held-out graph structures, to better substantiate the isolation claim. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark and method evaluation with no self-referential derivations or fitted predictions

full rationale

The paper introduces CounterBench as an external dataset of 1K questions and evaluates LLMs plus the CoIn method on it. No equations, parameters, or first-principles derivations are present that reduce any claimed result to its own inputs by construction. The design choices (nonsensical names, graph variants) are presented as methodological safeguards rather than fitted quantities renamed as predictions. Self-citations, if any, are not load-bearing for the central empirical claims. The work is therefore self-contained against its stated external benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities; the work is an empirical benchmark and prompting study.

pith-pipeline@v0.9.0 · 5742 in / 1040 out tokens · 15796 ms · 2026-05-23T03:07:31.937032+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Counterfactual Modeling with Fine-Tuned LLMs for Health Intervention Design and Sensor Data Augmentation
cs.LG 2026-01 conditional novelty 6.0

Fine-tuned LLMs produce plausible counterfactuals for health interventions and recover 20% F1 via data augmentation in label-scarce sensor datasets.
DeepImagine: Learning Biomedical Reasoning via Successive Counterfactual Imagining
cs.CL 2026-04 unverdicted novelty 5.0

DeepImagine trains LLMs on counterfactual pairs from clinical trials using supervised fine-tuning and reinforcement learning to improve outcome prediction by approximating causal mechanisms.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · cited by 2 Pith papers · 1 internal anchor

[1]

Alwin. 2023. Understanding causal ai: Bridging the gap between correlation and causation. https://www.alwin.io/causal-ai. Accessed: 2025-01-06

work page 2023
[2]

Anthropic. 2024. Claude. https://www.anthropic.com/api. Accessed: 2025-01-06

work page 2024
[3]

Swagata Ashwani, Kshiteesh Hegde, Nishith Reddy Mannuru, Dushyant Singh Sengar, Mayank Jindal, Krishna Chaitanya Rao Kathala, Dishant Banga, Vinija Jain, and Aman Chadha. 2024. Cause and effect: Can large language models truly understand causality? In Proceedings of the AAAI Symposium Series, volume 4, pages 2--9

work page 2024
[4]

Ivi Chatzi, Nina Corvelo Benz, Eleni Straitouri, Stratis Tsirtsis, and Manuel Gomez-Rodriguez. 2024. Counterfactual token generation in large language models. arXiv preprint arXiv:2409.17027

work page arXiv 2024
[5]

Huigang Chen, Totte Harinen, Jeong-Yoon Lee, Mike Yung, and Zhenyu Zhao. 2020. Causalml: Python package for causal machine learning. arXiv preprint arXiv:2002.11631

work page arXiv 2020
[6]

DeepSeek . 2024. https://www.deepseek.com/ DeepSeek: AI-Powered Search Engine . Accessed: 2025-02-15

work page 2024
[7]

Amir Feder, Katherine A Keith, Emaad Manzoor, Reid Pryzant, Dhanya Sridhar, Zach Wood-Doughty, Jacob Eisenstein, Justin Grimmer, Roi Reichart, Margaret E Roberts, et al. 2022. Causal inference in natural language processing: Estimation, prediction, interpretation and beyond. Transactions of the Association for Computational Linguistics, 10:1138--1158

work page 2022
[8]

Google. 2024. Gemini. https://gemini.google.com/. Accessed: 2025-01-06

work page 2024
[9]

Ian D Gow, David F Larcker, and Peter C Reiss. 2016. Causal inference in accounting research. Journal of Accounting Research, 54(2):477--523

work page 2016
[10]

Emilia Gvozdenovi \'c , Lucio Malvisi, Elisa Cinconze, Stijn Vansteelandt, Phoebe Nakanwagi, Emmanuel Aris, and Dominique Rosillon. 2021. Causal inference concepts applied to three observational studies in the context of vaccine development: from theory to practice. BMC Medical Research Methodology, 21:1--10

work page 2021
[11]

Kairong Han, Kun Kuang, Ziyu Zhao, Junjian Ye, and Fei Wu. 2024. Causal agent based on large language model. arXiv preprint arXiv:2408.06849

work page arXiv 2024
[12]

Paul W Holland. 1986. Statistics and causal inference. Journal of the American statistical Association, 81(396):945--960

work page 1986
[13]

Zhenyang Hua, Shuyue Xing, Huixing Jiang, Chen Wei, and Xiaojie Wang. 2024. Improving causal inference of large language models with scm tools. In CCF International Conference on Natural Language Processing and Chinese Computing, pages 3--14. Springer

work page 2024
[14]

Zhijing Jin, Yuen Chen, Felix Leeb, Luigi Gresele, Ojasv Kamal, LYU Zhiheng, Kevin Blin, Fernando Gonzalez Adauto, Max Kleiman-Weiner, Mrinmaya Sachan, et al. 2023. Cladder: Assessing causal reasoning in language models. In Thirty-seventh conference on neural information processing systems

work page 2023
[15]

Fredrik Johansson, Uri Shalit, and David Sontag. 2016. Learning representations for counterfactual inference. In International conference on machine learning, pages 3020--3029. PMLR

work page 2016
[16]

Atoosa Kasirzadeh and Andrew Smart. 2021. The use and misuse of counterfactuals in ethical machine learning. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 228--236

work page 2021
[17]

Emre K c man, Robert Ness, Amit Sharma, and Chenhao Tan. 2023. Causal reasoning and large language models: Opening a new frontier for causality. arXiv preprint arXiv:2305.00050

work page arXiv 2023
[18]

Lisa Koonce, Karen K Nelson, and Catherine M Shakespeare. 2011. Judging the relevance of fair value for financial instruments. The Accounting Review, 86(6):2075--2098

work page 2011
[19]

Parthasarathy Krishnamurthy and Anuradha Sivaraman. 2002. Counterfactual thinking and advertising responses. Journal of Consumer Research, 28(4):650--658

work page 2002
[20]

Evangelia Kyrimi, Somayyeh Mossadegh, Jared M Wohlgemut, Rebecca S Stoner, Nigel RM Tai, and William Marsh. 2025. Counterfactual reasoning using causal bayesian networks as a healthcare governance tool. International Journal of Medical Informatics, 193:105681

work page 2025
[21]

Jia Li and Xiang Li. 2024. https://arxiv.org/abs/2307.16387 Relation-first modeling paradigm for causal representation learning toward the development of agi . Preprint, arXiv:2307.16387

work page arXiv 2024
[22]

Weixin Liang, Yaohui Zhang, Zhengxuan Wu, Haley Lepp, Wenlong Ji, Xuandong Zhao, Hancheng Cao, Sheng Liu, Siyu He, Zhi Huang, et al. 2024. Mapping the increasing use of llms in scientific papers. arXiv preprint arXiv:2404.01268

work page arXiv 2024
[23]

Jinxin Liu, Shulin Cao, Jiaxin Shi, Tingjian Zhang, Lunyiu Nie, Linmei Hu, Lei Hou, and Juanzi Li. 2024 a . How proficient are large language models in formal languages? an in-depth insight for knowledge base question answering. In Findings of the Association for Computational Linguistics ACL 2024, pages 792--815

work page 2024
[24]

Xiaoyu Liu, Paiheng Xu, Junda Wu, Jiaxin Yuan, Yifan Yang, Yuhang Zhou, Fuxiao Liu, Tianrui Guan, Haoliang Wang, Tong Yu, et al. 2024 b . Large language models and causal inference in collaboration: A comprehensive survey. arXiv preprint arXiv:2403.09606

work page arXiv 2024
[25]

Massimo Loi and Margarida Rodrigues. 2012. A note on the impact evaluation of public policies: the counterfactual analysis

work page 2012
[26]

Christos Louizos, Uri Shalit, Joris M Mooij, David Sontag, Richard Zemel, and Max Welling. 2017. Causal effect inference with deep latent-variable models. Advances in neural information processing systems, 30

work page 2017
[27]

Jing Ma. 2024. Causal inference with large language model: A survey. arXiv preprint arXiv:2409.09822

work page arXiv 2024
[28]

SL Morgan. 2015. Counterfactuals and causal inference. Cambridge University Press

work page 2015
[29]

Elena Musi and Rudi Palmieri. 2024. The fallacy of explainable generative ai: evidence from argumentative prompting in two domains. In CEUR Workshop Proceedings, volume 3769, pages 59--69

work page 2024
[30]

Xuefei Ning, Zinan Lin, Zixuan Zhou, Zifu Wang, Huazhong Yang, and Yu Wang. 2024. Skeleton-of-thought: Prompting llms for efficient parallel generation. In The Twelfth International Conference on Learning Representations

work page 2024
[31]

OpenAI. 2024. Models. https://platform.openai.com/docs/models. Accessed: 2025-01-06

work page 2024
[32]

Judea Pearl. 2009. Causality. Cambridge university press

work page 2009
[33]

Judea Pearl and Dana Mackenzie. 2018. The book of why: the new science of cause and effect. Basic books

work page 2018
[34]

Fabio Petroni, Tim Rockt \"a schel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H Miller, and Sebastian Riedel. 2019. Language models as knowledge bases? arXiv preprint arXiv:1909.01066

work page internal anchor Pith review Pith/arXiv arXiv 2019
[35]

Neil Sahota. 2023. Causal ai: Bridging the gap between correlation and causation. https://www.neilsahota.com https://www.neilsahota.com/causal-ai-bridging-the-gap-between-correlation-and-causation/. Accessed: 2025-01-06

work page 2023
[36]

Bilgehan Sel, Ahmad Al-Tawaha, Vanshaj Khattar, Ruoxi Jia, and Ming Jin. 2023. Algorithm of thoughts: Enhancing exploration of ideas in large language models. arXiv preprint arXiv:2308.10379

work page arXiv 2023
[37]

Uri Shalit, Fredrik D Johansson, and David Sontag. 2017. Estimating individual treatment effect: generalization bounds and algorithms. In International conference on machine learning, pages 3076--3085. PMLR

work page 2017
[38]

Amit Sharma and Emre Kiciman. 2020. Dowhy: An end-to-end library for causal inference. arXiv preprint arXiv:2011.04216

work page arXiv 2020
[39]

Artur Tarassow. 2023. The potential of llms for coding with low-resource and domain-specific programming languages. arXiv preprint arXiv:2307.13018

work page arXiv 2023
[40]

Vicuna . 2023. Vicuna: An open-source chatbot impressing GPT-4 with 90\ https://vicuna.lmsys.org/. Accessed: 2023

work page 2023
[41]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824--24837

work page 2022
[42]

Jinsung Yoon, James Jordon, and Mihaela Van Der Schaar. 2018. Ganite: Estimation of individualized treatment effects using generative adversarial nets. In International conference on learning representations

work page 2018
[43]

Matej Ze c evi \'c , Moritz Willig, Devendra Singh Dhami, and Kristian Kersting. 2023. Causal parrots: Large language models may talk causality but are not causal. arXiv preprint arXiv:2308.13067

work page arXiv 2023
[44]

Cheng Zhang, Stefan Bauer, Paul Bennett, Jiangfeng Gao, Wenbo Gong, Agrin Hilmkil, Joel Jennings, Chao Ma, Tom Minka, Nick Pawlowski, et al. 2023. Understanding causality with large language models: Feasibility and opportunities. arXiv preprint arXiv:2304.05524

work page arXiv 2023
[45]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page
[46]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[1] [1]

Alwin. 2023. Understanding causal ai: Bridging the gap between correlation and causation. https://www.alwin.io/causal-ai. Accessed: 2025-01-06

work page 2023

[2] [2]

Anthropic. 2024. Claude. https://www.anthropic.com/api. Accessed: 2025-01-06

work page 2024

[3] [3]

Swagata Ashwani, Kshiteesh Hegde, Nishith Reddy Mannuru, Dushyant Singh Sengar, Mayank Jindal, Krishna Chaitanya Rao Kathala, Dishant Banga, Vinija Jain, and Aman Chadha. 2024. Cause and effect: Can large language models truly understand causality? In Proceedings of the AAAI Symposium Series, volume 4, pages 2--9

work page 2024

[4] [4]

Ivi Chatzi, Nina Corvelo Benz, Eleni Straitouri, Stratis Tsirtsis, and Manuel Gomez-Rodriguez. 2024. Counterfactual token generation in large language models. arXiv preprint arXiv:2409.17027

work page arXiv 2024

[5] [5]

Huigang Chen, Totte Harinen, Jeong-Yoon Lee, Mike Yung, and Zhenyu Zhao. 2020. Causalml: Python package for causal machine learning. arXiv preprint arXiv:2002.11631

work page arXiv 2020

[6] [6]

DeepSeek . 2024. https://www.deepseek.com/ DeepSeek: AI-Powered Search Engine . Accessed: 2025-02-15

work page 2024

[7] [7]

Amir Feder, Katherine A Keith, Emaad Manzoor, Reid Pryzant, Dhanya Sridhar, Zach Wood-Doughty, Jacob Eisenstein, Justin Grimmer, Roi Reichart, Margaret E Roberts, et al. 2022. Causal inference in natural language processing: Estimation, prediction, interpretation and beyond. Transactions of the Association for Computational Linguistics, 10:1138--1158

work page 2022

[8] [8]

Google. 2024. Gemini. https://gemini.google.com/. Accessed: 2025-01-06

work page 2024

[9] [9]

Ian D Gow, David F Larcker, and Peter C Reiss. 2016. Causal inference in accounting research. Journal of Accounting Research, 54(2):477--523

work page 2016

[10] [10]

Emilia Gvozdenovi \'c , Lucio Malvisi, Elisa Cinconze, Stijn Vansteelandt, Phoebe Nakanwagi, Emmanuel Aris, and Dominique Rosillon. 2021. Causal inference concepts applied to three observational studies in the context of vaccine development: from theory to practice. BMC Medical Research Methodology, 21:1--10

work page 2021

[11] [11]

Kairong Han, Kun Kuang, Ziyu Zhao, Junjian Ye, and Fei Wu. 2024. Causal agent based on large language model. arXiv preprint arXiv:2408.06849

work page arXiv 2024

[12] [12]

Paul W Holland. 1986. Statistics and causal inference. Journal of the American statistical Association, 81(396):945--960

work page 1986

[13] [13]

Zhenyang Hua, Shuyue Xing, Huixing Jiang, Chen Wei, and Xiaojie Wang. 2024. Improving causal inference of large language models with scm tools. In CCF International Conference on Natural Language Processing and Chinese Computing, pages 3--14. Springer

work page 2024

[14] [14]

Zhijing Jin, Yuen Chen, Felix Leeb, Luigi Gresele, Ojasv Kamal, LYU Zhiheng, Kevin Blin, Fernando Gonzalez Adauto, Max Kleiman-Weiner, Mrinmaya Sachan, et al. 2023. Cladder: Assessing causal reasoning in language models. In Thirty-seventh conference on neural information processing systems

work page 2023

[15] [15]

Fredrik Johansson, Uri Shalit, and David Sontag. 2016. Learning representations for counterfactual inference. In International conference on machine learning, pages 3020--3029. PMLR

work page 2016

[16] [16]

Atoosa Kasirzadeh and Andrew Smart. 2021. The use and misuse of counterfactuals in ethical machine learning. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 228--236

work page 2021

[17] [17]

Emre K c man, Robert Ness, Amit Sharma, and Chenhao Tan. 2023. Causal reasoning and large language models: Opening a new frontier for causality. arXiv preprint arXiv:2305.00050

work page arXiv 2023

[18] [18]

Lisa Koonce, Karen K Nelson, and Catherine M Shakespeare. 2011. Judging the relevance of fair value for financial instruments. The Accounting Review, 86(6):2075--2098

work page 2011

[19] [19]

Parthasarathy Krishnamurthy and Anuradha Sivaraman. 2002. Counterfactual thinking and advertising responses. Journal of Consumer Research, 28(4):650--658

work page 2002

[20] [20]

Evangelia Kyrimi, Somayyeh Mossadegh, Jared M Wohlgemut, Rebecca S Stoner, Nigel RM Tai, and William Marsh. 2025. Counterfactual reasoning using causal bayesian networks as a healthcare governance tool. International Journal of Medical Informatics, 193:105681

work page 2025

[21] [21]

Jia Li and Xiang Li. 2024. https://arxiv.org/abs/2307.16387 Relation-first modeling paradigm for causal representation learning toward the development of agi . Preprint, arXiv:2307.16387

work page arXiv 2024

[22] [22]

Weixin Liang, Yaohui Zhang, Zhengxuan Wu, Haley Lepp, Wenlong Ji, Xuandong Zhao, Hancheng Cao, Sheng Liu, Siyu He, Zhi Huang, et al. 2024. Mapping the increasing use of llms in scientific papers. arXiv preprint arXiv:2404.01268

work page arXiv 2024

[23] [23]

Jinxin Liu, Shulin Cao, Jiaxin Shi, Tingjian Zhang, Lunyiu Nie, Linmei Hu, Lei Hou, and Juanzi Li. 2024 a . How proficient are large language models in formal languages? an in-depth insight for knowledge base question answering. In Findings of the Association for Computational Linguistics ACL 2024, pages 792--815

work page 2024

[24] [24]

Xiaoyu Liu, Paiheng Xu, Junda Wu, Jiaxin Yuan, Yifan Yang, Yuhang Zhou, Fuxiao Liu, Tianrui Guan, Haoliang Wang, Tong Yu, et al. 2024 b . Large language models and causal inference in collaboration: A comprehensive survey. arXiv preprint arXiv:2403.09606

work page arXiv 2024

[25] [25]

Massimo Loi and Margarida Rodrigues. 2012. A note on the impact evaluation of public policies: the counterfactual analysis

work page 2012

[26] [26]

Christos Louizos, Uri Shalit, Joris M Mooij, David Sontag, Richard Zemel, and Max Welling. 2017. Causal effect inference with deep latent-variable models. Advances in neural information processing systems, 30

work page 2017

[27] [27]

Jing Ma. 2024. Causal inference with large language model: A survey. arXiv preprint arXiv:2409.09822

work page arXiv 2024

[28] [28]

SL Morgan. 2015. Counterfactuals and causal inference. Cambridge University Press

work page 2015

[29] [29]

Elena Musi and Rudi Palmieri. 2024. The fallacy of explainable generative ai: evidence from argumentative prompting in two domains. In CEUR Workshop Proceedings, volume 3769, pages 59--69

work page 2024

[30] [30]

Xuefei Ning, Zinan Lin, Zixuan Zhou, Zifu Wang, Huazhong Yang, and Yu Wang. 2024. Skeleton-of-thought: Prompting llms for efficient parallel generation. In The Twelfth International Conference on Learning Representations

work page 2024

[31] [31]

OpenAI. 2024. Models. https://platform.openai.com/docs/models. Accessed: 2025-01-06

work page 2024

[32] [32]

Judea Pearl. 2009. Causality. Cambridge university press

work page 2009

[33] [33]

Judea Pearl and Dana Mackenzie. 2018. The book of why: the new science of cause and effect. Basic books

work page 2018

[34] [34]

Fabio Petroni, Tim Rockt \"a schel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H Miller, and Sebastian Riedel. 2019. Language models as knowledge bases? arXiv preprint arXiv:1909.01066

work page internal anchor Pith review Pith/arXiv arXiv 2019

[35] [35]

Neil Sahota. 2023. Causal ai: Bridging the gap between correlation and causation. https://www.neilsahota.com https://www.neilsahota.com/causal-ai-bridging-the-gap-between-correlation-and-causation/. Accessed: 2025-01-06

work page 2023

[36] [36]

Bilgehan Sel, Ahmad Al-Tawaha, Vanshaj Khattar, Ruoxi Jia, and Ming Jin. 2023. Algorithm of thoughts: Enhancing exploration of ideas in large language models. arXiv preprint arXiv:2308.10379

work page arXiv 2023

[37] [37]

Uri Shalit, Fredrik D Johansson, and David Sontag. 2017. Estimating individual treatment effect: generalization bounds and algorithms. In International conference on machine learning, pages 3076--3085. PMLR

work page 2017

[38] [38]

Amit Sharma and Emre Kiciman. 2020. Dowhy: An end-to-end library for causal inference. arXiv preprint arXiv:2011.04216

work page arXiv 2020

[39] [39]

Artur Tarassow. 2023. The potential of llms for coding with low-resource and domain-specific programming languages. arXiv preprint arXiv:2307.13018

work page arXiv 2023

[40] [40]

Vicuna . 2023. Vicuna: An open-source chatbot impressing GPT-4 with 90\ https://vicuna.lmsys.org/. Accessed: 2023

work page 2023

[41] [41]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824--24837

work page 2022

[42] [42]

Jinsung Yoon, James Jordon, and Mihaela Van Der Schaar. 2018. Ganite: Estimation of individualized treatment effects using generative adversarial nets. In International conference on learning representations

work page 2018

[43] [43]

Matej Ze c evi \'c , Moritz Willig, Devendra Singh Dhami, and Kristian Kersting. 2023. Causal parrots: Large language models may talk causality but are not causal. arXiv preprint arXiv:2308.13067

work page arXiv 2023

[44] [44]

Cheng Zhang, Stefan Bauer, Paul Bennett, Jiangfeng Gao, Wenbo Gong, Agrin Hilmkil, Joel Jennings, Chao Ma, Tom Minka, Nick Pawlowski, et al. 2023. Understanding causality with large language models: Feasibility and opportunities. arXiv preprint arXiv:2304.05524

work page arXiv 2023

[45] [45]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page

[46] [46]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page