Caliper: Probing Lexical Anchors versus Causal Structure in LLMs

Shuigeng Zhou; Zhenyu Yu

arxiv: 2606.04915 · v1 · pith:TNBDJFV7new · submitted 2026-06-03 · 💻 cs.CL · cs.IR

Caliper: Probing Lexical Anchors versus Causal Structure in LLMs

Zhenyu Yu , Shuigeng Zhou This is my paper

Pith reviewed 2026-06-28 05:50 UTC · model grok-4.3

classification 💻 cs.CL cs.IR

keywords causal reasoninglarge language modelslexical anchorsstructural reasoningperturbation methodCLadderinstruction tuningbenchmarks

0 comments

The pith

Replacing semantic names with placeholders in causal questions causes large accuracy drops in instruction-tuned LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Caliper to test whether LLMs achieve causal reasoning through structural understanding or through lexical pattern matching on familiar variable names. Caliper keeps the causal graphs and probability details the same but swaps out real names for neutral tokens. When this change is applied to standard benchmarks, accuracy falls consistently across model sizes and families. The pattern holds in zero-shot evaluation and is only partly offset by scaffolding or few-shot examples. The work therefore concludes that current models show little structural causal reasoning once the lexical anchors disappear.

Core claim

By applying Caliper to replace semantic variable names with placeholder tokens while preserving the causal graph and probabilistic specification, the evaluation reveals robust accuracy drops on causal reasoning benchmarks. Of 40 model-by-benchmark combinations, 39 exhibit positive gaps, with the effect diminishing on pseudoword subsets. This demonstrates that current instruction-tuned LLMs evaluated zero-shot exhibit little evidence of structural causal reasoning once lexical anchors are removed.

What carries the argument

Caliper, a controlled perturbation that replaces semantic variable names with placeholder tokens while preserving the causal graph and probabilistic specification of each question.

If this is right

Lexical anonymization produces accuracy drops between 7.6 and 29.6 percentage points on CLadder, CRASS, and e-CARE across nine models.
The accuracy gap shrinks dramatically on CLadder's pseudoword subset, confirming dependence on recognizable lexical items.
Structured scaffolding and few-shot learning reduce the gap mainly by lowering performance on the original named questions rather than restoring performance on the anonymized versions.
The drop appears in 39 of 40 model-by-benchmark cells and holds from 3.8B to 671B parameter models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Causal benchmarks may need systematic use of abstract or randomized labels to isolate genuine structural reasoning.
Training or fine-tuning regimes that deliberately expose models to anonymized causal graphs could be required to develop non-lexical reasoning.
The same anonymization approach could be applied to other reasoning domains to check for hidden lexical dependencies.

Load-bearing premise

That replacing semantic variable names with placeholder tokens preserves the causal graph and probabilistic specification of each question without introducing unrelated difficulties that affect model performance.

What would settle it

A direct comparison in which models achieve statistically identical accuracy on the original and Caliper-perturbed versions of the same questions after any added difficulty from placeholders has been controlled.

Figures

Figures reproduced from arXiv: 2606.04915 by Shuigeng Zhou, Zhenyu Yu.

**Figure 2.** Figure 2: CLadder interventional gap by model size [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Confidence shift (P1 ´ P0) on stable-correct items (CC, green) and flipped items (CW, red) by rung. On the interventional panel the CW drop is consistently larger than the CC drop. mean gap is plotted against model size in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

Large language models reach 50 to 70% accuracy on causal reasoning benchmarks such as CLadder, but it is unclear whether this reflects structural reasoning or lexical pattern matching. We introduce Caliper, a controlled perturbation that replaces semantic variable names with placeholder tokens while preserving the causal graph and probabilistic specification of each question. Across nine instruction-tuned LLMs from 3.8B to 671B and three causal reasoning benchmarks, lexical anonymization yields robust accuracy drops of +7.6, +27.0, and +11.1 pp on a local 3.8B-14B set, rising to +29.6 and +18.0 pp on CRASS and e-CARE across nine frontier models spanning the 2024-2026 generations. Of 40 engaged model-by-benchmark cells, 39 show a positive gap, and the gap collapses by 17x on CLadder's pseudoword subset. Structured scaffolding and few-shot in-context learning each narrow the gap, but mainly by lowering P0 accuracy on smaller models rather than recovering P1. Current instruction-tuned LLMs, evaluated zero-shot, show little evidence of structural causal reasoning once lexical anchors are removed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Lexical anonymization produces consistent accuracy drops on causal benchmarks, but the claim of absent structural reasoning rests on an untested assumption that placeholders introduce no extra parsing or tokenization costs.

read the letter

The main result here is straightforward: swapping semantic variable names for placeholders in causal questions cuts LLM accuracy by 8 to 30 points across nine models and three benchmarks, with the gap appearing in 39 of 40 cells. The 17-fold collapse on CLadder's pseudoword subset is a useful internal check.

The work does a solid job documenting the pattern at scale. It covers instruction-tuned models from 3.8B to 671B, includes both local and frontier systems, and shows that scaffolding or few-shot examples shrink the gap mainly by hurting the original scores rather than lifting the anonymized ones. That empirical regularity is new and worth having on record.

The soft spot is the causal interpretation. The central claim requires that the placeholder replacement leaves the graph, directions, and probabilities untouched while only removing lexical cues. The abstract gives no detail on placeholder selection, tokenization boundaries, or any direct test that the anonymized format itself is not an out-of-distribution hit. The pseudoword control helps but does not address format or parsing side effects, so the link from drop to "little evidence of structural causal reasoning" stays provisional.

This is for people who build or audit causal reasoning benchmarks and want a simple probe for lexical reliance. A reader focused on LLM evaluation would find the quantified gaps useful even if they end up disagreeing with the stronger conclusion. The paper is coherent on its own terms and the measurements look reproducible enough to merit referee time, though the methods section will need expansion on the perturbation controls.

Referee Report

2 major / 2 minor

Summary. The paper introduces Caliper, a perturbation that replaces semantic variable names with placeholder tokens in causal reasoning benchmarks (CLadder, CRASS, e-CARE) while claiming to preserve the underlying causal graph and probabilistic specifications. Across nine instruction-tuned LLMs (3.8B–671B parameters) and three benchmarks, it reports consistent accuracy drops of +7.6 to +29.6 pp in 39 of 40 model-benchmark cells under zero-shot evaluation, with the gap collapsing 17x on CLadder's pseudoword subset. Structured scaffolding and few-shot learning narrow the gap mainly by reducing baseline accuracy rather than recovering performance on anonymized items. The central claim is that current LLMs show little evidence of structural causal reasoning once lexical anchors are removed.

Significance. If the perturbation cleanly isolates lexical effects, the results would provide direct empirical evidence that LLM performance on these benchmarks is driven primarily by lexical pattern matching rather than structural causal reasoning. The scale (nine models spanning 2024–2026 generations, three benchmarks, 40 cells) and the near-universal positive gap strengthen the finding. The work is purely empirical with no free parameters or circular derivations; the pseudoword control is a useful internal check. This would inform both capability assessment and benchmark construction in causal reasoning for LLMs.

major comments (2)

[Methods / perturbation description] Perturbation validity (Methods / §3): the claim that placeholder replacement 'preserves the causal graph and probabilistic specification' is load-bearing for attributing accuracy drops to absence of structural reasoning. The manuscript provides insufficient detail on (a) how placeholders were selected to avoid tokenization boundary shifts or OOD format effects, (b) explicit controls comparing token-level statistics or parsing difficulty between P0 and P1 versions, and (c) whether the anonymized templates remain within the models' training distribution. Without these, drops could arise from non-lexical confounds even if models perform structural reasoning when anchors are present.
[Results] Statistical reporting (Results / abstract): the abstract states 'consistent positive gaps in 39 of 40 cells' and specific deltas (+7.6, +27.0, +29.6 pp) but does not report per-cell standard errors, p-values, or confidence intervals. Given that the central claim rests on the robustness of these gaps across model sizes and benchmarks, the absence of significance testing or variance estimates weakens the ability to rule out noise, especially on smaller models where variance is typically higher.

minor comments (2)

[Methods] Clarify the exact placeholder vocabulary and sampling procedure (e.g., whether placeholders are drawn from a fixed set or generated per question) to allow replication.
[Results] Figure or table showing per-model per-benchmark accuracy for both P0 and P1 conditions would improve readability; current aggregate deltas make it hard to inspect outliers.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive comments. We respond point-by-point to the major comments below.

read point-by-point responses

Referee: [Methods / perturbation description] Perturbation validity (Methods / §3): the claim that placeholder replacement 'preserves the causal graph and probabilistic specification' is load-bearing for attributing accuracy drops to absence of structural reasoning. The manuscript provides insufficient detail on (a) how placeholders were selected to avoid tokenization boundary shifts or OOD format effects, (b) explicit controls comparing token-level statistics or parsing difficulty between P0 and P1 versions, and (c) whether the anonymized templates remain within the models' training distribution. Without these, drops could arise from non-lexical confounds even if models perform structural reasoning when anchors are present.

Authors: We agree that expanded documentation of the perturbation would strengthen the manuscript. In revision we will augment §3 with: (a) explicit description of placeholder selection (neutral tokens such as VAR1/VAR2 chosen to avoid boundary shifts and format changes), and (b) quantitative comparisons of token length, frequency, and parsing difficulty between P0 and P1 versions. For (c), we will add a limitations paragraph noting that direct verification against proprietary training corpora is not possible and will report any available proxies (e.g., perplexity). revision: partial
Referee: [Results] Statistical reporting (Results / abstract): the abstract states 'consistent positive gaps in 39 of 40 cells' and specific deltas (+7.6, +27.0, +29.6 pp) but does not report per-cell standard errors, p-values, or confidence intervals. Given that the central claim rests on the robustness of these gaps across model sizes and benchmarks, the absence of significance testing or variance estimates weakens the ability to rule out noise, especially on smaller models where variance is typically higher.

Authors: We accept that variance estimates would improve transparency. In the revised manuscript we will add per-cell standard errors (or bootstrap intervals where feasible) to the Results section and will note them in the abstract. Because the evaluations are deterministic zero-shot on fixed items, traditional run-to-run variance is limited, but we will also report stability across prompt paraphrases and data subsets to address the concern. revision: yes

standing simulated objections not resolved

Direct assessment of whether anonymized templates lie inside each model's training distribution, which cannot be performed without access to proprietary training data.

Circularity Check

0 steps flagged

No circularity: purely empirical perturbation study with no derivations or self-referential quantities

full rationale

The paper introduces Caliper as a controlled lexical perturbation on existing causal benchmarks (CLadder, CRASS, e-CARE) and reports accuracy drops across models. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claim rests on direct zero-shot measurements and the observation that the gap collapses on the pseudoword subset; these are falsifiable empirical results, not reductions to the paper's own inputs by construction. The perturbation's validity is an assumption open to external testing rather than a self-definitional loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that the perturbation cleanly isolates lexical anchors; no free parameters or invented entities are introduced in the abstract description.

axioms (1)

domain assumption The placeholder substitution preserves the causal graph and probabilistic specification exactly.
This is the defining property of the Caliper method stated in the abstract.

pith-pipeline@v0.9.1-grok · 5743 in / 1223 out tokens · 106658 ms · 2026-06-28T05:50:02.454407+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

68 extracted references · 6 canonical work pages

[2]

Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, and 1 others

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, and 1 others. 2020. https://arxiv.org/abs/2005.14165 Language models are few-shot learners . In Advances in Neural Information Processing Systems (NeurIPS)

Pith/arXiv arXiv 2020
[3]

Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tram \`e r, and Chiyuan Zhang. 2023. https://openreview.net/forum?id=TatRHT_1cK Quantifying memorization across neural language models . In International Conference on Learning Representations (ICLR)

2023
[4]

Simin Chen, Yiming Chen, Zexin Li, Yifan Jiang, Zhongwei Wan, Yixin He, Dezhi Ran, Tianle Gu, Haizhou Li, Tao Xie, and Baishakhi Ray. 2025. https://aclanthology.org/2025.emnlp-main.511 Benchmarking large language models under data contamination: A survey from static to dynamic evaluation . In Proceedings of the 2025 Conference on Empirical Methods in Natu...

2025
[5]

Haoang Chi, He Li, Wenjing Yang, Feng Liu, Long Lan, Xiaoguang Ren, Tongliang Liu, and Bo Han. 2024. https://proceedings.neurips.cc/paper_files/paper/2024/hash/af2bb2b2280d36f8842e440b4e275152-Abstract-Conference.html Unveiling causal reasoning in large language models: Reality or mirage? In Advances in Neural Information Processing Systems (NeurIPS)

2024
[6]

DeepSeek-AI . 2024. https://arxiv.org/abs/2412.19437 DeepSeek-V3 technical report . Preprint, arXiv:2412.19437

Pith/arXiv arXiv 2024
[7]

Zahra Dehghanighobadi, Asja Fischer, and Muhammad Bilal Zafar. 2025. https://aclanthology.org/2025.emnlp-main.396 Can LLM s explain themselves counterfactually? In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP)

2025
[8]

Li Du, Xiao Ding, Kai Xiong, Ting Liu, and Bing Qin. 2022. https://doi.org/10.18653/v1/2022.acl-long.33 e- CARE : A new dataset for exploring explainable causal reasoning . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), pages 432--446

work page doi:10.18653/v1/2022.acl-long.33 2022
[9]

Tibshirani

Bradley Efron and Robert J. Tibshirani. 1993. An Introduction to the Bootstrap. Chapman & Hall/CRC

1993
[10]

Yanai Elazar, Hongming Zhang, Yoav Goldberg, and Dan Roth. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.819 Back to square one: Artifact detection, training and commonsense disentanglement in the W inograd schema . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)

work page doi:10.18653/v1/2021.emnlp-main.819 2021
[11]

Tao Feng, Lizhen Qu, Niket Tandon, Zhuang Li, Xiaoxi Kang, and Gholamreza Haffari. 2025. https://aclanthology.org/2025.acl-long.471 On the reliability of large language models for causal discovery . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL)

2025
[12]

J \"o rg Frohberg and Frank Binder. 2022. https://aclanthology.org/2022.lrec-1.229 CRASS : A novel data set and benchmark to test counterfactual reasoning of large language models . In Proceedings of the Thirteenth Language Resources and Evaluation Conference (LREC), pages 2126--2140

2022
[13]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, and 1 others. 2024. https://arxiv.org/abs/2407.21783 The Llama 3 herd of models . arXiv preprint arXiv:2407.21783

Pith/arXiv arXiv 2024
[14]

Choquette-Choo, Katherine Lee, and A

Jamie Hayes, Marika Swanberg, Harsh Chaudhari, Itay Yona, Ilia Shumailov, Milad Nasr, Christopher A. Choquette-Choo, Katherine Lee, and A. Feder Cooper. 2025. https://aclanthology.org/2025.naacl-long.469 Measuring memorization in language models via probabilistic extraction . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of ...

2025
[15]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, and 1 others. 2023. https://arxiv.org/abs/2310.06825 Mistral 7B . arXiv preprint arXiv:2310.06825

Pith/arXiv arXiv 2023
[16]

Zhijing Jin, Yuen Chen, Felix Leeb, Luigi Gresele, Ojasv Kamal, Zhiheng Lyu, Kevin Blin, Fernando Gonzalez Adauto, Max Kleiman-Weiner, Mrinmaya Sachan, and Bernhard Sch \"o lkopf. 2023. https://arxiv.org/abs/2312.04350 CL adder: Assessing causal reasoning in language models . In Advances in Neural Information Processing Systems (NeurIPS)

arXiv 2023
[18]

Xiaoyu Liu, Paiheng Xu, Junda Wu, Jiaxin Yuan, Yifan Yang, Yuhang Zhou, Fuxiao Liu, Tianrui Guan, Haoliang Wang, Tong Yu, Julian McAuley, Wei Ai, and Furong Huang. 2025. https://aclanthology.org/2025.findings-naacl.427 Large language models and causal inference in collaboration: A comprehensive survey . In Findings of the Association for Computational Lin...

2025
[19]

Inbal Magar and Roy Schwartz. 2022. https://doi.org/10.18653/v1/2022.acl-short.18 Data contamination: From memorization to exploitation . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), Short Papers, pages 157--165

work page doi:10.18653/v1/2022.acl-short.18 2022
[20]

Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019. https://doi.org/10.18653/v1/P19-1334 Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference . In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), pages 3428--3448

work page doi:10.18653/v1/p19-1334 2019
[21]

Tarun Ram Menta, Susmit Agrawal, and Chirag Agarwal. 2025. https://aclanthology.org/2025.naacl-long.535 Analyzing memorization in large language models through the lens of model attribution . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL)

2025
[22]

Martina Miliani, Serena Auriemma, Alessandro Bondielli, Emmanuele Chersoni, Lucia Passaro, Irene Sucameli, and Alessandro Lenci. 2025. https://aclanthology.org/2025.findings-acl.891 ExpliCa : Evaluating explicit causal reasoning in large language models . In Findings of the Association for Computational Linguistics: ACL 2025

2025
[23]

Nous Research . 2024. https://arxiv.org/abs/2408.11857 Hermes 3 technical report . Preprint, arXiv:2408.11857

arXiv 2024
[24]

Judea Pearl. 2009. Causality: Models, Reasoning, and Inference, 2 edition. Cambridge University Press, Cambridge, UK

2009
[25]

Qwen Team . 2025. https://arxiv.org/abs/2412.15115 Qwen2.5 technical report . Preprint, arXiv:2412.15115

Pith/arXiv arXiv 2025
[26]

Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. https://doi.org/10.18653/v1/2020.acl-main.442 Beyond accuracy: Behavioral testing of NLP models with C heck L ist . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pages 4902--4912

work page doi:10.18653/v1/2020.acl-main.442 2020
[27]

Rahul Babu Shrestha, Simon Malberg, and Georg Groh. 2025. https://aclanthology.org/2025.nlp4dh-1.29 From causal parrots to causal prophets? T owards sound causal reasoning with large language models . In Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities (NLP4DH)

2025
[28]

Longxuan Yu, Delin Chen, Siheng Xiong, Qingyang Wu, Dawei Li, Zhikai Chen, Xiaoze Liu, and Liangming Pan. 2025. https://aclanthology.org/2025.naacl-long.622 CausalEval : Towards better causal reasoning in language models . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL)

2025
[29]

Matej Ze c evi \'c , Moritz Willig, Devendra Singh Dhami, and Kristian Kersting. 2023. https://openreview.net/forum?id=tv46tCzs83 Causal parrots: Large language models may talk causality but are not causal . Transactions on Machine Learning Research

2023
[30]

Yu Zhou, Xingyu Wu, Beicheng Huang, Jibin Wu, Liang Feng, and Kay Chen Tan. 2024. https://arxiv.org/abs/2404.06349 CausalBench : A comprehensive benchmark for causal learning capability of LLM s . arXiv preprint arXiv:2404.06349

arXiv 2024
[31]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972
[32]

Publications Manual , year = "1983", publisher =

1983
[33]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[34]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
[35]

Dan Gusfield , title =. 1997

1997
[36]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015
[37]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
[38]

arXiv preprint arXiv:2404.14219 , year=

Phi-3 technical report: A highly capable language model locally on your phone , author=. arXiv preprint arXiv:2404.14219 , year=

Pith/arXiv arXiv
[39]

International Conference on Learning Representations (ICLR) , year=

Synthetic and natural noise both break neural machine translation , author=. International Conference on Learning Representations (ICLR) , year=
[40]

International Conference on Learning Representations (ICLR) , year=

Quantifying memorization across neural language models , author=. International Conference on Learning Representations (ICLR) , year=
[41]

Back to square one: Artifact detection, training and commonsense disentanglement in the

Elazar, Yanai and Zhang, Hongming and Goldberg, Yoav and Roth, Dan , booktitle=. Back to square one: Artifact detection, training and commonsense disentanglement in the. 2021 , doi=

2021
[42]

Nature Machine Intelligence , volume=

Shortcut learning in deep neural networks , author=. Nature Machine Intelligence , volume=. 2020 , doi=

2020
[43]

Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) , pages=

Annotation artifacts in natural language inference data , author=. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) , pages=. 2018 , doi=

2018
[44]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Jin, Zhijing and Chen, Yuen and Leeb, Felix and Gresele, Luigi and Kamal, Ojasv and Lyu, Zhiheng and Blin, Kevin and Adauto, Fernando Gonzalez and Kleiman-Weiner, Max and Sachan, Mrinmaya and Sch. Advances in Neural Information Processing Systems (NeurIPS) , year=
[45]

arXiv preprint arXiv:2305.00050 , year=

Causal reasoning and large language models: Opening a new frontier for causality , author=. arXiv preprint arXiv:2305.00050 , year=

arXiv
[46]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), Short Papers , pages=

Data contamination: From memorization to exploitation , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), Short Papers , pages=. 2022 , doi=

2022
[47]

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL) , pages=. 2019 , doi=

2019
[48]

Beyond accuracy: Behavioral testing of

Ribeiro, Marco Tulio and Wu, Tongshuang and Guestrin, Carlos and Singh, Sameer , booktitle=. Beyond accuracy: Behavioral testing of. 2020 , doi=

2020
[49]

Transactions on Machine Learning Research , year=

Causal parrots: Large language models may talk causality but are not causal , author=. Transactions on Machine Learning Research , year=
[50]

Proceedings of the Thirteenth Language Resources and Evaluation Conference (LREC) , pages=

Frohberg, J. Proceedings of the Thirteenth Language Resources and Evaluation Conference (LREC) , pages=. 2022 , url=

2022
[51]

Du, Li and Ding, Xiao and Xiong, Kai and Liu, Ting and Qin, Bing , booktitle=. e-. 2022 , doi=

2022
[52]

Honnibal, Matthew and Montani, Ines and Van Landeghem, Sofie and Boyd, Adriane , year=
[53]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Language models are few-shot learners , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[54]

2024 , eprint=

Hermes 3 technical report , author=. 2024 , eprint=

2024
[55]

2009 , publisher =

Causality: Models, Reasoning, and Inference , author =. 2009 , publisher =

2009
[56]

Biometrics Bulletin , volume =

Wilcoxon, Frank , title =. Biometrics Bulletin , volume =. 1945 , doi =

1945
[57]

, title =

Efron, Bradley and Tibshirani, Robert J. , title =. 1993 , isbn =

1993
[58]

Jiang, Albert Q. and Sablayrolles, Alexandre and Mensch, Arthur and Bamford, Chris and Chaplot, Devendra Singh and de las Casas, Diego and Bressand, Florian and Lengyel, Gianna and Lample, Guillaume and Saulnier, Lucile and others , journal =. 2023 , url =

2023
[59]

Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and others , journal =. The. 2024 , url =

2024
[60]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Unveiling Causal Reasoning in Large Language Models: Reality or Mirage? , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[61]

2025 , url =

Yu, Longxuan and Chen, Delin and Xiong, Siheng and Wu, Qingyang and Li, Dawei and Chen, Zhikai and Liu, Xiaoze and Pan, Liangming , booktitle =. 2025 , url =

2025
[62]

From Causal Parrots to Causal Prophets?

Shrestha, Rahul Babu and Malberg, Simon and Groh, Georg , booktitle =. From Causal Parrots to Causal Prophets?. 2025 , url =

2025
[63]

Findings of the Association for Computational Linguistics: NAACL 2025 , year =

Large Language Models and Causal Inference in Collaboration: A Comprehensive Survey , author =. Findings of the Association for Computational Linguistics: NAACL 2025 , year =

2025
[64]

2024 , url =

Zhou, Yu and Wu, Xingyu and Huang, Beicheng and Wu, Jibin and Feng, Liang and Tan, Kay Chen , journal =. 2024 , url =

2024
[65]

2025 , url =

Miliani, Martina and Auriemma, Serena and Bondielli, Alessandro and Chersoni, Emmanuele and Passaro, Lucia and Sucameli, Irene and Lenci, Alessandro , booktitle =. 2025 , url =

2025
[66]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

Benchmarking Large Language Models Under Data Contamination: A Survey from Static to Dynamic Evaluation , author =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

2025
[67]

Dehghanighobadi, Zahra and Fischer, Asja and Zafar, Muhammad Bilal , booktitle =. Can. 2025 , url =

2025
[68]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , year =

On the Reliability of Large Language Models for Causal Discovery , author =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , year =
[69]

Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL) , year =

Analyzing Memorization in Large Language Models through the Lens of Model Attribution , author =. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL) , year =

2025
[70]

Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL) , year =

Measuring memorization in language models via probabilistic extraction , author =. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL) , year =

2025

[1] [2]

Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, and 1 others

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, and 1 others. 2020. https://arxiv.org/abs/2005.14165 Language models are few-shot learners . In Advances in Neural Information Processing Systems (NeurIPS)

Pith/arXiv arXiv 2020

[2] [3]

Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tram \`e r, and Chiyuan Zhang. 2023. https://openreview.net/forum?id=TatRHT_1cK Quantifying memorization across neural language models . In International Conference on Learning Representations (ICLR)

2023

[3] [4]

Simin Chen, Yiming Chen, Zexin Li, Yifan Jiang, Zhongwei Wan, Yixin He, Dezhi Ran, Tianle Gu, Haizhou Li, Tao Xie, and Baishakhi Ray. 2025. https://aclanthology.org/2025.emnlp-main.511 Benchmarking large language models under data contamination: A survey from static to dynamic evaluation . In Proceedings of the 2025 Conference on Empirical Methods in Natu...

2025

[4] [5]

Haoang Chi, He Li, Wenjing Yang, Feng Liu, Long Lan, Xiaoguang Ren, Tongliang Liu, and Bo Han. 2024. https://proceedings.neurips.cc/paper_files/paper/2024/hash/af2bb2b2280d36f8842e440b4e275152-Abstract-Conference.html Unveiling causal reasoning in large language models: Reality or mirage? In Advances in Neural Information Processing Systems (NeurIPS)

2024

[5] [6]

DeepSeek-AI . 2024. https://arxiv.org/abs/2412.19437 DeepSeek-V3 technical report . Preprint, arXiv:2412.19437

Pith/arXiv arXiv 2024

[6] [7]

Zahra Dehghanighobadi, Asja Fischer, and Muhammad Bilal Zafar. 2025. https://aclanthology.org/2025.emnlp-main.396 Can LLM s explain themselves counterfactually? In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP)

2025

[7] [8]

Li Du, Xiao Ding, Kai Xiong, Ting Liu, and Bing Qin. 2022. https://doi.org/10.18653/v1/2022.acl-long.33 e- CARE : A new dataset for exploring explainable causal reasoning . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), pages 432--446

work page doi:10.18653/v1/2022.acl-long.33 2022

[8] [9]

Tibshirani

Bradley Efron and Robert J. Tibshirani. 1993. An Introduction to the Bootstrap. Chapman & Hall/CRC

1993

[9] [10]

Yanai Elazar, Hongming Zhang, Yoav Goldberg, and Dan Roth. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.819 Back to square one: Artifact detection, training and commonsense disentanglement in the W inograd schema . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)

work page doi:10.18653/v1/2021.emnlp-main.819 2021

[10] [11]

Tao Feng, Lizhen Qu, Niket Tandon, Zhuang Li, Xiaoxi Kang, and Gholamreza Haffari. 2025. https://aclanthology.org/2025.acl-long.471 On the reliability of large language models for causal discovery . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL)

2025

[11] [12]

J \"o rg Frohberg and Frank Binder. 2022. https://aclanthology.org/2022.lrec-1.229 CRASS : A novel data set and benchmark to test counterfactual reasoning of large language models . In Proceedings of the Thirteenth Language Resources and Evaluation Conference (LREC), pages 2126--2140

2022

[12] [13]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, and 1 others. 2024. https://arxiv.org/abs/2407.21783 The Llama 3 herd of models . arXiv preprint arXiv:2407.21783

Pith/arXiv arXiv 2024

[13] [14]

Choquette-Choo, Katherine Lee, and A

Jamie Hayes, Marika Swanberg, Harsh Chaudhari, Itay Yona, Ilia Shumailov, Milad Nasr, Christopher A. Choquette-Choo, Katherine Lee, and A. Feder Cooper. 2025. https://aclanthology.org/2025.naacl-long.469 Measuring memorization in language models via probabilistic extraction . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of ...

2025

[14] [15]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, and 1 others. 2023. https://arxiv.org/abs/2310.06825 Mistral 7B . arXiv preprint arXiv:2310.06825

Pith/arXiv arXiv 2023

[15] [16]

Zhijing Jin, Yuen Chen, Felix Leeb, Luigi Gresele, Ojasv Kamal, Zhiheng Lyu, Kevin Blin, Fernando Gonzalez Adauto, Max Kleiman-Weiner, Mrinmaya Sachan, and Bernhard Sch \"o lkopf. 2023. https://arxiv.org/abs/2312.04350 CL adder: Assessing causal reasoning in language models . In Advances in Neural Information Processing Systems (NeurIPS)

arXiv 2023

[16] [18]

Xiaoyu Liu, Paiheng Xu, Junda Wu, Jiaxin Yuan, Yifan Yang, Yuhang Zhou, Fuxiao Liu, Tianrui Guan, Haoliang Wang, Tong Yu, Julian McAuley, Wei Ai, and Furong Huang. 2025. https://aclanthology.org/2025.findings-naacl.427 Large language models and causal inference in collaboration: A comprehensive survey . In Findings of the Association for Computational Lin...

2025

[17] [19]

Inbal Magar and Roy Schwartz. 2022. https://doi.org/10.18653/v1/2022.acl-short.18 Data contamination: From memorization to exploitation . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), Short Papers, pages 157--165

work page doi:10.18653/v1/2022.acl-short.18 2022

[18] [20]

Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019. https://doi.org/10.18653/v1/P19-1334 Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference . In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), pages 3428--3448

work page doi:10.18653/v1/p19-1334 2019

[19] [21]

Tarun Ram Menta, Susmit Agrawal, and Chirag Agarwal. 2025. https://aclanthology.org/2025.naacl-long.535 Analyzing memorization in large language models through the lens of model attribution . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL)

2025

[20] [22]

Martina Miliani, Serena Auriemma, Alessandro Bondielli, Emmanuele Chersoni, Lucia Passaro, Irene Sucameli, and Alessandro Lenci. 2025. https://aclanthology.org/2025.findings-acl.891 ExpliCa : Evaluating explicit causal reasoning in large language models . In Findings of the Association for Computational Linguistics: ACL 2025

2025

[21] [23]

Nous Research . 2024. https://arxiv.org/abs/2408.11857 Hermes 3 technical report . Preprint, arXiv:2408.11857

arXiv 2024

[22] [24]

Judea Pearl. 2009. Causality: Models, Reasoning, and Inference, 2 edition. Cambridge University Press, Cambridge, UK

2009

[23] [25]

Qwen Team . 2025. https://arxiv.org/abs/2412.15115 Qwen2.5 technical report . Preprint, arXiv:2412.15115

Pith/arXiv arXiv 2025

[24] [26]

Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. https://doi.org/10.18653/v1/2020.acl-main.442 Beyond accuracy: Behavioral testing of NLP models with C heck L ist . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pages 4902--4912

work page doi:10.18653/v1/2020.acl-main.442 2020

[25] [27]

Rahul Babu Shrestha, Simon Malberg, and Georg Groh. 2025. https://aclanthology.org/2025.nlp4dh-1.29 From causal parrots to causal prophets? T owards sound causal reasoning with large language models . In Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities (NLP4DH)

2025

[26] [28]

Longxuan Yu, Delin Chen, Siheng Xiong, Qingyang Wu, Dawei Li, Zhikai Chen, Xiaoze Liu, and Liangming Pan. 2025. https://aclanthology.org/2025.naacl-long.622 CausalEval : Towards better causal reasoning in language models . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL)

2025

[27] [29]

Matej Ze c evi \'c , Moritz Willig, Devendra Singh Dhami, and Kristian Kersting. 2023. https://openreview.net/forum?id=tv46tCzs83 Causal parrots: Large language models may talk causality but are not causal . Transactions on Machine Learning Research

2023

[28] [30]

Yu Zhou, Xingyu Wu, Beicheng Huang, Jibin Wu, Liang Feng, and Kay Chen Tan. 2024. https://arxiv.org/abs/2404.06349 CausalBench : A comprehensive benchmark for causal learning capability of LLM s . arXiv preprint arXiv:2404.06349

arXiv 2024

[29] [31]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972

[30] [32]

Publications Manual , year = "1983", publisher =

1983

[31] [33]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981

[32] [34]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

[33] [35]

Dan Gusfield , title =. 1997

1997

[34] [36]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015

[35] [37]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

[36] [38]

arXiv preprint arXiv:2404.14219 , year=

Phi-3 technical report: A highly capable language model locally on your phone , author=. arXiv preprint arXiv:2404.14219 , year=

Pith/arXiv arXiv

[37] [39]

International Conference on Learning Representations (ICLR) , year=

Synthetic and natural noise both break neural machine translation , author=. International Conference on Learning Representations (ICLR) , year=

[38] [40]

International Conference on Learning Representations (ICLR) , year=

Quantifying memorization across neural language models , author=. International Conference on Learning Representations (ICLR) , year=

[39] [41]

Back to square one: Artifact detection, training and commonsense disentanglement in the

Elazar, Yanai and Zhang, Hongming and Goldberg, Yoav and Roth, Dan , booktitle=. Back to square one: Artifact detection, training and commonsense disentanglement in the. 2021 , doi=

2021

[40] [42]

Nature Machine Intelligence , volume=

Shortcut learning in deep neural networks , author=. Nature Machine Intelligence , volume=. 2020 , doi=

2020

[41] [43]

Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) , pages=

Annotation artifacts in natural language inference data , author=. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) , pages=. 2018 , doi=

2018

[42] [44]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Jin, Zhijing and Chen, Yuen and Leeb, Felix and Gresele, Luigi and Kamal, Ojasv and Lyu, Zhiheng and Blin, Kevin and Adauto, Fernando Gonzalez and Kleiman-Weiner, Max and Sachan, Mrinmaya and Sch. Advances in Neural Information Processing Systems (NeurIPS) , year=

[43] [45]

arXiv preprint arXiv:2305.00050 , year=

Causal reasoning and large language models: Opening a new frontier for causality , author=. arXiv preprint arXiv:2305.00050 , year=

arXiv

[44] [46]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), Short Papers , pages=

Data contamination: From memorization to exploitation , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), Short Papers , pages=. 2022 , doi=

2022

[45] [47]

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL) , pages=. 2019 , doi=

2019

[46] [48]

Beyond accuracy: Behavioral testing of

Ribeiro, Marco Tulio and Wu, Tongshuang and Guestrin, Carlos and Singh, Sameer , booktitle=. Beyond accuracy: Behavioral testing of. 2020 , doi=

2020

[47] [49]

Transactions on Machine Learning Research , year=

Causal parrots: Large language models may talk causality but are not causal , author=. Transactions on Machine Learning Research , year=

[48] [50]

Proceedings of the Thirteenth Language Resources and Evaluation Conference (LREC) , pages=

Frohberg, J. Proceedings of the Thirteenth Language Resources and Evaluation Conference (LREC) , pages=. 2022 , url=

2022

[49] [51]

Du, Li and Ding, Xiao and Xiong, Kai and Liu, Ting and Qin, Bing , booktitle=. e-. 2022 , doi=

2022

[50] [52]

Honnibal, Matthew and Montani, Ines and Van Landeghem, Sofie and Boyd, Adriane , year=

[51] [53]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Language models are few-shot learners , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

[52] [54]

2024 , eprint=

Hermes 3 technical report , author=. 2024 , eprint=

2024

[53] [55]

2009 , publisher =

Causality: Models, Reasoning, and Inference , author =. 2009 , publisher =

2009

[54] [56]

Biometrics Bulletin , volume =

Wilcoxon, Frank , title =. Biometrics Bulletin , volume =. 1945 , doi =

1945

[55] [57]

, title =

Efron, Bradley and Tibshirani, Robert J. , title =. 1993 , isbn =

1993

[56] [58]

Jiang, Albert Q. and Sablayrolles, Alexandre and Mensch, Arthur and Bamford, Chris and Chaplot, Devendra Singh and de las Casas, Diego and Bressand, Florian and Lengyel, Gianna and Lample, Guillaume and Saulnier, Lucile and others , journal =. 2023 , url =

2023

[57] [59]

Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and others , journal =. The. 2024 , url =

2024

[58] [60]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Unveiling Causal Reasoning in Large Language Models: Reality or Mirage? , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[59] [61]

2025 , url =

Yu, Longxuan and Chen, Delin and Xiong, Siheng and Wu, Qingyang and Li, Dawei and Chen, Zhikai and Liu, Xiaoze and Pan, Liangming , booktitle =. 2025 , url =

2025

[60] [62]

From Causal Parrots to Causal Prophets?

Shrestha, Rahul Babu and Malberg, Simon and Groh, Georg , booktitle =. From Causal Parrots to Causal Prophets?. 2025 , url =

2025

[61] [63]

Findings of the Association for Computational Linguistics: NAACL 2025 , year =

Large Language Models and Causal Inference in Collaboration: A Comprehensive Survey , author =. Findings of the Association for Computational Linguistics: NAACL 2025 , year =

2025

[62] [64]

2024 , url =

Zhou, Yu and Wu, Xingyu and Huang, Beicheng and Wu, Jibin and Feng, Liang and Tan, Kay Chen , journal =. 2024 , url =

2024

[63] [65]

2025 , url =

Miliani, Martina and Auriemma, Serena and Bondielli, Alessandro and Chersoni, Emmanuele and Passaro, Lucia and Sucameli, Irene and Lenci, Alessandro , booktitle =. 2025 , url =

2025

[64] [66]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

Benchmarking Large Language Models Under Data Contamination: A Survey from Static to Dynamic Evaluation , author =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

2025

[65] [67]

Dehghanighobadi, Zahra and Fischer, Asja and Zafar, Muhammad Bilal , booktitle =. Can. 2025 , url =

2025

[66] [68]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , year =

On the Reliability of Large Language Models for Causal Discovery , author =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , year =

[67] [69]

Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL) , year =

Analyzing Memorization in Large Language Models through the Lens of Model Attribution , author =. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL) , year =

2025

[68] [70]

Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL) , year =

Measuring memorization in language models via probabilistic extraction , author =. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL) , year =

2025