arxiv: 2605.09079 · v1 · submitted 2026-05-09 · 💻 cs.AI

Recognition: 1 theorem link

· Lean Theorem

CauSim: Scaling Causal Reasoning with Increasingly Complex Causal Simulators

Nicol\'as Astorga , Anita Kriz , Mihaela van der Schaar

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:29 UTC · model grok-4.3

classification 💻 cs.AI

keywords causal reasoningstructural causal modelslarge language modelscausal simulatorsdata augmentationcurriculum scalingself-improvementcausal queries

0 comments

The pith

CauSim lets LLMs incrementally build executable causal simulators that scale while keeping query answers verifiable.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CauSim to solve the scarcity of labeled data for training large language models on causal reasoning. It does so by having the models themselves construct executable structural causal models step by step, growing their complexity while preserving the ability to verify answers to causal questions. This process converts between non-executable causal descriptions and code, which supports data augmentation from existing knowledge and supervision in representations that were previously difficult to use. Experiments demonstrate that performance improves with greater simulator complexity and data volume, that models can improve themselves using the data they generate, and that the approach generalizes across different forms of causal knowledge.

Core claim

CauSim constructs increasingly complex causal simulators as executable structural causal models built incrementally by LLMs. These simulators reach globally complex systems while maintaining verifiable ground-truth answers to causal queries. The framework formalizes non-executable causal knowledge into code for augmentation and translates the simulators back into natural language to enable supervision where it was previously scarce.

What carries the argument

CauSim, a framework that has LLMs incrementally construct executable structural causal models to produce scalable simulators with verifiable causal query answers.

If this is right

Training LLMs on data generated from these simulators produces consistent gains in causal reasoning performance.
Performance continues to improve as simulator complexity increases through curriculum ordering and as data volume grows.
LLMs achieve self-improvement on causal tasks by using simulators they have generated themselves.
Non-executable domain knowledge can be formalized into executable simulators to augment training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same incremental construction process could be tested as a way to bootstrap causal models from observational datasets in specific application areas.
Generated simulators could serve as controlled environments for measuring the limits of current causal reasoning techniques before applying them to real systems.
Detecting and correcting construction errors during the incremental build would allow the method to reach even larger scales.

Load-bearing premise

Large language models can incrementally construct these complex simulators without introducing errors that invalidate the ground-truth causal relationships or the correctness of query answers.

What would settle it

Direct verification that answers to causal queries computed from the generated simulator match the answers derived from its underlying causal graph structure, particularly as the number of variables and relations grows.

Figures

Figures reproduced from arXiv: 2605.09079 by Anita Kriz, Mihaela van der Schaar, Nicol\'as Astorga.

**Figure 1.** Figure 1: Overview of CauSim. (Left) Causal reasoning faces three challenges: [C1] scale, [C2] non-executable representations, and [C3] lack of ground-truth supervision. (Right) CauSim addresses these through increasingly complex causal simulators: SCMs are built incrementally to scale complexity [C1]; non-executable causal knowledge is formalized into executable code and executable SCMs are informalized into natura… view at source ↗

**Figure 2.** Figure 2: Node-by-node generation improves SCM generation at scale. Method. We compare two methods for constructing an SCM with n structural mechanisms and n corresponding exogenous samplers as executable code: (i) One-shot generation: the LLM is prompted to output the full SCM in a single specification step and (ii) Incremental generation: the LLM first specifies a 1-node SCM, and then incrementally grows by addin… view at source ↗

**Figure 3.** Figure 3: Improving Causal Reasoning Across Distributions General Setup. Unless otherwise noted, the following setup is held constant across experiments. Data. We use our incremental SCM generation process (Sec. 4) to create a training set of Pythonexecutable SCMs over 4-letter meaningless variables, which we term the NONSENSE-code dataset. We use GPT-5 to generate SCMs and enforce three topologies: inverted stars,… view at source ↗

**Figure 4.** Figure 4: Curriculum Matters. 4 5 6 7 8 9 Complexity levels 0.0 0.2 0.4 pass@1 intervention 4 5 6 7 8 9 Complexity levels 0.0 0.2 0.4 counterfactual base 5 15 45 135 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 6.** Figure 6: Self-Improvement Motivation. A natural question is whether a model can use the framework to bootstrap its own causal reasoning ability by training on its own generated data. A related question is how much performance depends on the quality of the generator used to produce SCMs, and whether self-generated SCMs are as effective as those produced by stronger models. Method. We study self-improvement by varyin… view at source ↗

**Figure 7.** Figure 7: Data Augmentation Motivation. In many real-world settings, data is generated by an underlying, unobserved SCM. When such domain knowledge can be formalized, even approximately, it can simulate interventional and counterfactual data, providing a principled way to expand causal supervision beyond the original dataset. Method. We construct a controlled “original” dataset derived from the NIH Stroke Scale (NI… view at source ↗

**Figure 8.** Figure 8: Training on formal causal queries. Pass@1 accuracy by causal query type for models trained with and without simulator-generated causal queries. Fine-tuning with causal-simulator data consistently improves performance over the base model. M.2 Evaluation of Causal-Simulator Data Augmentation We evaluate the value of transforming an existing informal SCM into executable form, automatically generating addition… view at source ↗

**Figure 9.** Figure 9: Joint results across all causal query types. Results for deduction, abduction, intervention, and counterfactual queries. 49 [PITH_FULL_IMAGE:figures/full_fig_p049_9.png] view at source ↗

**Figure 10.** Figure 10: Ablation varying the amount of training data. Test results on NONSENSE-formal SCMs for intervention and counterfactual queries, averaged across structures. Models are trained jointly on counterfactual and intervention causal queries. We show levels 1-10. 50 [PITH_FULL_IMAGE:figures/full_fig_p050_10.png] view at source ↗

**Figure 11.** Figure 11: Ablation varying the amount of training data. Test results on NONSENSE-formal SCMs for intervention and counterfactual queries, averaged across structures. Models are trained jointly on counterfactual and intervention causal queries. We show levels 4-9. 51 [PITH_FULL_IMAGE:figures/full_fig_p051_11.png] view at source ↗

**Figure 12.** Figure 12: Ablation varying the curriculum. Test results on NONSENSE-formal SCMs for intervention and counterfactual queries, averaged across structures. Models are trained jointly on counterfactual and intervention causal queries. We show levels 1-10. 52 [PITH_FULL_IMAGE:figures/full_fig_p052_12.png] view at source ↗

**Figure 13.** Figure 13: Ablation varying the curriculum. Test results on NONSENSE-formal SCMs for intervention and counterfactual queries, averaged across structures. Models are trained jointly on counterfactual and intervention causal queries. We show levels 4-9. 53 [PITH_FULL_IMAGE:figures/full_fig_p053_13.png] view at source ↗

read the original abstract

Despite surpassing human performance across mathematics, coding, and other knowledge-intensive tasks, large language models (LLMs) continue to struggle with causal reasoning. A core obstacle is the target data itself: causal systems are complex and often expressed in non-executable forms, while ground-truth answers to causal queries are inherently scarce. We introduce CauSim, a framework that turns causal reasoning from a scarce-label problem into a scalable supervised one. CauSim constructs increasingly complex causal simulators: executable structural causal models (SCMs), incrementally built by LLMs, that scale to globally complex systems while maintaining verifiable answers to causal queries. CauSim operates across representations by formalizing non-executable causal knowledge into code, enabling data augmentation, and translating executable SCMs into natural language, enabling supervision in previously difficult-to-supervise representations. We structure our research into two parts: (1) how to construct increasingly complex causal simulators, and (2) a systematic study of what CauSim enables, demonstrating generalization across representations, consistent gains from curriculum scaling and data volume, LLM self-improvement through self-generated simulators, and data augmentation via formalization of existing domain knowledge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CauSim uses LLMs to incrementally build executable SCM simulators for causal data, which is a reasonable way to address label scarcity, but the abstract gives no concrete results to show whether the construction stays accurate at scale.

read the letter

The paper's main move is to treat causal reasoning as a data problem rather than a pure capability problem. It has LLMs construct executable structural causal models step by step, starting simple and growing more complex, so that causal queries have verifiable answers that can be used for supervised training. The cross-representation part—formalizing knowledge into code for augmentation and translating simulators back to natural language—is the element that feels freshest compared with prior synthetic data work in this area. The two-part structure (construction methods followed by a study of generalization, curriculum effects, data volume, and self-improvement) is also cleanly laid out in the abstract. If the generated simulators really remain faithful, this could turn a scarce-label setting into something closer to standard supervised learning, which is a practical step forward for anyone trying to improve causal performance in LLMs. The obvious soft spot is the construction process itself. The abstract claims the simulators scale to globally complex systems without breaking the causal structure, yet it supplies no error rates, fidelity checks, or ablation on how often the incremental LLM steps introduce inconsistencies. That assumption is doing a lot of work; if it fails even moderately, the downstream gains in generalization and self-improvement become hard to trust. The self-improvement loop adds a second layer of risk, since verification ultimately rests on the same model family that generated the simulators. This work is aimed at groups already working on causal reasoning benchmarks or synthetic data pipelines for LLMs. A reader who cares about verifiable causal supervision would get value from seeing the construction details and the scaling curves, even if the final numbers are modest. It is worth sending to peer review so that referees can examine the actual implementation and any safeguards against error accumulation that the abstract does not describe.

Referee Report

2 major / 1 minor

Summary. The paper introduces CauSim, a framework that uses LLMs to incrementally construct executable structural causal models (SCMs) as simulators for causal reasoning. These simulators scale in complexity while providing verifiable ground-truth answers to causal queries. The approach formalizes non-executable causal knowledge into code for data augmentation and translates SCMs back to natural language for supervision. Research is structured in two parts: (1) methods for building increasingly complex causal simulators and (2) empirical studies demonstrating generalization across representations, gains from curriculum scaling and data volume, LLM self-improvement via self-generated simulators, and augmentation from domain knowledge.

Significance. If the central claims hold, CauSim would address the scarcity of causal data and supervision signals for LLMs by converting causal reasoning into a scalable supervised learning problem with independent executable ground truth. The curriculum scaling, cross-representation generalization, and self-improvement results could represent a meaningful advance. The use of executable SCMs is a strength for verifiability, though the self-improvement component carries a noted risk of circularity that requires careful validation.

major comments (2)

The abstract and Part (1) description of incremental SCM construction by LLMs claims scaling to globally complex systems while maintaining verifiable causal query answers, but without detailed algorithms for error detection, propagation analysis, or empirical fidelity metrics in the provided text, it is unclear whether construction errors invalidate ground-truth relationships. This is load-bearing for the core claim of reliable simulators.
In the self-improvement study (Part 2), the loop of LLMs generating and training on their own simulators risks circularity if verification of causal correctness ultimately depends on the same model capabilities rather than fully independent executable checks. Concrete ablation results separating model-generated data from external validation would be needed to support the self-improvement claim.

minor comments (1)

The abstract would benefit from explicit definitions or metrics for 'increasingly complex' and 'globally complex systems' to make the scaling claims more precise.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful review and constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications on the mechanisms that support reliable simulator construction and independent verification.

read point-by-point responses

Referee: The abstract and Part (1) description of incremental SCM construction by LLMs claims scaling to globally complex systems while maintaining verifiable causal query answers, but without detailed algorithms for error detection, propagation analysis, or empirical fidelity metrics in the provided text, it is unclear whether construction errors invalidate ground-truth relationships. This is load-bearing for the core claim of reliable simulators.

Authors: We agree that the manuscript would benefit from greater explicitness on these aspects. The incremental construction procedure incorporates execution-based consistency checks after each addition of a variable or edge: a battery of causal queries is run on the updated executable SCM and compared against results from the prior verified state. Inconsistencies arising from construction errors are rejected before the simulator is accepted. While the current text describes this process at a high level, we will add a dedicated subsection with pseudocode for the error-detection routine, a formal propagation analysis, and quantitative fidelity metrics (e.g., query-consistency rates across build steps) in the revised version. revision: yes
Referee: In the self-improvement study (Part 2), the loop of LLMs generating and training on their own simulators risks circularity if verification of causal correctness ultimately depends on the same model capabilities rather than fully independent executable checks. Concrete ablation results separating model-generated data from external validation would be needed to support the self-improvement claim.

Authors: Verification of each generated simulator occurs exclusively through direct execution of its code, which supplies ground-truth answers independently of any LLM reasoning. The self-improvement experiments already include controls that train on simulators built from external domain knowledge and compare against purely self-generated ones; performance gains remain when evaluation is restricted to execution-derived labels. To further address the circularity concern, we will expand the results with additional ablations that explicitly isolate model-generated data from any model-assisted validation steps. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided manuscript text consists only of the abstract and a placeholder for the full paper, with no equations, self-citations, or derivation steps that reduce any claim to its own inputs by construction. The framework's core elements—LLM-driven construction of executable SCMs, formalization across representations, and curriculum scaling—are presented as independent mechanisms that generate verifiable ground truth via code execution rather than through self-referential fitting or imported uniqueness theorems. No load-bearing step matches the enumerated circularity patterns, and the self-improvement aspect is described as enabled by the external executability of the simulators, keeping the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the domain assumption that structural causal models can be made executable to yield verifiable answers, and that LLMs can build them incrementally without breaking this property. No free parameters or invented physical entities are described.

axioms (1)

domain assumption Structural causal models can be represented as executable programs that provide verifiable ground-truth answers to causal queries
This assumption enables the conversion of causal knowledge into scalable supervised data.

invented entities (1)

CauSim framework no independent evidence
purpose: To construct and scale complex causal simulators for training
Newly introduced method; no independent evidence provided beyond the framework description itself.

pith-pipeline@v0.9.0 · 5504 in / 1302 out tokens · 48227 ms · 2026-05-12T02:29:23.746767+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages · 10 internal anchors

[1]

Thomas Hubert, Rishi Mehta, Laurent Sartran, Miklós Z. Horváth, Goran Žuži´c, Eric Wieser, Aja Huang, Julian Schrittwieser, Yannick Schroecker, Hussain Masoom, Ottavia Bertolli, Tom Zahavy, Amol Mandhane, Jessica Yung, Iuliya Beloshapka, Borja Ibarz, Vivek Veeriah, Lei Yu, Oliver Nash, Paul Lezeau, Salvatore Mercuri, Calle Sönne, Bhavik Mehta, Alex Davies...

work page doi:10.1038/s41586-025-09833-y 2026
[2]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushm...

work page doi:10.1126/science.abq1158 2022
[4]

Can llms replace human evaluators? an empirical study of llm-as-a-judge in software engineering.Proc

Ruiqi Wang, Jiyu Guo, Cuiyun Gao, Guodong Fan, Chun Yong Chong, and Xin Xia. Can llms replace human evaluators? an empirical study of llm-as-a-judge in software engineering.Proc. ACM Softw. Eng., 2(ISSTA), June 2025. doi: 10.1145/3728963. URL https://doi.org/10. 1145/3728963

work page doi:10.1145/3728963 2025
[5]

Luo, X., Rechardt, A., Sun, G., Nejad, K

Xiaoliang Luo, Akilles Rechardt, Guangzhi Sun, Kevin K. Nejad, Felipe Yáñez, Bati Yil- maz, Kangjoo Lee, Alexandra O. Cohen, Valentina Borghesani, Anton Pashkov, Daniele Marinazzo, Jonathan Nicholas, Alessandro Salatiello, Ilia Sucholutsky, Pasquale Minervini, Sepehr Razavi, Roberta Rocca, Elkhan Yusifov, Tereza Okalova, Nianlong Gu, Martin Ferianc, Mikai...

work page doi:10.1038/s41562-024-02046-9 2025
[6]

Cladder: Assessing causal reasoning in language models.Advances in Neural Information Processing Systems, 36: 31038–31065, 2023

Zhijing Jin, Yuen Chen, Felix Leeb, Luigi Gresele, Ojasv Kamal, Zhiheng Lyu, Kevin Blin, Fernando Gonzalez Adauto, Max Kleiman-Weiner, Mrinmaya Sachan, et al. Cladder: Assessing causal reasoning in language models.Advances in Neural Information Processing Systems, 36: 31038–31065, 2023

work page 2023
[7]

Cause and effect: can large language models truly understand causality? InProceedings of the AAAI Symposium Series, volume 4, pages 2–9, 2024

Swagata Ashwani, Kshiteesh Hegde, Nishith Reddy Mannuru, Dushyant Singh Sengar, Mayank Jindal, Krishna Chaitanya Rao Kathala, Dishant Banga, Vinija Jain, and Aman Chadha. Cause and effect: can large language models truly understand causality? InProceedings of the AAAI Symposium Series, volume 4, pages 2–9, 2024. 11

work page 2024
[8]

Unveiling causal reasoning in large language models: Reality or mi- rage? InAdvances in Neural Information Processing Systems, volume 37, pages 96640– 96670, 2024

Haoang Chi, He Li, Wenjing Yang, Feng Liu, Long Lan, Xiaoguang Ren, Tongliang Liu, and Bo Han. Unveiling causal reasoning in large language models: Reality or mi- rage? InAdvances in Neural Information Processing Systems, volume 37, pages 96640– 96670, 2024. URL https://proceedings.neurips.cc/paper_files/paper/2024/ hash/af2bb2b2280d36f8842e440b4e275152-A...

work page 2024
[9]

A Proofs of Section 4 Proof of Proposition 1.Let Ii be the indicator that on instance i the bias has the opposite sign to θi and |bi|>|θ i|

Matej Zeˇcevi´c, Moritz Willig, Devendra Singh Dhami, and Kristian Kersting. Causal parrots: Large language models may talk causality but are not causal.arXiv preprint arXiv:2308.13067, 2023

work page arXiv 2023
[10]

Paul W. Holland. Statistics and causal inference.Journal of the American Statistical Association, 81(396):945–960, 1986. doi: 10.1080/01621459.1986.10478354

work page doi:10.1080/01621459.1986.10478354 1986
[11]

Hernán and James M

Miguel A. Hernán and James M. Robins.Causal Inference: What If. Chapman & Hall/CRC, Boca Raton, 2020

work page 2020
[12]

2009.Causality(2 ed.)

Judea Pearl.Causality: Models, Reasoning, and Inference. Cambridge University Press, Cambridge, UK, 2 edition, 2009. doi: 10.1017/CBO9780511803161

work page doi:10.1017/cbo9780511803161 2009
[13]

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data.arXiv preprint arXiv:2505.03335, 2025. URL https://arxiv.org/ abs/2505.03335

work page internal anchor Pith review arXiv 2025
[14]

2016 , month = jul, journal =

Elias Bareinboim and Judea Pearl. Causal inference and the data-fusion problem.Proceedings of the National Academy of Sciences, 113(27):7345–7352, 2016. doi: 10.1073/pnas.1510507113. URLhttps://www.pnas.org/doi/abs/10.1073/pnas.1510507113

work page doi:10.1073/pnas.1510507113 2016
[15]

Causal diagrams for empirical research.Biometrika, 82(4):669–688, 1995

Judea Pearl. Causal diagrams for empirical research.Biometrika, 82(4):669–688, 1995. doi: 10.1093/biomet/82.4.669

work page doi:10.1093/biomet/82.4.669 1995
[16]

Re-imagine: Symbolic benchmark synthesis for reasoning evaluation.arXiv preprint arXiv:2506.15455, 2025

Xinnuo Xu, Rachel Lawrence, Kshitij Dubey, Atharva Pandey, Risa Ueno, Fabian Falck, Aditya V Nori, Rahul Sharma, Amit Sharma, and Javier Gonzalez. Re-imagine: Symbolic benchmark synthesis for reasoning evaluation.arXiv preprint arXiv:2506.15455, 2025

work page arXiv 2025
[17]

Reasoning elicitation in language models via counterfactual feedback.arXiv preprint arXiv:2410.03767, 2024

Alihan Hüyük, Xinnuo Xu, Jacqueline Maasch, Aditya V Nori, and Javier González. Reasoning elicitation in language models via counterfactual feedback.arXiv preprint arXiv:2410.03767, 2024

work page arXiv 2024
[18]

Executable counterfactuals: Improving llms’ causal reasoning through code.arXiv preprint arXiv:2510.01539, 2025

Aniket Vashishtha, Qirun Dai, Hongyuan Mei, Amit Sharma, Chenhao Tan, and Hao Peng. Executable counterfactuals: Improving llms’ causal reasoning through code.arXiv preprint arXiv:2510.01539, 2025

work page arXiv 2025
[19]

Can large language models infer causation from correlation? arXiv preprint arXiv:2306.05836, 2023

Zhijing Jin, Jiarui Liu, Zhiheng Lyu, Spencer Poff, Mrinmaya Sachan, Rada Mihalcea, Mona Diab, and Bernhard Schölkopf. Can large language models infer causation from correlation? arXiv preprint arXiv:2306.05836, 2023

work page arXiv 2023
[20]

CausalBench: A comprehensive benchmark for evaluating causal reasoning capabilities of large language models

Zeyu Wang. CausalBench: A comprehensive benchmark for evaluating causal reasoning capabilities of large language models. InProceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10), pages 143–151, Bangkok, Thailand, August

work page
[21]

URL https://aclanthology.org/2024

Association for Computational Linguistics. URL https://aclanthology.org/2024. sighan-1.17/

work page 2024
[22]

Longllada: Unlocking long context capabilities in diffusion llms

Yuefei Chen, Vivek K. Singh, Jing Ma, and Ruixiang Tang. Counterbench: Evaluating and improving counterfactual reasoning in large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 30350–30358, 2026. doi: 10.1609/aaai. v40i36.40287. URLhttps://ojs.aaai.org/index.php/AAAI/article/view/40287

work page doi:10.1609/aaai 2026
[23]

Causalarc: Abstract reasoning with causal world models.arXiv preprint arXiv:2509.03636, 2025

Jacqueline Maasch, John Kalantari, and Kia Khezeli. Causalarc: Abstract reasoning with causal world models.arXiv preprint arXiv:2509.03636, 2025

work page arXiv 2025
[24]

Com- positional causal reasoning evaluation in language models.arXiv preprint arXiv:2503.04556, 2025

Jacqueline RMA Maasch, Alihan Hüyük, Xinnuo Xu, Aditya V Nori, and Javier Gonzalez. Com- positional causal reasoning evaluation in language models.arXiv preprint arXiv:2503.04556, 2025. 12

work page arXiv 2025
[25]

Does reasoning emerge? examining the probabilities of causation in large language models.Advances in Neural Information Processing Systems, 37: 117737–117761, 2024

Javier González and Aditya Nori. Does reasoning emerge? examining the probabilities of causation in large language models.Advances in Neural Information Processing Systems, 37: 117737–117761, 2024

work page 2024
[26]

Counterfactual reasoning: Testing language models’ understanding of hypothetical scenarios.arXiv preprint arXiv:2305.16572, 2023

Jiaxuan Li, Lang Yu, and Allyson Ettinger. Counterfactual reasoning: Testing language models’ understanding of hypothetical scenarios.arXiv preprint arXiv:2305.16572, 2023

work page arXiv 2023
[27]

Look before you decide: Prompting active deduction of mllms for assumptive reasoning,

Yian Li, Wentao Tian, Yang Jiao, Jingjing Chen, Tianwen Qian, Bin Zhu, Na Zhao, and Yu-Gang Jiang. Look before you decide: Prompting active deduction of mllms for assumptive reasoning,

work page
[28]

Version 5, last revised 17 Apr 2025

URLhttps://arxiv.org/abs/2404.12966. Version 5, last revised 17 Apr 2025

work page arXiv 2025
[29]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

doi: 10.48550/arXiv.2110.14168

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2110.14168
[31]

Trust region policy optimization

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In Francis Bach and David Blei, editors,Proceedings of the 32nd International Conference on Machine Learning, volume 37 ofProceedings of Machine Learning Research, pages 1889–1897, Lille, France, 2015. PMLR. URLhttps://proceedings.mlr. press/...

work page 2015
[32]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[33]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, et al. Deepseek- r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

Xumeng Wen, Zihan Liu, Shun Zheng, Shengyu Ye, Zhirong Wu, Yang Wang, Zhijian Xu, Xiao Liang, Junjie Li, Ziming Miao, Jiang Bian, and Mao Yang. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms.arXiv preprint arXiv:2506.14245, 2025

work page internal anchor Pith review arXiv 2025
[36]

Self-evolving curriculum for llm reasoning.arXiv preprint arXiv:2505.14970, 2025

Xiaoyin Chen, Jiarui Lu, Minsu Kim, Dinghuai Zhang, Jian Tang, Alexandre Piché, Nicolas Gontier, Yoshua Bengio, and Ehsan Kamalloo. Self-evolving curriculum for llm reasoning. arXiv preprint arXiv:2505.14970, 2025

work page arXiv 2025
[37]

Eliciting and im- proving the causal reasoning abilities of large language models with conditional statements

Xiao Liu, Da Yin, Chen Zhang, Dongyan Zhao, and Yansong Feng. Eliciting and im- proving the causal reasoning abilities of large language models with conditional statements. Computational Linguistics, 51:467–504, June 2025. doi: 10.1162/coli_a_00548. URL https://aclanthology.org/2025.cl-2.3/

work page doi:10.1162/coli_a_00548 2025
[38]

Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason E. Weston. Self-rewarding language models. InProceedings of the 41st Inter- national Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 57905–57923. PMLR, 2024. URL https://proceedings.mlr.press/ v235/yuan24d.html

work page 2024
[39]

Huajian Xin, Z. Z. Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, Wenjun Gao, Qihao Zhu, Dejian Yang, Zhibin Gou, Z. F. Wu, Fuli Luo, and Chong Ruan. Deepseek-prover-v1.5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search, 2024. URL https://arxiv.org/abs/ 2408.08152

work page arXiv 2024
[40]

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. STaR: Bootstrapping reasoning with reasoning. InAdvances in Neural Information Processing Systems, 2022. 13

work page 2022
[41]

Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Chuanqi Tan, Chang Zhou, and Jingren Zhou. Scaling relationship on learning mathematical reasoning with large language models.arXiv preprint arXiv:2308.01825, 2023. doi: 10.48550/arXiv.2308.01825. URLhttps://arxiv.org/abs/2308.01825

work page internal anchor Pith review doi:10.48550/arxiv.2308.01825 2023
[42]

Raft: Reward ranked finetuning for generative foundation model alignment.arXiv preprint arXiv:2304.06767, 2023

Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. RAFT: Reward ranked finetuning for generative foundation model alignment.Transactions on Machine Learning Research, 2023. doi: 10.48550/arXiv.2304.06767. URLhttps://arxiv.org/abs/2304.06767

work page doi:10.48550/arxiv.2304.06767 2023
[43]

Learning bayesian networks with the bnlearn r package.Journal of Statistical Software, 35(3):1–22, 2010

Marco Scutari. Learning bayesian networks with the bnlearn r package.Journal of Statistical Software, 35(3):1–22, 2010. doi: 10.18637/jss.v035.i03. URL https://www.jstatsoft. org/article/view/v035i03

work page doi:10.18637/jss.v035.i03 2010
[44]

An axiomatic characterization of causal counterfactuals.Founda- tions of Science, 3(1):151–182, 1998

David Galles and Judea Pearl. An axiomatic characterization of causal counterfactuals.Founda- tions of Science, 3(1):151–182, 1998. doi: 10.1023/A:1009602825894

work page doi:10.1023/a:1009602825894 1998
[45]

Complete identification methods for the causal hierarchy.Journal of Machine Learning Research, 9:1941–1979, 2008

Ilya Shpitser and Judea Pearl. Complete identification methods for the causal hierarchy.Journal of Machine Learning Research, 9:1941–1979, 2008

work page 1941
[46]

Adaptive Computation and Machine Learning Series

Jonas Peters, Dominik Janzing, and Bernhard Schölkopf.Elements of Causal Inference: Foundations and Learning Algorithms. Adaptive Computation and Machine Learning Series. The MIT Press, Cambridge, MA, USA, 2017

work page 2017
[47]

Bernhard Schölkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris M. Mooij. On causal and anticausal learning. InProceedings of the 29th International Conference on Machine Learning, pages 1255–1262. Omnipress, 2012

work page 2012
[48]

2020 , month = nov, journal =

Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020. doi: 10.1038/s42256-020-00257-z

work page doi:10.1038/s42256-020-00257-z 2020
[49]

Toward causal representation learning.Proceedings of the IEEE, 109(5):612–634, 2021

Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. Toward causal representation learning.Proceedings of the IEEE, 109(5):612–634, 2021. doi: 10.1109/JPROC.2021.3058954

work page doi:10.1109/jproc.2021.3058954 2021
[50]

Jonas Peters, Peter Bühlmann, and Nicolai Meinshausen. Causal inference by using invariant prediction: Identification and confidence intervals.Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(5):947–1012, 2016. doi: 10.1111/rssb.12167

work page doi:10.1111/rssb.12167 2016
[51]

Turner, and Jonas Peters

Mateo Rojas-Carulla, Bernhard Schölkopf, Richard E. Turner, and Jonas Peters. Invariant models for causal transfer learning.Journal of Machine Learning Research, 19(36):1–34, 2018

work page 2018
[52]

Invariant Risk Minimization

Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk mini- mization.arXiv preprint arXiv:1907.02893, 2019

work page internal anchor Pith review arXiv 1907
[53]

Challenging common assumptions in the unsupervised learning of disentangled representations

Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Rätsch, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. InProceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 4114–4124. PMLR, 2019

work page 2019
[54]

Investigating gender bias in language models using causal mediation analysis

Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. Investigating gender bias in language models using causal mediation analysis. InAdvances in Neural Information Processing Systems, volume 33, pages 12388–12401, 2020

work page 2020
[55]

Causal abstractions of neural networks

Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. Causal abstractions of neural networks. InAdvances in Neural Information Processing Systems, volume 34, pages 9574–9586, 2021

work page 2021
[56]

Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V . Le. Self-training with noisy student improves imagenet classification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10687–10698, 2020. 14

work page 2020
[57]

Large Language Models Can Self-Improve

Jiaxin Huang, Shixiang Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1051–1068, Singapore, 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.67. URL https://aclanthology.o...

work page doi:10.18653/v1/2023.emnlp-main.67 2023
[58]

O’Connor, and Kevin McGuinness

Eric Arazo, Diego Ortego, Paul Albert, Noel E. O’Connor, and Kevin McGuinness. Pseudo- labeling and confirmation bias in deep semi-supervised learning. In2020 International Joint Conference on Neural Networks, pages 1–8. IEEE, 2020. doi: 10.1109/IJCNN48605.2020. 9207304

work page doi:10.1109/ijcnn48605.2020 2020
[59]

Debiased self-training for semi-supervised learning

Baixu Chen, Junguang Jiang, Ximei Wang, Pengfei Wan, Jianmin Wang, and Mingsheng Long. Debiased self-training for semi-supervised learning. InAdvances in Neural Information Processing Systems, 2022

work page 2022
[60]

AI models collapse when trained on recursively generated data.Nature, 631:755–759,

Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal. AI models collapse when trained on recursively generated data.Nature, 631:755–759,

work page
[61]

doi: 10.1038/s41586-024-07566-y

work page doi:10.1038/s41586-024-07566-y
[62]

Baraniuk

Sina Alemohammad, Josue Casco-Rodriguez, Lorenzo Luzi, Ahmed Imtiaz Humayun, Hossein Babaei, Daniel LeJeune, Ali Siahkoohi, and Richard G. Baraniuk. Self-consuming generative models go MAD. InInternational Conference on Learning Representations, 2024

work page 2024
[63]

Roberts, Diyi Yang, David L

Matthias Gerstgrasser, Rylan Schaeffer, Apratim Dey, Rafael Rafailov, Henry Sleight, John Hughes, Tomasz Korbak, Rajashree Agrawal, Dhruv Pai, Andrey Gromov, Daniel A. Roberts, Diyi Yang, David L. Donoho, and Sanmi Koyejo. Is model collapse inevitable? breaking the curse of recursion by accumulating real and synthetic data. InConference on Language Modeli...

work page arXiv 2024
[64]

Mathematical Description No No Yes, including mutation One-shot

work page
[65]

Logic Description Yes No No; static benchmark Human-created

work page
[66]

Spatial Examples No No No; static benchmark Human-created

work page
[67]

Purpose :

Mathematical Description No LLM derivesp(u| ·)No; static benchmark Human-created F Causal Pattern Recognition versus Causal Mechanism Learning A useful way to interpret the scope of our results is to distinguish causal-query answering from learning a causal model. In the structural causal model (SCM) view, causal knowledge is not only a joint distribution...

work page
[68]

All U_ * ex oge no us sampler d e f i n i t i o n s

work page
[69]

All f_ * s t r u c t u r a l me ch an is m d e f i n i t i o n s

work page
[70]

chain ":

The run_once ( seed ) driver . Return ONLY a single Python code block . No prose . Task : - Create an SCM with { nu m_ no de s } e xo ge nou s samplers and { nu m_ no de s } s t r u c t u r a l fu nc tio ns . - The internal nodes ( f_ *) of the SCM must follow a { dag_type } topology . { D A G _ D E F I N I T I O N S [ dag_type ]} 30 Topology Definitions ...

work page
[71]

A DECISION JSON with : - r at io na le - n e w _ f u n c t i o n : name of new f_ * - n e w _ e x o _ n o i s e : name of c o r r e s p o n d i n g U_ * - d i r e c t _ p a r e n t s : list of internal parent nodes of n e w _ f u n c t i o n - d i r e c t _ c h i l d r e n : list of child nodes of n e w _ f u n c t i o n

work page
[72]

CODE CONTEXT : - For each d i r e c t _ c h i l d : full function , in cl ud in g signature , docstring , and body - For each d i r e c t _ p a r e n t : do cs tr in g only , no code

work page
[73]

none"; else if it is less than 0.7, return

DRIVER FUNCTION : def run_once ( seed : int | None ) -> dict [ str , object ]: Your task : (1) Define a new exo ge no us sampler U_new () . - It must be pure and have no I / O . - It must return symbolic , categorical , or boolean states , not numeric math . - It must have a single do cs tri ng c o n t a i n i n g : Purpose : describe symbolic / m e c h a...

work page
[74]

A bd uc tio n : infer u n k n o w n _ e x o g e n o u s a s s i g n m e n t ( s ) c o n s i s t e n t with o b s e r v e d _ e n d o g e n o u s and f i x e d _ e x o g e n o u s

work page
[75]

I n t e r v e n t i o n : apply do (...)

work page
[76]

U_QE ":

P r e d i c t i o n : using the SAME abduced ex og en ou s values ( no r e s a m p l i n g ) , compute required outputs under do (...) . Im po rt an t : o b s e r v e d _ e n d o g e n o u s is only evidence for step (1) ; do NOT enforce it after the i n t e r v e n t i o n . Return that unique output . Required output keys ( EXACT match ; no extras ) : [...

work page 2048
[77]

A from __future__ import annotations header is included to avoid import-time failures due to unevaluated type annotations

Writes an executable Python module scm_{id}_v{v}.py by emitting imports, func- tion signatures, and function bodies from the bundle. A from __future__ import annotations header is included to avoid import-time failures due to unevaluated type annotations

work page
[78]

Versions failing validation are skipped

Optionally validates executability by importing the module and running run_once(seed=0). Versions failing validation are skipped

work page
[79]

This reduces prompt length and removes non-semantic text for downstream prompt construction

Writes a comment-stripped variant, scm_{id}_v{v}_nocomments.py, obtained by remov- ing docstrings and #-style comments. This reduces prompt length and removes non-semantic text for downstream prompt construction

work page
[80]

44 Parallelism and CPU hygiene.Rehydration and sampling are parallelized across SCMs using a process pool

Writes the bundle metadata itself tobundle_v{v}.json. 44 Parallelism and CPU hygiene.Rehydration and sampling are parallelized across SCMs using a process pool. The worker count is selected using cpuset-aware or cgroup-aware CPU detection, or scheduler environment variables such as SLURM and PBS, and capped by the number of SCMs. To prevent oversubscripti...

work page 2000

Showing first 80 references.