pith. machine review for the scientific record. sign in

arxiv: 2605.09079 · v1 · submitted 2026-05-09 · 💻 cs.AI

Recognition: 1 theorem link

· Lean Theorem

CauSim: Scaling Causal Reasoning with Increasingly Complex Causal Simulators

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:29 UTC · model grok-4.3

classification 💻 cs.AI
keywords causal reasoningstructural causal modelslarge language modelscausal simulatorsdata augmentationcurriculum scalingself-improvementcausal queries
0
0 comments X

The pith

CauSim lets LLMs incrementally build executable causal simulators that scale while keeping query answers verifiable.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CauSim to solve the scarcity of labeled data for training large language models on causal reasoning. It does so by having the models themselves construct executable structural causal models step by step, growing their complexity while preserving the ability to verify answers to causal questions. This process converts between non-executable causal descriptions and code, which supports data augmentation from existing knowledge and supervision in representations that were previously difficult to use. Experiments demonstrate that performance improves with greater simulator complexity and data volume, that models can improve themselves using the data they generate, and that the approach generalizes across different forms of causal knowledge.

Core claim

CauSim constructs increasingly complex causal simulators as executable structural causal models built incrementally by LLMs. These simulators reach globally complex systems while maintaining verifiable ground-truth answers to causal queries. The framework formalizes non-executable causal knowledge into code for augmentation and translates the simulators back into natural language to enable supervision where it was previously scarce.

What carries the argument

CauSim, a framework that has LLMs incrementally construct executable structural causal models to produce scalable simulators with verifiable causal query answers.

If this is right

  • Training LLMs on data generated from these simulators produces consistent gains in causal reasoning performance.
  • Performance continues to improve as simulator complexity increases through curriculum ordering and as data volume grows.
  • LLMs achieve self-improvement on causal tasks by using simulators they have generated themselves.
  • Non-executable domain knowledge can be formalized into executable simulators to augment training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same incremental construction process could be tested as a way to bootstrap causal models from observational datasets in specific application areas.
  • Generated simulators could serve as controlled environments for measuring the limits of current causal reasoning techniques before applying them to real systems.
  • Detecting and correcting construction errors during the incremental build would allow the method to reach even larger scales.

Load-bearing premise

Large language models can incrementally construct these complex simulators without introducing errors that invalidate the ground-truth causal relationships or the correctness of query answers.

What would settle it

Direct verification that answers to causal queries computed from the generated simulator match the answers derived from its underlying causal graph structure, particularly as the number of variables and relations grows.

Figures

Figures reproduced from arXiv: 2605.09079 by Anita Kriz, Mihaela van der Schaar, Nicol\'as Astorga.

Figure 1
Figure 1. Figure 1: Overview of CauSim. (Left) Causal reasoning faces three challenges: [C1] scale, [C2] non-executable representations, and [C3] lack of ground-truth supervision. (Right) CauSim addresses these through increasingly complex causal simulators: SCMs are built incrementally to scale complexity [C1]; non-executable causal knowledge is formalized into executable code and executable SCMs are informalized into natura… view at source ↗
Figure 2
Figure 2. Figure 2: Node-by-node generation im￾proves SCM generation at scale. Method. We compare two methods for constructing an SCM with n structural mechanisms and n corresponding exogenous samplers as executable code: (i) One-shot generation: the LLM is prompted to output the full SCM in a single specification step and (ii) Incremental generation: the LLM first specifies a 1-node SCM, and then incrementally grows by addin… view at source ↗
Figure 3
Figure 3. Figure 3: Improving Causal Reasoning Across Distributions General Setup. Unless otherwise noted, the following setup is held constant across experiments. Data. We use our incremental SCM generation process (Sec. 4) to create a training set of Python￾executable SCMs over 4-letter meaningless variables, which we term the NONSENSE-code dataset. We use GPT-5 to generate SCMs and enforce three topologies: inverted stars,… view at source ↗
Figure 4
Figure 4. Figure 4: Curriculum Matters. 4 5 6 7 8 9 Complexity levels 0.0 0.2 0.4 pass@1 intervention 4 5 6 7 8 9 Complexity levels 0.0 0.2 0.4 counterfactual base 5 15 45 135 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Self-Improvement Motivation. A natural question is whether a model can use the framework to bootstrap its own causal reasoning ability by training on its own generated data. A related question is how much performance depends on the quality of the generator used to produce SCMs, and whether self-generated SCMs are as effective as those produced by stronger models. Method. We study self-improvement by varyin… view at source ↗
Figure 7
Figure 7. Figure 7: Data Augmentation Motivation. In many real-world settings, data is generated by an underlying, unobserved SCM. When such domain knowl￾edge can be formalized, even approximately, it can simulate interventional and counterfactual data, providing a principled way to expand causal supervision beyond the original dataset. Method. We construct a controlled “original” dataset derived from the NIH Stroke Scale (NI… view at source ↗
Figure 8
Figure 8. Figure 8: Training on formal causal queries. Pass@1 accuracy by causal query type for models trained with and without simulator-generated causal queries. Fine-tuning with causal-simulator data consistently improves performance over the base model. M.2 Evaluation of Causal-Simulator Data Augmentation We evaluate the value of transforming an existing informal SCM into executable form, automatically generating addition… view at source ↗
Figure 9
Figure 9. Figure 9: Joint results across all causal query types. Results for deduction, abduction, intervention, and counterfactual queries. 49 [PITH_FULL_IMAGE:figures/full_fig_p049_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Ablation varying the amount of training data. Test results on NONSENSE-formal SCMs for intervention and counterfactual queries, averaged across structures. Models are trained jointly on counterfactual and intervention causal queries. We show levels 1-10. 50 [PITH_FULL_IMAGE:figures/full_fig_p050_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Ablation varying the amount of training data. Test results on NONSENSE-formal SCMs for intervention and counterfactual queries, averaged across structures. Models are trained jointly on counterfactual and intervention causal queries. We show levels 4-9. 51 [PITH_FULL_IMAGE:figures/full_fig_p051_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Ablation varying the curriculum. Test results on NONSENSE-formal SCMs for intervention and counterfactual queries, averaged across structures. Models are trained jointly on counterfactual and intervention causal queries. We show levels 1-10. 52 [PITH_FULL_IMAGE:figures/full_fig_p052_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Ablation varying the curriculum. Test results on NONSENSE-formal SCMs for intervention and counterfactual queries, averaged across structures. Models are trained jointly on counterfactual and intervention causal queries. We show levels 4-9. 53 [PITH_FULL_IMAGE:figures/full_fig_p053_13.png] view at source ↗
read the original abstract

Despite surpassing human performance across mathematics, coding, and other knowledge-intensive tasks, large language models (LLMs) continue to struggle with causal reasoning. A core obstacle is the target data itself: causal systems are complex and often expressed in non-executable forms, while ground-truth answers to causal queries are inherently scarce. We introduce CauSim, a framework that turns causal reasoning from a scarce-label problem into a scalable supervised one. CauSim constructs increasingly complex causal simulators: executable structural causal models (SCMs), incrementally built by LLMs, that scale to globally complex systems while maintaining verifiable answers to causal queries. CauSim operates across representations by formalizing non-executable causal knowledge into code, enabling data augmentation, and translating executable SCMs into natural language, enabling supervision in previously difficult-to-supervise representations. We structure our research into two parts: (1) how to construct increasingly complex causal simulators, and (2) a systematic study of what CauSim enables, demonstrating generalization across representations, consistent gains from curriculum scaling and data volume, LLM self-improvement through self-generated simulators, and data augmentation via formalization of existing domain knowledge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces CauSim, a framework that uses LLMs to incrementally construct executable structural causal models (SCMs) as simulators for causal reasoning. These simulators scale in complexity while providing verifiable ground-truth answers to causal queries. The approach formalizes non-executable causal knowledge into code for data augmentation and translates SCMs back to natural language for supervision. Research is structured in two parts: (1) methods for building increasingly complex causal simulators and (2) empirical studies demonstrating generalization across representations, gains from curriculum scaling and data volume, LLM self-improvement via self-generated simulators, and augmentation from domain knowledge.

Significance. If the central claims hold, CauSim would address the scarcity of causal data and supervision signals for LLMs by converting causal reasoning into a scalable supervised learning problem with independent executable ground truth. The curriculum scaling, cross-representation generalization, and self-improvement results could represent a meaningful advance. The use of executable SCMs is a strength for verifiability, though the self-improvement component carries a noted risk of circularity that requires careful validation.

major comments (2)
  1. The abstract and Part (1) description of incremental SCM construction by LLMs claims scaling to globally complex systems while maintaining verifiable causal query answers, but without detailed algorithms for error detection, propagation analysis, or empirical fidelity metrics in the provided text, it is unclear whether construction errors invalidate ground-truth relationships. This is load-bearing for the core claim of reliable simulators.
  2. In the self-improvement study (Part 2), the loop of LLMs generating and training on their own simulators risks circularity if verification of causal correctness ultimately depends on the same model capabilities rather than fully independent executable checks. Concrete ablation results separating model-generated data from external validation would be needed to support the self-improvement claim.
minor comments (1)
  1. The abstract would benefit from explicit definitions or metrics for 'increasingly complex' and 'globally complex systems' to make the scaling claims more precise.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful review and constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications on the mechanisms that support reliable simulator construction and independent verification.

read point-by-point responses
  1. Referee: The abstract and Part (1) description of incremental SCM construction by LLMs claims scaling to globally complex systems while maintaining verifiable causal query answers, but without detailed algorithms for error detection, propagation analysis, or empirical fidelity metrics in the provided text, it is unclear whether construction errors invalidate ground-truth relationships. This is load-bearing for the core claim of reliable simulators.

    Authors: We agree that the manuscript would benefit from greater explicitness on these aspects. The incremental construction procedure incorporates execution-based consistency checks after each addition of a variable or edge: a battery of causal queries is run on the updated executable SCM and compared against results from the prior verified state. Inconsistencies arising from construction errors are rejected before the simulator is accepted. While the current text describes this process at a high level, we will add a dedicated subsection with pseudocode for the error-detection routine, a formal propagation analysis, and quantitative fidelity metrics (e.g., query-consistency rates across build steps) in the revised version. revision: yes

  2. Referee: In the self-improvement study (Part 2), the loop of LLMs generating and training on their own simulators risks circularity if verification of causal correctness ultimately depends on the same model capabilities rather than fully independent executable checks. Concrete ablation results separating model-generated data from external validation would be needed to support the self-improvement claim.

    Authors: Verification of each generated simulator occurs exclusively through direct execution of its code, which supplies ground-truth answers independently of any LLM reasoning. The self-improvement experiments already include controls that train on simulators built from external domain knowledge and compare against purely self-generated ones; performance gains remain when evaluation is restricted to execution-derived labels. To further address the circularity concern, we will expand the results with additional ablations that explicitly isolate model-generated data from any model-assisted validation steps. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided manuscript text consists only of the abstract and a placeholder for the full paper, with no equations, self-citations, or derivation steps that reduce any claim to its own inputs by construction. The framework's core elements—LLM-driven construction of executable SCMs, formalization across representations, and curriculum scaling—are presented as independent mechanisms that generate verifiable ground truth via code execution rather than through self-referential fitting or imported uniqueness theorems. No load-bearing step matches the enumerated circularity patterns, and the self-improvement aspect is described as enabled by the external executability of the simulators, keeping the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the domain assumption that structural causal models can be made executable to yield verifiable answers, and that LLMs can build them incrementally without breaking this property. No free parameters or invented physical entities are described.

axioms (1)
  • domain assumption Structural causal models can be represented as executable programs that provide verifiable ground-truth answers to causal queries
    This assumption enables the conversion of causal knowledge into scalable supervised data.
invented entities (1)
  • CauSim framework no independent evidence
    purpose: To construct and scale complex causal simulators for training
    Newly introduced method; no independent evidence provided beyond the framework description itself.

pith-pipeline@v0.9.0 · 5504 in / 1302 out tokens · 48227 ms · 2026-05-12T02:29:23.746767+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages · 10 internal anchors

  1. [1]

    Thomas Hubert, Rishi Mehta, Laurent Sartran, Miklós Z. Horváth, Goran Žuži´c, Eric Wieser, Aja Huang, Julian Schrittwieser, Yannick Schroecker, Hussain Masoom, Ottavia Bertolli, Tom Zahavy, Amol Mandhane, Jessica Yung, Iuliya Beloshapka, Borja Ibarz, Vivek Veeriah, Lei Yu, Oliver Nash, Paul Lezeau, Salvatore Mercuri, Calle Sönne, Bhavik Mehta, Alex Davies...

  2. [2]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  3. [3]

    Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushm...

  4. [4]

    Can llms replace human evaluators? an empirical study of llm-as-a-judge in software engineering.Proc

    Ruiqi Wang, Jiyu Guo, Cuiyun Gao, Guodong Fan, Chun Yong Chong, and Xin Xia. Can llms replace human evaluators? an empirical study of llm-as-a-judge in software engineering.Proc. ACM Softw. Eng., 2(ISSTA), June 2025. doi: 10.1145/3728963. URL https://doi.org/10. 1145/3728963

  5. [5]

    Luo, X., Rechardt, A., Sun, G., Nejad, K

    Xiaoliang Luo, Akilles Rechardt, Guangzhi Sun, Kevin K. Nejad, Felipe Yáñez, Bati Yil- maz, Kangjoo Lee, Alexandra O. Cohen, Valentina Borghesani, Anton Pashkov, Daniele Marinazzo, Jonathan Nicholas, Alessandro Salatiello, Ilia Sucholutsky, Pasquale Minervini, Sepehr Razavi, Roberta Rocca, Elkhan Yusifov, Tereza Okalova, Nianlong Gu, Martin Ferianc, Mikai...

  6. [6]

    Cladder: Assessing causal reasoning in language models.Advances in Neural Information Processing Systems, 36: 31038–31065, 2023

    Zhijing Jin, Yuen Chen, Felix Leeb, Luigi Gresele, Ojasv Kamal, Zhiheng Lyu, Kevin Blin, Fernando Gonzalez Adauto, Max Kleiman-Weiner, Mrinmaya Sachan, et al. Cladder: Assessing causal reasoning in language models.Advances in Neural Information Processing Systems, 36: 31038–31065, 2023

  7. [7]

    Cause and effect: can large language models truly understand causality? InProceedings of the AAAI Symposium Series, volume 4, pages 2–9, 2024

    Swagata Ashwani, Kshiteesh Hegde, Nishith Reddy Mannuru, Dushyant Singh Sengar, Mayank Jindal, Krishna Chaitanya Rao Kathala, Dishant Banga, Vinija Jain, and Aman Chadha. Cause and effect: can large language models truly understand causality? InProceedings of the AAAI Symposium Series, volume 4, pages 2–9, 2024. 11

  8. [8]

    Unveiling causal reasoning in large language models: Reality or mi- rage? InAdvances in Neural Information Processing Systems, volume 37, pages 96640– 96670, 2024

    Haoang Chi, He Li, Wenjing Yang, Feng Liu, Long Lan, Xiaoguang Ren, Tongliang Liu, and Bo Han. Unveiling causal reasoning in large language models: Reality or mi- rage? InAdvances in Neural Information Processing Systems, volume 37, pages 96640– 96670, 2024. URL https://proceedings.neurips.cc/paper_files/paper/2024/ hash/af2bb2b2280d36f8842e440b4e275152-A...

  9. [9]

    A Proofs of Section 4 Proof of Proposition 1.Let Ii be the indicator that on instance i the bias has the opposite sign to θi and |bi|>|θ i|

    Matej Zeˇcevi´c, Moritz Willig, Devendra Singh Dhami, and Kristian Kersting. Causal parrots: Large language models may talk causality but are not causal.arXiv preprint arXiv:2308.13067, 2023

  10. [10]

    Paul W. Holland. Statistics and causal inference.Journal of the American Statistical Association, 81(396):945–960, 1986. doi: 10.1080/01621459.1986.10478354

  11. [11]

    Hernán and James M

    Miguel A. Hernán and James M. Robins.Causal Inference: What If. Chapman & Hall/CRC, Boca Raton, 2020

  12. [12]

    2009.Causality(2 ed.)

    Judea Pearl.Causality: Models, Reasoning, and Inference. Cambridge University Press, Cambridge, UK, 2 edition, 2009. doi: 10.1017/CBO9780511803161

  13. [13]

    Absolute Zero: Reinforced Self-play Reasoning with Zero Data

    Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data.arXiv preprint arXiv:2505.03335, 2025. URL https://arxiv.org/ abs/2505.03335

  14. [14]

    2016 , month = jul, journal =

    Elias Bareinboim and Judea Pearl. Causal inference and the data-fusion problem.Proceedings of the National Academy of Sciences, 113(27):7345–7352, 2016. doi: 10.1073/pnas.1510507113. URLhttps://www.pnas.org/doi/abs/10.1073/pnas.1510507113

  15. [15]

    Causal diagrams for empirical research.Biometrika, 82(4):669–688, 1995

    Judea Pearl. Causal diagrams for empirical research.Biometrika, 82(4):669–688, 1995. doi: 10.1093/biomet/82.4.669

  16. [16]

    Re-imagine: Symbolic benchmark synthesis for reasoning evaluation.arXiv preprint arXiv:2506.15455, 2025

    Xinnuo Xu, Rachel Lawrence, Kshitij Dubey, Atharva Pandey, Risa Ueno, Fabian Falck, Aditya V Nori, Rahul Sharma, Amit Sharma, and Javier Gonzalez. Re-imagine: Symbolic benchmark synthesis for reasoning evaluation.arXiv preprint arXiv:2506.15455, 2025

  17. [17]

    Reasoning elicitation in language models via counterfactual feedback.arXiv preprint arXiv:2410.03767, 2024

    Alihan Hüyük, Xinnuo Xu, Jacqueline Maasch, Aditya V Nori, and Javier González. Reasoning elicitation in language models via counterfactual feedback.arXiv preprint arXiv:2410.03767, 2024

  18. [18]

    Executable counterfactuals: Improving llms’ causal reasoning through code.arXiv preprint arXiv:2510.01539, 2025

    Aniket Vashishtha, Qirun Dai, Hongyuan Mei, Amit Sharma, Chenhao Tan, and Hao Peng. Executable counterfactuals: Improving llms’ causal reasoning through code.arXiv preprint arXiv:2510.01539, 2025

  19. [19]

    Can large language models infer causation from correlation? arXiv preprint arXiv:2306.05836, 2023

    Zhijing Jin, Jiarui Liu, Zhiheng Lyu, Spencer Poff, Mrinmaya Sachan, Rada Mihalcea, Mona Diab, and Bernhard Schölkopf. Can large language models infer causation from correlation? arXiv preprint arXiv:2306.05836, 2023

  20. [20]

    CausalBench: A comprehensive benchmark for evaluating causal reasoning capabilities of large language models

    Zeyu Wang. CausalBench: A comprehensive benchmark for evaluating causal reasoning capabilities of large language models. InProceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10), pages 143–151, Bangkok, Thailand, August

  21. [21]

    URL https://aclanthology.org/2024

    Association for Computational Linguistics. URL https://aclanthology.org/2024. sighan-1.17/

  22. [22]

    Longllada: Unlocking long context capabilities in diffusion llms

    Yuefei Chen, Vivek K. Singh, Jing Ma, and Ruixiang Tang. Counterbench: Evaluating and improving counterfactual reasoning in large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 30350–30358, 2026. doi: 10.1609/aaai. v40i36.40287. URLhttps://ojs.aaai.org/index.php/AAAI/article/view/40287

  23. [23]

    Causalarc: Abstract reasoning with causal world models.arXiv preprint arXiv:2509.03636, 2025

    Jacqueline Maasch, John Kalantari, and Kia Khezeli. Causalarc: Abstract reasoning with causal world models.arXiv preprint arXiv:2509.03636, 2025

  24. [24]

    Com- positional causal reasoning evaluation in language models.arXiv preprint arXiv:2503.04556, 2025

    Jacqueline RMA Maasch, Alihan Hüyük, Xinnuo Xu, Aditya V Nori, and Javier Gonzalez. Com- positional causal reasoning evaluation in language models.arXiv preprint arXiv:2503.04556, 2025. 12

  25. [25]

    Does reasoning emerge? examining the probabilities of causation in large language models.Advances in Neural Information Processing Systems, 37: 117737–117761, 2024

    Javier González and Aditya Nori. Does reasoning emerge? examining the probabilities of causation in large language models.Advances in Neural Information Processing Systems, 37: 117737–117761, 2024

  26. [26]

    Counterfactual reasoning: Testing language models’ understanding of hypothetical scenarios.arXiv preprint arXiv:2305.16572, 2023

    Jiaxuan Li, Lang Yu, and Allyson Ettinger. Counterfactual reasoning: Testing language models’ understanding of hypothetical scenarios.arXiv preprint arXiv:2305.16572, 2023

  27. [27]

    Look before you decide: Prompting active deduction of mllms for assumptive reasoning,

    Yian Li, Wentao Tian, Yang Jiao, Jingjing Chen, Tianwen Qian, Bin Zhu, Na Zhao, and Yu-Gang Jiang. Look before you decide: Prompting active deduction of mllms for assumptive reasoning,

  28. [28]

    Version 5, last revised 17 Apr 2025

    URLhttps://arxiv.org/abs/2404.12966. Version 5, last revised 17 Apr 2025

  29. [29]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

  30. [30]

    doi: 10.48550/arXiv.2110.14168

  31. [31]

    Trust region policy optimization

    John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In Francis Bach and David Blei, editors,Proceedings of the 32nd International Conference on Machine Learning, volume 37 ofProceedings of Machine Learning Research, pages 1889–1897, Lille, France, 2015. PMLR. URLhttps://proceedings.mlr. press/...

  32. [32]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  33. [33]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  34. [34]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, et al. Deepseek- r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  35. [35]

    Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

    Xumeng Wen, Zihan Liu, Shun Zheng, Shengyu Ye, Zhirong Wu, Yang Wang, Zhijian Xu, Xiao Liang, Junjie Li, Ziming Miao, Jiang Bian, and Mao Yang. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms.arXiv preprint arXiv:2506.14245, 2025

  36. [36]

    Self-evolving curriculum for llm reasoning.arXiv preprint arXiv:2505.14970, 2025

    Xiaoyin Chen, Jiarui Lu, Minsu Kim, Dinghuai Zhang, Jian Tang, Alexandre Piché, Nicolas Gontier, Yoshua Bengio, and Ehsan Kamalloo. Self-evolving curriculum for llm reasoning. arXiv preprint arXiv:2505.14970, 2025

  37. [37]

    Eliciting and im- proving the causal reasoning abilities of large language models with conditional statements

    Xiao Liu, Da Yin, Chen Zhang, Dongyan Zhao, and Yansong Feng. Eliciting and im- proving the causal reasoning abilities of large language models with conditional statements. Computational Linguistics, 51:467–504, June 2025. doi: 10.1162/coli_a_00548. URL https://aclanthology.org/2025.cl-2.3/

  38. [38]

    Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason E. Weston. Self-rewarding language models. InProceedings of the 41st Inter- national Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 57905–57923. PMLR, 2024. URL https://proceedings.mlr.press/ v235/yuan24d.html

  39. [39]

    Huajian Xin, Z. Z. Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, Wenjun Gao, Qihao Zhu, Dejian Yang, Zhibin Gou, Z. F. Wu, Fuli Luo, and Chong Ruan. Deepseek-prover-v1.5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search, 2024. URL https://arxiv.org/abs/ 2408.08152

  40. [40]

    Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. STaR: Bootstrapping reasoning with reasoning. InAdvances in Neural Information Processing Systems, 2022. 13

  41. [41]

    Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

    Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Chuanqi Tan, Chang Zhou, and Jingren Zhou. Scaling relationship on learning mathematical reasoning with large language models.arXiv preprint arXiv:2308.01825, 2023. doi: 10.48550/arXiv.2308.01825. URLhttps://arxiv.org/abs/2308.01825

  42. [42]

    Raft: Reward ranked finetuning for generative foundation model alignment.arXiv preprint arXiv:2304.06767, 2023

    Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. RAFT: Reward ranked finetuning for generative foundation model alignment.Transactions on Machine Learning Research, 2023. doi: 10.48550/arXiv.2304.06767. URLhttps://arxiv.org/abs/2304.06767

  43. [43]

    Learning bayesian networks with the bnlearn r package.Journal of Statistical Software, 35(3):1–22, 2010

    Marco Scutari. Learning bayesian networks with the bnlearn r package.Journal of Statistical Software, 35(3):1–22, 2010. doi: 10.18637/jss.v035.i03. URL https://www.jstatsoft. org/article/view/v035i03

  44. [44]

    An axiomatic characterization of causal counterfactuals.Founda- tions of Science, 3(1):151–182, 1998

    David Galles and Judea Pearl. An axiomatic characterization of causal counterfactuals.Founda- tions of Science, 3(1):151–182, 1998. doi: 10.1023/A:1009602825894

  45. [45]

    Complete identification methods for the causal hierarchy.Journal of Machine Learning Research, 9:1941–1979, 2008

    Ilya Shpitser and Judea Pearl. Complete identification methods for the causal hierarchy.Journal of Machine Learning Research, 9:1941–1979, 2008

  46. [46]

    Adaptive Computation and Machine Learning Series

    Jonas Peters, Dominik Janzing, and Bernhard Schölkopf.Elements of Causal Inference: Foundations and Learning Algorithms. Adaptive Computation and Machine Learning Series. The MIT Press, Cambridge, MA, USA, 2017

  47. [47]

    Bernhard Schölkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris M. Mooij. On causal and anticausal learning. InProceedings of the 29th International Conference on Machine Learning, pages 1255–1262. Omnipress, 2012

  48. [48]

    2020 , month = nov, journal =

    Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020. doi: 10.1038/s42256-020-00257-z

  49. [49]

    Toward causal representation learning.Proceedings of the IEEE, 109(5):612–634, 2021

    Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. Toward causal representation learning.Proceedings of the IEEE, 109(5):612–634, 2021. doi: 10.1109/JPROC.2021.3058954

  50. [50]

    Jonas Peters, Peter Bühlmann, and Nicolai Meinshausen. Causal inference by using invariant prediction: Identification and confidence intervals.Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(5):947–1012, 2016. doi: 10.1111/rssb.12167

  51. [51]

    Turner, and Jonas Peters

    Mateo Rojas-Carulla, Bernhard Schölkopf, Richard E. Turner, and Jonas Peters. Invariant models for causal transfer learning.Journal of Machine Learning Research, 19(36):1–34, 2018

  52. [52]

    Invariant Risk Minimization

    Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk mini- mization.arXiv preprint arXiv:1907.02893, 2019

  53. [53]

    Challenging common assumptions in the unsupervised learning of disentangled representations

    Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Rätsch, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. InProceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 4114–4124. PMLR, 2019

  54. [54]

    Investigating gender bias in language models using causal mediation analysis

    Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. Investigating gender bias in language models using causal mediation analysis. InAdvances in Neural Information Processing Systems, volume 33, pages 12388–12401, 2020

  55. [55]

    Causal abstractions of neural networks

    Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. Causal abstractions of neural networks. InAdvances in Neural Information Processing Systems, volume 34, pages 9574–9586, 2021

  56. [56]

    Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V . Le. Self-training with noisy student improves imagenet classification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10687–10698, 2020. 14

  57. [57]

    Large Language Models Can Self-Improve

    Jiaxin Huang, Shixiang Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1051–1068, Singapore, 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.67. URL https://aclanthology.o...

  58. [58]

    O’Connor, and Kevin McGuinness

    Eric Arazo, Diego Ortego, Paul Albert, Noel E. O’Connor, and Kevin McGuinness. Pseudo- labeling and confirmation bias in deep semi-supervised learning. In2020 International Joint Conference on Neural Networks, pages 1–8. IEEE, 2020. doi: 10.1109/IJCNN48605.2020. 9207304

  59. [59]

    Debiased self-training for semi-supervised learning

    Baixu Chen, Junguang Jiang, Ximei Wang, Pengfei Wan, Jianmin Wang, and Mingsheng Long. Debiased self-training for semi-supervised learning. InAdvances in Neural Information Processing Systems, 2022

  60. [60]

    AI models collapse when trained on recursively generated data.Nature, 631:755–759,

    Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal. AI models collapse when trained on recursively generated data.Nature, 631:755–759,

  61. [61]

    doi: 10.1038/s41586-024-07566-y

  62. [62]

    Baraniuk

    Sina Alemohammad, Josue Casco-Rodriguez, Lorenzo Luzi, Ahmed Imtiaz Humayun, Hossein Babaei, Daniel LeJeune, Ali Siahkoohi, and Richard G. Baraniuk. Self-consuming generative models go MAD. InInternational Conference on Learning Representations, 2024

  63. [63]

    Roberts, Diyi Yang, David L

    Matthias Gerstgrasser, Rylan Schaeffer, Apratim Dey, Rafael Rafailov, Henry Sleight, John Hughes, Tomasz Korbak, Rajashree Agrawal, Dhruv Pai, Andrey Gromov, Daniel A. Roberts, Diyi Yang, David L. Donoho, and Sanmi Koyejo. Is model collapse inevitable? breaking the curse of recursion by accumulating real and synthetic data. InConference on Language Modeli...

  64. [64]

    Mathematical Description No No Yes, including mutation One-shot

  65. [65]

    Logic Description Yes No No; static benchmark Human-created

  66. [66]

    Spatial Examples No No No; static benchmark Human-created

  67. [67]

    Purpose :

    Mathematical Description No LLM derivesp(u| ·)No; static benchmark Human-created F Causal Pattern Recognition versus Causal Mechanism Learning A useful way to interpret the scope of our results is to distinguish causal-query answering from learning a causal model. In the structural causal model (SCM) view, causal knowledge is not only a joint distribution...

  68. [68]

    All U_ * ex oge no us sampler d e f i n i t i o n s

  69. [69]

    All f_ * s t r u c t u r a l me ch an is m d e f i n i t i o n s

  70. [70]

    chain ":

    The run_once ( seed ) driver . Return ONLY a single Python code block . No prose . Task : - Create an SCM with { nu m_ no de s } e xo ge nou s samplers and { nu m_ no de s } s t r u c t u r a l fu nc tio ns . - The internal nodes ( f_ *) of the SCM must follow a { dag_type } topology . { D A G _ D E F I N I T I O N S [ dag_type ]} 30 Topology Definitions ...

  71. [71]

    A DECISION JSON with : - r at io na le - n e w _ f u n c t i o n : name of new f_ * - n e w _ e x o _ n o i s e : name of c o r r e s p o n d i n g U_ * - d i r e c t _ p a r e n t s : list of internal parent nodes of n e w _ f u n c t i o n - d i r e c t _ c h i l d r e n : list of child nodes of n e w _ f u n c t i o n

  72. [72]

    CODE CONTEXT : - For each d i r e c t _ c h i l d : full function , in cl ud in g signature , docstring , and body - For each d i r e c t _ p a r e n t : do cs tr in g only , no code

  73. [73]

    none"; else if it is less than 0.7, return

    DRIVER FUNCTION : def run_once ( seed : int | None ) -> dict [ str , object ]: Your task : (1) Define a new exo ge no us sampler U_new () . - It must be pure and have no I / O . - It must return symbolic , categorical , or boolean states , not numeric math . - It must have a single do cs tri ng c o n t a i n i n g : Purpose : describe symbolic / m e c h a...

  74. [74]

    A bd uc tio n : infer u n k n o w n _ e x o g e n o u s a s s i g n m e n t ( s ) c o n s i s t e n t with o b s e r v e d _ e n d o g e n o u s and f i x e d _ e x o g e n o u s

  75. [75]

    I n t e r v e n t i o n : apply do (...)

  76. [76]

    U_QE ":

    P r e d i c t i o n : using the SAME abduced ex og en ou s values ( no r e s a m p l i n g ) , compute required outputs under do (...) . Im po rt an t : o b s e r v e d _ e n d o g e n o u s is only evidence for step (1) ; do NOT enforce it after the i n t e r v e n t i o n . Return that unique output . Required output keys ( EXACT match ; no extras ) : [...

  77. [77]

    A from __future__ import annotations header is included to avoid import-time failures due to unevaluated type annotations

    Writes an executable Python module scm_{id}_v{v}.py by emitting imports, func- tion signatures, and function bodies from the bundle. A from __future__ import annotations header is included to avoid import-time failures due to unevaluated type annotations

  78. [78]

    Versions failing validation are skipped

    Optionally validates executability by importing the module and running run_once(seed=0). Versions failing validation are skipped

  79. [79]

    This reduces prompt length and removes non-semantic text for downstream prompt construction

    Writes a comment-stripped variant, scm_{id}_v{v}_nocomments.py, obtained by remov- ing docstrings and #-style comments. This reduces prompt length and removes non-semantic text for downstream prompt construction

  80. [80]

    44 Parallelism and CPU hygiene.Rehydration and sampling are parallelized across SCMs using a process pool

    Writes the bundle metadata itself tobundle_v{v}.json. 44 Parallelism and CPU hygiene.Rehydration and sampling are parallelized across SCMs using a process pool. The worker count is selected using cpuset-aware or cgroup-aware CPU detection, or scheduler environment variables such as SLURM and PBS, and capped by the number of SCMs. To prevent oversubscripti...

Showing first 80 references.