CasualSynth: Generating Structurally Sound Synthetic Data
Pith reviewed 2026-05-20 14:12 UTC · model grok-4.3
The pith
CausalSynth generates causally valid synthetic data by decoupling structure generation from LLM realization and using iterative verification to correct violations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CausalSynth decouples causal structure generation from semantic realization. A Structural Causal Model generates causal skeletons that satisfy the Global Markov Property via ancestral sampling. An LLM acts as a constrained realizer that maps each skeleton to high-dimensional observations. An Iterative Consistency Verification module detects structural violations through deterministic extraction and feeds targeted corrections back to the LLM, forming a closed-loop refinement. The framework identifies the Semantic Backdoor problem in which LLMs override imposed causal facts with pre-training priors and shows that the iterative mechanism reduces the resulting selection bias relative to standard
What carries the argument
The Iterative Consistency Verification module, which performs deterministic extraction of structural violations from LLM outputs and feeds targeted corrections back to the LLM to close the refinement loop.
If this is right
- Preserves conditional independencies with false-positive rates near the nominal α=0.05 level across ASIA, ALARM, and MIMIC-Struct benchmarks.
- Achieves realizability rates above 96 percent with 70B-parameter LLM backbones.
- Enables principled interventional and counterfactual data generation through noise retention and graph mutilation.
- Reduces selection bias arising from the Semantic Backdoor relative to standard rejection sampling.
Where Pith is reading between the lines
- The same skeleton-plus-verification pattern could be reused to create large synthetic datasets for causal discovery algorithms in domains such as economics or genomics.
- If the verification loop remains efficient at scale, it might allow privacy-preserving training corpora for causal reasoning models that would otherwise require restricted real-world records.
- Testing the method on DAGs that are themselves estimated from data rather than supplied in advance would reveal whether the approach tolerates uncertainty in the underlying structure.
Load-bearing premise
The Iterative Consistency Verification module can reliably detect structural violations through deterministic extraction and reduce selection bias by feeding corrections back to the LLM without introducing new unmeasured distortions.
What would settle it
If false-positive rates for conditional-independence tests on the ALARM benchmark rise well above the nominal 0.05 level or if realizability falls below 90 percent under 70B-parameter backbones, the claim that the framework reliably produces causally sound data would be undermined.
Figures
read the original abstract
Large Language Models (LLMs) generate realistic synthetic data but offer no guarantee that their outputs respect the causal mechanisms governing the target domain. We introduce CausalSynth, a framework that decouples causal structure generation from semantic realization, yielding synthetic data that is both causally valid and linguistically rich. The framework operates in three phases. First, a Structural Causal Model (SCM) - a tuple of structural equations defined over a directed acyclic graph (DAG) generates causal skeletons, i.e., variable assignments that satisfy the Global Markov Property of the governing DAG, via ancestral sampling. Second, an LLM acts as a constrained \emph{realizer}, a conditional translator that maps each skeleton to a high-dimensional observation such as a clinical note or a transaction log. Third, an Iterative Consistency Verification module detects structural violations through deterministic extraction and feeds targeted corrections back to the LLM, forming a closed-loop refinement process. We identify the Semantic Backdoor problem the systematic tendency of LLMs to override imposed causal facts with pre-training priors -- and prove that our iterative mechanism reduces the resulting selection bias relative to standard rejection sampling. On three causal benchmarks (ASIA, ALARM, and MIMIC-Struct), CausalSynth preserved conditional independencies with false-positive rates near the nominal $\alpha=0.05$ level and achieved realizability rates above 96% with 70B-parameter LLM backbones. The framework additionally supports principled interventional and counterfactual generation through noise retention and graph mutilation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces CausalSynth, a framework that decouples causal structure generation from semantic realization for producing synthetic data. It uses a Structural Causal Model (SCM) over a DAG to generate causal skeletons via ancestral sampling that satisfy the Global Markov Property, an LLM as a constrained realizer to map skeletons to high-dimensional outputs such as clinical notes, and an Iterative Consistency Verification module that performs deterministic extraction to detect violations and feeds targeted corrections back to the LLM in a closed loop. The authors identify the Semantic Backdoor problem (LLMs overriding causal facts with pre-training priors) and claim their mechanism reduces selection bias relative to rejection sampling. On ASIA, ALARM, and MIMIC-Struct benchmarks, the method preserves conditional independencies with false-positive rates near nominal α=0.05 and achieves realizability rates above 96% with 70B-parameter LLMs, while also supporting interventional and counterfactual generation via noise retention and graph mutilation.
Significance. If the empirical claims hold under rigorous validation, the work could be significant for generating causally consistent synthetic data in domains like healthcare and causal discovery, where preserving conditional independencies matters for downstream inference. The modular separation of SCM-based skeletons from LLM realization, the explicit treatment of the Semantic Backdoor, and the closed-loop refinement mechanism are conceptually useful contributions. The reported results on standard causal benchmarks provide an initial evaluation point, and the support for interventions/counterfactuals via graph operations is a practical strength.
major comments (2)
- [Iterative Consistency Verification module] Iterative Consistency Verification module (as described in the abstract and methods): the manuscript supplies no implementation details on the deterministic extraction process, the exact correction prompts, or whether false-positive rates incorporate multiple-testing corrections. This is load-bearing for the central claim, because preservation of conditional independencies at α=0.05 on MIMIC-Struct (free-text clinical notes) depends on reliable detection of structural violations; any incompleteness in extraction would inflate apparent success rates.
- [Abstract and empirical results] Abstract and empirical results: the claim that the iterative mechanism 'proves' bias reduction versus rejection sampling is asserted without quantitative bias metrics, ablation studies on extraction accuracy, or evidence that corrections reduce Semantic Backdoor effects without introducing new unmeasured distortions. This directly affects the assertion that the closed-loop approach outperforms standard rejection sampling.
minor comments (2)
- Add a concrete example of skeleton-to-realization mapping and one full iteration of the verification loop to improve clarity of the three-phase pipeline.
- Clarify how realizability rate is operationalized (e.g., exact criteria for a valid high-dimensional observation) and report per-benchmark breakdowns rather than aggregate figures.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which identify key areas for improving the clarity and empirical support of our framework. We address each major comment below and describe the revisions planned for the next version of the manuscript.
read point-by-point responses
-
Referee: [Iterative Consistency Verification module] Iterative Consistency Verification module (as described in the abstract and methods): the manuscript supplies no implementation details on the deterministic extraction process, the exact correction prompts, or whether false-positive rates incorporate multiple-testing corrections. This is load-bearing for the central claim, because preservation of conditional independencies at α=0.05 on MIMIC-Struct (free-text clinical notes) depends on reliable detection of structural violations; any incompleteness in extraction would inflate apparent success rates.
Authors: We agree that the current manuscript does not provide sufficient implementation details on the Iterative Consistency Verification module. In the revised version we will add a dedicated subsection in the Methods that specifies the deterministic extraction rules for detecting violations of the Global Markov Property, the exact correction prompt templates passed back to the LLM, and the statistical procedure used to compute false-positive rates (including whether any multiple-testing correction such as Bonferroni was applied). We will also report the accuracy of the extraction step on a validation subset of MIMIC-Struct to address concerns about potential incompleteness. revision: yes
-
Referee: [Abstract and empirical results] Abstract and empirical results: the claim that the iterative mechanism 'proves' bias reduction versus rejection sampling is asserted without quantitative bias metrics, ablation studies on extraction accuracy, or evidence that corrections reduce Semantic Backdoor effects without introducing new unmeasured distortions. This directly affects the assertion that the closed-loop approach outperforms standard rejection sampling.
Authors: The manuscript contains a theoretical argument that the iterative correction loop reduces selection bias relative to rejection sampling by retaining and repairing samples instead of discarding them. We acknowledge, however, that the current version lacks explicit quantitative bias metrics and dedicated ablation studies. In the revision we will add an ablation comparing the iterative method against rejection sampling on the ASIA and ALARM benchmarks, reporting direct measures of Semantic Backdoor incidence before and after correction as well as additional causal-consistency metrics to check for new distortions. We maintain that the reported conditional-independence preservation and realizability rates provide supporting evidence, but we will strengthen the empirical section with the requested quantitative comparisons. revision: partial
Circularity Check
No significant circularity; empirical results are direct measurements
full rationale
The paper reports empirical performance on fixed external benchmarks (ASIA, ALARM, MIMIC-Struct) as direct measurements of conditional independence preservation (FPR near α=0.05) and realizability (>96%). These quantities are not derived from fitted parameters or self-referential predictions. The Iterative Consistency Verification module and claimed proof of bias reduction versus rejection sampling are described procedurally without equations that reduce the reported rates to the inputs by construction. No self-citation chain or ansatz is invoked as load-bearing for the central claims. The derivation chain remains self-contained against the stated benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLMs can be constrained to respect externally supplied causal facts when given targeted corrections
- standard math Ancestral sampling from an SCM produces variable assignments that satisfy the Global Markov Property
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery / orbit embedding unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Phase I constructs the causal skeleton v by drawing samples from the joint distribution implied by the SCM, P_M(V). ... ancestral sampling ... produces samples whose joint distribution factorizes according to G by construction.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel / Jcost uniqueness unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We identify the Semantic Backdoor problem ... and prove that our iterative mechanism reduces the resulting selection bias relative to standard rejection sampling (Theorem 2).
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Ingo A Beinlich, Henri Jacques Suermondt, R Martin Chavez, and Gregory F Cooper. 1989. The ALARM monitoring system: A case study with two proba- bilistic inference techniques for belief networks. InAIME 89: Second European Conference on Artificial Intelligence in Medicine, London, August 29th–31st 1989. Proceedings. Springer, 247–256
work page 1989
-
[2]
Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?. InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. 610–623
work page 2021
-
[3]
2006.Pattern Recognition and Machine Learning
Christopher M Bishop. 2006.Pattern Recognition and Machine Learning. Springer
work page 2006
-
[4]
Vadim Borisov, Kathrin Sessler, Tobias Leemann, Martin Pawelczyk, and Gjergji Kasneci. 2023. Language Models are Realistic Tabular Data Generators. InInter- national Conference on Learning Representations
work page 2023
-
[5]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al . 2020. Language Models are Few-Shot Learners. InAdvances in Neural Information Processing Systems, Vol. 33. 1877–1901
work page 2020
-
[6]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating Large Language Models Trained on Code.arXiv preprint arXiv:2107.03374(2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[7]
David Maxwell Chickering. 2002. Optimal Structure Identification With Greedy Search. InJournal of Machine Learning Research, Vol. 3. 507–554
work page 2002
-
[8]
Tomas Geffner, Javier Antoran, Adam Foster, Wenbo Gong, Chao Ma, Emre Kiciman, Amit Sharma, Angus Lamb, Martin Kukla, Nick Pawlowski, Miltiadis Allamanis, and Cheng Zhang. 2022. Deep End-to-end Causal Inference. InWork- shop on Causal Representation Learning at NeurIPS
work page 2022
- [9]
-
[10]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. InAdvances in Neural Information Processing Systems, Vol. 27
work page 2014
-
[11]
Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2021. DeBERTaV3: Improv- ing DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing. arXiv:2111.09543 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[12]
Or Honovich, Roee Aharoni, Jonathan Herzig, Hagai Taitelbaum, Doron Kuk- liansy, Vered Cohen, Thomas Scialom, Idan Szpektor, Avinatan Hassidim, and Yossi Matias. 2022. TRUE: Re-evaluating Factual Consistency Evaluation. In Proceedings of NAACL-HLT
work page 2022
-
[13]
Maximilian Ilse, Patrick Forré, Max Welling, and Joris M Mooij. 2022. Combining Interventional and Observational Data Using Causal Reductions. InAdvances in Approximate Bayesian Inference (AABI)
work page 2022
-
[14]
Adrián Javaloy, Pablo Sánchez-Martín, and Isabel Valera. 2023. Causal Normaliz- ing Flows: From Theory to Practice. InAdvances in Neural Information Processing Systems, Vol. 36
work page 2023
-
[15]
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of Hallucination in Natural Language Generation.Comput. Surveys55, 12 (2023), 1–38
work page 2023
-
[16]
Zhijing Jin, Yuen Chen, Felix Leber, Luigi Gresele, Ojasv Kamath, Bernhard Schölkopf, et al. 2024. CLadder: Assessing Causal Reasoning in Language Models. Advances in Neural Information Processing Systems36 (2024)
work page 2024
-
[17]
Alistair E W Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. 2016. MIMIC-III, a freely accessible critical care database.Scientific Data3 (2016), 160035
work page 2016
- [18]
-
[19]
Diviyan Kalainathan, Olivier Goudet, and Ritik Dutta. 2020. Causal discovery toolbox: Uncovering causal relationships in python.Journal of Machine Learning Research21, 37 (2020), 1–5
work page 2020
- [20]
-
[21]
Diederik P Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. arXiv preprint arXiv:1312.6114(2014)
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[22]
Murat Kocaoglu, Christopher Snyder, Alexandros G Dimakis, and Sriram Vish- wanath. 2018. CausalGAN: Learning Causal Implicit Generative Models with Adversarial Training. InInternational Conference on Learning Representations
work page 2018
-
[23]
2009.Probabilistic Graphical Models: Principles and Techniques
Daphne Koller and Nir Friedman. 2009.Probabilistic Graphical Models: Principles and Techniques. MIT Press
work page 2009
-
[24]
Akim Kotelnikov, Dmitry Baranchuk, Ivan Rubachev, and Artem Babenko. 2023. TabDDPM: Modelling Tabular Data with Diffusion Models. InInternational Con- ference on Machine Learning. 17564–17579
work page 2023
-
[25]
Matt J Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. 2017. Counterfac- tual Fairness. InAdvances in Neural Information Processing Systems, Vol. 30
work page 2017
-
[26]
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E Gonzalez, Hao Zhang, and Ion Stoica. 2023. vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles. 611–626
work page 2023
-
[27]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al
-
[28]
In Advances in Neural Information Processing Systems, Vol
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Advances in Neural Information Processing Systems, Vol. 33. 9459–9474
-
[29]
Gary Marcus. 2018. Deep Learning: A Critical Appraisal.arXiv preprint arXiv:1801.00631(2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[30]
Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. On Faithfulness and Factuality in Abstractive Summarization. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 1906–1919
work page 2020
-
[31]
2021.Synthetic Data for Deep Learning
Sergey I Nikolenko. 2021.Synthetic Data for Deep Learning. Springer
work page 2021
-
[32]
OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
Shirui Pan, Linhao Luo, Yufei Wang, Chen Chen, Jiapu Wang, and Xindong Wu
-
[34]
IEEE Transactions on Knowledge and Data Engineering(2024)
Unifying Large Language Models and Knowledge Graphs: A Roadmap. IEEE Transactions on Knowledge and Data Engineering(2024)
work page 2024
-
[35]
Nick Pawlowski, Daniel Coelho de Castro, and Ben Glocker. 2020. Deep Structural Causal Models for Tractable Counterfactual Inference. InAdvances in Neural Information Processing Systems, Vol. 33. 857–869
work page 2020
- [36]
-
[37]
Judea Pearl et al. 2000. Models, reasoning and inference.Cambridge, UK: Cam- bridgeUniversityPress19, 2 (2000), 3
work page 2000
-
[38]
2018.The Book of Why: The New Science of Cause and Effect
Judea Pearl and Dana Mackenzie. 2018.The Book of Why: The New Science of Cause and Effect. Basic Books
work page 2018
-
[39]
2017.Elements of Causal Inference: Foundations and Learning Algorithms
Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. 2017.Elements of Causal Inference: Foundations and Learning Algorithms. MIT Press
work page 2017
-
[40]
Donald B Rubin. 1993. Statistical Disclosure Limitation.Journal of Official Statistics9, 2 (1993), 461–468
work page 1993
-
[41]
Pablo Sánchez-Martín, Miriam Rateike, and Isabel Valera. 2022. VACA: Designing Variational Graph Autoencoders for Causal Queries. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 8159–8168
work page 2022
-
[42]
Peter Schulam and Suchi Saria. 2017. Reliable Decision Support using Counter- factual Models. InAdvances in Neural Information Processing Systems, Vol. 30
work page 2017
-
[43]
2000.Causation, Prediction, and Search(2 ed.)
Peter Spirtes, Clark N Glymour, and Richard Scheines. 2000.Causation, Prediction, and Search(2 ed.). MIT Press
work page 2000
-
[44]
Thomas Verma and Judea Pearl. 1990. Equivalence and Synthesis of Causal Models. InProceedings of the Sixth Annual Conference on Uncertainty in Artificial Intelligence. 255–270
work page 1990
- [45]
-
[46]
Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. 2023. Poisoning Language Models During Instruction Tuning.International Conference on Machine Learning(2023)
work page 2023
-
[47]
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-Consistency Improves Chain of Thought Reasoning in Language Models. InInternational Conference on Learning Representations
work page 2023
-
[48]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. InAdvances in Neural Information Processing Systems, Vol. 35. 24824–24837
work page 2022
-
[49]
Liang Wendong, Armin Kekic, Mohamed Bouhamidi, and Bernhard Schölkopf
- [50]
-
[51]
Brandon T Willard and Rémi Louf. 2023. Efficient Guided Generation for Large Language Models.arXiv preprint arXiv:2307.09702(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[52]
Jiashu Xu, Mingyu Derek Ma, Fei Wang, Chaowei Xiao, and Muhao Chen. 2024. Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models. InProceedings of NAACL-HLT
work page 2024
-
[53]
Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni
-
[54]
InAdvances in Neural Information Processing Systems, Vol
Modeling Tabular Data using Conditional GAN. InAdvances in Neural Information Processing Systems, Vol. 32
-
[55]
Tony Z Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Cal- ibrate Before Use: Improving Few-shot Performance of Language Models. In International Conference on Machine Learning. 12697–12706
work page 2021
-
[56]
[C3] Blood Pressure: HIGH — You MUST include this exact value
Xun Zheng, Bryon Aragam, Pradeep K Ravikumar, and Eric P Xing. 2018. DAGs with NO TEARS: Continuous Optimization for Structure Learning. InAdvances in Neural Information Processing Systems, Vol. 31. CasualSynth: Generating Structurally Sound Synthetic Data Conference’17, July 2017, Washington, DC, USA A Problem Formulation Generating synthetic data for hi...
work page 2018
-
[57]
≤𝐻 𝑏 (𝜖) +𝜖log(|V 𝑖 | −1). The joint bound follows from the chain rule: 𝐻( V | ˆV)= Í𝑁 𝑖=1 𝐻(𝑉 𝑖 | ˆ𝑉𝑖, ˆ𝑉1, . . . , ˆ𝑉𝑖−1 ) ≤ Í𝑁 𝑖=1 𝐻(𝑉 𝑖 | ˆ𝑉𝑖 ), where the inequality uses the fact that conditioning reduces en- tropy.□ Corollary 2 (Ideal Extractor).When 𝜖= 0, the conditional entropy 𝐻(𝑉 𝑖 | ˆ𝑉𝑖 )= 0for all 𝑖, and consequently 𝐻( V | ˆV)= 0. The realize...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.