Recognition: unknown
TabSCM: A practical Framework for Generating Realistic Tabular Data
Pith reviewed 2026-05-08 12:30 UTC · model grok-4.3
The pith
TabSCM generates tabular data by learning conditional distributions along a discovered causal graph rather than matching statistics alone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TabSCM orients a CPDAG into a DAG, estimates root-node marginals with kernel density estimation or frequency counts, and fits topologically ordered structural assignments by training conditional diffusion models on continuous child nodes and gradient-boosted trees on categorical child nodes; ancestral sampling then yields causally coherent records and enables precise interventional queries.
What carries the argument
A CPDAG-derived DAG that decomposes the joint distribution into explicit conditional structural assignments learned by mixed-type models.
If this is right
- Generated data exhibits lower rates of rule violations than non-causal baselines.
- Downstream models trained on the data achieve higher utility than those trained on outputs from GAN, diffusion, or LLM generators.
- Privacy risk metrics remain comparable or better while causal interventions stay robust.
- Generation runs up to 583 times faster than pure diffusion models because sampling decomposes into independent conditional steps.
- The explicit equations expose interpretable parameters for auditing fairness and simulating policy changes.
Where Pith is reading between the lines
- Substituting domain-expert graphs for the off-the-shelf discovery step would let TabSCM serve as a policy simulator in regulated fields such as healthcare or finance.
- The same decomposition could be applied to other structured generative tasks, for example time-series or graph data, by replacing the tabular conditionals with appropriate sequence or graph learners.
- Sensitivity tests that vary the accuracy of the input CPDAG would quantify how much TabSCM's gains depend on causal discovery quality.
Load-bearing premise
The causal graph recovered by standard discovery algorithms is close enough to the true structure that the fitted conditionals remain valid under intervention.
What would settle it
Generate data with TabSCM on a synthetic dataset whose ground-truth DAG is known, then repeat the experiment after feeding the discovery algorithm a deliberately corrupted version of that DAG and check whether the fidelity, rule-violation, and intervention advantages disappear.
Figures
read the original abstract
Most tabular-data generators match marginal statistics yet ignore causal structure, leading downstream models to learn spurious or unfair patterns. We present TabSCM, a mixed-type generator that preserves those causal dependencies. Starting from a Completed Partially Directed Acyclic Graph (CPDAG) found by any causal structure discovery algorithm, TabSCM (i) orients edges to a DAG, (ii) fits root-node marginals with KDE or categorical frequencies, and (iii) learns topologically ordered structural assignments. Such assignments are achieved using conditional diffusion models for continuous variables as child nodes and gradient-boosted trees for categorical ones. Ancestral sampling yields semantically valid records and enables exact counterfactual queries. On seven public datasets, encompassing healthcare, finance, housing, environment, TabSCM matches or surpasses state-of-the-art GAN, diffusion, and LLM baselines in statistical fidelity, downstream utility, and privacy risk, while also cutting rule-violation rates and providing causally meaningful and robust conditional interventions. Because generation is decomposed into explicit equations, it runs up to 583$\times$ faster than diffusion-only models and exposes interpretable knobs for fairness auditing and policy simulation, making TabSCM a practical choice for realism, explainability, and causal soundness.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces TabSCM, a mixed-type tabular data generator that incorporates causal structure. It takes a CPDAG from any off-the-shelf causal discovery algorithm, orients it to a DAG, models root marginals via KDE or frequencies, and learns topologically ordered structural assignments (conditional diffusion for continuous children, gradient-boosted trees for categorical). Ancestral sampling produces records; the approach is claimed to match or exceed GAN/diffusion/LLM baselines on seven public datasets in statistical fidelity, downstream utility, privacy, rule-violation rates, and causal intervention quality, while running up to 583× faster due to its explicit decomposition.
Significance. If the reported empirical gains hold under rigorous verification and the causal assumptions prove robust, TabSCM would provide a practical, interpretable alternative to black-box generators for domains such as healthcare and finance. The explicit structural decomposition enabling fast sampling, counterfactual queries, and fairness auditing is a clear strength over purely implicit models.
major comments (3)
- [§4 and §5] §4 (Experimental Setup) and §5 (Results): the abstract and introduction assert superiority on seven datasets in fidelity, utility, and rule-violation metrics, yet no tables, error bars, statistical significance tests, or ablation studies isolating the causal component versus non-causal baselines are referenced; without these, the central empirical claim cannot be evaluated.
- [§3.1] §3.1 (Causal Structure Learning): the validity of learned conditionals and 'causally meaningful interventions' rests on the supplied CPDAG being close to the true structure, but the manuscript reports no sensitivity analysis, no comparison across multiple discovery algorithms, and no controlled edge-error injection experiments to quantify degradation in fidelity or intervention quality.
- [§3.3] §3.3 (Structural Assignments): the claim that diffusion and GBT assignments remain valid under ancestral sampling and do-calculus interventions is load-bearing for the causal soundness argument, yet no formal justification or empirical check against ground-truth DAGs is supplied when the discovery step errs.
minor comments (2)
- [§3.2] Notation for the oriented DAG after CPDAG processing is introduced without an explicit equation or pseudocode step; a small diagram or numbered procedure would improve clarity.
- [§5] The 583× speedup claim is stated without a corresponding table listing wall-clock times against each baseline on identical hardware.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing TabSCM's potential as a practical, interpretable alternative to black-box generators. We address each major comment below, committing to revisions that strengthen the empirical claims and robustness analysis while preserving the manuscript's core contributions.
read point-by-point responses
-
Referee: [§4 and §5] §4 (Experimental Setup) and §5 (Results): the abstract and introduction assert superiority on seven datasets in fidelity, utility, and rule-violation metrics, yet no tables, error bars, statistical significance tests, or ablation studies isolating the causal component versus non-causal baselines are referenced; without these, the central empirical claim cannot be evaluated.
Authors: We agree that the presentation of results can be strengthened for clarity and rigor. The current manuscript includes comparative evaluations across the seven datasets in §5, but we acknowledge the absence of consolidated tables with means and standard deviations, error bars on figures, formal statistical significance tests, and explicit ablations isolating the causal orientation step. In the revision we will add: (i) a summary table reporting all metrics with standard deviations over 5 random seeds and paired t-test p-values against baselines; (ii) error bars on all fidelity and utility plots; and (iii) a dedicated ablation subsection comparing TabSCM against a non-causal variant that uses the same diffusion/GBT components but ignores the learned DAG topology. These additions will make the central claims directly verifiable. revision: yes
-
Referee: [§3.1] §3.1 (Causal Structure Learning): the validity of learned conditionals and 'causally meaningful interventions' rests on the supplied CPDAG being close to the true structure, but the manuscript reports no sensitivity analysis, no comparison across multiple discovery algorithms, and no controlled edge-error injection experiments to quantify degradation in fidelity or intervention quality.
Authors: This concern is well-founded. The manuscript treats the CPDAG as an input from any off-the-shelf algorithm and focuses on the subsequent generation pipeline. To quantify robustness we will add a new subsection in §5 that (a) runs TabSCM with CPDAGs produced by both the PC algorithm and NOTEARS on the same datasets, (b) reports fidelity, utility, and intervention metrics for each, and (c) includes a controlled edge-error injection study on synthetic data with known ground-truth DAGs, measuring degradation as a function of flipped or missing edges. These experiments will be presented with the same metrics used in the main results. revision: yes
-
Referee: [§3.3] §3.3 (Structural Assignments): the claim that diffusion and GBT assignments remain valid under ancestral sampling and do-calculus interventions is load-bearing for the causal soundness argument, yet no formal justification or empirical check against ground-truth DAGs is supplied when the discovery step errs.
Authors: We partially agree. When the oriented DAG is correct, ancestral sampling is valid by the topological ordering and the explicit structural equations; do-interventions are realized by clamping the intervened node and resampling its descendants, which follows standard causal semantics. However, the manuscript does not supply a formal proof of validity under discovery errors nor ground-truth checks on erroneous CPDAGs. In revision we will (i) add a short discussion in §3.3 clarifying the assumption that the supplied CPDAG is sufficiently accurate and (ii) include an empirical study on synthetic ground-truth DAGs that injects controlled errors and reports the resulting drop in fidelity and intervention quality. This provides the requested empirical check without overstating robustness. revision: partial
Circularity Check
No circularity: procedural generative method with external structure input and empirical evaluation
full rationale
TabSCM is defined as a pipeline that ingests an externally supplied CPDAG (from any off-the-shelf discovery algorithm), orients it to a DAG, fits root marginals via KDE/frequencies, and learns conditional structural assignments (conditional diffusion for continuous, GBTs for categorical) in topological order. Generation proceeds by ancestral sampling. All performance claims (fidelity, utility, privacy, rule violations, interventions) are obtained by running the fitted generator on held-out public datasets and comparing to external baselines. No equation or claim reduces a derived quantity to a fitted parameter by construction, no uniqueness theorem is imported via self-citation, and no ansatz is smuggled; the causal soundness claim rests on the (explicitly stated) assumption that the input CPDAG is sufficiently accurate, which is an external modeling choice rather than an internal tautology.
Axiom & Free-Parameter Ledger
free parameters (2)
- KDE bandwidth for root marginals
- Diffusion noise schedule and tree hyperparameters
axioms (2)
- domain assumption The input CPDAG from any causal discovery algorithm is sufficiently accurate to support valid conditional distributions under intervention
- standard math Topological ordering permits sequential ancestral sampling without cycles
Forward citations
Cited by 1 Pith paper
-
Tabular Foundation Model for Generative Modelling
TabFORGE generates high-quality synthetic tabular data by leveraging pretrained causality-aware representations in a two-stage diffusion-decoder architecture that mitigates latent distribution shifts.
Reference graph
Works this paper leans on
-
[1]
In: International conference on machine learning
Alaa, A., Van Breugel, B., Saveliev, E.S., Van Der Schaar, M.: How faithful is your synthetic data? sample-level metrics for evaluating and auditing generative models. In: International conference on machine learning. pp. 290–306. PMLR (2022)
2022
-
[2]
Applied Sciences12(9), 4619 (2022)
Barbierato, E., Vedova, M.L.D., Tessera, D., Toti, D., Vanoli, N.: A method- ology for controlling bias and fairness in synthetic data generation. Applied Sciences12(9), 4619 (2022)
2022
-
[3]
Becker, B., Kohavi, R.: Adult. UCI Machine Learning Repository (1996), DOI: https://doi.org/10.24432/C5XW20
-
[4]
UCI Machine Learning Repository (2004), DOI: https://doi.org/10.24432/C52C8B
Bock, R.: MAGIC Gamma Telescope. UCI Machine Learning Repository (2004), DOI: https://doi.org/10.24432/C52C8B
-
[5]
In: The Eleventh International Conference on Learning Representations (2023),https://openreview
Borisov,V.,Sessler,K.,Leemann,T.,Pawelczyk,M.,Kasneci,G.:Language models are realistic tabular data generators. In: The Eleventh International Conference on Learning Representations (2023),https://openreview. net/forum?id=cEygmQNOeI
2023
-
[6]
Journal of artificial intelligence research16, 321–357 (2002)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: syn- thetic minority over-sampling technique. Journal of artificial intelligence research16, 321–357 (2002)
2002
-
[7]
Nature Biomedical Engineering5(6), 493–497 (2021)
Chen, R.J., Lu, M.Y., Chen, T.Y., Williamson, D.F., Mahmood, F.: Synthetic data in machine learning for medicine and healthcare. Nature Biomedical Engineering5(6), 493–497 (2021)
2021
-
[8]
In: Pro- ceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining
Chen, T., Guestrin, C.: Xgboost: A scalable tree boosting system. In: Pro- ceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. pp. 785–794 (2016)
2016
-
[9]
Chickering,D.M.:Optimalstructureidentificationwithgreedysearch.Jour- nal of machine learning research3(Nov), 507–554 (2002)
2002
-
[10]
Advances in neural information processing systems34, 8780–8794 (2021)
Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Advances in neural information processing systems34, 8780–8794 (2021)
2021
- [11]
-
[12]
Neurocomputing321, 321–331 (2018)
Frid-Adar, M., Diamant, I., Klang, E., Amitai, M., Goldberger, J., Greenspan, H.: Gan-based synthetic medical image augmentation for in- creased cnn performance in liver lesion classification. Neurocomputing321, 321–331 (2018)
2018
-
[13]
Annals of statistics pp
Friedman, J.H.: Greedy function approximation: a gradient boosting ma- chine. Annals of statistics pp. 1189–1232 (2001)
2001
-
[14]
BMC medical research methodology20, 1–40 (2020)
Goncalves, A., Ray, P., Soper, B., Stevens, J., Coyle, L., Sales, A.P.: Gen- eration and evaluation of synthetic patient data. BMC medical research methodology20, 1–40 (2020)
2020
-
[15]
In: 2008 IEEE international joint con- 24 S
He, H., Bai, Y., Garcia, E.A., Li, S.: Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint con- 24 S. Jacob et al. ference on neural networks (IEEE world congress on computational intelli- gence). pp. 1322–1328. Ieee (2008)
2008
-
[16]
Com- puter Vision and Machine Intelligence in Medical Image Analysis (2019), https://api.semanticscholar.org/CorpusID:202835967
Islam, M.M.F., Ferdousi, R., Rahman, S., Bushra, H.Y.: Likelihood pre- diction of diabetes at early stage using data mining techniques. Com- puter Vision and Machine Intelligence in Medical Image Analysis (2019), https://api.semanticscholar.org/CorpusID:202835967
2019
-
[17]
In: International conference on machine learning
Johansson, F., Shalit, U., Sontag, D.: Learning representations for coun- terfactual inference. In: International conference on machine learning. pp. 3020–3029. PMLR (2016)
2016
-
[18]
In: International conference on learning representations (2018)
Jordon, J., Yoon, J., Van Der Schaar, M.: Pate-gan: Generating synthetic data with differential privacy guarantees. In: International conference on learning representations (2018)
2018
-
[19]
In: International Conference on Learning Representations (2018),https:// openreview.net/forum?id=BJE-4xW0W
Kocaoglu, M., Snyder, C., Dimakis, A.G., Vishwanath, S.: CausalGAN: Learning causal implicit generative models with adversarial training. In: International Conference on Learning Representations (2018),https:// openreview.net/forum?id=BJE-4xW0W
2018
-
[20]
In: International Conference on Machine Learning
Kotelnikov, A., Baranchuk, D., Rubachev, I., Babenko, A.: Tabddpm: Mod- elling tabular data with diffusion models. In: International Conference on Machine Learning. pp. 17564–17579. PMLR (2023)
2023
-
[21]
5 pollution: severity, weather impact, apec and winter heating
Liang, X., Zou, T., Guo, B., Li, S., Zhang, H., Zhang, S., Huang, H., Chen, S.X.: Assessing beijing’s pm2. 5 pollution: severity, weather impact, apec and winter heating. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences471(2182), 20150257 (2015)
2015
-
[22]
In: The Eleventh International Conference on Learning Representations (2023)
Liu, T., Qian, Z., Berrevoets, J., van der Schaar, M.: Goggle: Generative modelling for tabular data by learning relational structure. In: The Eleventh International Conference on Learning Representations (2023)
2023
-
[23]
Advances in neural information processing systems30(2017)
Louizos, C., Shalit, U., Mooij, J.M., Sontag, D., Zemel, R., Welling, M.: Causal effect inference with deep latent-variable models. Advances in neural information processing systems30(2017)
2017
-
[24]
In: Proceedings of the 2020 conference on fairness, accountability, and transparency
Mothilal, R.K., Sharma, A., Tan, C.: Explaining machine learning classifiers through diverse counterfactual explanations. In: Proceedings of the 2020 conference on fairness, accountability, and transparency. pp. 607–617 (2020)
2020
-
[25]
Statistics & Proba- bility Letters33(3), 291–297 (1997)
Pace, R.K., Barry, R.: Sparse spatial autoregressions. Statistics & Proba- bility Letters33(3), 291–297 (1997)
1997
-
[26]
Park, N., Mohammadi, M., Gorde, K., Jajodia, S., Park, H., Kim, Y.: Data synthesis based on generative adversarial networks. Proc. VLDB Endow. 11(10), 1071–1083 (Jun 2018).https://doi.org/10.14778/3231751. 3231757,https://doi.org/10.14778/3231751.3231757
-
[27]
Cambridge Univer- sity Press, New York (2000)
Pearl, J.: Causality: Models, Reasoning and Inference. Cambridge Univer- sity Press, New York (2000)
2000
-
[28]
In: Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence
Pearl, J.: The do-calculus revisited. In: Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence. p. 3–11. UAI’12, AUAI Press, Arlington, Virginia, USA (2012)
2012
-
[29]
The MIT Press (2017) TabSCM: A practical Framework for Generating Realistic Tabular Data 25
Peters, J., Janzing, D., Schölkopf, B.: Elements of causal inference: founda- tions and learning algorithms. The MIT Press (2017) TabSCM: A practical Framework for Generating Realistic Tabular Data 25
2017
-
[30]
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)
2018
-
[31]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchi- cal text-conditional image generation with clip latents. arXiv preprint arXiv:2204.061251(2), 3 (2022)
work page internal anchor Pith review arXiv 2022
-
[32]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High- resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)
2022
-
[33]
In: The Thirteenth International Conference on Learning Representations (2025), https://openreview.net/forum?id=swvURjrt8z
Shi, J., Xu, M., Hua, H., Zhang, H., Ermon, S., Leskovec, J.: Tabd- iff: a mixed-type diffusion model for tabular data generation. In: The Thirteenth International Conference on Learning Representations (2025), https://openreview.net/forum?id=swvURjrt8z
2025
-
[34]
MIT press (2000)
Spirtes, P., Glymour, C.N., Scheines, R.: Causation, prediction, and search. MIT press (2000)
2000
-
[35]
Big Data & Society4(2), 2053951717743530 (2017)
Veale, M., Binns, R.: Fairer machine learning in the real world: Mitigating discrimination without collecting sensitive data. Big Data & Society4(2), 2053951717743530 (2017)
2017
-
[36]
In: ICLR Work- shop on Deep Generative Models for Highly Structured Data (2022),https: //openreview.net/forum?id=BEhxCh4dvW5
Wen, B., Cao, Y., Yang, F., Subbalakshmi, K., Chandramouli, R.: Causal- TGAN: Modeling tabular data using causally-aware GAN. In: ICLR Work- shop on Deep Generative Models for Highly Structured Data (2022),https: //openreview.net/forum?id=BEhxCh4dvW5
2022
-
[37]
In: 2018 IEEE international conference on big data (big data)
Xu, D., Yuan, S., Zhang, L., Wu, X.: Fairgan: Fairness-aware generative adversarial networks. In: 2018 IEEE international conference on big data (big data). pp. 570–575. IEEE (2018)
2018
-
[38]
Advances in neural information process- ing systems32(2019)
Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional gan. Advances in neural information process- ing systems32(2019)
2019
-
[39]
In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum?id= 4Ay23yeuz0
Zhang, H., Zhang, J., Shen, Z., Srinivasan, B., Qin, X., Faloutsos, C., Rangwala, H., Karypis, G.: Mixed-type tabular data synthesis with score- based diffusion in latent space. In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum?id= 4Ay23yeuz0
2024
-
[40]
Advances in neural infor- mation processing systems31(2018)
Zheng, X., Aragam, B., Ravikumar, P.K., Xing, E.P.: Dags with no tears: Continuous optimization for structure learning. Advances in neural infor- mation processing systems31(2018)
2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.