Generative Synthetic Data for Causal Inference: Pitfalls, Remedies, and Opportunities
Pith reviewed 2026-05-12 02:09 UTC · model grok-4.3
The pith
Fully generative synthetic tabular data often preserves predictive performance while distorting average treatment effect estimates, because prediction loss only weakly constrains the treatment contrast.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We demonstrate that fully generative tabular synthesizers preserve predictive utility while distorting ATE estimates. The failure is structural: ATE preservation requires both a realistic covariate law and an accurate treatment-effect contrast, whereas prediction loss penalizes treatment-effect error only through an overlap-weighted term. We formalize this mismatch through sensitivity and loss-decomposition results and propose a hybrid synthetic-data framework that generates covariates while modeling treatment and outcome mechanisms separately, allowing causal-purpose treatment assignment such as randomized synthetic assignment. Across synthetic and ACTG experiments, hybrid synthesis reduces
What carries the argument
Hybrid synthetic-data framework that generates covariates generatively while modeling treatment assignment and outcome mechanisms separately to support causal-purpose assignments.
If this is right
- Hybrid synthesis supports targeted augmentation to mitigate practical positivity violations in observational studies.
- The resulting synthetic tables can function as simulation engines for comparing causal estimators such as OR, IPW, AIPW, and TMLE before real-data analysis.
- LLM-based versions of the hybrid approach often achieve higher ATE fidelity than CTGAN-based versions.
- The structural mismatch between prediction loss and ATE preservation holds for both GAN and LLM generators.
Where Pith is reading between the lines
- Causal analyses that rely on synthetic data should explicitly verify treatment-effect contrast preservation rather than depending only on distributional similarity or predictive metrics.
- The hybrid separation could be adapted to preserve other quantities such as conditional average treatment effects by adjusting the separate outcome model step.
- In data-scarce settings the method may allow preliminary estimator validation and sensitivity checks without exhausting limited real observations.
Load-bearing premise
Separating covariate generation from treatment and outcome modeling can be performed without introducing new biases or distortions that would make the synthetic data unrealistic for causal inference.
What would settle it
A controlled simulation in which the true ATE is known shows that hybrid synthetic data recovers the ATE within sampling error while fully generative synthetic data produces estimates that deviate by more than two standard errors.
Figures
read the original abstract
Synthetic tabular data are often evaluated by distributional similarity, privacy distance, or train-on-synthetic-test-on-real predictive performance, but these criteria do not ensure validity for causal inference. We show that fully generative tabular synthesizers, including GAN- and LLM-based models, can preserve predictive utility while distorting average treatment effect (ATE) estimates. The failure is structural: ATE preservation requires both a realistic covariate law and an accurate treatment-effect contrast, whereas prediction loss penalizes treatment-effect error only through an overlap-weighted term. We formalize this mismatch through sensitivity and loss-decomposition results, and identify an analogous decomposition in block-level next-token prediction under log loss. Motivated by the tabular causal analysis, we propose a hybrid synthetic-data framework that generates covariates while modeling treatment and outcome mechanisms separately, allowing causal-purpose treatment assignment such as randomized synthetic assignment. We evaluate this framework in three settings: ATE preservation under fully generative versus hybrid synthesis, targeted augmentation for practical positivity problems, and synthetic simulation engines for comparing OR, IPW, AIPW, and TMLE before real-data analysis. Across synthetic and ACTG experiments, hybrid synthesis improves causal fidelity relative to fully generative baselines; LLM-based hybrid synthesis is often more faithful than CTGAN for ATE preservation and finite-sample estimator benchmarking.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that fully generative tabular synthesizers (GAN- and LLM-based) can preserve predictive utility while distorting ATE estimates due to a structural mismatch: ATE requires realistic p(X) and accurate treatment-effect contrast, but prediction loss penalizes treatment-effect error only via an overlap-weighted term. It formalizes this via sensitivity and loss-decomposition results (including an analogous decomposition for block-level next-token prediction), proposes a hybrid framework that generates covariates separately while modeling treatment/outcome mechanisms (allowing randomized synthetic assignment), and reports improved ATE preservation, targeted augmentation for positivity issues, and utility for estimator benchmarking across synthetic and ACTG experiments, with LLM-hybrid often outperforming CTGAN.
Significance. If the hybrid construction can be shown to avoid new biases while delivering the claimed ATE fidelity, the work would be significant for causal inference practice: it supplies a concrete remedy for using synthetic data in privacy-sensitive or positivity-challenged settings and for pre-analysis simulation of estimators (OR, IPW, AIPW, TMLE). The loss-decomposition insight usefully separates predictive from causal objectives and is grounded in standard loss functions rather than ad-hoc quantities.
major comments (3)
- [hybrid framework] Hybrid framework description: No theorem or proposition establishes that the separated modeling of p(X), p(T|X), and p(Y|X,T) (or randomized assignment) preserves the original ATE up to o_p(1) error. The loss-decomposition result shows that even separate fits can under-penalize tau errors in low-overlap regions; without a formal bound or convergence argument, the claim that hybrid synthesis remedies the structural mismatch remains unproven.
- [experiments] Evaluation on ACTG data and synthetic settings: The reported improvements in causal fidelity for hybrid vs. fully generative baselines do not include sensitivity checks to outcome-model misspecification or to the finite-sample behavior of the separate logistic/ML/LLM fits. Such checks are needed because the same overlap-weighted penalization identified in the decomposition can reappear in the hybrid components.
- [hybrid framework] Randomized synthetic assignment option: When the empirical propensity is replaced by randomized assignment, the synthetic joint no longer matches the original data-generating process; any ATE computed on the synthetic data then reflects an interventional contrast rather than the associational quantity in the real data. The manuscript does not delineate the conditions under which this substitution is valid for the intended causal-inference use cases.
minor comments (2)
- [formalization] Clarify whether the loss-decomposition result is stated for the population or for finite samples, and whether it applies directly to the LLM next-token objective or only by analogy.
- [experiments] In the tables or figures reporting ATE estimates, include standard errors or confidence intervals so that the magnitude of improvement can be assessed against sampling variability.
Simulated Author's Rebuttal
We thank the referee for their insightful comments, which identify key areas where the theoretical justification and empirical robustness of the hybrid framework can be strengthened. We address each major comment below, indicating where revisions will be incorporated.
read point-by-point responses
-
Referee: [hybrid framework] Hybrid framework description: No theorem or proposition establishes that the separated modeling of p(X), p(T|X), and p(Y|X,T) (or randomized assignment) preserves the original ATE up to o_p(1) error. The loss-decomposition result shows that even separate fits can under-penalize tau errors in low-overlap regions; without a formal bound or convergence argument, the claim that hybrid synthesis remedies the structural mismatch remains unproven.
Authors: We acknowledge that the manuscript does not contain a formal theorem establishing o_p(1) preservation of the ATE under separated modeling of p(X), p(T|X), and p(Y|X,T). The loss-decomposition result indeed applies componentwise and does not automatically guarantee that separate fits eliminate under-penalization of tau errors in low-overlap regions. Our position is that the hybrid construction remedies the structural mismatch by allowing direct optimization of the treatment-effect contrast (rather than an indirect overlap-weighted term), which is supported by the reported experiments. In the revision we will add a dedicated paragraph in the discussion section explicitly noting the absence of a general convergence bound, stating the additional assumptions (consistent estimation of the conditional models and correct specification of the covariate marginal) under which the hybrid ATE would match the original, and clarifying that the current claims rest on the decomposition insight plus empirical evidence rather than a formal guarantee. revision: partial
-
Referee: [experiments] Evaluation on ACTG data and synthetic settings: The reported improvements in causal fidelity for hybrid vs. fully generative baselines do not include sensitivity checks to outcome-model misspecification or to the finite-sample behavior of the separate logistic/ML/LLM fits. Such checks are needed because the same overlap-weighted penalization identified in the decomposition can reappear in the hybrid components.
Authors: We agree that the current experiments lack systematic sensitivity checks for outcome-model misspecification and finite-sample behavior of the separate fits. The reported gains are based on standard logistic, ML, and LLM implementations, but the overlap-weighted penalization identified in the decomposition could indeed reappear if the conditional models are poorly specified or estimated from small samples. In the revised manuscript we will add a new subsection containing (i) controlled misspecification experiments (e.g., omitting interactions or using linear models when the true conditional expectation is nonlinear) and (ii) finite-sample curves showing ATE fidelity as a function of training size for each hybrid component. These additions will directly address the referee's concern. revision: yes
-
Referee: [hybrid framework] Randomized synthetic assignment option: When the empirical propensity is replaced by randomized assignment, the synthetic joint no longer matches the original data-generating process; any ATE computed on the synthetic data then reflects an interventional contrast rather than the associational quantity in the real data. The manuscript does not delineate the conditions under which this substitution is valid for the intended causal-inference use cases.
Authors: The referee correctly notes that randomized synthetic assignment produces an interventional rather than associational joint distribution. This option is intended only for specific use cases: (a) targeted augmentation to alleviate practical positivity violations while preserving marginal covariate and outcome distributions, and (b) creation of synthetic simulation engines with a known interventional ATE for pre-analysis estimator benchmarking. When the goal is to preserve the original associational ATE, the empirical propensity is retained. We will revise the manuscript by adding a short subsection that explicitly distinguishes the two regimes, states the conditions under which each is appropriate, and includes a summary table of intended use cases. This will remove ambiguity about when the interventional contrast is substituted. revision: partial
Circularity Check
No significant circularity; derivations are self-contained from standard losses
full rationale
The paper's core loss-decomposition and sensitivity results are explicitly derived from standard causal inference quantities (e.g., overlap-weighted prediction loss) and machine-learning objectives rather than from any fitted parameters or self-defined quantities internal to the paper. The hybrid synthesis proposal is motivated by the identified mismatch and evaluated via experiments on synthetic and ACTG data; it does not rely on a theorem that reduces to the paper's own inputs by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes smuggled via prior work appear in the abstract or described chain. The central structural claim about ATE distortion versus predictive utility is therefore independent and falsifiable against external benchmarks such as IPW, AIPW, and TMLE estimators.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Prediction loss penalizes treatment-effect error only through an overlap-weighted term
Reference graph
Works this paper leans on
-
[1]
Borisov, V., Sessler, K., Leemann, T., Pawelczyk, M., and Kasneci, G. (2023). Language models are realistic tabular data generators. InThe Eleventh International Conference on Learning Representations. Cheng, C., Li, F., Thomas, L. E., and Li, F. F. (2022). Addressing extreme propensity scores in estimating counterfactual survival functions via the overla...
work page 2023
-
[2]
Chipman, H., George, E., and McCulloch, R. (2010). Bart: Bayesian additive regression trees.The Annals of Applied Statistics,
work page 2010
-
[3]
Cole, S. R. and Hern´ an, M. A. (2008). Constructing inverse probability weights for marginal structural models.American Journal of Epidemiology, 168(6):656–664. Epub 2008 Aug
work page 2008
-
[4]
De Bartolomeis, P., Abad, J., Wang, G., Donhauser, K., Duch, R. M., Yang, F., and Dahabreh, I. J. (2024). Efficient randomized experiments using foundation models.arXiv preprint arXiv:2502.04262. Freedman, D. A. and Berk, R. A. (2008). Weighting regressions by propensity scores.Evaluation Review, 32(4):392–409. Gruber, S., Phillips, R. V., Lee, H., and va...
-
[5]
Li, Z., Zhu, H., Lu, Z., and Yin, M. (2023). Synthetic data generation with large language models for text classification: Potential and limitations. In Bouamor, H., Pino, J., and Bali, K., editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10443–10461, Singapore. Association for Computational Linguistics...
-
[6]
M., Rotnitzky, A., and Zhao, L
Robins, J. M., Rotnitzky, A., and Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed.Journal of the American Statistical Association, 89(427):846–866. Tran, L., Ye, T., Ding, P., and Han, F. (2026). Generative modeling for the bootstrap. van der Laan, M. J. and Rose, S. (2011).Targeted Learning: Causal I...
work page 1994
-
[7]
Xu, Y., Gruber, S., and van der Laan, M
Curran Associates, Inc. Xu, Y., Gruber, S., and van der Laan, M. J. (2026a). Investigating targeting strategies and truncation in tmle for the average treatment effect under practical positivity violations. Xu, Y., Nakada, R., Zhang, L., and Li, L. (2026b). Residual feature integration is sufficient to prevent negative transfer. InThe Fourteenth Internati...
work page 2020
-
[8]
For LLM-based tabular synthesis, we used the GReaT framework Borisov et al
For the positivity experiments, we additionally generate an observational dataset of size 200, where treatment assignment follows the propensity model above. For LLM-based tabular synthesis, we used the GReaT framework Borisov et al. (2023) with GPT-2 as the underlying language model. Each row of the training table was serialized into a textual template t...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.