Generative Synthetic Data for Causal Inference: Pitfalls, Remedies, and Opportunities

Yichen Xu

arxiv: 2604.23904 · v2 · submitted 2026-04-26 · 📊 stat.ME · cs.AI· stat.ML

Generative Synthetic Data for Causal Inference: Pitfalls, Remedies, and Opportunities

Yichen Xu This is my paper

Pith reviewed 2026-05-12 02:09 UTC · model grok-4.3

classification 📊 stat.ME cs.AIstat.ML

keywords synthetic datacausal inferenceaverage treatment effectgenerative modelshybrid synthesistabular dataGANLLM

0 comments

The pith

Fully generative synthetic tabular data often preserves predictive performance while distorting average treatment effect estimates, because prediction loss only weakly constrains the treatment contrast.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard generative models for tabular data, whether GAN-based or LLM-based, can produce synthetic tables that work well for ordinary prediction yet yield biased estimates of average treatment effects. The mismatch arises because accurate ATE recovery needs both a faithful distribution of covariates and an undistorted difference in outcomes between treated and untreated units, whereas typical training losses penalize errors in the treatment contrast only through an overlap-weighted term. To address this, the authors introduce a hybrid generation procedure that creates covariates generatively but models the treatment assignment and outcome processes separately, which permits deliberate choices such as random treatment assignment in the synthetic sample. Experiments on both simulated and real ACTG data indicate that the hybrid method recovers ATE values more faithfully than fully generative baselines and supports reliable benchmarking of estimators like IPW and TMLE.

Core claim

We demonstrate that fully generative tabular synthesizers preserve predictive utility while distorting ATE estimates. The failure is structural: ATE preservation requires both a realistic covariate law and an accurate treatment-effect contrast, whereas prediction loss penalizes treatment-effect error only through an overlap-weighted term. We formalize this mismatch through sensitivity and loss-decomposition results and propose a hybrid synthetic-data framework that generates covariates while modeling treatment and outcome mechanisms separately, allowing causal-purpose treatment assignment such as randomized synthetic assignment. Across synthetic and ACTG experiments, hybrid synthesis reduces

What carries the argument

Hybrid synthetic-data framework that generates covariates generatively while modeling treatment assignment and outcome mechanisms separately to support causal-purpose assignments.

If this is right

Hybrid synthesis supports targeted augmentation to mitigate practical positivity violations in observational studies.
The resulting synthetic tables can function as simulation engines for comparing causal estimators such as OR, IPW, AIPW, and TMLE before real-data analysis.
LLM-based versions of the hybrid approach often achieve higher ATE fidelity than CTGAN-based versions.
The structural mismatch between prediction loss and ATE preservation holds for both GAN and LLM generators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Causal analyses that rely on synthetic data should explicitly verify treatment-effect contrast preservation rather than depending only on distributional similarity or predictive metrics.
The hybrid separation could be adapted to preserve other quantities such as conditional average treatment effects by adjusting the separate outcome model step.
In data-scarce settings the method may allow preliminary estimator validation and sensitivity checks without exhausting limited real observations.

Load-bearing premise

Separating covariate generation from treatment and outcome modeling can be performed without introducing new biases or distortions that would make the synthetic data unrealistic for causal inference.

What would settle it

A controlled simulation in which the true ATE is known shows that hybrid synthetic data recovers the ATE within sampling error while fully generative synthetic data produces estimates that deviate by more than two standard errors.

Figures

Figures reproduced from arXiv: 2604.23904 by Yichen Xu.

**Figure 1.** Figure 1: Privacy and causal-fidelity diagnostics for synthetic data. Left: predictive utility (TSTR AUC) and view at source ↗

**Figure 2.** Figure 2: Finite-sample benchmarking results under the LLM-based hybrid ACTG simulator. Panels report view at source ↗

read the original abstract

Synthetic tabular data are often evaluated by distributional similarity, privacy distance, or train-on-synthetic-test-on-real predictive performance, but these criteria do not ensure validity for causal inference. We show that fully generative tabular synthesizers, including GAN- and LLM-based models, can preserve predictive utility while distorting average treatment effect (ATE) estimates. The failure is structural: ATE preservation requires both a realistic covariate law and an accurate treatment-effect contrast, whereas prediction loss penalizes treatment-effect error only through an overlap-weighted term. We formalize this mismatch through sensitivity and loss-decomposition results, and identify an analogous decomposition in block-level next-token prediction under log loss. Motivated by the tabular causal analysis, we propose a hybrid synthetic-data framework that generates covariates while modeling treatment and outcome mechanisms separately, allowing causal-purpose treatment assignment such as randomized synthetic assignment. We evaluate this framework in three settings: ATE preservation under fully generative versus hybrid synthesis, targeted augmentation for practical positivity problems, and synthetic simulation engines for comparing OR, IPW, AIPW, and TMLE before real-data analysis. Across synthetic and ACTG experiments, hybrid synthesis improves causal fidelity relative to fully generative baselines; LLM-based hybrid synthesis is often more faithful than CTGAN for ATE preservation and finite-sample estimator benchmarking.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags that standard generative synthesizers preserve prediction but distort ATE due to overlap-weighted losses, and offers a hybrid covariate-then-treatment/outcome framework that improves fidelity in the reported experiments.

read the letter

The main thing to know is that fully generative models can keep overall predictive performance while still shifting ATE estimates, and the authors trace this to a mismatch between standard losses and the requirements for accurate treatment contrasts. They formalize it with a loss decomposition and sensitivity argument, then test a hybrid synthesis that generates covariates separately and models treatment and outcome mechanisms on their own, including options like randomized synthetic assignment. This setup is evaluated on synthetic cases, positivity augmentation, and estimator benchmarking with ACTG data, where it shows gains over CTGAN and LLM baselines for ATE preservation and finite-sample checks on OR, IPW, AIPW, and TMLE.

Referee Report

3 major / 2 minor

Summary. The paper claims that fully generative tabular synthesizers (GAN- and LLM-based) can preserve predictive utility while distorting ATE estimates due to a structural mismatch: ATE requires realistic p(X) and accurate treatment-effect contrast, but prediction loss penalizes treatment-effect error only via an overlap-weighted term. It formalizes this via sensitivity and loss-decomposition results (including an analogous decomposition for block-level next-token prediction), proposes a hybrid framework that generates covariates separately while modeling treatment/outcome mechanisms (allowing randomized synthetic assignment), and reports improved ATE preservation, targeted augmentation for positivity issues, and utility for estimator benchmarking across synthetic and ACTG experiments, with LLM-hybrid often outperforming CTGAN.

Significance. If the hybrid construction can be shown to avoid new biases while delivering the claimed ATE fidelity, the work would be significant for causal inference practice: it supplies a concrete remedy for using synthetic data in privacy-sensitive or positivity-challenged settings and for pre-analysis simulation of estimators (OR, IPW, AIPW, TMLE). The loss-decomposition insight usefully separates predictive from causal objectives and is grounded in standard loss functions rather than ad-hoc quantities.

major comments (3)

[hybrid framework] Hybrid framework description: No theorem or proposition establishes that the separated modeling of p(X), p(T|X), and p(Y|X,T) (or randomized assignment) preserves the original ATE up to o_p(1) error. The loss-decomposition result shows that even separate fits can under-penalize tau errors in low-overlap regions; without a formal bound or convergence argument, the claim that hybrid synthesis remedies the structural mismatch remains unproven.
[experiments] Evaluation on ACTG data and synthetic settings: The reported improvements in causal fidelity for hybrid vs. fully generative baselines do not include sensitivity checks to outcome-model misspecification or to the finite-sample behavior of the separate logistic/ML/LLM fits. Such checks are needed because the same overlap-weighted penalization identified in the decomposition can reappear in the hybrid components.
[hybrid framework] Randomized synthetic assignment option: When the empirical propensity is replaced by randomized assignment, the synthetic joint no longer matches the original data-generating process; any ATE computed on the synthetic data then reflects an interventional contrast rather than the associational quantity in the real data. The manuscript does not delineate the conditions under which this substitution is valid for the intended causal-inference use cases.

minor comments (2)

[formalization] Clarify whether the loss-decomposition result is stated for the population or for finite samples, and whether it applies directly to the LLM next-token objective or only by analogy.
[experiments] In the tables or figures reporting ATE estimates, include standard errors or confidence intervals so that the magnitude of improvement can be assessed against sampling variability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments, which identify key areas where the theoretical justification and empirical robustness of the hybrid framework can be strengthened. We address each major comment below, indicating where revisions will be incorporated.

read point-by-point responses

Referee: [hybrid framework] Hybrid framework description: No theorem or proposition establishes that the separated modeling of p(X), p(T|X), and p(Y|X,T) (or randomized assignment) preserves the original ATE up to o_p(1) error. The loss-decomposition result shows that even separate fits can under-penalize tau errors in low-overlap regions; without a formal bound or convergence argument, the claim that hybrid synthesis remedies the structural mismatch remains unproven.

Authors: We acknowledge that the manuscript does not contain a formal theorem establishing o_p(1) preservation of the ATE under separated modeling of p(X), p(T|X), and p(Y|X,T). The loss-decomposition result indeed applies componentwise and does not automatically guarantee that separate fits eliminate under-penalization of tau errors in low-overlap regions. Our position is that the hybrid construction remedies the structural mismatch by allowing direct optimization of the treatment-effect contrast (rather than an indirect overlap-weighted term), which is supported by the reported experiments. In the revision we will add a dedicated paragraph in the discussion section explicitly noting the absence of a general convergence bound, stating the additional assumptions (consistent estimation of the conditional models and correct specification of the covariate marginal) under which the hybrid ATE would match the original, and clarifying that the current claims rest on the decomposition insight plus empirical evidence rather than a formal guarantee. revision: partial
Referee: [experiments] Evaluation on ACTG data and synthetic settings: The reported improvements in causal fidelity for hybrid vs. fully generative baselines do not include sensitivity checks to outcome-model misspecification or to the finite-sample behavior of the separate logistic/ML/LLM fits. Such checks are needed because the same overlap-weighted penalization identified in the decomposition can reappear in the hybrid components.

Authors: We agree that the current experiments lack systematic sensitivity checks for outcome-model misspecification and finite-sample behavior of the separate fits. The reported gains are based on standard logistic, ML, and LLM implementations, but the overlap-weighted penalization identified in the decomposition could indeed reappear if the conditional models are poorly specified or estimated from small samples. In the revised manuscript we will add a new subsection containing (i) controlled misspecification experiments (e.g., omitting interactions or using linear models when the true conditional expectation is nonlinear) and (ii) finite-sample curves showing ATE fidelity as a function of training size for each hybrid component. These additions will directly address the referee's concern. revision: yes
Referee: [hybrid framework] Randomized synthetic assignment option: When the empirical propensity is replaced by randomized assignment, the synthetic joint no longer matches the original data-generating process; any ATE computed on the synthetic data then reflects an interventional contrast rather than the associational quantity in the real data. The manuscript does not delineate the conditions under which this substitution is valid for the intended causal-inference use cases.

Authors: The referee correctly notes that randomized synthetic assignment produces an interventional rather than associational joint distribution. This option is intended only for specific use cases: (a) targeted augmentation to alleviate practical positivity violations while preserving marginal covariate and outcome distributions, and (b) creation of synthetic simulation engines with a known interventional ATE for pre-analysis estimator benchmarking. When the goal is to preserve the original associational ATE, the empirical propensity is retained. We will revise the manuscript by adding a short subsection that explicitly distinguishes the two regimes, states the conditions under which each is appropriate, and includes a summary table of intended use cases. This will remove ambiguity about when the interventional contrast is substituted. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivations are self-contained from standard losses

full rationale

The paper's core loss-decomposition and sensitivity results are explicitly derived from standard causal inference quantities (e.g., overlap-weighted prediction loss) and machine-learning objectives rather than from any fitted parameters or self-defined quantities internal to the paper. The hybrid synthesis proposal is motivated by the identified mismatch and evaluated via experiments on synthetic and ACTG data; it does not rely on a theorem that reduces to the paper's own inputs by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes smuggled via prior work appear in the abstract or described chain. The central structural claim about ATE distortion versus predictive utility is therefore independent and falsifiable against external benchmarks such as IPW, AIPW, and TMLE estimators.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard domain assumptions from causal inference and generative modeling; the hybrid separation is presented as a modeling choice rather than a new postulated entity.

axioms (1)

domain assumption Prediction loss penalizes treatment-effect error only through an overlap-weighted term
This is the key structural mismatch stated in the abstract.

pith-pipeline@v0.9.0 · 5525 in / 1386 out tokens · 62493 ms · 2026-05-12T02:09:22.968844+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages

[1]

Borisov, V., Sessler, K., Leemann, T., Pawelczyk, M., and Kasneci, G. (2023). Language models are realistic tabular data generators. InThe Eleventh International Conference on Learning Representations. Cheng, C., Li, F., Thomas, L. E., and Li, F. F. (2022). Addressing extreme propensity scores in estimating counterfactual survival functions via the overla...

work page 2023
[2]

Chipman, H., George, E., and McCulloch, R. (2010). Bart: Bayesian additive regression trees.The Annals of Applied Statistics,

work page 2010
[3]

Cole, S. R. and Hern´ an, M. A. (2008). Constructing inverse probability weights for marginal structural models.American Journal of Epidemiology, 168(6):656–664. Epub 2008 Aug

work page 2008
[4]

M., Yang, F., and Dahabreh, I

De Bartolomeis, P., Abad, J., Wang, G., Donhauser, K., Duch, R. M., Yang, F., and Dahabreh, I. J. (2024). Efficient randomized experiments using foundation models.arXiv preprint arXiv:2502.04262. Freedman, D. A. and Berk, R. A. (2008). Weighting regressions by propensity scores.Evaluation Review, 32(4):392–409. Gruber, S., Phillips, R. V., Lee, H., and va...

work page arXiv 2024
[5]

Li, Z., Zhu, H., Lu, Z., and Yin, M. (2023). Synthetic data generation with large language models for text classification: Potential and limitations. In Bouamor, H., Pino, J., and Bali, K., editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10443–10461, Singapore. Association for Computational Linguistics...

work page arXiv 2023
[6]

M., Rotnitzky, A., and Zhao, L

Robins, J. M., Rotnitzky, A., and Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed.Journal of the American Statistical Association, 89(427):846–866. Tran, L., Ye, T., Ding, P., and Han, F. (2026). Generative modeling for the bootstrap. van der Laan, M. J. and Rose, S. (2011).Targeted Learning: Causal I...

work page 1994
[7]

Xu, Y., Gruber, S., and van der Laan, M

Curran Associates, Inc. Xu, Y., Gruber, S., and van der Laan, M. J. (2026a). Investigating targeting strategies and truncation in tmle for the average treatment effect under practical positivity violations. Xu, Y., Nakada, R., Zhang, L., and Li, L. (2026b). Residual feature integration is sufficient to prevent negative transfer. InThe Fourteenth Internati...

work page 2020
[8]

For LLM-based tabular synthesis, we used the GReaT framework Borisov et al

For the positivity experiments, we additionally generate an observational dataset of size 200, where treatment assignment follows the propensity model above. For LLM-based tabular synthesis, we used the GReaT framework Borisov et al. (2023) with GPT-2 as the underlying language model. Each row of the training table was serialized into a textual template t...

work page 2023

[1] [1]

Borisov, V., Sessler, K., Leemann, T., Pawelczyk, M., and Kasneci, G. (2023). Language models are realistic tabular data generators. InThe Eleventh International Conference on Learning Representations. Cheng, C., Li, F., Thomas, L. E., and Li, F. F. (2022). Addressing extreme propensity scores in estimating counterfactual survival functions via the overla...

work page 2023

[2] [2]

Chipman, H., George, E., and McCulloch, R. (2010). Bart: Bayesian additive regression trees.The Annals of Applied Statistics,

work page 2010

[3] [3]

Cole, S. R. and Hern´ an, M. A. (2008). Constructing inverse probability weights for marginal structural models.American Journal of Epidemiology, 168(6):656–664. Epub 2008 Aug

work page 2008

[4] [4]

M., Yang, F., and Dahabreh, I

De Bartolomeis, P., Abad, J., Wang, G., Donhauser, K., Duch, R. M., Yang, F., and Dahabreh, I. J. (2024). Efficient randomized experiments using foundation models.arXiv preprint arXiv:2502.04262. Freedman, D. A. and Berk, R. A. (2008). Weighting regressions by propensity scores.Evaluation Review, 32(4):392–409. Gruber, S., Phillips, R. V., Lee, H., and va...

work page arXiv 2024

[5] [5]

Li, Z., Zhu, H., Lu, Z., and Yin, M. (2023). Synthetic data generation with large language models for text classification: Potential and limitations. In Bouamor, H., Pino, J., and Bali, K., editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10443–10461, Singapore. Association for Computational Linguistics...

work page arXiv 2023

[6] [6]

M., Rotnitzky, A., and Zhao, L

Robins, J. M., Rotnitzky, A., and Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed.Journal of the American Statistical Association, 89(427):846–866. Tran, L., Ye, T., Ding, P., and Han, F. (2026). Generative modeling for the bootstrap. van der Laan, M. J. and Rose, S. (2011).Targeted Learning: Causal I...

work page 1994

[7] [7]

Xu, Y., Gruber, S., and van der Laan, M

Curran Associates, Inc. Xu, Y., Gruber, S., and van der Laan, M. J. (2026a). Investigating targeting strategies and truncation in tmle for the average treatment effect under practical positivity violations. Xu, Y., Nakada, R., Zhang, L., and Li, L. (2026b). Residual feature integration is sufficient to prevent negative transfer. InThe Fourteenth Internati...

work page 2020

[8] [8]

For LLM-based tabular synthesis, we used the GReaT framework Borisov et al

For the positivity experiments, we additionally generate an observational dataset of size 200, where treatment assignment follows the propensity model above. For LLM-based tabular synthesis, we used the GReaT framework Borisov et al. (2023) with GPT-2 as the underlying language model. Each row of the training table was serialized into a textual template t...

work page 2023