Synthetic Counterfactual Labels for Efficient Conformal Counterfactual Inference
Pith reviewed 2026-05-18 19:04 UTC · model grok-4.3
The pith
Augmenting conformal calibration with synthetic counterfactual labels from a pre-trained model produces narrower prediction intervals while preserving marginal coverage.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the synthetic-data-powered conformal counterfactual inference procedure, by augmenting the calibration set with labels from a pre-trained model and applying a debiasing step before risk-controlling prediction sets, yields strictly tighter intervals than standard conformal counterfactual inference while retaining exact marginal coverage, with supporting guarantees that hold under both exact and approximate importance weighting.
What carries the argument
SP-CCI, the procedure that inserts synthetic counterfactual labels into the calibration set, applies a prediction-powered-inference debiasing correction, and then runs risk-controlling prediction sets.
If this is right
- Prediction intervals for individual counterfactual outcomes become narrower under treatment imbalance.
- Marginal coverage is retained for both exact and approximate importance weighting.
- The same coverage target is achieved with fewer real counterfactual samples.
- Interval widths decrease consistently across multiple datasets relative to standard conformal counterfactual inference.
Where Pith is reading between the lines
- The method could be applied to sequential decision problems where new synthetic labels are generated on the fly from an updating model.
- It raises the question of how the quality of the pre-trained model affects the degree of tightening in finite samples.
- One could test whether replacing the pre-trained model with a simpler baseline still yields net gains after the debiasing step.
Load-bearing premise
The synthetic labels, once corrected by the debiasing step, must preserve the marginal coverage property of the risk-controlling procedure without extra conditions on how accurate the pre-trained model is or how much the data overlap.
What would settle it
An experiment in which the empirical coverage of the resulting intervals falls below the nominal target level on a dataset where the pre-trained counterfactual model is deliberately made inaccurate.
Figures
read the original abstract
This work addresses the problem of constructing reliable prediction intervals for individual counterfactual outcomes. Existing conformal counterfactual inference (CCI) methods provide marginal coverage guarantees but often produce overly conservative intervals, particularly under treatment imbalance when counterfactual samples are scarce. We introduce synthetic data-powered CCI (SP-CCI), a new framework that augments the calibration set with synthetic counterfactual labels generated by a pre-trained counterfactual model. To ensure validity, SP-CCI incorporates synthetic samples into a conformal calibration procedure based on risk-controlling prediction sets (RCPS) with a debiasing step informed by prediction-powered inference (PPI). We prove that SP-CCI achieves tighter prediction intervals while preserving marginal coverage, with theoretical guarantees under both exact and approximate importance weighting. Empirical results on different datasets confirm that SP-CCI consistently reduces interval width compared to standard CCI across all settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes SP-CCI, a framework for conformal counterfactual inference that augments the calibration set with synthetic counterfactual labels from a pre-trained model. It applies a PPI-based debiasing correction within an RCPS procedure to obtain tighter prediction intervals while preserving marginal coverage, with claimed theoretical guarantees under both exact and approximate importance weighting. Empirical results across datasets are reported to show consistent reductions in interval width relative to standard CCI.
Significance. If the central claims hold, SP-CCI would offer a practical way to mitigate conservatism in CCI under treatment imbalance by leveraging synthetic data without sacrificing validity. The combination of synthetic labels, PPI debiasing, and RCPS is a coherent technical contribution that could generalize to other conformal settings with scarce counterfactual samples. The manuscript supplies both proofs and experiments, which is a strength.
major comments (2)
- [Theoretical guarantees for approximate importance weighting] The section on theoretical guarantees for approximate importance weighting: the high-probability coverage statement for the RCPS procedure after PPI correction does not include an explicit bound on the residual bias induced by approximation error in the importance weights (whether from finite-sample estimation or model misspecification). Without such a bound relative to the RCPS risk tolerance, the marginal coverage guarantee may not hold even if the exact-weighting case is correct.
- [§3 (method)] §3 (method) and the associated assumptions: the analysis assumes that the pre-trained counterfactual model produces synthetic labels whose distribution, after PPI debiasing, satisfies the conditions for RCPS to retain marginal coverage. No quantitative conditions on model accuracy or overlap between observed and synthetic distributions are stated, leaving the validity claim dependent on unverified properties of the pre-trained model.
minor comments (2)
- [Abstract] The abstract states results on 'different datasets' without naming them or reporting the precise metrics (e.g., average width reduction, coverage rates) used for comparison.
- [Notation and definitions] Notation for the importance weights and the PPI correction term could be introduced earlier and used consistently in both the theoretical and experimental sections to improve readability.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive comments on our manuscript. We address each of the major comments below and describe the revisions we will make to strengthen the theoretical analysis.
read point-by-point responses
-
Referee: [Theoretical guarantees for approximate importance weighting] The section on theoretical guarantees for approximate importance weighting: the high-probability coverage statement for the RCPS procedure after PPI correction does not include an explicit bound on the residual bias induced by approximation error in the importance weights (whether from finite-sample estimation or model misspecification). Without such a bound relative to the RCPS risk tolerance, the marginal coverage guarantee may not hold even if the exact-weighting case is correct.
Authors: We agree that an explicit bound on the residual bias would make the guarantee more complete. In the revised version, we will add a lemma bounding the approximation error in the importance weights (from both finite samples and potential misspecification) and show how this error propagates to the RCPS risk control. Specifically, we will derive a term that can be absorbed into the risk tolerance α, ensuring the marginal coverage holds with high probability under a mild condition on the approximation quality. revision: yes
-
Referee: [§3 (method)] §3 (method) and the associated assumptions: the analysis assumes that the pre-trained counterfactual model produces synthetic labels whose distribution, after PPI debiasing, satisfies the conditions for RCPS to retain marginal coverage. No quantitative conditions on model accuracy or overlap between observed and synthetic distributions are stated, leaving the validity claim dependent on unverified properties of the pre-trained model.
Authors: This is a valid point. The current presentation relies on the debiasing step to correct for discrepancies, but we did not quantify the required model quality. In the revision, we will introduce an assumption on the accuracy of the pre-trained model, such as a bound on the expected absolute difference between synthetic and true counterfactuals or on the variance of the importance weights. We will also discuss how overlap can be ensured via the importance weighting scheme and add a remark on practical verification of these conditions. revision: yes
Circularity Check
No circularity: derivation relies on independent RCPS and PPI steps
full rationale
The abstract and claims describe SP-CCI as augmenting calibration with synthetic labels from a pre-trained model, then applying RCPS with PPI debiasing to obtain tighter intervals while preserving marginal coverage. No quoted equations or steps reduce any 'prediction' to a fitted parameter by construction, nor does any load-bearing premise collapse to a self-citation or ansatz imported from the authors' prior work. The theoretical guarantees under exact and approximate weighting are presented as separate proofs, leaving the central result self-contained against external conformal baselines.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Marginal coverage of RCPS is preserved after augmentation with synthetic labels and PPI debiasing under exact or approximate importance weighting.
invented entities (1)
-
Synthetic counterfactual labels
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We prove that SP-CCI achieves tighter prediction intervals while preserving marginal coverage, with theoretical guarantees under both exact and approximate importance weighting.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SP-CCI incorporates synthetic samples into a conformal calibration procedure based on risk-controlling prediction sets (RCPS) with a debiasing step informed by prediction-powered inference (PPI).
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A tutorial on conformal prediction.Journal of Machine Learn- ing Research, 9(3), 2008
Glenn Shafer and Vladimir Vovk. A tutorial on conformal prediction.Journal of Machine Learn- ing Research, 9(3), 2008
work page 2008
-
[2]
Vladimir Vovk, Alexander Gammerman, and Glenn Shafer.Algorithmic learning in a random world. Springer, 2005
work page 2005
-
[3]
Lihua Lei and Emmanuel J Cand` es. Conformal inference of counterfactuals and individual treat- ment effects.Journal of the Royal Statistical Soci- ety Series B: Statistical Methodology, 83(5):911– 938, 2021
work page 2021
-
[4]
Issa J Dahabreh, Sarah E Robertson, Jon A Stein- grimsson, Elizabeth A Stuart, and Miguel A Her- nan. Extending inferences from a randomized trial to a new target population.Statistics in medicine, 39(14):1999–2014, 2020
work page 1999
-
[5]
L´ eon Bottou, Jonas Peters, Joaquin Qui˜ nonero- Candela, Denis X Charles, D Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. Counterfactual reasoning and learning systems: The example of computational advertising.The Journal of Machine Learning Research, 14(1):3207–3260, 2013
work page 2013
-
[6]
Counterfactual risk minimization: Learning from logged bandit feedback
Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization: Learning from logged bandit feedback. InInternational confer- ence on machine learning, pages 814–823. PMLR, 2015
work page 2015
-
[7]
Cambridge University Press, 2020
Ron Kohavi, Diane Tang, and Ya Xu.Trust- worthy online controlled experiments: A practical guide to a/b testing. Cambridge University Press, 2020
work page 2020
-
[8]
Metalearners for estimating hetero- geneous treatment effects using machine learning
S¨ oren R K¨ unzel, Jasjeet S Sekhon, Peter J Bickel, and Bin Yu. Metalearners for estimating hetero- geneous treatment effects using machine learning. Proceedings of the national academy of sciences, 116(10):4156–4165, 2019
work page 2019
-
[9]
Stefan Wager and Susan Athey. Estimation and inference of heterogeneous treatment effects using random forests.Journal of the American Statis- tical Association, 113(523):1228–1242, 2018
work page 2018
-
[10]
Nick Pawlowski, Daniel Coelho de Castro, and Ben Glocker. Deep structural causal models for tractable counterfactual inference.Advances in neural information processing systems, 33:857– 869, 2020
work page 2020
-
[11]
Prediction-powered inference.Science, 382(6671):669–674, 2023
Anastasios N Angelopoulos, Stephen Bates, Clara Fannjiang, Michael I Jordan, and Tijana Zr- nic. Prediction-powered inference.Science, 382(6671):669–674, 2023
work page 2023
-
[12]
Distribution-free, risk-controlling prediction sets
Stephen Bates, Anastasios Angelopoulos, Li- hua Lei, Jitendra Malik, and Michael Jordan. Distribution-free, risk-controlling prediction sets. Journal of the ACM (JACM), 68(6):1–34, 2021
work page 2021
-
[13]
Learning representations for counterfactual inference
Fredrik Johansson, Uri Shalit, and David Son- tag. Learning representations for counterfactual inference. InInternational conference on machine learning, pages 3020–3029. PMLR, 2016
work page 2016
-
[14]
Estimating individual treatment effect: gen- eralization bounds and algorithms
Uri Shalit, Fredrik D Johansson, and David Son- tag. Estimating individual treatment effect: gen- eralization bounds and algorithms. InInter- national conference on machine learning, pages 3076–3085. PMLR, 2017
work page 2017
-
[15]
Jennifer L Hill. Bayesian nonparametric modeling for causal inference.Journal of Computational and Graphical Statistics, 20(1):217–240, 2011
work page 2011
-
[16]
Christos Louizos, Uri Shalit, Joris M Mooij, David Sontag, Richard Zemel, and Max Welling. Causal effect inference with deep latent-variable models.Advances in neural information process- ing systems, 30, 2017
work page 2017
-
[17]
Mingzhang Yin, Claudia Shi, Yixin Wang, and David M Blei. Conformal sensitivity analysis for individual treatment effects.Journal of the American Statistical Association, 119(545):122– 135, 2024
work page 2024
-
[18]
Ahmed M Alaa, Zaid Ahmad, and Mark van der Laan. Conformal meta-learners for predictive inference of individual treatment effects.Ad- vances in neural information processing systems, 36:47682–47703, 2023
work page 2023
-
[19]
Ganite: Estimation of individual- ized treatment effects using generative adversar- ial nets
Jinsung Yoon, James Jordon, and Mihaela Van Der Schaar. Ganite: Estimation of individual- ized treatment effects using generative adversar- ial nets. InInternational conference on learning representations, 2018
work page 2018
-
[20]
Ioana Bica, James Jordon, and Mihaela van der Schaar. Estimating the effects of continuous- valued interventions using generative adversarial networks.Advances in Neural Information Pro- cessing Systems, 33:16434–16445, 2020. Synthetic Counterfactual Labels for Efficient Conformal Counterfactual Inference
work page 2020
-
[21]
Data- efficient off-policy policy evaluation for reinforce- ment learning
Philip Thomas and Emma Brunskill. Data- efficient off-policy policy evaluation for reinforce- ment learning. InInternational conference on ma- chine learning, pages 2139–2148. PMLR, 2016
work page 2016
-
[22]
Semi-supervised risk control via prediction-powered inference.arXiv preprint arXiv:2412.11174, 2024
Bat-Sheva Einbinder, Liran Ringel, and Yaniv Romano. Semi-supervised risk control via prediction-powered inference.arXiv preprint arXiv:2412.11174, 2024
-
[23]
On the application of prob- ability theory to agricultural experiments
Jerzy Splawa-Neyman, Dorota M Dabrowska, and Terrence P Speed. On the application of prob- ability theory to agricultural experiments. essay on principles. section 9.Statistical Science, pages 465–472, 1990
work page 1990
-
[24]
Donald B Rubin. Estimating causal effects of treatments in randomized and nonrandom- ized studies.Journal of educational Psychology, 66(5):688, 1974
work page 1974
-
[25]
Donald B Rubin. Formal mode of statistical in- ference for causal effects.Journal of statistical planning and inference, 25(3):279–292, 1990
work page 1990
-
[26]
Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in obser- vational studies for causal effects.Biometrika, 70(1):41–55, 1983
work page 1983
-
[27]
Donald B Rubin. Bayesian inference for causal effects: The role of randomization.The Annals of statistics, pages 34–58, 1978
work page 1978
-
[28]
Cambridge university press, 2015
Guido W Imbens and Donald B Rubin.Causal inference in statistics, social, and biomedical sci- ences. Cambridge university press, 2015
work page 2015
-
[29]
Daphne Koller and Nir Friedman.Probabilistic graphical models: principles and techniques. MIT press, 2009
work page 2009
-
[30]
Quan- tile regression forests.Journal of machine learn- ing research, 7(6), 2006
Nicolai Meinshausen and Greg Ridgeway. Quan- tile regression forests.Journal of machine learn- ing research, 7(6), 2006
work page 2006
-
[31]
Jerome H Friedman. Greedy function approxi- mation: a gradient boosting machine.Annals of statistics, pages 1189–1232, 2001
work page 2001
-
[32]
Cambridge university press, 2009
Judea Pearl.Causality. Cambridge university press, 2009
work page 2009
-
[33]
Calibrating doubly-robust es- timators with unbalanced treatment assignment
Daniele Ballinari. Calibrating doubly-robust es- timators with unbalanced treatment assignment. Economics Letters, 241:111838, 2024
work page 2024
-
[34]
arXiv preprint arXiv:2505.18659 , year=
Sangwoo Park, Matteo Zecchin, and Osvaldo Simeone. Adaptive prediction-powered autoeval with reliability and efficiency guarantees.arXiv preprint arXiv:2505.18659, 2025
-
[35]
Statistical de- cision theory with counterfactual loss.arXiv preprint arXiv:2505.08908, 2025
Benedikt Koch and Kosuke Imai. Statistical de- cision theory with counterfactual loss.arXiv preprint arXiv:2505.08908, 2025
-
[36]
Sara Khosravi, Hossein Shokri-Ghadikolaei, and Marina Petrova. Learning-based handover in mo- bile millimeter-wave networks.IEEE Transac- tions on Cognitive Communications and Network- ing, 7(2):663–674, 2020
work page 2020
-
[37]
arXiv preprint arXiv:2203.11854 , year=
Jakob Hoydis, Sebastian Cammerer, Fay¸ cal Ait Aoudia, Avinash Vem, Nikolaus Binder, Guillermo Marcus, and Alexander Keller. Sionna: An open-source library for next- generation physical layer research.arXiv preprint arXiv:2203.11854, 2022. Amirmohammad Farzaneh, Matteo Zecchin, Osvaldo Simeone Synthetic Counterfactual Labels for Efficient Conformal Counte...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.