CausalGuard: Conformal Inference under Graph Uncertainty
Pith reviewed 2026-05-22 07:48 UTC · model grok-4.3
The pith
CausalGuard aggregates pseudo-outcomes from candidate causal graphs to provide finite-sample coverage guarantees for treatment effect estimates despite graph uncertainty.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CausalGuard provides distribution-free finite-sample marginal coverage for this aggregated pseudo-outcome; under causal identification, overlap, conditional-mean nuisance stability, and concentration on target-aligned valid adjustment strategies, its conditional mean converges to the true Conditional Average Treatment Effect.
What carries the argument
A structure-weighted conformal framework that calibrates after aggregating graph-conditional doubly robust pseudo-outcomes using a composite nonconformity score on the posterior-weighted pseudo-outcome.
If this is right
- Attains mean coverage above the nominal 90% level across five benchmarks for the target.
- Reduces prediction interval width compared to graph-agnostic conformal methods that require large padding.
- Suppresses invalid collider adjustment in stress tests.
- Remains stable under misspecified priors provided the retained candidates are data-supported.
Where Pith is reading between the lines
- If the approach scales, it could enable safer use of observational data in policy decisions where causal graphs are debated.
- Combining this with expert knowledge to refine LLM-proposed graphs might further tighten the intervals.
- Testing on time-series or dynamic graphs could reveal whether the method extends beyond static DAGs.
Load-bearing premise
The set of candidate DAGs must contain target-aligned valid adjustment strategies and be supported by the observed data after pruning and reweighting.
What would settle it
Observe whether coverage falls below 90% on a benchmark dataset where the true causal graph is known and the method's assumptions hold but graph uncertainty is high.
Figures
read the original abstract
Estimating treatment effects from observational data requires choosing an adjustment set, but valid adjustment depends on an unknown causal graph. Graph misspecification can cause under-coverage, while graph-agnostic conformal wrappers may regain nominal coverage only through large padding. We introduce CausalGuard, a structure-weighted conformal framework that calibrates after aggregating graph-conditional doubly robust pseudo-outcomes. Candidate DAGs are proposed from an LLM-derived edge prior, pruned by conditional-independence tests, and reweighted by Bayesian Information Criterion. A composite nonconformity score then calibrates the posterior-weighted pseudo-outcome. CausalGuard provides distribution-free finite-sample marginal coverage for this aggregated pseudo-outcome; under causal identification, overlap, conditional-mean nuisance stability, and concentration on target-aligned valid adjustment strategies, its conditional mean converges to the true Conditional Average Treatment Effect. Across five benchmarks, CausalGuard attains mean coverage above the nominal 90% level for the directly evaluable target and reduces width when graph-agnostic conformal baselines require large padding. Stress tests show that CausalGuard suppresses invalid collider adjustment and remains stable under misspecified priors when the retained candidate set is data-supported.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces CausalGuard, a conformal framework for treatment effect estimation under causal graph uncertainty. Candidate DAGs are proposed via an LLM-derived edge prior, pruned by conditional-independence tests, and reweighted by BIC. Graph-conditional doubly robust pseudo-outcomes are aggregated using posterior weights, and a composite nonconformity score is applied to the posterior-weighted pseudo-outcome. The paper claims distribution-free finite-sample marginal coverage for this aggregated quantity; under causal identification, overlap, nuisance stability, and concentration on target-aligned valid adjustment strategies, the conditional mean is asserted to converge to the true CATE. Experiments across five benchmarks report coverage above the nominal 90% level and narrower intervals than graph-agnostic baselines, with stress tests indicating suppression of invalid collider adjustment.
Significance. If the central claims hold, the work provides a principled way to obtain valid conformal intervals for causal effects when the adjustment set is uncertain, potentially reducing the conservatism of purely graph-agnostic wrappers. The distribution-free finite-sample coverage for the aggregated pseudo-outcome is a clear technical strength, and the empirical demonstration that the method remains stable under misspecified priors when the retained set is data-supported is useful. The combination of LLM-assisted discovery, BIC reweighting, and conformal calibration is novel and addresses a practical gap in observational causal inference.
major comments (2)
- [Abstract] Abstract and theoretical claims: the finite-sample marginal coverage guarantee applies to the aggregated pseudo-outcome via the composite nonconformity score, but the claim that the conditional mean converges to the CATE rests on the unverified assumption that BIC-reweighted posteriors concentrate on graphs containing target-aligned valid adjustment strategies. No formal argument is supplied showing that BIC down-weights invalid yet data-compatible candidates (e.g., those inducing collider bias after CI pruning); the cited stress tests are empirical only and do not bound posterior mass on such graphs.
- [Theoretical results] Theoretical development: the manuscript should explicitly derive or cite the conditions under which posterior weighting of doubly robust pseudo-outcomes preserves the convergence property when the support includes both valid and invalid adjustment sets; the listed assumptions (identification, overlap, nuisance stability, concentration) are necessary but the interaction with the composite score and BIC reweighting requires a precise statement to support the CATE claim.
minor comments (2)
- The abstract would benefit from a clearer separation between the distribution-free coverage result (which holds regardless of graph correctness) and the convergence result (which requires the concentration assumption).
- Notation for the posterior-weighted pseudo-outcome and the composite nonconformity score should be introduced with explicit definitions early in the methods section to aid readability.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our work. We address the major comments below and outline the revisions we plan to make to strengthen the theoretical presentation.
read point-by-point responses
-
Referee: [Abstract] Abstract and theoretical claims: the finite-sample marginal coverage guarantee applies to the aggregated pseudo-outcome via the composite nonconformity score, but the claim that the conditional mean converges to the CATE rests on the unverified assumption that BIC-reweighted posteriors concentrate on graphs containing target-aligned valid adjustment strategies. No formal argument is supplied showing that BIC down-weights invalid yet data-compatible candidates (e.g., those inducing collider bias after CI pruning); the cited stress tests are empirical only and do not bound posterior mass on such graphs.
Authors: The referee is correct that the CATE convergence relies on the concentration assumption stated in the paper. We do not supply a formal argument that BIC universally down-weights invalid graphs, since this depends on the data and may not hold in all cases. The stress tests are indeed empirical. We will revise the abstract to better distinguish the coverage guarantee from the convergence claim and add a discussion of the assumption's role and limitations. revision: partial
-
Referee: [Theoretical results] Theoretical development: the manuscript should explicitly derive or cite the conditions under which posterior weighting of doubly robust pseudo-outcomes preserves the convergence property when the support includes both valid and invalid adjustment sets; the listed assumptions (identification, overlap, nuisance stability, concentration) are necessary but the interaction with the composite score and BIC reweighting requires a precise statement to support the CATE claim.
Authors: We will revise the theoretical section to provide a more precise statement of the assumptions and their interaction with the posterior weighting and composite nonconformity score. The coverage guarantee is distribution-free for the aggregated quantity. Convergence to CATE requires the concentration assumption to ensure the weighted pseudo-outcome targets the correct quantity. We will cite relevant literature on causal model averaging to support this. revision: yes
- A general formal proof or bound that BIC reweighting concentrates posterior mass exclusively on valid adjustment sets for arbitrary invalid but data-compatible graphs.
Circularity Check
No significant circularity; coverage and convergence claims are self-contained under explicit assumptions
full rationale
The paper defines an aggregated posterior-weighted pseudo-outcome and applies a composite nonconformity score to obtain distribution-free finite-sample marginal coverage for that quantity; this is a direct application of standard conformal prediction to a constructed score and does not reduce by construction to a fitted parameter or self-referential definition. Convergence of the conditional mean to the true CATE is stated only under a list of assumptions that explicitly includes 'concentration on target-aligned valid adjustment strategies,' which is not derived or proven inside the paper but treated as a prerequisite. No load-bearing step relies on self-citation chains, ansatz smuggling, or renaming of known results; the BIC reweighting and LLM prior are procedural choices whose validity is assumed rather than tautologically enforced by the coverage guarantee. The derivation therefore remains independent of its inputs.
Axiom & Free-Parameter Ledger
axioms (4)
- domain assumption causal identification
- domain assumption overlap
- domain assumption conditional-mean nuisance stability
- domain assumption concentration on target-aligned valid adjustment strategies
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CAUSALGUARD provides distribution-free finite-sample marginal coverage for this aggregated pseudo-outcome; under causal identification, overlap, conditional-mean nuisance stability, and concentration on target-aligned valid adjustment strategies, its conditional mean converges to the true Conditional Average Treatment Effect.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
A composite nonconformity score then calibrates the posterior-weighted pseudo-outcome.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Cambridge University Press, USA, 2nd edition, 2009
Judea Pearl.Causality: Models, Reasoning and Inference. Cambridge University Press, USA, 2nd edition, 2009. ISBN 052189560X
work page 2009
-
[2]
Double/debiased machine learning for treatment and structural parameters
Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. Double/debiased machine learning for treatment and structural parameters.The Econometrics Journal, 21(1):C1–C68, 02 2018. ISSN 1368-4221. doi: 10.1111/ectj.12097. URLhttps://doi.org/10.1111/ectj.12097
-
[3]
Estimation and Inference of Heterogeneous Treatment Effects using Random Forests
Stefan Wager and Susan Athey. Estimation and inference of heterogeneous treatment effects using random forests, 2017. URLhttps://arxiv.org/abs/1510.04342
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[4]
Peter Spirtes, Clark Glymour, and Richard Scheines.Causation, Prediction, and Search. MIT Press, 2nd edition, 2000
work page 2000
-
[5]
Lihua Lei and Emmanuel J. Candès. Conformal inference of counterfactuals and individual treatment effects.Journal of the Royal Statistical Society: Series B, 83(5):911–938, 2021
work page 2021
-
[6]
Yaniv Romano, Evan Patterson, and Emmanuel J. Candès. Conformalized quantile regression. InAdvances in Neural Information Processing Systems (NeurIPS), 2019
work page 2019
-
[7]
Estimating the dimension of a model.The Annals of Statistics, 6(2):461–464, 1978
Gideon Schwarz. Estimating the dimension of a model.The Annals of Statistics, 6(2):461–464, 1978
work page 1978
-
[8]
Vladimir V ovk, Alexander Gammerman, and Glenn Shafer.Algorithmic Learning in a Random World. Springer, 2005
work page 2005
-
[9]
Jing Lei, Max G’Sell, Alessandro Rinaldo, Ryan J Tibshirani, and Larry Wasserman. Distribution-free predictive inference for regression.Journal of the American Statistical Associ- ation, 113(523):1094–1111, 2018
work page 2018
-
[10]
Alaa, Zaid Ahmad, and Mark van der Laan
Ahmed M. Alaa, Zaid Ahmad, and Mark van der Laan. Conformal meta-learners for predictive inference of individual treatment effects. InAdvances in Neural Information Processing Systems (NeurIPS), 2023
work page 2023
-
[11]
Tibshirani, Rina Foygel Barber, Emmanuel J
Ryan J. Tibshirani, Rina Foygel Barber, Emmanuel J. Candès, and Aaditya Ramdas. Conformal prediction under covariate shift. InAdvances in Neural Information Processing Systems (NeurIPS), 2019
work page 2019
-
[12]
Ying Jin, Zhimei Ren, and Emmanuel J. Candès. Sensitivity analysis of individual treatment effects: A robust conformal inference approach.Proceedings of the National Academy of Sciences, 120(6):e2214889120, 2023
work page 2023
-
[13]
Data-driven covariate selection for nonparametric estimation of causal effects
Doris Entner, Patrik Hoyer, and Peter Spirtes. Data-driven covariate selection for nonparametric estimation of causal effects. InAISTATS, 2013
work page 2013
- [14]
-
[15]
David Maxwell Chickering. Optimal structure identification with greedy search.Journal of Machine Learning Research, 3:507–554, 2002
work page 2002
-
[16]
Xun Zheng, Bryon Aragam, Pradeep Ravikumar, and Eric P. Xing. DAGs with NO TEARS: Continuous optimization for structure learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2018
work page 2018
-
[17]
Causal reasoning and large language models: Opening a new frontier for causality
Emre Kiciman, Robert Ness, Amit Sharma, and Chenhao Tan. Causal reasoning and large language models: Opening a new frontier for causality.arXiv preprint arXiv:2305.00050, 2023
-
[18]
Causal discovery with language models as imperfect experts
Stephanie Long, Alexandre Piché, Valentina Zantedeschi, Tibor Schuster, and Alexandre Drouin. Causal discovery with language models as imperfect experts. InICML 2023 Workshop on Structured Probabilistic Inference & Generative Modeling, 2023. URLhttps://openreview. net/forum?id=RXlvYZAE49. 10
work page 2023
-
[19]
Taiyu Ban, Lyvzhou Chen, Xiangyu Wang, and Huanhuan Chen. From query tools to causal architects: Harnessing large language models for advanced causal discovery from data.arXiv preprint arXiv:2306.16902, 2023
-
[20]
Efficient Causal Graph Discovery Using Large Language Models
Thomas Jiralerspong, Xiaoyin Chen, Yash More, Vedant Shah, and Yoshua Bengio. Efficient causal graph discovery using large language models.arXiv preprint arXiv:2402.01207, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Eleni Sgouritsa, Virginia Aglietti, Yee Whye Teh, Arnaud Doucet, Arthur Gretton, and Silvia Chiappa. Prompting strategies for enabling large language models to infer causation from correlation.arXiv preprint arXiv:2412.13952, 2024
- [22]
-
[23]
VERGE: Formal Refinement and Guidance Engine for Verifiable LLM Reasoning
Vikash Singh, D. Cassel, N. Weir, N. Feng, and S. Bayless. VERGE: Formal refinement and guidance engine for verifiable LLM reasoning.arXiv preprint arXiv:2601.20055, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[24]
David Madigan and Adrian E. Raftery. Model selection and accounting for model uncertainty in graphical models using Occam’s window.Journal of the American Statistical Association, 89(428):1535–1546, 1994
work page 1994
-
[25]
Hoeting, David Madigan, Adrian E
Jennifer A. Hoeting, David Madigan, Adrian E. Raftery, and Chris T. V olinsky. Bayesian model averaging: A tutorial.Statistical Science, 14(4):382–401, 1999
work page 1999
-
[26]
Jennifer Hill. Bayesian nonparametric modeling for causal inference.Journal of Computational and Graphical Statistics, 20:217–240, 03 2011. doi: 10.1198/jcgs.2010.08162
-
[27]
Maathuis, Markus Kalisch, and Peter Bühlmann
Marloes H. Maathuis, Markus Kalisch, and Peter Bühlmann. Estimating high-dimensional intervention effects from observational data.The Annals of Statistics, 37(6A):3133–3164, 2009
work page 2009
-
[28]
On the application of probability theory to agricultural experiments
Jerzy Neyman. On the application of probability theory to agricultural experiments. essay on principles. section 9.Statistical Science, 5(4):465–472, 1990
work page 1990
-
[29]
Donald B. Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies.Journal of Educational Psychology, 66(5):688, 1974
work page 1974
-
[30]
Donald B Rubin. Randomization analysis of experimental data: The fisher randomization test comment.Journal of the American Statistical Association, 75(371):591–593, 1980
work page 1980
-
[31]
Being bayesian about network structure
Nir Friedman and Daphne Koller. Being bayesian about network structure. a bayesian approach to structure discovery in bayesian networks.Machine learning, 50:95–125, 2003
work page 2003
-
[32]
Ronald A Fisher. On the" probable error" of a coefficient of correlation deduced from a small sample.Metron, 1:3–32, 1921
work page 1921
-
[33]
James M Robins, Andrea Rotnitzky, and Lue Ping Zhao. Estimation of regression coefficients when some regressors are not always observed.Journal of the American statistical Association, 89(427):846–866, 1994
work page 1994
-
[34]
Karen Sachs, Omar Perez, Dana Pe’er, Douglas A. Lauffenburger, and Garry P. Nolan. Causal protein-signaling networks derived from multiparameter single-cell data.Science, 308(5721): 523–529, 2005. doi: 10.1126/science.1105809. URL https://www.science.org/doi/ abs/10.1126/science.1105809
-
[35]
Atlantic Causal Inference Conference (ACIC) Data Analysis Challenge 2017
P. Richard Hahn, Vincent Dorie, and Jared S. Murray. Atlantic causal inference conference (acic) data analysis challenge 2017, 2019. URLhttps://arxiv.org/abs/1905.09515
work page internal anchor Pith review Pith/arXiv arXiv 2017
- [36]
-
[37]
Christos Louizos, Uri Shalit, Joris M Mooij, David Sontag, Richard Zemel, and Max Welling. Causal effect inference with deep latent-variable models.Advances in neural information processing systems, 30, 2017. 11
work page 2017
- [38]
-
[39]
Edward H. Kennedy. Towards optimal doubly robust estimation of heterogeneous causal effects,
- [40]
- [41]
-
[42]
Forte: Finding outliers with representation typicality estimation
Debargha Ganguly, Warren Morningstar, Andrew Yu, and Vipin Chaudhary. Forte: Finding outliers with representation typicality estimation. In13th International Conference on Learning Representations (ICLR), Singapore, April 2025
work page 2025
- [43]
-
[44]
$K^4$: Online Log Anomaly Detection Via Unsupervised Typicality Learning
Weicong Chen, Vikash Singh, Z. Rahmani, Debargha Ganguly, M. Hariri, and Vipin Chaud- hary. K4: Online log anomaly detection via unsupervised typicality learning.arXiv preprint arXiv:2507.20051, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [45]
-
[46]
W. Yang, Debargha Ganguly, X. Li, C. Song, S. Wang, Vikash Singh, Vipin Chaudhary, and X. Han. Mid-Think: Training-free intermediate-budget reasoning via token-level triggers.arXiv preprint arXiv:2601.07036, 2026
-
[47]
Labeling copilot: A deep research agent for automated data curation in computer vision,
Debargha Ganguly, Sumit Kumar, Ishwar Balappanawar, Weicong Chen, Shashank Kambhatla, Srinivasan Iyengar, Shivkumar Kalyanaraman, Ponnurangam Kumaraguru, and Vipin Chaud- hary. Labeling copilot: A deep research agent for automated data curation in computer vision,
- [48]
-
[49]
B. Zhang, M. Zheng, Debargha Ganguly, X. Zhang, Vikash Singh, Vipin Chaudhary, and Z. Zhang. Efficient fine-grained GPU performance modeling for distributed deep learning of LLM.arXiv preprint arXiv:2509.22832, 2025. A Assumptions We collect the regularity conditions required for the theoretical guarantees. Assumption 1 is the only condition needed for th...
-
[50]
BIC scores and BMA weights(Equations 3–4): Gaussian likelihoods evaluated on Dtrain, followed by duplicate adjustment-strategy collapse before normalization
-
[51]
LLM is pattern-matching variable names
Nuisance models: propensity score ˆe(k), outcome models ˆµ(k) t , and heuristic bounds [ˆq(k) low,ˆq(k) high]all fitted onD train. Stage 4 (conformal calibration) uses Dcal only for evaluating out-of-sample pseudo-outcomes, computing nonconformity scores, and determining ˆQ. This separation ensures that, conditioned on Dtrain, the composite scores are sym...
-
[52]
Initialise a NumPyRandomStatewith the run seed
-
[53]
Draw a uniform random permutationπof all variables (covariates,T,Y)
-
[54]
This guarantees T precedes Y in topological order so that the protected edgeT→Yis order-consistent
If π(T)> π(Y) , swap the positions of T and Y in π. This guarantees T precedes Y in topological order so that the protected edgeT→Yis order-consistent
-
[55]
For each ordered pair (u, v) with π(u)< π(v) , include the edge u→v with probability pu→v via an independent Bernoulli draw
-
[56]
Force-add the edgeT→Yto the resulting graph
-
[57]
Reject duplicates by hashing frozenset(G.edges()); up to 100K resamples are at- tempted before stopping. The firstKunique non-empty DAGs are kept. This procedure cannot produce cycles because every sampled edge respects π, and T→Y is consistent with the swap in step 3. CI-pruning rule (Fisher partial correlation).Given a sampled DAG G and Xtr (the train-o...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.