pith. sign in

arxiv: 2605.12924 · v1 · pith:E7FQG7PCnew · submitted 2026-05-13 · 💻 cs.LG

IV-ICL: Bounding Causal Effects with Instrumental Variables via In-Context Learning

Pith reviewed 2026-05-14 19:45 UTC · model grok-4.3

classification 💻 cs.LG
keywords instrumental variablescausal boundsin-context learningpartial identificationamortized inferenceinclusive KL divergenceBayesian posterior
0
0 comments X

The pith

An amortized in-context learner recovers the full identified set of causal effects from instrumental variable data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops IV-ICL to bound causal effects in instrumental variable settings where point identification is impossible due to confounding. It trains an in-context model to learn the marginal posterior over the causal effect directly, using an inclusive KL objective that covers the whole identified set rather than concentrating on one mode. Bounds are then read off as quantiles of this posterior, avoiding the need for bespoke closed-form estimators. On benchmarks the resulting intervals prove more reliable and informative than those from existing methods while running orders of magnitude faster at test time.

Core claim

By training an in-context learner to minimize the expected inclusive KL divergence on instrumental-variable data, IV-ICL obtains the marginal posterior distribution of the causal effect; its quantiles then supply bounds that empirically cover the full identified set for a range of data-generating processes.

What carries the argument

Amortized in-context learner minimizing inclusive KL to output the posterior over causal effects.

Load-bearing premise

That the inclusive-KL objective will recover the full identified set for arbitrary data-generating processes rather than only the synthetic ones seen in training.

What would settle it

A counterexample data-generating process where the quantiles of the learned posterior do not contain all values in the true identified set for the causal effect.

Figures

Figures reproduced from arXiv: 2605.12924 by Hamidreza Kamkari, Medha Barath, Rahul G. Krishnan, Ricardo Silva, Vahid Balazadeh.

Figure 1
Figure 1. Figure 1: IV-ICL pipeline. During pre-training, we generate a diverse library of synthetic datasets with IV structure and known ground-truth causal effects, and train a transformer to map each dataset to the marginal posterior of its causal effect. At inference, the trained model takes any new IV dataset as context and returns the causal effect marginal posterior in a single forward pass; bounds are read off as marg… view at source ↗
Figure 2
Figure 2. Figure 2: The effect of post-processing on SATE distribution. d ′ refers to the covariate dimension. random function fψ : R d ′ → R, using synthetic function generators similar to Balazadeh et al. [7]. The instrument propensity is then defined as Pψ(Z = 1 | X) = sigmoid(fψ(X)). 3. Potential outcome and treatment generation Pψ [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Effect of instrument strength on interval width (Jobs benchmark). Width decreases as ρ(Z, T) increases; validity remains perfect. Effect of instrument strength [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Calibration curves on three synthetic IV DGPs. Empirical coverage versus nominal credibility level 1 − α over K = 100 datasets per DGP at n = 1024, d = 5. Bands show ±1 standard error. IV-ICL over-covers in the regime that matters for inference and approaches the diagonal smoothly at very narrow nominal levels — the expected signature of an inclusive KL-trained marginal posterior on partial identification.… view at source ↗
Figure 5
Figure 5. Figure 5: Sensitivity to sample size and dimensionality at α = 0.1 (90% credible intervals). Top row: empirical coverage; bottom row: mean normalized interval width. Bands show ±1 standard error over K = 30 seeds per cell. Coverage saturates at 1.00 across the entire grid; widths shrink monotonically with n and grow with d, as expected. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗
read the original abstract

The instrumental-variables (IV) setting is standard for partial identification of causal effects when unobserved confounding makes point identification impossible. Existing approaches face methodological bottlenecks: closed-form bound estimands are required -- e.g., Balke-Pearl equations in binary IV -- and even when available, designing accurate estimators requires manual effort tailored to each estimand. While direct Bayesian inference of the causal effects, instead of the bounds, circumvents these challenges, it is often computationally intensive and suffers from high prior sensitivity or under-dispersed posteriors. As a remedy, we introduce IV-ICL, an amortized Bayesian in-context learning method that learns the marginal posterior distribution of the causal effects directly and derives bounds as its quantiles. Unlike standard variational inference that optimizes exclusive KL divergence, amortized Bayesian inference minimizes the expected inclusive KL, a mass-covering objective. We empirically observe that optimizing inclusive KL can recover the entire identified set across diverse data-generating processes, while exclusive-KL (e.g. with variational inference) of the same Bayesian formulation collapses onto a single mode and fails to cover the identified set. We evaluate IV-ICL on synthetic and semi-synthetic IV benchmarks and show it produces intervals that are more reliably valid and more informative compared to efficient semi-parametric, Bayesian, and plug-in baselines, at 20-500x lower inference time. Beyond methodology, we propose a procedure to convert randomized controlled trials into IV benchmarks with provably preserved ground-truth causal effects that enables a more realistic evaluation of partial-identification methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces IV-ICL, an amortized Bayesian in-context learning method for partial identification of causal effects in instrumental variable (IV) settings. By directly learning the marginal posterior distribution of the causal effects through minimization of the inclusive KL divergence and extracting bounds as posterior quantiles, the approach aims to circumvent the need for closed-form bound estimands and manual estimator design. The key empirical observation is that inclusive KL optimization recovers the full identified set across diverse data-generating processes, in contrast to exclusive KL which collapses to a single mode. The paper evaluates this on synthetic and semi-synthetic benchmarks, reporting more valid and informative intervals than baselines at significantly lower inference time, and proposes a method to convert RCTs into IV benchmarks preserving ground-truth effects.

Significance. If the empirical observation that inclusive KL recovers the identified set generalizes, the method offers a scalable alternative to existing IV bounding techniques that avoids closed-form requirements and reduces computational burden while improving coverage over under-dispersed Bayesian or plug-in estimators. The RCT-to-IV benchmark conversion procedure is a concrete methodological contribution that could improve evaluation standards in partial identification research.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (method description): the central claim that 'optimizing inclusive KL can recover the entire identified set across diverse data-generating processes' is supported solely by empirical behavior on the synthetic families used to train the amortizer. No derivation is given showing that the mass-covering property of the inclusive KL objective on the marginal posterior necessarily yields the full identified set for arbitrary IV models (e.g., continuous instruments, non-linear response surfaces, or DGPs outside the training support). Without this, the quantile-derived bounds lack a general validity guarantee.
  2. [§4] §4 (experiments): the reported favorable results on synthetic and semi-synthetic benchmarks use data-generating processes drawn from the same distributional families employed during amortizer training. This leaves open whether the learned posterior covers the identified set on out-of-distribution DGPs; additional stress tests on held-out model classes would be required to substantiate the generalization claim.
minor comments (2)
  1. [Abstract] The abstract states '20-500x lower inference time' without specifying the exact baseline methods, hardware, or batch sizes used for the timing comparison; this detail should be added for reproducibility.
  2. [§3] Notation for the in-context learning model hyperparameters and the precise form of the inclusive KL objective should be introduced earlier and used consistently throughout the method section.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments. We agree that the central claims rest on empirical observations rather than general theory and that the experiments would benefit from explicit out-of-distribution tests. We outline targeted revisions below.

read point-by-point responses
  1. Referee: [Abstract and §3] the central claim that 'optimizing inclusive KL can recover the entire identified set across diverse data-generating processes' is supported solely by empirical behavior on the synthetic families used to train the amortizer. No derivation is given showing that the mass-covering property of the inclusive KL objective on the marginal posterior necessarily yields the full identified set for arbitrary IV models.

    Authors: We agree the claim is empirical. The manuscript presents the recovery of the identified set as an observed property of inclusive-KL optimization on the families studied, in contrast to exclusive KL. We will revise the abstract and §3 to state explicitly that this is an empirical finding without a general validity guarantee for arbitrary IV models (e.g., continuous instruments or DGPs outside the training support). A brief limitations paragraph will be added noting the absence of a theoretical derivation and the need for future work on conditions under which inclusive KL covers the identified set. revision: yes

  2. Referee: [§4] the reported favorable results on synthetic and semi-synthetic benchmarks use data-generating processes drawn from the same distributional families employed during amortizer training. This leaves open whether the learned posterior covers the identified set on out-of-distribution DGPs.

    Authors: We acknowledge the overlap between training and test DGPs. While the semi-synthetic benchmarks introduce realistic variation, we will add new experiments using held-out model classes (different response surfaces, continuous instruments, and functional forms not seen in training) to test whether the learned posterior continues to cover the identified set on OOD DGPs. These results will be reported with the same coverage and interval-width metrics. revision: yes

standing simulated objections not resolved
  • No general theoretical derivation is available showing that inclusive KL necessarily recovers the full identified set for arbitrary IV models.

Circularity Check

0 steps flagged

No circularity; central claim is empirical observation on synthetic DGPs, not reduction by construction

full rationale

The paper defines IV-ICL as an amortized Bayesian method that directly learns the marginal posterior of causal effects via inclusive KL minimization and takes quantiles as bounds. The key statement is an empirical observation ('We empirically observe that optimizing inclusive KL can recover the entire identified set across diverse data-generating processes') rather than a derivation that equates the output bounds to fitted inputs or self-cited uniqueness theorems. No equations, self-citations, or ansatzes are shown that would force the identified-set coverage by definition. Training on synthetics followed by application to new data is standard amortized inference and does not create circularity. The method remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach assumes that the posterior quantiles under the learned model coincide with the sharp identified set for causal effects and that the training distribution of simulated IV problems is representative enough for generalization.

free parameters (1)
  • in-context learning model hyperparameters
    The neural network or transformer used for amortization has trainable parameters whose values are fitted during the inclusive-KL training phase.
axioms (1)
  • domain assumption The data-generating processes used for training cover the relevant class of IV problems so that the learned posterior generalizes to real data.
    Invoked when claiming recovery of the identified set across diverse processes.

pith-pipeline@v0.9.0 · 5591 in / 1360 out tokens · 58743 ms · 2026-05-14T19:45:13.789703+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · 1 internal anchor

  1. [1]

    Accountability and flexibility in public schools: Evidence from boston’s charters and pilots.The Quarterly Journal of Economics, 126(2):699–748, 2011

    Atila Abdulkadiro ˘glu, Joshua D Angrist, Susan M Dynarski, Thomas J Kane, and Parag A Pathak. Accountability and flexibility in public schools: Evidence from boston’s charters and pilots.The Quarterly Journal of Economics, 126(2):699–748, 2011

  2. [2]

    Joshua D Angrist and Guido W. Imbens. Identification and estimation of local average treatment effects.Econometrica, 62:467–475, 1994

  3. [3]

    Princeton university press, 2009

    Joshua D Angrist and Jörn-Steffen Pischke.Mostly harmless econometrics: An empiricist’s companion. Princeton university press, 2009

  4. [4]

    Identification of causal effects using instrumental variables.Journal of the American statistical Association, 91(434):444–455, 1996

    Joshua D Angrist, Guido W Imbens, and Donald B Rubin. Identification of causal effects using instrumental variables.Journal of the American statistical Association, 91(434):444–455, 1996

  5. [5]

    The paired availability design: a proposal for evaluating epidural analgesia during labor.Statistics in medicine, 13(21):2269–2278, 1994

    Stuart G Baker and Karen S Lindeman. The paired availability design: a proposal for evaluating epidural analgesia during labor.Statistics in medicine, 13(21):2269–2278, 1994

  6. [6]

    Partial identification of treatment effects with implicit generative models.Advances in Neural Information Processing Systems, 35:22816–22829, 2022

    Vahid Balazadeh, Vasilis Syrgkanis, and Rahul G Krishnan. Partial identification of treatment effects with implicit generative models.Advances in Neural Information Processing Systems, 35:22816–22829, 2022

  7. [7]

    Cresswell, and Rahul G

    Vahid Balazadeh, Hamidreza Kamkari, Valentin Thomas, Benson Li, Junwei Ma, Jesse C. Cresswell, and Rahul G. Krishnan. Causalpfn: Amortized causal effect estimation via in-context learning. InAdvances in Neural Information Processing Systems, volume 38, 2025

  8. [8]

    Counterfactual probabilities: Computational methods, bounds and applications

    Alexander Balke and Judea Pearl. Counterfactual probabilities: Computational methods, bounds and applications. InUncertainty in artificial intelligence, pages 46–54. Elsevier, 1994

  9. [9]

    Bounds on treatment effects from studies with imperfect compliance.Journal of the American statistical Association, 92(439):1171–1176, 1997

    Alexander Balke and Judea Pearl. Bounds on treatment effects from studies with imperfect compliance.Journal of the American statistical Association, 92(439):1171–1176, 1997

  10. [10]

    Black box causal inference: Effect estimation via meta prediction.arXiv:2503.05985, 2025

    Lucius EJ Bynum, Aahlad Manas Puli, Diego Herrero-Quevedo, Nhi Nguyen, Carlos Fernandez- Granda, Kyunghyun Cho, and Rajesh Ranganath. Black box causal inference: Effect estimation via meta prediction.arXiv:2503.05985, 2025

  11. [11]

    Non-parametric bounds on treatment effects with non-compliance by covariate adjustment.Statistics in medicine, 26(16):3188–3204, 2007

    Zhihong Cai, Manabu Kuroki, and Tosiya Sato. Non-parametric bounds on treatment effects with non-compliance by covariate adjustment.Statistics in medicine, 26(16):3188–3204, 2007

  12. [12]

    A clinician’s tool for analyzing non-compliance

    David Maxwell Chickering and Judea Pearl. A clinician’s tool for analyzing non-compliance. InProceedings of the National Conference on Artificial Intelligence, pages 1269–1276, 1996

  13. [13]

    Challenges in statistics: A dozen challenges in causality and causal inference.arXiv preprint arXiv:2508.17099, 2025

    Carlos Cinelli, Avi Feller, Guido Imbens, Edward Kennedy, Sara Magliacane, and Jose Zu- bizarreta. Challenges in statistics: A dozen challenges in causality and causal inference.arXiv preprint arXiv:2508.17099, 2025

  14. [14]

    Causal inference using influence diagrams: the problem of partial compliance

    A Philip Dawid. Causal inference using influence diagrams: the problem of partial compliance. Oxford Statistical Science Series, pages 45–65, 2003

  15. [15]

    An automated approach to causal inference in discrete settings.Journal of the American Statistical Association, 119(547):1778–1793, 2024

    Guilherme Duarte, Noam Finkelstein, Dean Knox, Jonathan Mummolo, and Ilya Shpitser. An automated approach to causal inference in discrete settings.Journal of the American Statistical Association, 119(547):1778–1793, 2024

  16. [16]

    Principal stratification in causal inference

    Constantine E Frangakis and Donald B Rubin. Principal stratification in causal inference. Biometrics, 58(1):21–29, 2002

  17. [17]

    Flexible sensitivity analysis for observational studies without observable implications.Journal of the American Statistical Association, 2020

    AlexanderM Franks, Alexander D’Amour, and Avi Feller. Flexible sensitivity analysis for observational studies without observable implications.Journal of the American Statistical Association, 2020

  18. [18]

    Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration

    Jacob R Gardner, Geoff Pleiss, David Bindel, Kilian Q Weinberger, and Andrew Gordon Wilson. Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration. InAdvances in Neural Information Processing Systems, 2018. 10

  19. [19]

    Relaxation penalties and priors for plausible modeling of nonidentified bias sources.Statistical Science, 24:195–210, 2009

    Sander Greenland. Relaxation penalties and priors for plausible modeling of nonidentified bias sources.Statistical Science, 24:195–210, 2009

  20. [20]

    On Model Expansion, Model Contraction, Identifiability and Prior Information: Two Illustrative Scenarios Involving Mismeasured Variables.Statistical Science, 20(2):111 – 140, 2005

    Paul Gustafson. On Model Expansion, Model Contraction, Identifiability and Prior Information: Two Illustrative Scenarios Involving Mismeasured Variables.Statistical Science, 20(2):111 – 140, 2005

  21. [21]

    Bayesian inference for partially identified models.The international journal of biostatistics, 6(2), 2010

    Paul Gustafson. Bayesian inference for partially identified models.The international journal of biostatistics, 6(2), 2010

  22. [22]

    Deep iv: A flexible approach for counterfactual prediction

    Jason Hartford, Greg Lewis, Kevin Leyton-Brown, and Matt Taddy. Deep iv: A flexible approach for counterfactual prediction. InProceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, Australia, 6-11 August 2017, pages 1–9, 2017

  23. [23]

    Assessing the effect of an influenza vaccine in an encouragement design.Biostatistics, 1(1):69–88, 2000

    Keisuke Hirano, Guido W Imbens, Donald B Rubin, and Xiao-Hua Zhou. Assessing the effect of an influenza vaccine in an encouragement design.Biostatistics, 1(1):69–88, 2000

  24. [24]

    TabPFN: A transformer that solves small tabular classification problems in a second

    Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter. TabPFN: A transformer that solves small tabular classification problems in a second. InThe Eleventh International Conference on Learning Representations, 2023

  25. [25]

    Accurate predictions on small data with a tabular foundation model.Nature, 637(8045):319–326, 2025

    Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. Accurate predictions on small data with a tabular foundation model.Nature, 637(8045):319–326, 2025

  26. [26]

    Cambridge University Press, 2015

    Guido W Imbens and Donald B Rubin.Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015

  27. [27]

    Identifying causal-effect inference failure with uncertainty-aware models.Advances in Neural Information Processing Systems, 33:11637–11649, 2020

    Andrew Jesson, Sören Mindermann, Uri Shalit, and Yarin Gal. Identifying causal-effect inference failure with uncertainty-aware models.Advances in Neural Information Processing Systems, 33:11637–11649, 2020

  28. [28]

    Tabicl: A tabular foundation model for in-context learning on large data

    Qu Jingang, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. Tabicl: A tabular foundation model for in-context learning on large data. InForty-second International Conference on Machine Learning, 2025

  29. [29]

    Ghost ads: Improving the economics of measuring online ad effectiveness.Journal of Marketing Research, 54(6):867–884, 2017

    Garrett A Johnson, Randall A Lewis, and Elmar I Nubbemeyer. Ghost ads: Improving the economics of measuring online ad effectiveness.Journal of Marketing Research, 54(6):867–884, 2017

  30. [30]

    A class of algorithms for general instrumental variable models.Advances in Neural Information Processing Systems, 33:20108–20119, 2020

    Niki Kilbertus, Matt J Kusner, and Ricardo Silva. A class of algorithms for general instrumental variable models.Advances in Neural Information Processing Systems, 33:20108–20119, 2020

  31. [31]

    Springer-Verlag, New York,

    Christian Kleiber and Achim Zeileis.Applied Econometrics with R. Springer-Verlag, New York,

  32. [32]

    doi: 10.1007/978-0-387-77318-6

  33. [33]

    Evaluating the econometric evaluations of training programs with experi- mental data.The American Economic Review, pages 604–620, 1986

    Robert J LaLonde. Evaluating the econometric evaluations of training programs with experi- mental data.The American Economic Review, pages 604–620, 1986

  34. [34]

    Levis, Matteo Bonvini, Zhenghao Zeng, Luke Keele, and Edward H

    Alexander W. Levis, Matteo Bonvini, Zhenghao Zeng, Luke Keele, and Edward H. Kennedy. Covariate-assisted bounds on causal effects with instrumental variables.Journal of the Royal Statistical Society Series B: Statistical Methodology, 2025

  35. [35]

    Sharpening bounds on principal effects with covariates

    Dustin M Long and Michael G Hudgens. Sharpening bounds on principal effects with covariates. Biometrics, 69(4):812–819, 2013

  36. [36]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2017

  37. [37]

    Tabdpt: Scaling tabular foundation models on real data

    Junwei Ma, Valentin Thomas, Rasa Hosseinzadeh, Hamidreza Kamkari, Alex Labach, Jesse C Cresswell, Keyvan Golestan, Guangwei Yu, Anthony L Caterini, and Maksims V olkovs. Tabdpt: Scaling tabular foundation models on real data. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 11

  38. [38]

    Foundation models for causal inference via prior-data fitted networks

    Yuchen Ma, Dennis Frauen, Emil Javurek, and Stefan Feuerriegel. Foundation models for causal inference via prior-data fitted networks. InThe Fourteenth International Conference on Learning Representations, 2026

  39. [39]

    Amortized inference of causal models via conditional fixed-point iterations.Transactions on Machine Learning Research, 2025

    Divyat Mahajan, Jannes Gladrow, Agrin Hilmkil, Cheng Zhang, and Meyer Scetbon. Amortized inference of causal models via conditional fixed-point iterations.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. J2C Certification

  40. [40]

    Nonparametric bounds on treatment effects.The American Economic Review, 80(2):319–323, 1990

    Charles F Manski. Nonparametric bounds on treatment effects.The American Economic Review, 80(2):319–323, 1990

  41. [41]

    Springer, 2003

    Charles F Manski.Partial identification of probability distributions. Springer, 2003

  42. [42]

    Frequentist consistency of prior-data fitted networks for causal inference.arXiv preprint arXiv:2603.12037, 2026

    Valentyn Melnychuk, Dennis Frauen, Maresa Schröder, and Stefan Feuerriegel. Frequentist consistency of prior-data fitted networks for causal inference.arXiv preprint arXiv:2603.12037, 2026

  43. [43]

    The tennessee study of class size in the early school grades.The future of children, pages 113–127, 1995

    Frederick Mosteller. The tennessee study of class size in the early school grades.The future of children, pages 113–127, 1995

  44. [44]

    Prior-Data Fitted Networks for Causal Inference: a Simulation Study with Real-World Scenarios

    Francisco Mourao, David Hajage, Daria Bystrova, Bertrand Bouvarel, Nathanaël Lapidus, Fabrice Carrat, and Benjamin Glemain. Prior-data fitted networks for causal inference: a simulation study with real-world scenarios.arXiv preprint arXiv:2603.15928, 2026

  45. [45]

    Transformers Can Do Bayesian Inference

    Samuel Müller, Noah Hollmann, Sebastian Pineda Arango, Josif Grabocka, and Frank Hut- ter. Transformers Can Do Bayesian Inference. InInternational Conference on Learning Representations, 2022

  46. [46]

    Statistical foundations of prior-data fitted networks

    Thomas Nagler. Statistical foundations of prior-data fitted networks. InProceedings of the 40th International Conference on Machine Learning, volume 202, pages 25660–25676, 2023

  47. [47]

    Stochastic causal programming for bounding treatment effects

    Kirtan Padh, Jakob Zeitler, David Watson, Matt Kusner, Ricardo Silva, and Niki Kilbertus. Stochastic causal programming for bounding treatment effects. InConference on Causal Learning and Reasoning, pages 142–176. PMLR, 2023

  48. [48]

    Cambridge University Press, 2009

    Judea Pearl.Causality. Cambridge University Press, 2009

  49. [49]

    PhD thesis, Almqvist & Wiksell, 1945

    Olav Reiersøl.Confluence analysis by means of instrumental sets of variables. PhD thesis, Almqvist & Wiksell, 1945

  50. [50]

    Nonparametric bounds and sensitivity analysis of treatment effects.Statistical science: a review journal of the Institute of Mathematical Statistics, 29(4):596, 2015

    Amy Richardson, Michael G Hudgens, Peter B Gilbert, and Jason P Fine. Nonparametric bounds and sensitivity analysis of treatment effects.Statistical science: a review journal of the Institute of Mathematical Statistics, 29(4):596, 2015

  51. [51]

    Transparent parameterizations of models for potential outcomes.Bayesian statistics, 9:569–610, 2011

    Thomas S Richardson, Robin J Evans, and James M Robins. Transparent parameterizations of models for potential outcomes.Bayesian statistics, 9:569–610, 2011

  52. [52]

    Do-PFN: In-context learning for causal effect estimation

    Jake Robertson, Arik Reuter, Siyuan Guo, Noah Hollmann, Frank Hutter, and Bernhard Schölkopf. Do-PFN: In-context learning for causal effect estimation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  53. [53]

    Estimating causal effects of treatments in randomized and nonrandomized studies.Journal of Educational Psychology, 66(5):688, 1974

    Donald B Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies.Journal of Educational Psychology, 66(5):688, 1974

  54. [54]

    Bayesian inference for causal effects: The role of randomization.The Annals of Statistics, pages 34–58, 1978

    Donald B Rubin. Bayesian inference for causal effects: The role of randomization.The Annals of Statistics, pages 34–58, 1978

  55. [55]

    Causal inference using potential outcomes: Design, modeling, decisions

    Donald B Rubin. Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association, 100(469):322–331, 2005

  56. [56]

    A general method for deriving tight symbolic bounds on causal effects.Journal of Computational and Graphical Statistics, 32(2):567–576, 2023

    Michael C Sachs, Gustav Jonzon, Arvid Sjölander, and Erin E Gabriel. A general method for deriving tight symbolic bounds on causal effects.Journal of Computational and Graphical Statistics, 32(2):567–576, 2023. 12

  57. [57]

    Learning representations of instruments for partial identification of treatment effects

    Jonas Schweisthal, Dennis Frauen, Maresa Schröder, Konstantin Hess, Niki Kilbertus, and Stefan Feuerriegel. Learning representations of instruments for partial identification of treatment effects. InICLR 2025 Workshop on Generative and Experimental Perspectives for Biomolecular Design, 2025

  58. [58]

    Causal inference through a witness protection program.Journal of Machine Learning Research, 17(56):1–53, 2016

    Ricardo Silva and Robin Evans. Causal inference through a witness protection program.Journal of Machine Learning Research, 17(56):1–53, 2016

  59. [59]

    Sonja A Swanson, Miguel A Hernán, Matthew Miller, James M Robins, and Thomas S Richard- son. Partial identification of the average treatment effect using instrumental variables: review of methods for binary instruments, treatments, and outcomes.Journal of the American Statistical Association, 113(522):933–947, 2018

  60. [60]

    Probabilities of causation: Bounds and identification.Annals of Mathematics and Artificial Intelligence, 28(1):287–313, 2000

    Jin Tian and Judea Pearl. Probabilities of causation: Bounds and identification.Annals of Mathematics and Artificial Intelligence, 28(1):287–313, 2000

  61. [61]

    Variational learning of inducing variables in sparse gaussian processes

    Michalis Titsias. Variational learning of inducing variables in sparse gaussian processes. In Artificial intelligence and statistics, pages 567–574. PMLR, 2009

  62. [62]

    Covariate adjustment for two-sample treatment comparisons in randomized clinical trials: a principled yet flexible approach.Statistics in medicine, 27(23):4658–4677, 2008

    Anastasios A Tsiatis, Marie Davidian, Min Zhang, and Xiaomin Lu. Covariate adjustment for two-sample treatment comparisons in randomized clinical trials: a principled yet flexible approach.Statistics in medicine, 27(23):4658–4677, 2008

  63. [63]

    VanderWeele and Ilya Shpitser

    Tyler J. VanderWeele and Ilya Shpitser. On the definition of a confounder.Annals of statistics, 41(1):196–220, 2013

  64. [64]

    Quantile-optimal treatment regimes,

    Justin Whitehouse, Morgane Austern, and Vasilis Syrgkanis. Inference on optimal policy values and other irregular functionals via smoothing.arXiv preprint arXiv:2507.11780, 2025

  65. [65]

    Macmillan, 1928

    Philip Green Wright.The tariff on animal and vegetable oils. Macmillan, 1928

  66. [66]

    Neural causal models for counterfactual identifi- cation and estimation.arXiv preprint arXiv:2210.00035, 2022

    Kevin Xia, Yushu Pan, and Elias Bareinboim. Neural causal models for counterfactual identifi- cation and estimation.arXiv preprint arXiv:2210.00035, 2022

  67. [67]

    Towards causal foundation model: on duality between causal inference and attention.arXiv preprint arXiv:2310.00809, 2023

    Jiaqi Zhang, Joel Jennings, Agrin Hilmkil, Nick Pawlowski, Cheng Zhang, and Chao Ma. Towards causal foundation model: on duality between causal inference and attention.arXiv preprint arXiv:2310.00809, 2023

  68. [68]

    Bounding causal effects on continuous outcome

    Junzhe Zhang and Elias Bareinboim. Bounding causal effects on continuous outcome. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 12207–12215, 2021

  69. [69]

    Non-parametric methods for partial identification of causal effects.Columbia CausalAI Laboratory Technical Report, 2021

    Junzhe Zhang and Elias Bareinboim. Non-parametric methods for partial identification of causal effects.Columbia CausalAI Laboratory Technical Report, 2021. 13 Appendix Contents A Balke-Pearl Equations 15 B Inclusive KL Equivalence 15 C Training Details and Inference 16 D Proof of Proposition 1 16 E Details of the Benchmark 17 E.1 Details of the Synthetic ...

  70. [70]

    Covariates:Sample X∈R n×d with d∼Unif{5,6,7,8,9,10} , where each entry is drawn from eitherN(5,1)or Unif(−10,5), chosen randomly

  71. [71]

    Com- pute logits as ℓZ =Xw Z +ε Z, where εZ is noise from either N(0,1) or Laplace(0,1)

    Instrument generation:Generate weights wZ ∈R d from either N(1,2) or Unif(−2,2) . Com- pute logits as ℓZ =Xw Z +ε Z, where εZ is noise from either N(0,1) or Laplace(0,1) . Stan- dardize: ˜ℓZ = (ℓZ − ¯ℓZ)/std(ℓZ). SampleZ∼Bernoulli(σ( ˜ℓZ)). 17

  72. [72]

    Apply row-wise softmax to obtain strata probabilities P∈R n×16, where each row sums to 1

    Potential treatment/outcome:Generate weights W∈R d×16 and compute logits L=XW+E where E∈R n×16 is noise. Apply row-wise softmax to obtain strata probabilities P∈R n×16, where each row sums to 1. 4.Treatment and outcome strata:The 16 columns correspond to combinations of: • Treatment strata: Always-Takers (AT), Never-Takers (NT), Defiers (DE), Compliers (C...

  73. [73]

    Observable generation:For each unit i, sample the stratum from the categorical distribution defined byP i, then determine(T i, Yi)based onZ i and the sampled stratum

  74. [74]

    E.2 Details of the Jobs Benchmark The original National Supported Work (NSW) Demonstration is an RCT evaluating job training effects on earnings

    Ground-truth bounds:Compute the observational probabilities pyt.z(xi) analytically from the strata probabilities, then apply the Balke-Pearl equations to obtainℓ(x i)andu(x i). E.2 Details of the Jobs Benchmark The original National Supported Work (NSW) Demonstration is an RCT evaluating job training effects on earnings. It includes the following covariat...

  75. [75]

    Finally, the outcome variable is the amount of earnings in 1978 (re78)

    The treatment is a binary indicator of assignment to job training program. Finally, the outcome variable is the amount of earnings in 1978 (re78). We apply log transforms to the outputs to get less skewed outcome distribution: re74←log(re74+ 1) , re75←log(re75+ 1) , Y←log(re78+ 1) . Covariate Split.We split the features into observed covariates (O), which...