IV-ICL: Bounding Causal Effects with Instrumental Variables via In-Context Learning

Hamidreza Kamkari; Medha Barath; Rahul G. Krishnan; Ricardo Silva; Vahid Balazadeh

arxiv: 2605.12924 · v1 · pith:E7FQG7PCnew · submitted 2026-05-13 · 💻 cs.LG

IV-ICL: Bounding Causal Effects with Instrumental Variables via In-Context Learning

Vahid Balazadeh , Hamidreza Kamkari , Medha Barath , Ricardo Silva , Rahul G. Krishnan This is my paper

Pith reviewed 2026-05-14 19:45 UTC · model grok-4.3

classification 💻 cs.LG

keywords instrumental variablescausal boundsin-context learningpartial identificationamortized inferenceinclusive KL divergenceBayesian posterior

0 comments

The pith

An amortized in-context learner recovers the full identified set of causal effects from instrumental variable data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops IV-ICL to bound causal effects in instrumental variable settings where point identification is impossible due to confounding. It trains an in-context model to learn the marginal posterior over the causal effect directly, using an inclusive KL objective that covers the whole identified set rather than concentrating on one mode. Bounds are then read off as quantiles of this posterior, avoiding the need for bespoke closed-form estimators. On benchmarks the resulting intervals prove more reliable and informative than those from existing methods while running orders of magnitude faster at test time.

Core claim

By training an in-context learner to minimize the expected inclusive KL divergence on instrumental-variable data, IV-ICL obtains the marginal posterior distribution of the causal effect; its quantiles then supply bounds that empirically cover the full identified set for a range of data-generating processes.

What carries the argument

Amortized in-context learner minimizing inclusive KL to output the posterior over causal effects.

Load-bearing premise

That the inclusive-KL objective will recover the full identified set for arbitrary data-generating processes rather than only the synthetic ones seen in training.

What would settle it

A counterexample data-generating process where the quantiles of the learned posterior do not contain all values in the true identified set for the causal effect.

Figures

Figures reproduced from arXiv: 2605.12924 by Hamidreza Kamkari, Medha Barath, Rahul G. Krishnan, Ricardo Silva, Vahid Balazadeh.

**Figure 1.** Figure 1: IV-ICL pipeline. During pre-training, we generate a diverse library of synthetic datasets with IV structure and known ground-truth causal effects, and train a transformer to map each dataset to the marginal posterior of its causal effect. At inference, the trained model takes any new IV dataset as context and returns the causal effect marginal posterior in a single forward pass; bounds are read off as marg… view at source ↗

**Figure 2.** Figure 2: The effect of post-processing on SATE distribution. d ′ refers to the covariate dimension. random function fψ : R d ′ → R, using synthetic function generators similar to Balazadeh et al. [7]. The instrument propensity is then defined as Pψ(Z = 1 | X) = sigmoid(fψ(X)). 3. Potential outcome and treatment generation Pψ [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Effect of instrument strength on interval width (Jobs benchmark). Width decreases as ρ(Z, T) increases; validity remains perfect. Effect of instrument strength [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Calibration curves on three synthetic IV DGPs. Empirical coverage versus nominal credibility level 1 − α over K = 100 datasets per DGP at n = 1024, d = 5. Bands show ±1 standard error. IV-ICL over-covers in the regime that matters for inference and approaches the diagonal smoothly at very narrow nominal levels — the expected signature of an inclusive KL-trained marginal posterior on partial identification.… view at source ↗

**Figure 5.** Figure 5: Sensitivity to sample size and dimensionality at α = 0.1 (90% credible intervals). Top row: empirical coverage; bottom row: mean normalized interval width. Bands show ±1 standard error over K = 30 seeds per cell. Coverage saturates at 1.00 across the entire grid; widths shrink monotonically with n and grow with d, as expected. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

read the original abstract

The instrumental-variables (IV) setting is standard for partial identification of causal effects when unobserved confounding makes point identification impossible. Existing approaches face methodological bottlenecks: closed-form bound estimands are required -- e.g., Balke-Pearl equations in binary IV -- and even when available, designing accurate estimators requires manual effort tailored to each estimand. While direct Bayesian inference of the causal effects, instead of the bounds, circumvents these challenges, it is often computationally intensive and suffers from high prior sensitivity or under-dispersed posteriors. As a remedy, we introduce IV-ICL, an amortized Bayesian in-context learning method that learns the marginal posterior distribution of the causal effects directly and derives bounds as its quantiles. Unlike standard variational inference that optimizes exclusive KL divergence, amortized Bayesian inference minimizes the expected inclusive KL, a mass-covering objective. We empirically observe that optimizing inclusive KL can recover the entire identified set across diverse data-generating processes, while exclusive-KL (e.g. with variational inference) of the same Bayesian formulation collapses onto a single mode and fails to cover the identified set. We evaluate IV-ICL on synthetic and semi-synthetic IV benchmarks and show it produces intervals that are more reliably valid and more informative compared to efficient semi-parametric, Bayesian, and plug-in baselines, at 20-500x lower inference time. Beyond methodology, we propose a procedure to convert randomized controlled trials into IV benchmarks with provably preserved ground-truth causal effects that enables a more realistic evaluation of partial-identification methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

IV-ICL amortizes posterior sampling over causal effects in IV settings via in-context learning and inclusive KL, delivering faster and often tighter bounds on benchmarks but without a general proof that coverage holds outside the training distributions.

read the letter

The paper's main contribution is an amortized in-context learner that takes IV data as context and directly produces samples from the marginal posterior on the causal effect, with bounds read off as quantiles. Training minimizes expected inclusive KL rather than the usual exclusive KL, which the authors show empirically spreads mass across the identified set instead of collapsing to one mode. They also introduce a clean way to repurpose RCTs as IV benchmarks while preserving the true effect for evaluation. On the synthetic and semi-synthetic cases they test, the resulting intervals are valid more often and more informative than the semi-parametric and plug-in baselines, at a large speed advantage. That practical payoff is the clearest strength. The soft spot is the coverage claim. The argument that inclusive KL recovers the full identified set rests on observed behavior across the families used for training; there is no derivation showing the property survives for arbitrary DGPs, continuous instruments, or response surfaces outside the training support. If the amortizer under-covers on new data, the quantiles are no longer guaranteed bounds even though the underlying Bayesian model is fine. The circularity issue is minor because the posterior itself is well-defined. This work is aimed at causal-inference researchers who want scalable partial-identification tools without hand-crafting estimators for each new estimand. Readers working on amortized inference or applied IV problems will get the most out of the benchmarks and the runtime numbers. I would send it for peer review. The empirical results are concrete enough to justify referee time, and the method is a useful engineering step even if the generalization argument needs more support.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces IV-ICL, an amortized Bayesian in-context learning method for partial identification of causal effects in instrumental variable (IV) settings. By directly learning the marginal posterior distribution of the causal effects through minimization of the inclusive KL divergence and extracting bounds as posterior quantiles, the approach aims to circumvent the need for closed-form bound estimands and manual estimator design. The key empirical observation is that inclusive KL optimization recovers the full identified set across diverse data-generating processes, in contrast to exclusive KL which collapses to a single mode. The paper evaluates this on synthetic and semi-synthetic benchmarks, reporting more valid and informative intervals than baselines at significantly lower inference time, and proposes a method to convert RCTs into IV benchmarks preserving ground-truth effects.

Significance. If the empirical observation that inclusive KL recovers the identified set generalizes, the method offers a scalable alternative to existing IV bounding techniques that avoids closed-form requirements and reduces computational burden while improving coverage over under-dispersed Bayesian or plug-in estimators. The RCT-to-IV benchmark conversion procedure is a concrete methodological contribution that could improve evaluation standards in partial identification research.

major comments (2)

[Abstract and §3] Abstract and §3 (method description): the central claim that 'optimizing inclusive KL can recover the entire identified set across diverse data-generating processes' is supported solely by empirical behavior on the synthetic families used to train the amortizer. No derivation is given showing that the mass-covering property of the inclusive KL objective on the marginal posterior necessarily yields the full identified set for arbitrary IV models (e.g., continuous instruments, non-linear response surfaces, or DGPs outside the training support). Without this, the quantile-derived bounds lack a general validity guarantee.
[§4] §4 (experiments): the reported favorable results on synthetic and semi-synthetic benchmarks use data-generating processes drawn from the same distributional families employed during amortizer training. This leaves open whether the learned posterior covers the identified set on out-of-distribution DGPs; additional stress tests on held-out model classes would be required to substantiate the generalization claim.

minor comments (2)

[Abstract] The abstract states '20-500x lower inference time' without specifying the exact baseline methods, hardware, or batch sizes used for the timing comparison; this detail should be added for reproducibility.
[§3] Notation for the in-context learning model hyperparameters and the precise form of the inclusive KL objective should be introduced earlier and used consistently throughout the method section.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments. We agree that the central claims rest on empirical observations rather than general theory and that the experiments would benefit from explicit out-of-distribution tests. We outline targeted revisions below.

read point-by-point responses

Referee: [Abstract and §3] the central claim that 'optimizing inclusive KL can recover the entire identified set across diverse data-generating processes' is supported solely by empirical behavior on the synthetic families used to train the amortizer. No derivation is given showing that the mass-covering property of the inclusive KL objective on the marginal posterior necessarily yields the full identified set for arbitrary IV models.

Authors: We agree the claim is empirical. The manuscript presents the recovery of the identified set as an observed property of inclusive-KL optimization on the families studied, in contrast to exclusive KL. We will revise the abstract and §3 to state explicitly that this is an empirical finding without a general validity guarantee for arbitrary IV models (e.g., continuous instruments or DGPs outside the training support). A brief limitations paragraph will be added noting the absence of a theoretical derivation and the need for future work on conditions under which inclusive KL covers the identified set. revision: yes
Referee: [§4] the reported favorable results on synthetic and semi-synthetic benchmarks use data-generating processes drawn from the same distributional families employed during amortizer training. This leaves open whether the learned posterior covers the identified set on out-of-distribution DGPs.

Authors: We acknowledge the overlap between training and test DGPs. While the semi-synthetic benchmarks introduce realistic variation, we will add new experiments using held-out model classes (different response surfaces, continuous instruments, and functional forms not seen in training) to test whether the learned posterior continues to cover the identified set on OOD DGPs. These results will be reported with the same coverage and interval-width metrics. revision: yes

standing simulated objections not resolved

No general theoretical derivation is available showing that inclusive KL necessarily recovers the full identified set for arbitrary IV models.

Circularity Check

0 steps flagged

No circularity; central claim is empirical observation on synthetic DGPs, not reduction by construction

full rationale

The paper defines IV-ICL as an amortized Bayesian method that directly learns the marginal posterior of causal effects via inclusive KL minimization and takes quantiles as bounds. The key statement is an empirical observation ('We empirically observe that optimizing inclusive KL can recover the entire identified set across diverse data-generating processes') rather than a derivation that equates the output bounds to fitted inputs or self-cited uniqueness theorems. No equations, self-citations, or ansatzes are shown that would force the identified-set coverage by definition. Training on synthetics followed by application to new data is standard amortized inference and does not create circularity. The method remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach assumes that the posterior quantiles under the learned model coincide with the sharp identified set for causal effects and that the training distribution of simulated IV problems is representative enough for generalization.

free parameters (1)

in-context learning model hyperparameters
The neural network or transformer used for amortization has trainable parameters whose values are fitted during the inclusive-KL training phase.

axioms (1)

domain assumption The data-generating processes used for training cover the relevant class of IV problems so that the learned posterior generalizes to real data.
Invoked when claiming recovery of the identified set across diverse processes.

pith-pipeline@v0.9.0 · 5591 in / 1360 out tokens · 58743 ms · 2026-05-14T19:45:13.789703+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · 1 internal anchor

[1]

Accountability and flexibility in public schools: Evidence from boston’s charters and pilots.The Quarterly Journal of Economics, 126(2):699–748, 2011

Atila Abdulkadiro ˘glu, Joshua D Angrist, Susan M Dynarski, Thomas J Kane, and Parag A Pathak. Accountability and flexibility in public schools: Evidence from boston’s charters and pilots.The Quarterly Journal of Economics, 126(2):699–748, 2011

work page 2011
[2]

Joshua D Angrist and Guido W. Imbens. Identification and estimation of local average treatment effects.Econometrica, 62:467–475, 1994

work page 1994
[3]

Princeton university press, 2009

Joshua D Angrist and Jörn-Steffen Pischke.Mostly harmless econometrics: An empiricist’s companion. Princeton university press, 2009

work page 2009
[4]

Identification of causal effects using instrumental variables.Journal of the American statistical Association, 91(434):444–455, 1996

Joshua D Angrist, Guido W Imbens, and Donald B Rubin. Identification of causal effects using instrumental variables.Journal of the American statistical Association, 91(434):444–455, 1996

work page 1996
[5]

The paired availability design: a proposal for evaluating epidural analgesia during labor.Statistics in medicine, 13(21):2269–2278, 1994

Stuart G Baker and Karen S Lindeman. The paired availability design: a proposal for evaluating epidural analgesia during labor.Statistics in medicine, 13(21):2269–2278, 1994

work page 1994
[6]

Partial identification of treatment effects with implicit generative models.Advances in Neural Information Processing Systems, 35:22816–22829, 2022

Vahid Balazadeh, Vasilis Syrgkanis, and Rahul G Krishnan. Partial identification of treatment effects with implicit generative models.Advances in Neural Information Processing Systems, 35:22816–22829, 2022

work page 2022
[7]

Cresswell, and Rahul G

Vahid Balazadeh, Hamidreza Kamkari, Valentin Thomas, Benson Li, Junwei Ma, Jesse C. Cresswell, and Rahul G. Krishnan. Causalpfn: Amortized causal effect estimation via in-context learning. InAdvances in Neural Information Processing Systems, volume 38, 2025

work page 2025
[8]

Counterfactual probabilities: Computational methods, bounds and applications

Alexander Balke and Judea Pearl. Counterfactual probabilities: Computational methods, bounds and applications. InUncertainty in artificial intelligence, pages 46–54. Elsevier, 1994

work page 1994
[9]

Bounds on treatment effects from studies with imperfect compliance.Journal of the American statistical Association, 92(439):1171–1176, 1997

Alexander Balke and Judea Pearl. Bounds on treatment effects from studies with imperfect compliance.Journal of the American statistical Association, 92(439):1171–1176, 1997

work page 1997
[10]

Black box causal inference: Effect estimation via meta prediction.arXiv:2503.05985, 2025

Lucius EJ Bynum, Aahlad Manas Puli, Diego Herrero-Quevedo, Nhi Nguyen, Carlos Fernandez- Granda, Kyunghyun Cho, and Rajesh Ranganath. Black box causal inference: Effect estimation via meta prediction.arXiv:2503.05985, 2025

work page arXiv 2025
[11]

Non-parametric bounds on treatment effects with non-compliance by covariate adjustment.Statistics in medicine, 26(16):3188–3204, 2007

Zhihong Cai, Manabu Kuroki, and Tosiya Sato. Non-parametric bounds on treatment effects with non-compliance by covariate adjustment.Statistics in medicine, 26(16):3188–3204, 2007

work page 2007
[12]

A clinician’s tool for analyzing non-compliance

David Maxwell Chickering and Judea Pearl. A clinician’s tool for analyzing non-compliance. InProceedings of the National Conference on Artificial Intelligence, pages 1269–1276, 1996

work page 1996
[13]

Challenges in statistics: A dozen challenges in causality and causal inference.arXiv preprint arXiv:2508.17099, 2025

Carlos Cinelli, Avi Feller, Guido Imbens, Edward Kennedy, Sara Magliacane, and Jose Zu- bizarreta. Challenges in statistics: A dozen challenges in causality and causal inference.arXiv preprint arXiv:2508.17099, 2025

work page arXiv 2025
[14]

Causal inference using influence diagrams: the problem of partial compliance

A Philip Dawid. Causal inference using influence diagrams: the problem of partial compliance. Oxford Statistical Science Series, pages 45–65, 2003

work page 2003
[15]

An automated approach to causal inference in discrete settings.Journal of the American Statistical Association, 119(547):1778–1793, 2024

Guilherme Duarte, Noam Finkelstein, Dean Knox, Jonathan Mummolo, and Ilya Shpitser. An automated approach to causal inference in discrete settings.Journal of the American Statistical Association, 119(547):1778–1793, 2024

work page 2024
[16]

Principal stratification in causal inference

Constantine E Frangakis and Donald B Rubin. Principal stratification in causal inference. Biometrics, 58(1):21–29, 2002

work page 2002
[17]

Flexible sensitivity analysis for observational studies without observable implications.Journal of the American Statistical Association, 2020

AlexanderM Franks, Alexander D’Amour, and Avi Feller. Flexible sensitivity analysis for observational studies without observable implications.Journal of the American Statistical Association, 2020

work page 2020
[18]

Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration

Jacob R Gardner, Geoff Pleiss, David Bindel, Kilian Q Weinberger, and Andrew Gordon Wilson. Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration. InAdvances in Neural Information Processing Systems, 2018. 10

work page 2018
[19]

Relaxation penalties and priors for plausible modeling of nonidentified bias sources.Statistical Science, 24:195–210, 2009

Sander Greenland. Relaxation penalties and priors for plausible modeling of nonidentified bias sources.Statistical Science, 24:195–210, 2009

work page 2009
[20]

On Model Expansion, Model Contraction, Identifiability and Prior Information: Two Illustrative Scenarios Involving Mismeasured Variables.Statistical Science, 20(2):111 – 140, 2005

Paul Gustafson. On Model Expansion, Model Contraction, Identifiability and Prior Information: Two Illustrative Scenarios Involving Mismeasured Variables.Statistical Science, 20(2):111 – 140, 2005

work page 2005
[21]

Bayesian inference for partially identified models.The international journal of biostatistics, 6(2), 2010

Paul Gustafson. Bayesian inference for partially identified models.The international journal of biostatistics, 6(2), 2010

work page 2010
[22]

Deep iv: A flexible approach for counterfactual prediction

Jason Hartford, Greg Lewis, Kevin Leyton-Brown, and Matt Taddy. Deep iv: A flexible approach for counterfactual prediction. InProceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, Australia, 6-11 August 2017, pages 1–9, 2017

work page 2017
[23]

Assessing the effect of an influenza vaccine in an encouragement design.Biostatistics, 1(1):69–88, 2000

Keisuke Hirano, Guido W Imbens, Donald B Rubin, and Xiao-Hua Zhou. Assessing the effect of an influenza vaccine in an encouragement design.Biostatistics, 1(1):69–88, 2000

work page 2000
[24]

TabPFN: A transformer that solves small tabular classification problems in a second

Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter. TabPFN: A transformer that solves small tabular classification problems in a second. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023
[25]

Accurate predictions on small data with a tabular foundation model.Nature, 637(8045):319–326, 2025

Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. Accurate predictions on small data with a tabular foundation model.Nature, 637(8045):319–326, 2025

work page 2025
[26]

Cambridge University Press, 2015

Guido W Imbens and Donald B Rubin.Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015

work page 2015
[27]

Identifying causal-effect inference failure with uncertainty-aware models.Advances in Neural Information Processing Systems, 33:11637–11649, 2020

Andrew Jesson, Sören Mindermann, Uri Shalit, and Yarin Gal. Identifying causal-effect inference failure with uncertainty-aware models.Advances in Neural Information Processing Systems, 33:11637–11649, 2020

work page 2020
[28]

Tabicl: A tabular foundation model for in-context learning on large data

Qu Jingang, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. Tabicl: A tabular foundation model for in-context learning on large data. InForty-second International Conference on Machine Learning, 2025

work page 2025
[29]

Ghost ads: Improving the economics of measuring online ad effectiveness.Journal of Marketing Research, 54(6):867–884, 2017

Garrett A Johnson, Randall A Lewis, and Elmar I Nubbemeyer. Ghost ads: Improving the economics of measuring online ad effectiveness.Journal of Marketing Research, 54(6):867–884, 2017

work page 2017
[30]

A class of algorithms for general instrumental variable models.Advances in Neural Information Processing Systems, 33:20108–20119, 2020

Niki Kilbertus, Matt J Kusner, and Ricardo Silva. A class of algorithms for general instrumental variable models.Advances in Neural Information Processing Systems, 33:20108–20119, 2020

work page 2020
[31]

Springer-Verlag, New York,

Christian Kleiber and Achim Zeileis.Applied Econometrics with R. Springer-Verlag, New York,

work page
[32]

doi: 10.1007/978-0-387-77318-6

work page doi:10.1007/978-0-387-77318-6
[33]

Evaluating the econometric evaluations of training programs with experi- mental data.The American Economic Review, pages 604–620, 1986

Robert J LaLonde. Evaluating the econometric evaluations of training programs with experi- mental data.The American Economic Review, pages 604–620, 1986

work page 1986
[34]

Levis, Matteo Bonvini, Zhenghao Zeng, Luke Keele, and Edward H

Alexander W. Levis, Matteo Bonvini, Zhenghao Zeng, Luke Keele, and Edward H. Kennedy. Covariate-assisted bounds on causal effects with instrumental variables.Journal of the Royal Statistical Society Series B: Statistical Methodology, 2025

work page 2025
[35]

Sharpening bounds on principal effects with covariates

Dustin M Long and Michael G Hudgens. Sharpening bounds on principal effects with covariates. Biometrics, 69(4):812–819, 2013

work page 2013
[36]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2017

work page 2017
[37]

Tabdpt: Scaling tabular foundation models on real data

Junwei Ma, Valentin Thomas, Rasa Hosseinzadeh, Hamidreza Kamkari, Alex Labach, Jesse C Cresswell, Keyvan Golestan, Guangwei Yu, Anthony L Caterini, and Maksims V olkovs. Tabdpt: Scaling tabular foundation models on real data. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 11

work page 2025
[38]

Foundation models for causal inference via prior-data fitted networks

Yuchen Ma, Dennis Frauen, Emil Javurek, and Stefan Feuerriegel. Foundation models for causal inference via prior-data fitted networks. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[39]

Amortized inference of causal models via conditional fixed-point iterations.Transactions on Machine Learning Research, 2025

Divyat Mahajan, Jannes Gladrow, Agrin Hilmkil, Cheng Zhang, and Meyer Scetbon. Amortized inference of causal models via conditional fixed-point iterations.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. J2C Certification

work page 2025
[40]

Nonparametric bounds on treatment effects.The American Economic Review, 80(2):319–323, 1990

Charles F Manski. Nonparametric bounds on treatment effects.The American Economic Review, 80(2):319–323, 1990

work page 1990
[41]

Springer, 2003

Charles F Manski.Partial identification of probability distributions. Springer, 2003

work page 2003
[42]

Frequentist consistency of prior-data fitted networks for causal inference.arXiv preprint arXiv:2603.12037, 2026

Valentyn Melnychuk, Dennis Frauen, Maresa Schröder, and Stefan Feuerriegel. Frequentist consistency of prior-data fitted networks for causal inference.arXiv preprint arXiv:2603.12037, 2026

work page arXiv 2026
[43]

The tennessee study of class size in the early school grades.The future of children, pages 113–127, 1995

Frederick Mosteller. The tennessee study of class size in the early school grades.The future of children, pages 113–127, 1995

work page 1995
[44]

Prior-Data Fitted Networks for Causal Inference: a Simulation Study with Real-World Scenarios

Francisco Mourao, David Hajage, Daria Bystrova, Bertrand Bouvarel, Nathanaël Lapidus, Fabrice Carrat, and Benjamin Glemain. Prior-data fitted networks for causal inference: a simulation study with real-world scenarios.arXiv preprint arXiv:2603.15928, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[45]

Transformers Can Do Bayesian Inference

Samuel Müller, Noah Hollmann, Sebastian Pineda Arango, Josif Grabocka, and Frank Hut- ter. Transformers Can Do Bayesian Inference. InInternational Conference on Learning Representations, 2022

work page 2022
[46]

Statistical foundations of prior-data fitted networks

Thomas Nagler. Statistical foundations of prior-data fitted networks. InProceedings of the 40th International Conference on Machine Learning, volume 202, pages 25660–25676, 2023

work page 2023
[47]

Stochastic causal programming for bounding treatment effects

Kirtan Padh, Jakob Zeitler, David Watson, Matt Kusner, Ricardo Silva, and Niki Kilbertus. Stochastic causal programming for bounding treatment effects. InConference on Causal Learning and Reasoning, pages 142–176. PMLR, 2023

work page 2023
[48]

Cambridge University Press, 2009

Judea Pearl.Causality. Cambridge University Press, 2009

work page 2009
[49]

PhD thesis, Almqvist & Wiksell, 1945

Olav Reiersøl.Confluence analysis by means of instrumental sets of variables. PhD thesis, Almqvist & Wiksell, 1945

work page 1945
[50]

Nonparametric bounds and sensitivity analysis of treatment effects.Statistical science: a review journal of the Institute of Mathematical Statistics, 29(4):596, 2015

Amy Richardson, Michael G Hudgens, Peter B Gilbert, and Jason P Fine. Nonparametric bounds and sensitivity analysis of treatment effects.Statistical science: a review journal of the Institute of Mathematical Statistics, 29(4):596, 2015

work page 2015
[51]

Transparent parameterizations of models for potential outcomes.Bayesian statistics, 9:569–610, 2011

Thomas S Richardson, Robin J Evans, and James M Robins. Transparent parameterizations of models for potential outcomes.Bayesian statistics, 9:569–610, 2011

work page 2011
[52]

Do-PFN: In-context learning for causal effect estimation

Jake Robertson, Arik Reuter, Siyuan Guo, Noah Hollmann, Frank Hutter, and Bernhard Schölkopf. Do-PFN: In-context learning for causal effect estimation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[53]

Estimating causal effects of treatments in randomized and nonrandomized studies.Journal of Educational Psychology, 66(5):688, 1974

Donald B Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies.Journal of Educational Psychology, 66(5):688, 1974

work page 1974
[54]

Bayesian inference for causal effects: The role of randomization.The Annals of Statistics, pages 34–58, 1978

Donald B Rubin. Bayesian inference for causal effects: The role of randomization.The Annals of Statistics, pages 34–58, 1978

work page 1978
[55]

Causal inference using potential outcomes: Design, modeling, decisions

Donald B Rubin. Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association, 100(469):322–331, 2005

work page 2005
[56]

A general method for deriving tight symbolic bounds on causal effects.Journal of Computational and Graphical Statistics, 32(2):567–576, 2023

Michael C Sachs, Gustav Jonzon, Arvid Sjölander, and Erin E Gabriel. A general method for deriving tight symbolic bounds on causal effects.Journal of Computational and Graphical Statistics, 32(2):567–576, 2023. 12

work page 2023
[57]

Learning representations of instruments for partial identification of treatment effects

Jonas Schweisthal, Dennis Frauen, Maresa Schröder, Konstantin Hess, Niki Kilbertus, and Stefan Feuerriegel. Learning representations of instruments for partial identification of treatment effects. InICLR 2025 Workshop on Generative and Experimental Perspectives for Biomolecular Design, 2025

work page 2025
[58]

Causal inference through a witness protection program.Journal of Machine Learning Research, 17(56):1–53, 2016

Ricardo Silva and Robin Evans. Causal inference through a witness protection program.Journal of Machine Learning Research, 17(56):1–53, 2016

work page 2016
[59]

Sonja A Swanson, Miguel A Hernán, Matthew Miller, James M Robins, and Thomas S Richard- son. Partial identification of the average treatment effect using instrumental variables: review of methods for binary instruments, treatments, and outcomes.Journal of the American Statistical Association, 113(522):933–947, 2018

work page 2018
[60]

Probabilities of causation: Bounds and identification.Annals of Mathematics and Artificial Intelligence, 28(1):287–313, 2000

Jin Tian and Judea Pearl. Probabilities of causation: Bounds and identification.Annals of Mathematics and Artificial Intelligence, 28(1):287–313, 2000

work page 2000
[61]

Variational learning of inducing variables in sparse gaussian processes

Michalis Titsias. Variational learning of inducing variables in sparse gaussian processes. In Artificial intelligence and statistics, pages 567–574. PMLR, 2009

work page 2009
[62]

Covariate adjustment for two-sample treatment comparisons in randomized clinical trials: a principled yet flexible approach.Statistics in medicine, 27(23):4658–4677, 2008

Anastasios A Tsiatis, Marie Davidian, Min Zhang, and Xiaomin Lu. Covariate adjustment for two-sample treatment comparisons in randomized clinical trials: a principled yet flexible approach.Statistics in medicine, 27(23):4658–4677, 2008

work page 2008
[63]

VanderWeele and Ilya Shpitser

Tyler J. VanderWeele and Ilya Shpitser. On the definition of a confounder.Annals of statistics, 41(1):196–220, 2013

work page 2013
[64]

Quantile-optimal treatment regimes,

Justin Whitehouse, Morgane Austern, and Vasilis Syrgkanis. Inference on optimal policy values and other irregular functionals via smoothing.arXiv preprint arXiv:2507.11780, 2025

work page arXiv 2025
[65]

Macmillan, 1928

Philip Green Wright.The tariff on animal and vegetable oils. Macmillan, 1928

work page 1928
[66]

Neural causal models for counterfactual identifi- cation and estimation.arXiv preprint arXiv:2210.00035, 2022

Kevin Xia, Yushu Pan, and Elias Bareinboim. Neural causal models for counterfactual identifi- cation and estimation.arXiv preprint arXiv:2210.00035, 2022

work page arXiv 2022
[67]

Towards causal foundation model: on duality between causal inference and attention.arXiv preprint arXiv:2310.00809, 2023

Jiaqi Zhang, Joel Jennings, Agrin Hilmkil, Nick Pawlowski, Cheng Zhang, and Chao Ma. Towards causal foundation model: on duality between causal inference and attention.arXiv preprint arXiv:2310.00809, 2023

work page arXiv 2023
[68]

Bounding causal effects on continuous outcome

Junzhe Zhang and Elias Bareinboim. Bounding causal effects on continuous outcome. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 12207–12215, 2021

work page 2021
[69]

Non-parametric methods for partial identification of causal effects.Columbia CausalAI Laboratory Technical Report, 2021

Junzhe Zhang and Elias Bareinboim. Non-parametric methods for partial identification of causal effects.Columbia CausalAI Laboratory Technical Report, 2021. 13 Appendix Contents A Balke-Pearl Equations 15 B Inclusive KL Equivalence 15 C Training Details and Inference 16 D Proof of Proposition 1 16 E Details of the Benchmark 17 E.1 Details of the Synthetic ...

work page 2021
[70]

Covariates:Sample X∈R n×d with d∼Unif{5,6,7,8,9,10} , where each entry is drawn from eitherN(5,1)or Unif(−10,5), chosen randomly

work page
[71]

Com- pute logits as ℓZ =Xw Z +ε Z, where εZ is noise from either N(0,1) or Laplace(0,1)

Instrument generation:Generate weights wZ ∈R d from either N(1,2) or Unif(−2,2) . Com- pute logits as ℓZ =Xw Z +ε Z, where εZ is noise from either N(0,1) or Laplace(0,1) . Stan- dardize: ˜ℓZ = (ℓZ − ¯ℓZ)/std(ℓZ). SampleZ∼Bernoulli(σ( ˜ℓZ)). 17

work page
[72]

Apply row-wise softmax to obtain strata probabilities P∈R n×16, where each row sums to 1

Potential treatment/outcome:Generate weights W∈R d×16 and compute logits L=XW+E where E∈R n×16 is noise. Apply row-wise softmax to obtain strata probabilities P∈R n×16, where each row sums to 1. 4.Treatment and outcome strata:The 16 columns correspond to combinations of: • Treatment strata: Always-Takers (AT), Never-Takers (NT), Defiers (DE), Compliers (C...

work page
[73]

Observable generation:For each unit i, sample the stratum from the categorical distribution defined byP i, then determine(T i, Yi)based onZ i and the sampled stratum

work page
[74]

E.2 Details of the Jobs Benchmark The original National Supported Work (NSW) Demonstration is an RCT evaluating job training effects on earnings

Ground-truth bounds:Compute the observational probabilities pyt.z(xi) analytically from the strata probabilities, then apply the Balke-Pearl equations to obtainℓ(x i)andu(x i). E.2 Details of the Jobs Benchmark The original National Supported Work (NSW) Demonstration is an RCT evaluating job training effects on earnings. It includes the following covariat...

work page 1974
[75]

Finally, the outcome variable is the amount of earnings in 1978 (re78)

The treatment is a binary indicator of assignment to job training program. Finally, the outcome variable is the amount of earnings in 1978 (re78). We apply log transforms to the outputs to get less skewed outcome distribution: re74←log(re74+ 1) , re75←log(re75+ 1) , Y←log(re78+ 1) . Covariate Split.We split the features into observed covariates (O), which...

work page 1978

[1] [1]

Accountability and flexibility in public schools: Evidence from boston’s charters and pilots.The Quarterly Journal of Economics, 126(2):699–748, 2011

Atila Abdulkadiro ˘glu, Joshua D Angrist, Susan M Dynarski, Thomas J Kane, and Parag A Pathak. Accountability and flexibility in public schools: Evidence from boston’s charters and pilots.The Quarterly Journal of Economics, 126(2):699–748, 2011

work page 2011

[2] [2]

Joshua D Angrist and Guido W. Imbens. Identification and estimation of local average treatment effects.Econometrica, 62:467–475, 1994

work page 1994

[3] [3]

Princeton university press, 2009

Joshua D Angrist and Jörn-Steffen Pischke.Mostly harmless econometrics: An empiricist’s companion. Princeton university press, 2009

work page 2009

[4] [4]

Identification of causal effects using instrumental variables.Journal of the American statistical Association, 91(434):444–455, 1996

Joshua D Angrist, Guido W Imbens, and Donald B Rubin. Identification of causal effects using instrumental variables.Journal of the American statistical Association, 91(434):444–455, 1996

work page 1996

[5] [5]

The paired availability design: a proposal for evaluating epidural analgesia during labor.Statistics in medicine, 13(21):2269–2278, 1994

Stuart G Baker and Karen S Lindeman. The paired availability design: a proposal for evaluating epidural analgesia during labor.Statistics in medicine, 13(21):2269–2278, 1994

work page 1994

[6] [6]

Partial identification of treatment effects with implicit generative models.Advances in Neural Information Processing Systems, 35:22816–22829, 2022

Vahid Balazadeh, Vasilis Syrgkanis, and Rahul G Krishnan. Partial identification of treatment effects with implicit generative models.Advances in Neural Information Processing Systems, 35:22816–22829, 2022

work page 2022

[7] [7]

Cresswell, and Rahul G

Vahid Balazadeh, Hamidreza Kamkari, Valentin Thomas, Benson Li, Junwei Ma, Jesse C. Cresswell, and Rahul G. Krishnan. Causalpfn: Amortized causal effect estimation via in-context learning. InAdvances in Neural Information Processing Systems, volume 38, 2025

work page 2025

[8] [8]

Counterfactual probabilities: Computational methods, bounds and applications

Alexander Balke and Judea Pearl. Counterfactual probabilities: Computational methods, bounds and applications. InUncertainty in artificial intelligence, pages 46–54. Elsevier, 1994

work page 1994

[9] [9]

Bounds on treatment effects from studies with imperfect compliance.Journal of the American statistical Association, 92(439):1171–1176, 1997

Alexander Balke and Judea Pearl. Bounds on treatment effects from studies with imperfect compliance.Journal of the American statistical Association, 92(439):1171–1176, 1997

work page 1997

[10] [10]

Black box causal inference: Effect estimation via meta prediction.arXiv:2503.05985, 2025

Lucius EJ Bynum, Aahlad Manas Puli, Diego Herrero-Quevedo, Nhi Nguyen, Carlos Fernandez- Granda, Kyunghyun Cho, and Rajesh Ranganath. Black box causal inference: Effect estimation via meta prediction.arXiv:2503.05985, 2025

work page arXiv 2025

[11] [11]

Non-parametric bounds on treatment effects with non-compliance by covariate adjustment.Statistics in medicine, 26(16):3188–3204, 2007

Zhihong Cai, Manabu Kuroki, and Tosiya Sato. Non-parametric bounds on treatment effects with non-compliance by covariate adjustment.Statistics in medicine, 26(16):3188–3204, 2007

work page 2007

[12] [12]

A clinician’s tool for analyzing non-compliance

David Maxwell Chickering and Judea Pearl. A clinician’s tool for analyzing non-compliance. InProceedings of the National Conference on Artificial Intelligence, pages 1269–1276, 1996

work page 1996

[13] [13]

Challenges in statistics: A dozen challenges in causality and causal inference.arXiv preprint arXiv:2508.17099, 2025

Carlos Cinelli, Avi Feller, Guido Imbens, Edward Kennedy, Sara Magliacane, and Jose Zu- bizarreta. Challenges in statistics: A dozen challenges in causality and causal inference.arXiv preprint arXiv:2508.17099, 2025

work page arXiv 2025

[14] [14]

Causal inference using influence diagrams: the problem of partial compliance

A Philip Dawid. Causal inference using influence diagrams: the problem of partial compliance. Oxford Statistical Science Series, pages 45–65, 2003

work page 2003

[15] [15]

An automated approach to causal inference in discrete settings.Journal of the American Statistical Association, 119(547):1778–1793, 2024

Guilherme Duarte, Noam Finkelstein, Dean Knox, Jonathan Mummolo, and Ilya Shpitser. An automated approach to causal inference in discrete settings.Journal of the American Statistical Association, 119(547):1778–1793, 2024

work page 2024

[16] [16]

Principal stratification in causal inference

Constantine E Frangakis and Donald B Rubin. Principal stratification in causal inference. Biometrics, 58(1):21–29, 2002

work page 2002

[17] [17]

Flexible sensitivity analysis for observational studies without observable implications.Journal of the American Statistical Association, 2020

AlexanderM Franks, Alexander D’Amour, and Avi Feller. Flexible sensitivity analysis for observational studies without observable implications.Journal of the American Statistical Association, 2020

work page 2020

[18] [18]

Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration

Jacob R Gardner, Geoff Pleiss, David Bindel, Kilian Q Weinberger, and Andrew Gordon Wilson. Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration. InAdvances in Neural Information Processing Systems, 2018. 10

work page 2018

[19] [19]

Relaxation penalties and priors for plausible modeling of nonidentified bias sources.Statistical Science, 24:195–210, 2009

Sander Greenland. Relaxation penalties and priors for plausible modeling of nonidentified bias sources.Statistical Science, 24:195–210, 2009

work page 2009

[20] [20]

On Model Expansion, Model Contraction, Identifiability and Prior Information: Two Illustrative Scenarios Involving Mismeasured Variables.Statistical Science, 20(2):111 – 140, 2005

Paul Gustafson. On Model Expansion, Model Contraction, Identifiability and Prior Information: Two Illustrative Scenarios Involving Mismeasured Variables.Statistical Science, 20(2):111 – 140, 2005

work page 2005

[21] [21]

Bayesian inference for partially identified models.The international journal of biostatistics, 6(2), 2010

Paul Gustafson. Bayesian inference for partially identified models.The international journal of biostatistics, 6(2), 2010

work page 2010

[22] [22]

Deep iv: A flexible approach for counterfactual prediction

Jason Hartford, Greg Lewis, Kevin Leyton-Brown, and Matt Taddy. Deep iv: A flexible approach for counterfactual prediction. InProceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, Australia, 6-11 August 2017, pages 1–9, 2017

work page 2017

[23] [23]

Assessing the effect of an influenza vaccine in an encouragement design.Biostatistics, 1(1):69–88, 2000

Keisuke Hirano, Guido W Imbens, Donald B Rubin, and Xiao-Hua Zhou. Assessing the effect of an influenza vaccine in an encouragement design.Biostatistics, 1(1):69–88, 2000

work page 2000

[24] [24]

TabPFN: A transformer that solves small tabular classification problems in a second

Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter. TabPFN: A transformer that solves small tabular classification problems in a second. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023

[25] [25]

Accurate predictions on small data with a tabular foundation model.Nature, 637(8045):319–326, 2025

Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. Accurate predictions on small data with a tabular foundation model.Nature, 637(8045):319–326, 2025

work page 2025

[26] [26]

Cambridge University Press, 2015

Guido W Imbens and Donald B Rubin.Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015

work page 2015

[27] [27]

Identifying causal-effect inference failure with uncertainty-aware models.Advances in Neural Information Processing Systems, 33:11637–11649, 2020

Andrew Jesson, Sören Mindermann, Uri Shalit, and Yarin Gal. Identifying causal-effect inference failure with uncertainty-aware models.Advances in Neural Information Processing Systems, 33:11637–11649, 2020

work page 2020

[28] [28]

Tabicl: A tabular foundation model for in-context learning on large data

Qu Jingang, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. Tabicl: A tabular foundation model for in-context learning on large data. InForty-second International Conference on Machine Learning, 2025

work page 2025

[29] [29]

Ghost ads: Improving the economics of measuring online ad effectiveness.Journal of Marketing Research, 54(6):867–884, 2017

Garrett A Johnson, Randall A Lewis, and Elmar I Nubbemeyer. Ghost ads: Improving the economics of measuring online ad effectiveness.Journal of Marketing Research, 54(6):867–884, 2017

work page 2017

[30] [30]

A class of algorithms for general instrumental variable models.Advances in Neural Information Processing Systems, 33:20108–20119, 2020

Niki Kilbertus, Matt J Kusner, and Ricardo Silva. A class of algorithms for general instrumental variable models.Advances in Neural Information Processing Systems, 33:20108–20119, 2020

work page 2020

[31] [31]

Springer-Verlag, New York,

Christian Kleiber and Achim Zeileis.Applied Econometrics with R. Springer-Verlag, New York,

work page

[32] [32]

doi: 10.1007/978-0-387-77318-6

work page doi:10.1007/978-0-387-77318-6

[33] [33]

Evaluating the econometric evaluations of training programs with experi- mental data.The American Economic Review, pages 604–620, 1986

Robert J LaLonde. Evaluating the econometric evaluations of training programs with experi- mental data.The American Economic Review, pages 604–620, 1986

work page 1986

[34] [34]

Levis, Matteo Bonvini, Zhenghao Zeng, Luke Keele, and Edward H

Alexander W. Levis, Matteo Bonvini, Zhenghao Zeng, Luke Keele, and Edward H. Kennedy. Covariate-assisted bounds on causal effects with instrumental variables.Journal of the Royal Statistical Society Series B: Statistical Methodology, 2025

work page 2025

[35] [35]

Sharpening bounds on principal effects with covariates

Dustin M Long and Michael G Hudgens. Sharpening bounds on principal effects with covariates. Biometrics, 69(4):812–819, 2013

work page 2013

[36] [36]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2017

work page 2017

[37] [37]

Tabdpt: Scaling tabular foundation models on real data

Junwei Ma, Valentin Thomas, Rasa Hosseinzadeh, Hamidreza Kamkari, Alex Labach, Jesse C Cresswell, Keyvan Golestan, Guangwei Yu, Anthony L Caterini, and Maksims V olkovs. Tabdpt: Scaling tabular foundation models on real data. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 11

work page 2025

[38] [38]

Foundation models for causal inference via prior-data fitted networks

Yuchen Ma, Dennis Frauen, Emil Javurek, and Stefan Feuerriegel. Foundation models for causal inference via prior-data fitted networks. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026

[39] [39]

Amortized inference of causal models via conditional fixed-point iterations.Transactions on Machine Learning Research, 2025

Divyat Mahajan, Jannes Gladrow, Agrin Hilmkil, Cheng Zhang, and Meyer Scetbon. Amortized inference of causal models via conditional fixed-point iterations.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. J2C Certification

work page 2025

[40] [40]

Nonparametric bounds on treatment effects.The American Economic Review, 80(2):319–323, 1990

Charles F Manski. Nonparametric bounds on treatment effects.The American Economic Review, 80(2):319–323, 1990

work page 1990

[41] [41]

Springer, 2003

Charles F Manski.Partial identification of probability distributions. Springer, 2003

work page 2003

[42] [42]

Frequentist consistency of prior-data fitted networks for causal inference.arXiv preprint arXiv:2603.12037, 2026

Valentyn Melnychuk, Dennis Frauen, Maresa Schröder, and Stefan Feuerriegel. Frequentist consistency of prior-data fitted networks for causal inference.arXiv preprint arXiv:2603.12037, 2026

work page arXiv 2026

[43] [43]

The tennessee study of class size in the early school grades.The future of children, pages 113–127, 1995

Frederick Mosteller. The tennessee study of class size in the early school grades.The future of children, pages 113–127, 1995

work page 1995

[44] [44]

Prior-Data Fitted Networks for Causal Inference: a Simulation Study with Real-World Scenarios

Francisco Mourao, David Hajage, Daria Bystrova, Bertrand Bouvarel, Nathanaël Lapidus, Fabrice Carrat, and Benjamin Glemain. Prior-data fitted networks for causal inference: a simulation study with real-world scenarios.arXiv preprint arXiv:2603.15928, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[45] [45]

Transformers Can Do Bayesian Inference

Samuel Müller, Noah Hollmann, Sebastian Pineda Arango, Josif Grabocka, and Frank Hut- ter. Transformers Can Do Bayesian Inference. InInternational Conference on Learning Representations, 2022

work page 2022

[46] [46]

Statistical foundations of prior-data fitted networks

Thomas Nagler. Statistical foundations of prior-data fitted networks. InProceedings of the 40th International Conference on Machine Learning, volume 202, pages 25660–25676, 2023

work page 2023

[47] [47]

Stochastic causal programming for bounding treatment effects

Kirtan Padh, Jakob Zeitler, David Watson, Matt Kusner, Ricardo Silva, and Niki Kilbertus. Stochastic causal programming for bounding treatment effects. InConference on Causal Learning and Reasoning, pages 142–176. PMLR, 2023

work page 2023

[48] [48]

Cambridge University Press, 2009

Judea Pearl.Causality. Cambridge University Press, 2009

work page 2009

[49] [49]

PhD thesis, Almqvist & Wiksell, 1945

Olav Reiersøl.Confluence analysis by means of instrumental sets of variables. PhD thesis, Almqvist & Wiksell, 1945

work page 1945

[50] [50]

Nonparametric bounds and sensitivity analysis of treatment effects.Statistical science: a review journal of the Institute of Mathematical Statistics, 29(4):596, 2015

Amy Richardson, Michael G Hudgens, Peter B Gilbert, and Jason P Fine. Nonparametric bounds and sensitivity analysis of treatment effects.Statistical science: a review journal of the Institute of Mathematical Statistics, 29(4):596, 2015

work page 2015

[51] [51]

Transparent parameterizations of models for potential outcomes.Bayesian statistics, 9:569–610, 2011

Thomas S Richardson, Robin J Evans, and James M Robins. Transparent parameterizations of models for potential outcomes.Bayesian statistics, 9:569–610, 2011

work page 2011

[52] [52]

Do-PFN: In-context learning for causal effect estimation

Jake Robertson, Arik Reuter, Siyuan Guo, Noah Hollmann, Frank Hutter, and Bernhard Schölkopf. Do-PFN: In-context learning for causal effect estimation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025

[53] [53]

Estimating causal effects of treatments in randomized and nonrandomized studies.Journal of Educational Psychology, 66(5):688, 1974

Donald B Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies.Journal of Educational Psychology, 66(5):688, 1974

work page 1974

[54] [54]

Bayesian inference for causal effects: The role of randomization.The Annals of Statistics, pages 34–58, 1978

Donald B Rubin. Bayesian inference for causal effects: The role of randomization.The Annals of Statistics, pages 34–58, 1978

work page 1978

[55] [55]

Causal inference using potential outcomes: Design, modeling, decisions

Donald B Rubin. Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association, 100(469):322–331, 2005

work page 2005

[56] [56]

A general method for deriving tight symbolic bounds on causal effects.Journal of Computational and Graphical Statistics, 32(2):567–576, 2023

Michael C Sachs, Gustav Jonzon, Arvid Sjölander, and Erin E Gabriel. A general method for deriving tight symbolic bounds on causal effects.Journal of Computational and Graphical Statistics, 32(2):567–576, 2023. 12

work page 2023

[57] [57]

Learning representations of instruments for partial identification of treatment effects

Jonas Schweisthal, Dennis Frauen, Maresa Schröder, Konstantin Hess, Niki Kilbertus, and Stefan Feuerriegel. Learning representations of instruments for partial identification of treatment effects. InICLR 2025 Workshop on Generative and Experimental Perspectives for Biomolecular Design, 2025

work page 2025

[58] [58]

Causal inference through a witness protection program.Journal of Machine Learning Research, 17(56):1–53, 2016

Ricardo Silva and Robin Evans. Causal inference through a witness protection program.Journal of Machine Learning Research, 17(56):1–53, 2016

work page 2016

[59] [59]

Sonja A Swanson, Miguel A Hernán, Matthew Miller, James M Robins, and Thomas S Richard- son. Partial identification of the average treatment effect using instrumental variables: review of methods for binary instruments, treatments, and outcomes.Journal of the American Statistical Association, 113(522):933–947, 2018

work page 2018

[60] [60]

Probabilities of causation: Bounds and identification.Annals of Mathematics and Artificial Intelligence, 28(1):287–313, 2000

Jin Tian and Judea Pearl. Probabilities of causation: Bounds and identification.Annals of Mathematics and Artificial Intelligence, 28(1):287–313, 2000

work page 2000

[61] [61]

Variational learning of inducing variables in sparse gaussian processes

Michalis Titsias. Variational learning of inducing variables in sparse gaussian processes. In Artificial intelligence and statistics, pages 567–574. PMLR, 2009

work page 2009

[62] [62]

Covariate adjustment for two-sample treatment comparisons in randomized clinical trials: a principled yet flexible approach.Statistics in medicine, 27(23):4658–4677, 2008

Anastasios A Tsiatis, Marie Davidian, Min Zhang, and Xiaomin Lu. Covariate adjustment for two-sample treatment comparisons in randomized clinical trials: a principled yet flexible approach.Statistics in medicine, 27(23):4658–4677, 2008

work page 2008

[63] [63]

VanderWeele and Ilya Shpitser

Tyler J. VanderWeele and Ilya Shpitser. On the definition of a confounder.Annals of statistics, 41(1):196–220, 2013

work page 2013

[64] [64]

Quantile-optimal treatment regimes,

Justin Whitehouse, Morgane Austern, and Vasilis Syrgkanis. Inference on optimal policy values and other irregular functionals via smoothing.arXiv preprint arXiv:2507.11780, 2025

work page arXiv 2025

[65] [65]

Macmillan, 1928

Philip Green Wright.The tariff on animal and vegetable oils. Macmillan, 1928

work page 1928

[66] [66]

Neural causal models for counterfactual identifi- cation and estimation.arXiv preprint arXiv:2210.00035, 2022

Kevin Xia, Yushu Pan, and Elias Bareinboim. Neural causal models for counterfactual identifi- cation and estimation.arXiv preprint arXiv:2210.00035, 2022

work page arXiv 2022

[67] [67]

Towards causal foundation model: on duality between causal inference and attention.arXiv preprint arXiv:2310.00809, 2023

Jiaqi Zhang, Joel Jennings, Agrin Hilmkil, Nick Pawlowski, Cheng Zhang, and Chao Ma. Towards causal foundation model: on duality between causal inference and attention.arXiv preprint arXiv:2310.00809, 2023

work page arXiv 2023

[68] [68]

Bounding causal effects on continuous outcome

Junzhe Zhang and Elias Bareinboim. Bounding causal effects on continuous outcome. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 12207–12215, 2021

work page 2021

[69] [69]

Non-parametric methods for partial identification of causal effects.Columbia CausalAI Laboratory Technical Report, 2021

Junzhe Zhang and Elias Bareinboim. Non-parametric methods for partial identification of causal effects.Columbia CausalAI Laboratory Technical Report, 2021. 13 Appendix Contents A Balke-Pearl Equations 15 B Inclusive KL Equivalence 15 C Training Details and Inference 16 D Proof of Proposition 1 16 E Details of the Benchmark 17 E.1 Details of the Synthetic ...

work page 2021

[70] [70]

Covariates:Sample X∈R n×d with d∼Unif{5,6,7,8,9,10} , where each entry is drawn from eitherN(5,1)or Unif(−10,5), chosen randomly

work page

[71] [71]

Com- pute logits as ℓZ =Xw Z +ε Z, where εZ is noise from either N(0,1) or Laplace(0,1)

Instrument generation:Generate weights wZ ∈R d from either N(1,2) or Unif(−2,2) . Com- pute logits as ℓZ =Xw Z +ε Z, where εZ is noise from either N(0,1) or Laplace(0,1) . Stan- dardize: ˜ℓZ = (ℓZ − ¯ℓZ)/std(ℓZ). SampleZ∼Bernoulli(σ( ˜ℓZ)). 17

work page

[72] [72]

Apply row-wise softmax to obtain strata probabilities P∈R n×16, where each row sums to 1

Potential treatment/outcome:Generate weights W∈R d×16 and compute logits L=XW+E where E∈R n×16 is noise. Apply row-wise softmax to obtain strata probabilities P∈R n×16, where each row sums to 1. 4.Treatment and outcome strata:The 16 columns correspond to combinations of: • Treatment strata: Always-Takers (AT), Never-Takers (NT), Defiers (DE), Compliers (C...

work page

[73] [73]

Observable generation:For each unit i, sample the stratum from the categorical distribution defined byP i, then determine(T i, Yi)based onZ i and the sampled stratum

work page

[74] [74]

E.2 Details of the Jobs Benchmark The original National Supported Work (NSW) Demonstration is an RCT evaluating job training effects on earnings

Ground-truth bounds:Compute the observational probabilities pyt.z(xi) analytically from the strata probabilities, then apply the Balke-Pearl equations to obtainℓ(x i)andu(x i). E.2 Details of the Jobs Benchmark The original National Supported Work (NSW) Demonstration is an RCT evaluating job training effects on earnings. It includes the following covariat...

work page 1974

[75] [75]

Finally, the outcome variable is the amount of earnings in 1978 (re78)

The treatment is a binary indicator of assignment to job training program. Finally, the outcome variable is the amount of earnings in 1978 (re78). We apply log transforms to the outputs to get less skewed outcome distribution: re74←log(re74+ 1) , re75←log(re75+ 1) , Y←log(re78+ 1) . Covariate Split.We split the features into observed covariates (O), which...

work page 1978