Estimating Joint Interventional Distributions from Marginal Interventional Data
Pith reviewed 2026-05-23 20:56 UTC · model grok-4.3
The pith
Marginal interventional distributions over variable subsets suffice to recover the joint interventional distribution over all variables.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Extending the Causal Maximum Entropy objective to include interventional constraints yields, by Lagrange duality, a solution in the exponential family. When marginal interventional distributions are provided for any subset of the variables, the same objective recovers the joint interventional distribution over the full set and also enables causal feature selection from mixed observational and single-variable interventional data.
What carries the argument
The extended Causal Maximum Entropy objective with interventional constraints, solved via its Lagrange dual to produce an exponential-family distribution.
If this is right
- Causal feature selection can be performed from a mixture of observational data and single-variable interventional data, outperforming prior merging methods on synthetic examples.
- The recovered joint interventional distributions match the performance of tests that require full joint observations.
- The exponential-family form supplies an explicit parametric representation for any collection of marginal interventional constraints.
Where Pith is reading between the lines
- If the causal graph is known, the same dual construction could be used to propagate constraints across unobserved interventions.
- The method suggests a data-collection strategy in which separate experiments each intervene on only a few variables, with the joint recovered afterward.
Load-bearing premise
Marginal interventional distributions supplied for arbitrary subsets of variables are together sufficient to uniquely determine the joint interventional distribution over all variables.
What would settle it
A concrete data-generating process in which two different joint interventional distributions produce identical marginal interventional distributions on every proper subset, yet differ on the full joint.
Figures
read the original abstract
In this paper we show how to exploit interventional data to acquire the joint conditional distribution of all the variables using the Maximum Entropy principle. To this end, we extend the Causal Maximum Entropy method to make use of interventional data in addition to observational data. Using Lagrange duality, we prove that the solution to the Causal Maximum Entropy problem with interventional constraints lies in the exponential family, as in the Maximum Entropy solution. Our method allows us to perform two tasks of interest when marginal interventional distributions are provided for any subset of the variables. First, we show how to perform causal feature selection from a mixture of observational and single-variable interventional data, and, second, how to infer joint interventional distributions. For the former task, we show on synthetically generated data, that our proposed method outperforms the state-of-the-art method on merging datasets, and yields comparable results to the KCI-test which requires access to joint observations of all variables.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript extends the Causal Maximum Entropy framework to incorporate interventional data alongside observational data. Using Lagrange duality, it claims to prove that the optimizer under interventional constraints remains in the exponential family. The method is applied to two tasks: causal feature selection from a mixture of observational and single-variable interventional data, and recovery of joint interventional distributions from marginal interventional distributions supplied for arbitrary subsets of variables. Synthetic experiments indicate that the approach outperforms dataset-merging baselines for feature selection and performs comparably to the KCI test (which requires joint observations).
Significance. If the duality argument is rigorous and the supplied marginal interventional constraints suffice for unique recovery of the joint, the work supplies a principled maximum-entropy route for fusing observational and interventional data without requiring full joint observations. The exponential-family preservation result would be a clean theoretical contribution, and the feature-selection experiments on synthetic data provide concrete empirical grounding.
major comments (2)
- [Abstract / identifiability section] Abstract and the section presenting the identifiability claim: the statement that the method recovers the joint interventional distribution “when marginal interventional distributions are provided for any subset of the variables” is load-bearing for both claimed tasks. No explicit identifiability theorem or graph-dependent conditions are supplied showing that the marginal interventional constraints uniquely determine the joint; multiple joints can agree on the same do-marginals when intervened subsets leave paths or components unconstrained.
- [Theoretical development / duality argument] The Lagrange-duality proof (referenced in the abstract and presumably in the main theoretical section): the claim that the solution remains in the exponential family under interventional constraints is central, yet the manuscript provides neither the explicit dual derivation nor the encoding of the marginal interventional expectations as constraints. Without these details the preservation result cannot be verified.
minor comments (2)
- [Experiments] The synthetic-data section should report the precise data-generating process, the number of variables, the fraction of interventional samples, and the exact performance metrics (beyond “outperforms”) so that the feature-selection comparison can be reproduced.
- [Notation / method section] Notation for the interventional constraints (e.g., how P(V_S | do(V_T)) is written inside the extended Causal MaxEnt objective) should be introduced once and used consistently.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which identify key areas where the manuscript requires greater rigor and explicit detail. We address each major comment below and will incorporate the necessary revisions.
read point-by-point responses
-
Referee: [Abstract / identifiability section] Abstract and the section presenting the identifiability claim: the statement that the method recovers the joint interventional distribution “when marginal interventional distributions are provided for any subset of the variables” is load-bearing for both claimed tasks. No explicit identifiability theorem or graph-dependent conditions are supplied showing that the marginal interventional constraints uniquely determine the joint; multiple joints can agree on the same do-marginals when intervened subsets leave paths or components unconstrained.
Authors: We acknowledge that the current version does not supply an explicit identifiability theorem with graph-dependent conditions. The claim in the abstract and introduction is intended to hold under the maximum-entropy principle when the supplied marginal interventional constraints are sufficient to pin down the joint, but we agree that uniqueness is not automatic for arbitrary subsets. In the revision we will add a dedicated identifiability subsection that states the precise conditions (e.g., when the collection of intervened variable sets covers all relevant causal paths or satisfies a covering criterion on the underlying DAG) under which the joint interventional distribution is uniquely recoverable from the given marginals. revision: yes
-
Referee: [Theoretical development / duality argument] The Lagrange-duality proof (referenced in the abstract and presumably in the main theoretical section): the claim that the solution remains in the exponential family under interventional constraints is central, yet the manuscript provides neither the explicit dual derivation nor the encoding of the marginal interventional expectations as constraints. Without these details the preservation result cannot be verified.
Authors: The manuscript sketches the Lagrange-duality argument but does not expand the full derivation or the precise encoding of interventional marginals. We will revise the theoretical section to include the complete steps: (i) formulation of the constrained optimization problem that augments the observational entropy objective with both observational and interventional moment-matching constraints, (ii) construction of the Lagrangian that incorporates the do-marginal expectations as linear constraints on the interventional distributions, (iii) derivation of the dual problem, and (iv) explicit verification that the resulting primal optimizer belongs to the exponential family with parameters that absorb the interventional Lagrange multipliers. This will make the preservation result directly verifiable. revision: yes
Circularity Check
No circularity: derivation follows from standard Lagrange duality on extended MaxEnt
full rationale
The abstract describes extending Causal MaxEnt with interventional constraints and applying Lagrange duality to obtain an exponential-family solution. This is a direct consequence of the optimization problem definition and does not reduce any target quantity (joint interventional distribution) to a fitted parameter or self-citation by construction. No load-bearing steps match the enumerated circularity patterns; the method is presented as building on the established MaxEnt principle with an independent duality argument. The identifiability of joints from marginals is an assumption whose validity is external to the derivation chain itself.
Axiom & Free-Parameter Ledger
free parameters (1)
- Lagrange multipliers for interventional constraints
axioms (2)
- domain assumption The maximum entropy principle remains valid when observational and interventional marginal constraints are combined in a causal setting.
- standard math Lagrange duality applies directly to the Causal MaxEnt objective with the added interventional constraints.
Reference graph
Works this paper leans on
-
[1]
Berger, A., Della Pietra, S. A., and Della Pietra, V. J. A maximum entropy approach to natural language processing. Computational linguistics, 22 0 (1): 0 39--71, 1996
work page 1996
-
[2]
Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., Vander P las, J., Wanderman- M ilne, S., and Zhang, Q. JAX : composable transformations of P ython+ N um P y programs, 2018. URL http://github.com/google/jax
work page 2018
-
[3]
Cooper, G. F. and Yoo, C. Causal discovery from a mixture of experimental and observational data. In Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, pp.\ 116--125, 1999
work page 1999
-
[4]
Integrating locally learned causal structures with overlapping variables
Danks, D., Glymour, C., and Tillman, R. Integrating locally learned causal structures with overlapping variables. Advances in Neural Information Processing Systems, 21, 2008
work page 2008
-
[5]
Deming, W. E. and Stephan, F. F. On a least squares adjustment of a sampled frequency table when the expected marginal totals are known. The Annals of Mathematical Statistics, 11 0 (4): 0 427--444, 1940
work page 1940
-
[6]
Eaton, D. and Murphy, K. Exact bayesian structure learning from uncertain interventions. In Meila, M. and Shen, X. (eds.), Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, volume 2 of Proceedings of Machine Learning Research, pp.\ 107--114, San Juan, Puerto Rico, 21--24 Mar 2007. PMLR. URL https://proceedings...
work page 2007
-
[7]
Q., Ghasemi, M., and Kocaoglu, M
Elahi, M. Q., Ghasemi, M., and Kocaoglu, M. Identification of average causal effects in confounded additive noise models. arXiv preprint arXiv:2407.10014, 2024
-
[8]
Farnia, F. and Tse, D. A minimax approach to supervised learning. Advances in Neural Information Processing Systems, 29, 2016
work page 2016
-
[9]
Obtaining causal information by merging datasets with maxent
Garrido Mejia , S., Kirschbaum, E., and Janzing, D. Obtaining causal information by merging datasets with maxent. In International Conference on Artificial Intelligence and Statistics, pp.\ 581--603. PMLR, 2022
work page 2022
-
[10]
Gresele, L., Von K \"u gelgen, J., K \"u bler, J., Kirschbaum, E., Sch \"o lkopf, B., and Janzing, D. Causal inference through the structural causal marginal problem. In International Conference on Machine Learning, pp.\ 7793--7824. PMLR, 2022
work page 2022
-
[11]
Invariant causal prediction for nonlinear models
Heinze-Deml, C., Peters, J., and Meinshausen, N. Invariant causal prediction for nonlinear models. Journal of Causal Inference, 6 0 (2), 2018
work page 2018
-
[12]
Hindersah, R., Kalay, A. M., and Talahaturuson, A. Rice yield grown in different fertilizer combination and planting methods: Case study in buru island, indonesia. Open Agriculture, 7 0 (1): 0 871--881, 2022
work page 2022
-
[13]
Causal versions of maximum entropy and principle of insufficient reason
Janzing, D. Causal versions of maximum entropy and principle of insufficient reason. Journal of Causal Inference, 9 0 (1): 0 285--301, 2021
work page 2021
-
[14]
Distinguishing Cause and Effect via Second Order Exponential Models
Janzing, D., Sun, X., and Sch \"o lkopf, B. Distinguishing cause and effect via second order exponential models. arXiv preprint arXiv:0910.5561, 2009
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[15]
Jaynes, E. T. Information theory and statistical mechanics. Physical review, 106 0 (4): 0 620, 1957
work page 1957
-
[16]
Jaynes, E. T. Probability theory: The logic of science. Cambridge university press, 2003
work page 2003
-
[17]
Disentangling causal effects from sets of interventions in the presence of unobserved confounders
Jeunen, O., Gilligan-Lee, C., Mehrotra, R., and Lalmas, M. Disentangling causal effects from sets of interventions in the presence of unobserved confounders. Advances in Neural Information Processing Systems, 35: 0 27850--27861, 2022
work page 2022
-
[18]
Kellerer, H. G. Ma theoretische marginalprobleme. Mathematische Annalen, 153 0 (3): 0 168--198, June 1964. doi:10.1007/bf01360315. URL https://doi.org/10.1007/bf01360315
-
[19]
Koller, D. and Friedman, N. Probabilistic graphical models: principles and techniques. MIT press, 2009
work page 2009
-
[20]
M., Magliacane, S., and Claassen, T
Mooij, J. M., Magliacane, S., and Claassen, T. Joint causal inference from multiple contexts. The Journal of Machine Learning Research, 21 0 (1): 0 3919--4026, 2020
work page 2020
- [21]
-
[22]
Pearl, J. and Mackenzie, D. The Book of Why: The New Science of Cause and Effect. Basic Books, Inc., USA, 1st edition, 2018. ISBN 046509760X
work page 2018
-
[23]
Causal inference by using invariant prediction: identification and confidence intervals
Peters, J., B \"u hlmann, P., and Meinshausen, N. Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78 0 (5): 0 947--1012, 2016
work page 2016
-
[24]
Effect of irrigation and fertilizer management on rice yield and nitrogen loss: A meta-analysis
Qiu, H., Yang, S., Jiang, Z., Xu, Y., and Jiao, X. Effect of irrigation and fertilizer management on rice yield and nitrogen loss: A meta-analysis. Plants, 11 0 (13): 0 1690, 2022
work page 2022
-
[25]
Saengkyongam, S. and Silva, R. Learning joint nonlinear effects from single-variable interventions in the presence of hidden confounders. In Conference on Uncertainty in Artificial Intelligence, pp.\ 300--309. PMLR, 2020
work page 2020
-
[26]
Sani, N., Mastakouri, A. A., and Janzing, D. Bounding probabilities of causation through the causal marginal problem. arXiv preprint arXiv:2304.02023, 2023
-
[27]
Causal inference by choosing graphs with most plausible markov kernels
Sun, X., Janzing, D., and Sch \"o lkopf, B. Causal inference by choosing graphs with most plausible markov kernels. In Ninth International Symposium on Artificial Intelligence and Mathematics (AIMath 2006), pp.\ 1--11, 2006
work page 2006
-
[28]
Tian, J. and Pearl, J. Causal discovery from changes. In Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence, pp.\ 512--521, 2001
work page 2001
-
[29]
Tian, J. and Pearl, J. A general identification condition for causal effects. eScholarship, University of California, 2002
work page 2002
-
[30]
Tillman, R. and Spirtes, P. Learning equivalence classes of acyclic models with latent and selection variables from multiple datasets with overlapping variables. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp.\ 3--15. JMLR Workshop and Conference Proceedings, 2011
work page 2011
-
[31]
Tillman, R. E. Structure learning with independent non-identically distributed data. In Proceedings of the 26th Annual International Conference on Machine Learning, pp.\ 1041--1048, 2009
work page 2009
-
[32]
Triantafillou, S. and Tsamardinos, I. Constraint-based causal discovery from multiple interventions over overlapping variable sets. The Journal of Machine Learning Research, 16 0 (1): 0 2147--2205, 2015
work page 2015
-
[33]
Wainwright, M. J., Jordan, M. I., et al. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning , 1 0 (1--2): 0 1--305, 2008
work page 2008
-
[34]
Wihardjaka, A., Harsanti, E. S., and Ardiwinata, A. N. Effect of fertilizer management on potassium dynamics and yield of rainfed lowland rice in indonesia. Chilean journal of agricultural research, 82 0 (1): 0 33--43, 2022
work page 2022
-
[35]
Kernel-based conditional independence test and application in causal discovery
Zhang, K., Peters, J., Janzing, D., and Sch \"o lkopf, B. Kernel-based conditional independence test and application in causal discovery. In 27th Conference on Uncertainty in Artificial Intelligence (UAI 2011), pp.\ 804--813. AUAI Press, 2011
work page 2011
-
[36]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.