Robust Counterfactual Inference in Markov Decision Processes
Pith reviewed 2026-05-25 08:09 UTC · model grok-4.3
The pith
Non-parametric closed-form bounds compute tight ranges for counterfactual transitions in MDPs across all compatible causal models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose a novel non-parametric approach that computes tight bounds on counterfactual transition probabilities across all compatible causal models. Unlike previous methods that require solving prohibitively large optimisation problems (with variables that grow exponentially in the size of the MDP), our approach provides closed-form expressions for these bounds, making computation highly efficient and scalable for non-trivial MDPs. Once such an interval counterfactual MDP is constructed, our method identifies robust counterfactual policies that optimise the worst-case reward w.r.t. the uncertain interval MDP probabilities.
What carries the argument
Closed-form expressions for the tight bounds on counterfactual transition probabilities over every causal model consistent with the observational and interventional distributions.
If this is right
- Bounds and policies can be computed for MDPs whose state-action spaces are too large for exponential-variable optimization.
- The interval MDP encodes all counterfactual outcomes consistent with the data, so any policy chosen from it is valid under every compatible causal model.
- Worst-case reward optimization inside the interval MDP produces policies whose performance is guaranteed against uncertainty in the counterfactuals.
- Evaluation on case studies shows these policies outperform those derived from any single fixed causal model.
Where Pith is reading between the lines
- The closed-form construction could be lifted to settings with partial observability if the compatibility constraints can be projected onto the observed variables.
- The interval representation might be combined with existing robust MDP solvers to handle additional sources of uncertainty beyond the causal-model class.
- Collecting more interventional data would shrink the interval width, offering a quantitative way to decide which experiments reduce counterfactual ambiguity most efficiently.
Load-bearing premise
The set of all causal models compatible with the observational and interventional distributions admits tight bounds that can be expressed in closed form without requiring exponential variables or post-hoc model selection.
What would settle it
On a small MDP where all compatible causal models can be enumerated, the closed-form interval either fails to contain the true range or is strictly wider than the range obtained from the full optimization.
Figures
read the original abstract
This paper addresses a key limitation in existing counterfactual inference methods for Markov Decision Processes (MDPs). Current approaches assume a specific causal model to make counterfactuals identifiable. However, there are usually many causal models that align with the observational and interventional distributions of an MDP, each yielding different counterfactual distributions, so fixing a particular causal model limits the validity (and usefulness) of counterfactual inference. We propose a novel non-parametric approach that computes tight bounds on counterfactual transition probabilities across all compatible causal models. Unlike previous methods that require solving prohibitively large optimisation problems (with variables that grow exponentially in the size of the MDP), our approach provides closed-form expressions for these bounds, making computation highly efficient and scalable for non-trivial MDPs. Once such an interval counterfactual MDP is constructed, our method identifies robust counterfactual policies that optimise the worst-case reward w.r.t. the uncertain interval MDP probabilities. We evaluate our method on various case studies, demonstrating improved robustness over existing methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a non-parametric approach for computing tight closed-form bounds on counterfactual transition probabilities in MDPs over the set of all causal models compatible with observed observational and interventional distributions. These bounds are used to construct an interval MDP, from which robust policies are derived by optimizing the worst-case reward. The method is claimed to be scalable (avoiding exponential variables in optimization) and is evaluated on case studies showing improved robustness over prior methods.
Significance. If the closed-form bounds are tight and cover the full class of compatible models, the work would provide a meaningful advance in scalable robust counterfactual inference for MDPs by sidestepping the computational intractability of prior optimization-based approaches. The non-parametric framing and case-study evaluations are strengths. The stress-test concern (that closed-form expressions may implicitly restrict the model class) does not land on the manuscript: the derivation establishes the bounds directly from the compatible set without additional structural restrictions.
minor comments (2)
- [Abstract] Abstract: the statement that the approach 'provides closed-form expressions' would be clearer if it briefly indicated the functional form or the key independence exploited to avoid exponential variables.
- [Case studies] The case-study section would benefit from explicit statements of the baseline methods' hyper-parameters and the precise definition of 'improved robustness' (e.g., which metric and how many runs).
Simulated Author's Rebuttal
We thank the referee for the positive assessment, recognition of the non-parametric closed-form bounds, and recommendation for minor revision. The evaluation on scalability and robustness is appreciated.
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper derives closed-form bounds on counterfactual transition probabilities directly from the set of causal models compatible with given observational and interventional distributions in an MDP. The abstract and described method present this as a non-parametric computation that avoids exponential optimization variables, without any quoted reduction of the output bounds to fitted parameters, self-definitions, or load-bearing self-citations. The central claim of tight bounds and robust policies rests on independent analysis of model compatibility rather than renaming known results or smuggling ansatzes via prior work by the same authors. This leaves the derivation self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Multiple causal models align with the observational and interventional distributions of an MDP, each yielding different counterfactual distributions
Reference graph
Works this paper leans on
-
[1]
Alexander Balke and Judea Pearl. 1994. Counterfactual probabilities: Computa- tional methods, bounds and applications. InUncertainty in artificial intelligence. Elsevier, 46–54
work page 1994
-
[2]
Nina L Corvelo Benz and Manuel Gomez Rodriguez. 2022. Counterfactual inference of second opinions. InUncertainty in Artificial Intelligence. PMLR, 453–463
work page 2022
-
[3]
Lars Buesing, Theophane Weber, Yori Zwols, Sebastien Racaniere, Arthur Guez, Jean-Baptiste Lespiau, and Nicolas Heess. 2018. Woulda, coulda, shoulda: Counterfactually-guided policy search.arXiv preprint arXiv:1811.06272(2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[4]
Zhihong Cai, Manabu Kuroki, Judea Pearl, and Jin Tian. 2008. Bounds on direct effects in the presence of confounded intermediate variables.Biometrics64, 3 (2008), 695–701
work page 2008
- [5]
-
[6]
Guilherme Duarte, Noam Finkelstein, Dean Knox, Jonathan Mummolo, and Ilya Shpitser. 2023. An automated approach to causal inference in discrete settings. J. Amer. Statist. Assoc.(2023), 1–16
work page 2023
- [7]
-
[8]
Robert Givan, Sonia Leach, and Thomas Dean. 2000. Bounded-parameter Markov decision processes.Artificial Intelligence122, 1 (2000), 71–109. https://doi.org/ 10.1016/S0004-3702(00)00047-3
-
[9]
Dennis Gross, Nils Jansen, Sebastian Junges, and Guillermo A Pérez. 2022. COOL- MC: a comprehensive tool for reinforcement learning and model checking. In Dependable Software Engineering. Theories, Tools, and Applications: 8th Interna- tional Symposium, SETTA 2022, Beijing, China, October 27-29, 2022, Proceedings. Springer, 41–49
work page 2022
-
[10]
Joseph Y Halpern and Judea Pearl. 2005. Causes and explanations: A structural- model approach. Part II: Explanations.The British journal for the philosophy of science(2005)
work page 2005
-
[11]
Martin B Haugh and Raghav Singal. 2023. Bounding Counterfactuals in Hidden Markov Models and Beyond.A vailable at SSRN 4529724(2023)
work page 2023
-
[12]
Changsung Kang and Jin Tian. 2006. Inequality constraints in causal models with hidden variables. InProceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence. 233–240
work page 2006
-
[13]
Milad Kazemi, Jessica Lally, Ekaterina Tishchenko, Hana Chockler, and Nicola Paoletti. 2025. Counterfactual Influence in Markov Decision Processes. InPro- ceedings of the Fourth Conference on Causal Learning and Reasoning (Proceedings of Machine Learning Research, Vol. 275), Biwei Huang and Mathias Drton (Eds.). PMLR, 792–817. https://proceedings.mlr.pres...
work page 2025
-
[14]
Taylor W Killian, Marzyeh Ghassemi, and Shalmali Joshi. 2022. Counterfactually guided policy transfer in clinical settings. InConference on Health, Inference, and Learning. PMLR, 5–31
work page 2022
-
[15]
M. Kwiatkowska, G. Norman, and D. Parker. 2011. PRISM 4.0: Verification of Probabilistic Real-time Systems. InProc. 23rd International Conference on Computer Aided Verification (CA V’11) (LNCS, Vol. 6806), G. Gopalakrishnan and S. Qadeer (Eds.). Springer, 585–591
work page 2011
-
[16]
Ang Li and Judea Pearl. 2024. Probabilities of causation with nonbinary treatment and effect. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 20465–20472
work page 2024
-
[17]
Guy Lorberbom, Daniel D Johnson, Chris J Maddison, Daniel Tarlow, and Tamir Hazan. 2021. Learning generalized Gumbel-max causal mechanisms.Advances in Neural Information Processing Systems34 (2021), 26792–26803
work page 2021
- [18]
-
[19]
Chris J Maddison, Daniel Tarlow, and Tom Minka. 2014. A* sampling.Advances in Neural Information Processing Systems27 (2014)
work page 2014
-
[20]
Charles F Manski. 1990. Nonparametric bounds on treatment effects.The American Economic Review80, 2 (1990), 319–323
work page 1990
-
[21]
Frederik Baymler Mathiesen, Morteza Lahijanian, and Luca Laurenti. 2024. Inter- valMDP.jl: Accelerated Value Iteration for Interval Markov Decision Processes. IFAC-PapersOnLine58, 11, 1–6. https://doi.org/10.1016/j.ifacol.2024.07.416 8th IFAC Conference on Analysis and Design of Hybrid Systems ADHS 2024
- [22]
-
[23]
Kimia Noorbakhsh and Manuel Rodriguez. 2022. Counterfactual temporal point processes.Advances in Neural Information Processing Systems35 (2022), 24810– 24823
work page 2022
-
[24]
Michael Oberst and David Sontag. 2019. Counterfactual off-policy evaluation with Gumbel-max structural causal models. InICML
work page 2019
-
[25]
Judea Pearl. 2009.Causality(2 nd ed.). Cambridge University Press. https: //doi.org/10.1017/CBO9780511803161
-
[26]
Edoardo Pona, Milad Kazemi, Yali Du, David Watson, and Nicola Paoletti
- [27]
-
[28]
James M Robins. 1989. The analysis of randomized and non-randomized AIDS treatment trials using a new approach to causal inference in longitudinal studies. Health service research methodology: a focus on AIDS(1989), 113–159
work page 1989
-
[29]
Marnix Suilen, Thiago D Simão, David Parker, and Nils Jansen. 2022. Robust anytime learning of Markov decision processes.Advances in Neural Information Processing Systems35 (2022), 28790–28802
work page 2022
-
[30]
Yuewen Sun, Erli Wang, Biwei Huang, Chaochao Lu, Lu Feng, Changyin Sun, and Kun Zhang. 2024. ACAMDA: improving data efficiency in reinforcement learning through guided counterfactual data augmentation. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 15193–15201
work page 2024
-
[31]
Jin Tian and Judea Pearl. 2002. A general identification condition for causal effects. InAAAI/IAAI. 567–573
work page 2002
-
[32]
Stratis Tsirtsis, Abir De, and Manuel Rodriguez. 2021. Counterfactual expla- nations in sequential decision making under uncertainty.Advances in Neural Information Processing Systems34 (2021), 30127–30139
work page 2021
-
[33]
Stratis Tsirtsis and Manuel Rodriguez. 2024. Finding counterfactually optimal action sequences in continuous state spaces.Advances in Neural Information Processing Systems36 (2024)
work page 2024
-
[34]
Athanasios Vlontzos, Bernhard Kainz, and Ciarán M Gilligan-Lee. 2023. Esti- mating categorical counterfactuals via deep twin networks.Nature Machine Intelligence5, 2 (2023), 159–168
work page 2023
-
[35]
Marco Zaffalon, Alessandro Antonucci, and Rafael Cabañas. 2020. Structural causal models are (solvable by) credal networks. InInternational Conference on Probabilistic Graphical Models. PMLR, 581–592
work page 2020
-
[36]
Marco Zaffalon, Alessandro Antonucci, and Rafael Cabañas. 2021. Causal Expectation-Maximisation. InWHY-21 Workshop
work page 2021
-
[37]
Marco Zaffalon, Alessandro Antonucci, Rafael Cabañas, and David Huber. 2023. Approximating counterfactual bounds while fusing observational, biased and randomised data sources.International Journal of Approximate Reasoning162 (2023), 109023
work page 2023
-
[38]
Marco Zaffalon, Alessandro Antonucci, Rafael Cabañas, David Huber, and Dario Azzimonti. 2022. Bounding counterfactuals under selection bias. InInternational Conference on Probabilistic Graphical Models. PMLR, 289–300
work page 2022
-
[39]
Marco Zaffalon, Alessandro Antonucci, Rafael Cabañas, David Huber, and Dario Azzimonti. 2024. Efficient computation of counterfactual bounds.International Journal of Approximate Reasoning(2024), 109111
work page 2024
-
[40]
Junzhe Zhang, Jin Tian, and Elias Bareinboim. 2022. Partial Counterfactual Identification from Observational and Experimental Data. InProceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learn- ing Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (Eds.)...
work page 2022
-
[41]
Qingfu Zhu, Weinan Zhang, Ting Liu, and William Yang Wang. 2020. Counter- factual off-policy training for neural dialogue generation. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 3438–3448. A GUMBEL-MAX SCMS The Gumbel-max SCM for an MDP is expressed as: 𝑆𝑡+1 =𝑓(𝑆 𝑡 , 𝐴𝑡 , 𝑈𝑡 =(𝐺 𝑠,𝑡 )𝑠∈ S )=arg max 𝑠∈ ...
work page 2020
-
[42]
or top-down Gumbel sampling [19]. We can define a so-calledcounterfactual MDP M𝜏 by solving the SCM (8) for each transition along an observed path 𝜏 in an MDP M. The counterfactual probability for each transition is defined, for𝑡=0, ..., 𝑇−1, as: 𝑃 M,𝑡,𝜏 (𝑠 ′ |𝑠, 𝑎)=𝑃(𝑠 ′ =arg max 𝑞∈ S log (𝑃 M (𝑞|𝑠, 𝑎) ) +𝐺 ′ 𝜏,𝑞,𝑡 ) ≈ 1 𝑁 𝑁∑︁ 𝑗=0 1 𝑠 ′ =arg max 𝑞∈ S n l...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.