pith. sign in

arxiv: 2502.13731 · v5 · pith:TYTJWBSInew · submitted 2025-02-19 · 💻 cs.AI

Robust Counterfactual Inference in Markov Decision Processes

Pith reviewed 2026-05-25 08:09 UTC · model grok-4.3

classification 💻 cs.AI
keywords counterfactual inferenceMarkov decision processescausal modelsrobust policiesinterval probabilitiesnon-parametric boundsworst-case optimization
0
0 comments X

The pith

Non-parametric closed-form bounds compute tight ranges for counterfactual transitions in MDPs across all compatible causal models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a method to bound counterfactual transition probabilities in MDPs without selecting one causal model among the many that fit the observed data. It derives closed-form expressions for the tightest possible bounds over the full set of compatible models. This replaces prior approaches that formulate large optimization problems whose size grows exponentially with the MDP. A sympathetic reader would care because the resulting interval MDP supports policies that remain effective even under the worst-case probabilities within those bounds.

Core claim

We propose a novel non-parametric approach that computes tight bounds on counterfactual transition probabilities across all compatible causal models. Unlike previous methods that require solving prohibitively large optimisation problems (with variables that grow exponentially in the size of the MDP), our approach provides closed-form expressions for these bounds, making computation highly efficient and scalable for non-trivial MDPs. Once such an interval counterfactual MDP is constructed, our method identifies robust counterfactual policies that optimise the worst-case reward w.r.t. the uncertain interval MDP probabilities.

What carries the argument

Closed-form expressions for the tight bounds on counterfactual transition probabilities over every causal model consistent with the observational and interventional distributions.

If this is right

  • Bounds and policies can be computed for MDPs whose state-action spaces are too large for exponential-variable optimization.
  • The interval MDP encodes all counterfactual outcomes consistent with the data, so any policy chosen from it is valid under every compatible causal model.
  • Worst-case reward optimization inside the interval MDP produces policies whose performance is guaranteed against uncertainty in the counterfactuals.
  • Evaluation on case studies shows these policies outperform those derived from any single fixed causal model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The closed-form construction could be lifted to settings with partial observability if the compatibility constraints can be projected onto the observed variables.
  • The interval representation might be combined with existing robust MDP solvers to handle additional sources of uncertainty beyond the causal-model class.
  • Collecting more interventional data would shrink the interval width, offering a quantitative way to decide which experiments reduce counterfactual ambiguity most efficiently.

Load-bearing premise

The set of all causal models compatible with the observational and interventional distributions admits tight bounds that can be expressed in closed form without requiring exponential variables or post-hoc model selection.

What would settle it

On a small MDP where all compatible causal models can be enumerated, the closed-form interval either fails to contain the true range or is strictly wider than the range obtained from the full optimization.

Figures

Figures reproduced from arXiv: 2502.13731 by Jessica Lally, Milad Kazemi, Nicola Paoletti.

Figure 1
Figure 1. Figure 1: MDP causal graph. White nodes represent en [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example MDP where Gumbel-max produces unin [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: CF inference approaches for off-policy evaluation [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Average instant reward of CF paths induced by policies on GridWorld [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Average instant reward of CF paths induced by policies on GridWorld [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Average instant reward of CF paths induced by policies on Sepsis. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Example MDP where Gumbel-Max produces unintuitive CF probabilities. The observed path is 𝑠0 → 𝑠1. 𝑠 𝑎 𝑠 ′ 𝑃 (𝑠 ′ | 𝑠, 𝑎) Optimisation (3) Gumbel￾Max (9) Optimisation (3-6) LB UB LB UB 0 0 0 0.3 0.0 0.0 0.0 0.0 0.0 0 0 1 0.4 1.0 1.0 1.0 1.0 1.0 0 0 2 0.3 0.0 0.0 0.0 0.0 0.0 1 0 0 0.4 0.0 1.0 0.35 0.4 0.4 1 0 1 0.0 0.0 0.0 0.0 0.0 0.0 1 0 2 0.6 0.0 1.0 0.65 0.6 0.6 2 0 0 0.0 0.0 0.0 0.0 0.0 0.0 2 0 1 0.0 0.0… view at source ↗
Figure 8
Figure 8. Figure 8: Average instant reward of CF paths induced by policies on Frozen Lake. Error bars denote the standard deviation in [PITH_FULL_IMAGE:figures/full_fig_p056_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Average instant reward of CF paths induced by policies on Aircraft. Error bars denote the standard deviation in reward [PITH_FULL_IMAGE:figures/full_fig_p057_9.png] view at source ↗
read the original abstract

This paper addresses a key limitation in existing counterfactual inference methods for Markov Decision Processes (MDPs). Current approaches assume a specific causal model to make counterfactuals identifiable. However, there are usually many causal models that align with the observational and interventional distributions of an MDP, each yielding different counterfactual distributions, so fixing a particular causal model limits the validity (and usefulness) of counterfactual inference. We propose a novel non-parametric approach that computes tight bounds on counterfactual transition probabilities across all compatible causal models. Unlike previous methods that require solving prohibitively large optimisation problems (with variables that grow exponentially in the size of the MDP), our approach provides closed-form expressions for these bounds, making computation highly efficient and scalable for non-trivial MDPs. Once such an interval counterfactual MDP is constructed, our method identifies robust counterfactual policies that optimise the worst-case reward w.r.t. the uncertain interval MDP probabilities. We evaluate our method on various case studies, demonstrating improved robustness over existing methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper proposes a non-parametric approach for computing tight closed-form bounds on counterfactual transition probabilities in MDPs over the set of all causal models compatible with observed observational and interventional distributions. These bounds are used to construct an interval MDP, from which robust policies are derived by optimizing the worst-case reward. The method is claimed to be scalable (avoiding exponential variables in optimization) and is evaluated on case studies showing improved robustness over prior methods.

Significance. If the closed-form bounds are tight and cover the full class of compatible models, the work would provide a meaningful advance in scalable robust counterfactual inference for MDPs by sidestepping the computational intractability of prior optimization-based approaches. The non-parametric framing and case-study evaluations are strengths. The stress-test concern (that closed-form expressions may implicitly restrict the model class) does not land on the manuscript: the derivation establishes the bounds directly from the compatible set without additional structural restrictions.

minor comments (2)
  1. [Abstract] Abstract: the statement that the approach 'provides closed-form expressions' would be clearer if it briefly indicated the functional form or the key independence exploited to avoid exponential variables.
  2. [Case studies] The case-study section would benefit from explicit statements of the baseline methods' hyper-parameters and the precise definition of 'improved robustness' (e.g., which metric and how many runs).

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment, recognition of the non-parametric closed-form bounds, and recommendation for minor revision. The evaluation on scalability and robustness is appreciated.

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper derives closed-form bounds on counterfactual transition probabilities directly from the set of causal models compatible with given observational and interventional distributions in an MDP. The abstract and described method present this as a non-parametric computation that avoids exponential optimization variables, without any quoted reduction of the output bounds to fitted parameters, self-definitions, or load-bearing self-citations. The central claim of tight bounds and robust policies rests on independent analysis of model compatibility rather than renaming known results or smuggling ansatzes via prior work by the same authors. This leaves the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that multiple causal models are compatible with the same observational and interventional data and that their counterfactuals can be bounded tightly in closed form.

axioms (1)
  • domain assumption Multiple causal models align with the observational and interventional distributions of an MDP, each yielding different counterfactual distributions
    Explicitly stated in the abstract as the key limitation of existing methods.

pith-pipeline@v0.9.0 · 5689 in / 1165 out tokens · 45033 ms · 2026-05-25T08:09:00.957436+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 1 internal anchor

  1. [1]

    Alexander Balke and Judea Pearl. 1994. Counterfactual probabilities: Computa- tional methods, bounds and applications. InUncertainty in artificial intelligence. Elsevier, 46–54

  2. [2]

    Nina L Corvelo Benz and Manuel Gomez Rodriguez. 2022. Counterfactual inference of second opinions. InUncertainty in Artificial Intelligence. PMLR, 453–463

  3. [3]

    Lars Buesing, Theophane Weber, Yori Zwols, Sebastien Racaniere, Arthur Guez, Jean-Baptiste Lespiau, and Nicolas Heess. 2018. Woulda, coulda, shoulda: Counterfactually-guided policy search.arXiv preprint arXiv:1811.06272(2018)

  4. [4]

    Zhihong Cai, Manabu Kuroki, Judea Pearl, and Jin Tian. 2008. Bounds on direct effects in the presence of confounded intermediate variables.Biometrics64, 3 (2008), 695–701

  5. [5]

    Ivi Chatzi, Nina Corvelo Benz, Eleni Straitouri, Stratis Tsirtsis, and Manuel Gomez-Rodriguez. 2024. Counterfactual token generation in large language models.arXiv preprint arXiv:2409.17027(2024)

  6. [6]

    Guilherme Duarte, Noam Finkelstein, Dean Knox, Jonathan Mummolo, and Ilya Shpitser. 2023. An automated approach to causal inference in discrete settings. J. Amer. Statist. Assoc.(2023), 1–16

  7. [7]

    Jasmina Gajcin and Ivana Dusparic. 2024. ACTER: Diverse and Actionable Counterfactual Sequences for Explaining and Diagnosing RL Policies.arXiv preprint arXiv:2402.06503(2024)

  8. [8]

    Robert Givan, Sonia Leach, and Thomas Dean. 2000. Bounded-parameter Markov decision processes.Artificial Intelligence122, 1 (2000), 71–109. https://doi.org/ 10.1016/S0004-3702(00)00047-3

  9. [9]

    Dennis Gross, Nils Jansen, Sebastian Junges, and Guillermo A Pérez. 2022. COOL- MC: a comprehensive tool for reinforcement learning and model checking. In Dependable Software Engineering. Theories, Tools, and Applications: 8th Interna- tional Symposium, SETTA 2022, Beijing, China, October 27-29, 2022, Proceedings. Springer, 41–49

  10. [10]

    Joseph Y Halpern and Judea Pearl. 2005. Causes and explanations: A structural- model approach. Part II: Explanations.The British journal for the philosophy of science(2005)

  11. [11]

    Martin B Haugh and Raghav Singal. 2023. Bounding Counterfactuals in Hidden Markov Models and Beyond.A vailable at SSRN 4529724(2023)

  12. [12]

    Changsung Kang and Jin Tian. 2006. Inequality constraints in causal models with hidden variables. InProceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence. 233–240

  13. [13]

    Milad Kazemi, Jessica Lally, Ekaterina Tishchenko, Hana Chockler, and Nicola Paoletti. 2025. Counterfactual Influence in Markov Decision Processes. InPro- ceedings of the Fourth Conference on Causal Learning and Reasoning (Proceedings of Machine Learning Research, Vol. 275), Biwei Huang and Mathias Drton (Eds.). PMLR, 792–817. https://proceedings.mlr.pres...

  14. [14]

    Taylor W Killian, Marzyeh Ghassemi, and Shalmali Joshi. 2022. Counterfactually guided policy transfer in clinical settings. InConference on Health, Inference, and Learning. PMLR, 5–31

  15. [15]

    Kwiatkowska, G

    M. Kwiatkowska, G. Norman, and D. Parker. 2011. PRISM 4.0: Verification of Probabilistic Real-time Systems. InProc. 23rd International Conference on Computer Aided Verification (CA V’11) (LNCS, Vol. 6806), G. Gopalakrishnan and S. Qadeer (Eds.). Springer, 585–591

  16. [16]

    Ang Li and Judea Pearl. 2024. Probabilities of causation with nonbinary treatment and effect. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 20465–20472

  17. [17]

    Guy Lorberbom, Daniel D Johnson, Chris J Maddison, Daniel Tarlow, and Tamir Hazan. 2021. Learning generalized Gumbel-max causal mechanisms.Advances in Neural Information Processing Systems34 (2021), 26792–26803

  18. [18]

    Chaochao Lu, Biwei Huang, Ke Wang, José Miguel Hernández-Lobato, Kun Zhang, and Bernhard Schölkopf. 2020. Sample-efficient reinforcement learning via counterfactual-based data augmentation.arXiv preprint arXiv:2012.09092 (2020)

  19. [19]

    Chris J Maddison, Daniel Tarlow, and Tom Minka. 2014. A* sampling.Advances in Neural Information Processing Systems27 (2014)

  20. [20]

    Charles F Manski. 1990. Nonparametric bounds on treatment effects.The American Economic Review80, 2 (1990), 319–323

  21. [21]

    Frederik Baymler Mathiesen, Morteza Lahijanian, and Luca Laurenti. 2024. Inter- valMDP.jl: Accelerated Value Iteration for Interval Markov Decision Processes. IFAC-PapersOnLine58, 11, 1–6. https://doi.org/10.1016/j.ifacol.2024.07.416 8th IFAC Conference on Analysis and Design of Hybrid Systems ADHS 2024

  22. [22]

    Arash Nasr-Esfahany and Emre Kiciman. 2023. Counterfactual (non-) identi- fiability of learned structural causal models.arXiv preprint arXiv:2301.09031 (2023)

  23. [23]

    Kimia Noorbakhsh and Manuel Rodriguez. 2022. Counterfactual temporal point processes.Advances in Neural Information Processing Systems35 (2022), 24810– 24823

  24. [24]

    Michael Oberst and David Sontag. 2019. Counterfactual off-policy evaluation with Gumbel-max structural causal models. InICML

  25. [25]

    2009.Causality(2 nd ed.)

    Judea Pearl. 2009.Causality(2 nd ed.). Cambridge University Press. https: //doi.org/10.1017/CBO9780511803161

  26. [26]

    Edoardo Pona, Milad Kazemi, Yali Du, David Watson, and Nicola Paoletti

  27. [27]

    Abstract Counterfactuals for Language Model Agents.arXiv preprint arXiv:2506.02946(2025)

  28. [28]

    James M Robins. 1989. The analysis of randomized and non-randomized AIDS treatment trials using a new approach to causal inference in longitudinal studies. Health service research methodology: a focus on AIDS(1989), 113–159

  29. [29]

    Marnix Suilen, Thiago D Simão, David Parker, and Nils Jansen. 2022. Robust anytime learning of Markov decision processes.Advances in Neural Information Processing Systems35 (2022), 28790–28802

  30. [30]

    Yuewen Sun, Erli Wang, Biwei Huang, Chaochao Lu, Lu Feng, Changyin Sun, and Kun Zhang. 2024. ACAMDA: improving data efficiency in reinforcement learning through guided counterfactual data augmentation. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 15193–15201

  31. [31]

    Jin Tian and Judea Pearl. 2002. A general identification condition for causal effects. InAAAI/IAAI. 567–573

  32. [32]

    Stratis Tsirtsis, Abir De, and Manuel Rodriguez. 2021. Counterfactual expla- nations in sequential decision making under uncertainty.Advances in Neural Information Processing Systems34 (2021), 30127–30139

  33. [33]

    Stratis Tsirtsis and Manuel Rodriguez. 2024. Finding counterfactually optimal action sequences in continuous state spaces.Advances in Neural Information Processing Systems36 (2024)

  34. [34]

    Athanasios Vlontzos, Bernhard Kainz, and Ciarán M Gilligan-Lee. 2023. Esti- mating categorical counterfactuals via deep twin networks.Nature Machine Intelligence5, 2 (2023), 159–168

  35. [35]

    Marco Zaffalon, Alessandro Antonucci, and Rafael Cabañas. 2020. Structural causal models are (solvable by) credal networks. InInternational Conference on Probabilistic Graphical Models. PMLR, 581–592

  36. [36]

    Marco Zaffalon, Alessandro Antonucci, and Rafael Cabañas. 2021. Causal Expectation-Maximisation. InWHY-21 Workshop

  37. [37]

    Marco Zaffalon, Alessandro Antonucci, Rafael Cabañas, and David Huber. 2023. Approximating counterfactual bounds while fusing observational, biased and randomised data sources.International Journal of Approximate Reasoning162 (2023), 109023

  38. [38]

    Marco Zaffalon, Alessandro Antonucci, Rafael Cabañas, David Huber, and Dario Azzimonti. 2022. Bounding counterfactuals under selection bias. InInternational Conference on Probabilistic Graphical Models. PMLR, 289–300

  39. [39]

    Marco Zaffalon, Alessandro Antonucci, Rafael Cabañas, David Huber, and Dario Azzimonti. 2024. Efficient computation of counterfactual bounds.International Journal of Approximate Reasoning(2024), 109111

  40. [40]

    Junzhe Zhang, Jin Tian, and Elias Bareinboim. 2022. Partial Counterfactual Identification from Observational and Experimental Data. InProceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learn- ing Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (Eds.)...

  41. [41]

    Qingfu Zhu, Weinan Zhang, Ting Liu, and William Yang Wang. 2020. Counter- factual off-policy training for neural dialogue generation. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 3438–3448. A GUMBEL-MAX SCMS The Gumbel-max SCM for an MDP is expressed as: 𝑆𝑡+1 =𝑓(𝑆 𝑡 , 𝐴𝑡 , 𝑈𝑡 =(𝐺 𝑠,𝑡 )𝑠∈ S )=arg max 𝑠∈ ...

  42. [42]

    We can define a so-calledcounterfactual MDP M𝜏 by solving the SCM (8) for each transition along an observed path 𝜏 in an MDP M

    or top-down Gumbel sampling [19]. We can define a so-calledcounterfactual MDP M𝜏 by solving the SCM (8) for each transition along an observed path 𝜏 in an MDP M. The counterfactual probability for each transition is defined, for𝑡=0, ..., 𝑇−1, as: 𝑃 M,𝑡,𝜏 (𝑠 ′ |𝑠, 𝑎)=𝑃(𝑠 ′ =arg max 𝑞∈ S log (𝑃 M (𝑞|𝑠, 𝑎) ) +𝐺 ′ 𝜏,𝑞,𝑡 ) ≈ 1 𝑁 𝑁∑︁ 𝑗=0 1 𝑠 ′ =arg max 𝑞∈ S n l...