Robust Counterfactual Inference in Markov Decision Processes

Jessica Lally; Milad Kazemi; Nicola Paoletti

arxiv: 2502.13731 · v5 · pith:TYTJWBSInew · submitted 2025-02-19 · 💻 cs.AI

Robust Counterfactual Inference in Markov Decision Processes

Jessica Lally , Milad Kazemi , Nicola Paoletti This is my paper

Pith reviewed 2026-05-25 08:09 UTC · model grok-4.3

classification 💻 cs.AI

keywords counterfactual inferenceMarkov decision processescausal modelsrobust policiesinterval probabilitiesnon-parametric boundsworst-case optimization

0 comments

The pith

Non-parametric closed-form bounds compute tight ranges for counterfactual transitions in MDPs across all compatible causal models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a method to bound counterfactual transition probabilities in MDPs without selecting one causal model among the many that fit the observed data. It derives closed-form expressions for the tightest possible bounds over the full set of compatible models. This replaces prior approaches that formulate large optimization problems whose size grows exponentially with the MDP. A sympathetic reader would care because the resulting interval MDP supports policies that remain effective even under the worst-case probabilities within those bounds.

Core claim

We propose a novel non-parametric approach that computes tight bounds on counterfactual transition probabilities across all compatible causal models. Unlike previous methods that require solving prohibitively large optimisation problems (with variables that grow exponentially in the size of the MDP), our approach provides closed-form expressions for these bounds, making computation highly efficient and scalable for non-trivial MDPs. Once such an interval counterfactual MDP is constructed, our method identifies robust counterfactual policies that optimise the worst-case reward w.r.t. the uncertain interval MDP probabilities.

What carries the argument

Closed-form expressions for the tight bounds on counterfactual transition probabilities over every causal model consistent with the observational and interventional distributions.

If this is right

Bounds and policies can be computed for MDPs whose state-action spaces are too large for exponential-variable optimization.
The interval MDP encodes all counterfactual outcomes consistent with the data, so any policy chosen from it is valid under every compatible causal model.
Worst-case reward optimization inside the interval MDP produces policies whose performance is guaranteed against uncertainty in the counterfactuals.
Evaluation on case studies shows these policies outperform those derived from any single fixed causal model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The closed-form construction could be lifted to settings with partial observability if the compatibility constraints can be projected onto the observed variables.
The interval representation might be combined with existing robust MDP solvers to handle additional sources of uncertainty beyond the causal-model class.
Collecting more interventional data would shrink the interval width, offering a quantitative way to decide which experiments reduce counterfactual ambiguity most efficiently.

Load-bearing premise

The set of all causal models compatible with the observational and interventional distributions admits tight bounds that can be expressed in closed form without requiring exponential variables or post-hoc model selection.

What would settle it

On a small MDP where all compatible causal models can be enumerated, the closed-form interval either fails to contain the true range or is strictly wider than the range obtained from the full optimization.

Figures

Figures reproduced from arXiv: 2502.13731 by Jessica Lally, Milad Kazemi, Nicola Paoletti.

**Figure 2.** Figure 2: Example MDP where Gumbel-max produces unin [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: CF inference approaches for off-policy evaluation [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Average instant reward of CF paths induced by policies on GridWorld [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Average instant reward of CF paths induced by policies on GridWorld [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Average instant reward of CF paths induced by policies on Sepsis. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Example MDP where Gumbel-Max produces unintuitive CF probabilities. The observed path is 𝑠0 → 𝑠1. 𝑠 𝑎 𝑠 ′ 𝑃 (𝑠 ′ | 𝑠, 𝑎) Optimisation (3) GumbelMax (9) Optimisation (3-6) LB UB LB UB 0 0 0 0.3 0.0 0.0 0.0 0.0 0.0 0 0 1 0.4 1.0 1.0 1.0 1.0 1.0 0 0 2 0.3 0.0 0.0 0.0 0.0 0.0 1 0 0 0.4 0.0 1.0 0.35 0.4 0.4 1 0 1 0.0 0.0 0.0 0.0 0.0 0.0 1 0 2 0.6 0.0 1.0 0.65 0.6 0.6 2 0 0 0.0 0.0 0.0 0.0 0.0 0.0 2 0 1 0.0 0.0… view at source ↗

**Figure 8.** Figure 8: Average instant reward of CF paths induced by policies on Frozen Lake. Error bars denote the standard deviation in [PITH_FULL_IMAGE:figures/full_fig_p056_8.png] view at source ↗

**Figure 9.** Figure 9: Average instant reward of CF paths induced by policies on Aircraft. Error bars denote the standard deviation in reward [PITH_FULL_IMAGE:figures/full_fig_p057_9.png] view at source ↗

read the original abstract

This paper addresses a key limitation in existing counterfactual inference methods for Markov Decision Processes (MDPs). Current approaches assume a specific causal model to make counterfactuals identifiable. However, there are usually many causal models that align with the observational and interventional distributions of an MDP, each yielding different counterfactual distributions, so fixing a particular causal model limits the validity (and usefulness) of counterfactual inference. We propose a novel non-parametric approach that computes tight bounds on counterfactual transition probabilities across all compatible causal models. Unlike previous methods that require solving prohibitively large optimisation problems (with variables that grow exponentially in the size of the MDP), our approach provides closed-form expressions for these bounds, making computation highly efficient and scalable for non-trivial MDPs. Once such an interval counterfactual MDP is constructed, our method identifies robust counterfactual policies that optimise the worst-case reward w.r.t. the uncertain interval MDP probabilities. We evaluate our method on various case studies, demonstrating improved robustness over existing methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives closed-form bounds on counterfactual transitions across compatible models in MDPs, which is new if the bounds really are tight without extra restrictions.

read the letter

The main takeaway is that they have a non-parametric way to bound counterfactual transition probabilities over the full set of causal models consistent with the observational and interventional data, then use those interval probabilities to find worst-case robust policies. The closed-form expressions are the part that sets this apart from prior work that either fixes one model or runs large optimizations whose size grows with the MDP. That efficiency angle is the practical contribution if it holds. They handle the identifiability problem head-on and show how to move from the bounds to a policy optimization step, which is a clean framing. The case studies are presented as evidence of better robustness than baselines. The soft spot is the stress-test concern: closed forms usually require some structural assumptions on how mechanisms can vary while matching the observed distributions. If the derivation relies on any such restrictions, the interval MDP will not contain every possible counterfactual behavior, and the robustness guarantee is weaker than stated. The abstract gives no derivation or tightness check, so it is not possible to tell from the given material whether the bounds are truly over all compatible models or only a subset. This paper is for researchers working on causal reinforcement learning and robust MDPs who already care about identifiability. A reader who needs scalable ways to handle causal uncertainty would get something usable from it, provided the model-class question is resolved. It deserves a serious referee because the problem is real and the proposed direction differs from existing approaches. Send it to peer review and ask reviewers to verify whether the closed forms cover the entire set of compatible models without unstated limits.

Referee Report

0 major / 2 minor

Summary. The paper proposes a non-parametric approach for computing tight closed-form bounds on counterfactual transition probabilities in MDPs over the set of all causal models compatible with observed observational and interventional distributions. These bounds are used to construct an interval MDP, from which robust policies are derived by optimizing the worst-case reward. The method is claimed to be scalable (avoiding exponential variables in optimization) and is evaluated on case studies showing improved robustness over prior methods.

Significance. If the closed-form bounds are tight and cover the full class of compatible models, the work would provide a meaningful advance in scalable robust counterfactual inference for MDPs by sidestepping the computational intractability of prior optimization-based approaches. The non-parametric framing and case-study evaluations are strengths. The stress-test concern (that closed-form expressions may implicitly restrict the model class) does not land on the manuscript: the derivation establishes the bounds directly from the compatible set without additional structural restrictions.

minor comments (2)

[Abstract] Abstract: the statement that the approach 'provides closed-form expressions' would be clearer if it briefly indicated the functional form or the key independence exploited to avoid exponential variables.
[Case studies] The case-study section would benefit from explicit statements of the baseline methods' hyper-parameters and the precise definition of 'improved robustness' (e.g., which metric and how many runs).

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment, recognition of the non-parametric closed-form bounds, and recommendation for minor revision. The evaluation on scalability and robustness is appreciated.

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper derives closed-form bounds on counterfactual transition probabilities directly from the set of causal models compatible with given observational and interventional distributions in an MDP. The abstract and described method present this as a non-parametric computation that avoids exponential optimization variables, without any quoted reduction of the output bounds to fitted parameters, self-definitions, or load-bearing self-citations. The central claim of tight bounds and robust policies rests on independent analysis of model compatibility rather than renaming known results or smuggling ansatzes via prior work by the same authors. This leaves the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that multiple causal models are compatible with the same observational and interventional data and that their counterfactuals can be bounded tightly in closed form.

axioms (1)

domain assumption Multiple causal models align with the observational and interventional distributions of an MDP, each yielding different counterfactual distributions
Explicitly stated in the abstract as the key limitation of existing methods.

pith-pipeline@v0.9.0 · 5689 in / 1165 out tokens · 45033 ms · 2026-05-25T08:09:00.957436+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 1 internal anchor

[1]

Alexander Balke and Judea Pearl. 1994. Counterfactual probabilities: Computa- tional methods, bounds and applications. InUncertainty in artificial intelligence. Elsevier, 46–54

work page 1994
[2]

Nina L Corvelo Benz and Manuel Gomez Rodriguez. 2022. Counterfactual inference of second opinions. InUncertainty in Artificial Intelligence. PMLR, 453–463

work page 2022
[3]

Lars Buesing, Theophane Weber, Yori Zwols, Sebastien Racaniere, Arthur Guez, Jean-Baptiste Lespiau, and Nicolas Heess. 2018. Woulda, coulda, shoulda: Counterfactually-guided policy search.arXiv preprint arXiv:1811.06272(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[4]

Zhihong Cai, Manabu Kuroki, Judea Pearl, and Jin Tian. 2008. Bounds on direct effects in the presence of confounded intermediate variables.Biometrics64, 3 (2008), 695–701

work page 2008
[5]

Ivi Chatzi, Nina Corvelo Benz, Eleni Straitouri, Stratis Tsirtsis, and Manuel Gomez-Rodriguez. 2024. Counterfactual token generation in large language models.arXiv preprint arXiv:2409.17027(2024)

work page arXiv 2024
[6]

Guilherme Duarte, Noam Finkelstein, Dean Knox, Jonathan Mummolo, and Ilya Shpitser. 2023. An automated approach to causal inference in discrete settings. J. Amer. Statist. Assoc.(2023), 1–16

work page 2023
[7]

Jasmina Gajcin and Ivana Dusparic. 2024. ACTER: Diverse and Actionable Counterfactual Sequences for Explaining and Diagnosing RL Policies.arXiv preprint arXiv:2402.06503(2024)

work page arXiv 2024
[8]

Robert Givan, Sonia Leach, and Thomas Dean. 2000. Bounded-parameter Markov decision processes.Artificial Intelligence122, 1 (2000), 71–109. https://doi.org/ 10.1016/S0004-3702(00)00047-3

work page doi:10.1016/s0004-3702(00)00047-3 2000
[9]

Dennis Gross, Nils Jansen, Sebastian Junges, and Guillermo A Pérez. 2022. COOL- MC: a comprehensive tool for reinforcement learning and model checking. In Dependable Software Engineering. Theories, Tools, and Applications: 8th Interna- tional Symposium, SETTA 2022, Beijing, China, October 27-29, 2022, Proceedings. Springer, 41–49

work page 2022
[10]

Joseph Y Halpern and Judea Pearl. 2005. Causes and explanations: A structural- model approach. Part II: Explanations.The British journal for the philosophy of science(2005)

work page 2005
[11]

Martin B Haugh and Raghav Singal. 2023. Bounding Counterfactuals in Hidden Markov Models and Beyond.A vailable at SSRN 4529724(2023)

work page 2023
[12]

Changsung Kang and Jin Tian. 2006. Inequality constraints in causal models with hidden variables. InProceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence. 233–240

work page 2006
[13]

Milad Kazemi, Jessica Lally, Ekaterina Tishchenko, Hana Chockler, and Nicola Paoletti. 2025. Counterfactual Influence in Markov Decision Processes. InPro- ceedings of the Fourth Conference on Causal Learning and Reasoning (Proceedings of Machine Learning Research, Vol. 275), Biwei Huang and Mathias Drton (Eds.). PMLR, 792–817. https://proceedings.mlr.pres...

work page 2025
[14]

Taylor W Killian, Marzyeh Ghassemi, and Shalmali Joshi. 2022. Counterfactually guided policy transfer in clinical settings. InConference on Health, Inference, and Learning. PMLR, 5–31

work page 2022
[15]

Kwiatkowska, G

M. Kwiatkowska, G. Norman, and D. Parker. 2011. PRISM 4.0: Verification of Probabilistic Real-time Systems. InProc. 23rd International Conference on Computer Aided Verification (CA V’11) (LNCS, Vol. 6806), G. Gopalakrishnan and S. Qadeer (Eds.). Springer, 585–591

work page 2011
[16]

Ang Li and Judea Pearl. 2024. Probabilities of causation with nonbinary treatment and effect. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 20465–20472

work page 2024
[17]

Guy Lorberbom, Daniel D Johnson, Chris J Maddison, Daniel Tarlow, and Tamir Hazan. 2021. Learning generalized Gumbel-max causal mechanisms.Advances in Neural Information Processing Systems34 (2021), 26792–26803

work page 2021
[18]

Chaochao Lu, Biwei Huang, Ke Wang, José Miguel Hernández-Lobato, Kun Zhang, and Bernhard Schölkopf. 2020. Sample-efficient reinforcement learning via counterfactual-based data augmentation.arXiv preprint arXiv:2012.09092 (2020)

work page arXiv 2020
[19]

Chris J Maddison, Daniel Tarlow, and Tom Minka. 2014. A* sampling.Advances in Neural Information Processing Systems27 (2014)

work page 2014
[20]

Charles F Manski. 1990. Nonparametric bounds on treatment effects.The American Economic Review80, 2 (1990), 319–323

work page 1990
[21]

Frederik Baymler Mathiesen, Morteza Lahijanian, and Luca Laurenti. 2024. Inter- valMDP.jl: Accelerated Value Iteration for Interval Markov Decision Processes. IFAC-PapersOnLine58, 11, 1–6. https://doi.org/10.1016/j.ifacol.2024.07.416 8th IFAC Conference on Analysis and Design of Hybrid Systems ADHS 2024

work page doi:10.1016/j.ifacol.2024.07.416 2024
[22]

Arash Nasr-Esfahany and Emre Kiciman. 2023. Counterfactual (non-) identi- fiability of learned structural causal models.arXiv preprint arXiv:2301.09031 (2023)

work page arXiv 2023
[23]

Kimia Noorbakhsh and Manuel Rodriguez. 2022. Counterfactual temporal point processes.Advances in Neural Information Processing Systems35 (2022), 24810– 24823

work page 2022
[24]

Michael Oberst and David Sontag. 2019. Counterfactual off-policy evaluation with Gumbel-max structural causal models. InICML

work page 2019
[25]

2009.Causality(2 nd ed.)

Judea Pearl. 2009.Causality(2 nd ed.). Cambridge University Press. https: //doi.org/10.1017/CBO9780511803161

work page doi:10.1017/cbo9780511803161 2009
[26]

Edoardo Pona, Milad Kazemi, Yali Du, David Watson, and Nicola Paoletti

work page
[27]

Abstract Counterfactuals for Language Model Agents.arXiv preprint arXiv:2506.02946(2025)

work page arXiv 2025
[28]

James M Robins. 1989. The analysis of randomized and non-randomized AIDS treatment trials using a new approach to causal inference in longitudinal studies. Health service research methodology: a focus on AIDS(1989), 113–159

work page 1989
[29]

Marnix Suilen, Thiago D Simão, David Parker, and Nils Jansen. 2022. Robust anytime learning of Markov decision processes.Advances in Neural Information Processing Systems35 (2022), 28790–28802

work page 2022
[30]

Yuewen Sun, Erli Wang, Biwei Huang, Chaochao Lu, Lu Feng, Changyin Sun, and Kun Zhang. 2024. ACAMDA: improving data efficiency in reinforcement learning through guided counterfactual data augmentation. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 15193–15201

work page 2024
[31]

Jin Tian and Judea Pearl. 2002. A general identification condition for causal effects. InAAAI/IAAI. 567–573

work page 2002
[32]

Stratis Tsirtsis, Abir De, and Manuel Rodriguez. 2021. Counterfactual expla- nations in sequential decision making under uncertainty.Advances in Neural Information Processing Systems34 (2021), 30127–30139

work page 2021
[33]

Stratis Tsirtsis and Manuel Rodriguez. 2024. Finding counterfactually optimal action sequences in continuous state spaces.Advances in Neural Information Processing Systems36 (2024)

work page 2024
[34]

Athanasios Vlontzos, Bernhard Kainz, and Ciarán M Gilligan-Lee. 2023. Esti- mating categorical counterfactuals via deep twin networks.Nature Machine Intelligence5, 2 (2023), 159–168

work page 2023
[35]

Marco Zaffalon, Alessandro Antonucci, and Rafael Cabañas. 2020. Structural causal models are (solvable by) credal networks. InInternational Conference on Probabilistic Graphical Models. PMLR, 581–592

work page 2020
[36]

Marco Zaffalon, Alessandro Antonucci, and Rafael Cabañas. 2021. Causal Expectation-Maximisation. InWHY-21 Workshop

work page 2021
[37]

Marco Zaffalon, Alessandro Antonucci, Rafael Cabañas, and David Huber. 2023. Approximating counterfactual bounds while fusing observational, biased and randomised data sources.International Journal of Approximate Reasoning162 (2023), 109023

work page 2023
[38]

Marco Zaffalon, Alessandro Antonucci, Rafael Cabañas, David Huber, and Dario Azzimonti. 2022. Bounding counterfactuals under selection bias. InInternational Conference on Probabilistic Graphical Models. PMLR, 289–300

work page 2022
[39]

Marco Zaffalon, Alessandro Antonucci, Rafael Cabañas, David Huber, and Dario Azzimonti. 2024. Efficient computation of counterfactual bounds.International Journal of Approximate Reasoning(2024), 109111

work page 2024
[40]

Junzhe Zhang, Jin Tian, and Elias Bareinboim. 2022. Partial Counterfactual Identification from Observational and Experimental Data. InProceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learn- ing Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (Eds.)...

work page 2022
[41]

Qingfu Zhu, Weinan Zhang, Ting Liu, and William Yang Wang. 2020. Counter- factual off-policy training for neural dialogue generation. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 3438–3448. A GUMBEL-MAX SCMS The Gumbel-max SCM for an MDP is expressed as: 𝑆𝑡+1 =𝑓(𝑆 𝑡 , 𝐴𝑡 , 𝑈𝑡 =(𝐺 𝑠,𝑡 )𝑠∈ S )=arg max 𝑠∈ ...

work page 2020
[42]

We can define a so-calledcounterfactual MDP M𝜏 by solving the SCM (8) for each transition along an observed path 𝜏 in an MDP M

or top-down Gumbel sampling [19]. We can define a so-calledcounterfactual MDP M𝜏 by solving the SCM (8) for each transition along an observed path 𝜏 in an MDP M. The counterfactual probability for each transition is defined, for𝑡=0, ..., 𝑇−1, as: 𝑃 M,𝑡,𝜏 (𝑠 ′ |𝑠, 𝑎)=𝑃(𝑠 ′ =arg max 𝑞∈ S log (𝑃 M (𝑞|𝑠, 𝑎) ) +𝐺 ′ 𝜏,𝑞,𝑡 ) ≈ 1 𝑁 𝑁∑︁ 𝑗=0 1 𝑠 ′ =arg max 𝑞∈ S n l...

work page

[1] [1]

Alexander Balke and Judea Pearl. 1994. Counterfactual probabilities: Computa- tional methods, bounds and applications. InUncertainty in artificial intelligence. Elsevier, 46–54

work page 1994

[2] [2]

Nina L Corvelo Benz and Manuel Gomez Rodriguez. 2022. Counterfactual inference of second opinions. InUncertainty in Artificial Intelligence. PMLR, 453–463

work page 2022

[3] [3]

Lars Buesing, Theophane Weber, Yori Zwols, Sebastien Racaniere, Arthur Guez, Jean-Baptiste Lespiau, and Nicolas Heess. 2018. Woulda, coulda, shoulda: Counterfactually-guided policy search.arXiv preprint arXiv:1811.06272(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[4] [4]

Zhihong Cai, Manabu Kuroki, Judea Pearl, and Jin Tian. 2008. Bounds on direct effects in the presence of confounded intermediate variables.Biometrics64, 3 (2008), 695–701

work page 2008

[5] [5]

Ivi Chatzi, Nina Corvelo Benz, Eleni Straitouri, Stratis Tsirtsis, and Manuel Gomez-Rodriguez. 2024. Counterfactual token generation in large language models.arXiv preprint arXiv:2409.17027(2024)

work page arXiv 2024

[6] [6]

Guilherme Duarte, Noam Finkelstein, Dean Knox, Jonathan Mummolo, and Ilya Shpitser. 2023. An automated approach to causal inference in discrete settings. J. Amer. Statist. Assoc.(2023), 1–16

work page 2023

[7] [7]

Jasmina Gajcin and Ivana Dusparic. 2024. ACTER: Diverse and Actionable Counterfactual Sequences for Explaining and Diagnosing RL Policies.arXiv preprint arXiv:2402.06503(2024)

work page arXiv 2024

[8] [8]

Robert Givan, Sonia Leach, and Thomas Dean. 2000. Bounded-parameter Markov decision processes.Artificial Intelligence122, 1 (2000), 71–109. https://doi.org/ 10.1016/S0004-3702(00)00047-3

work page doi:10.1016/s0004-3702(00)00047-3 2000

[9] [9]

Dennis Gross, Nils Jansen, Sebastian Junges, and Guillermo A Pérez. 2022. COOL- MC: a comprehensive tool for reinforcement learning and model checking. In Dependable Software Engineering. Theories, Tools, and Applications: 8th Interna- tional Symposium, SETTA 2022, Beijing, China, October 27-29, 2022, Proceedings. Springer, 41–49

work page 2022

[10] [10]

Joseph Y Halpern and Judea Pearl. 2005. Causes and explanations: A structural- model approach. Part II: Explanations.The British journal for the philosophy of science(2005)

work page 2005

[11] [11]

Martin B Haugh and Raghav Singal. 2023. Bounding Counterfactuals in Hidden Markov Models and Beyond.A vailable at SSRN 4529724(2023)

work page 2023

[12] [12]

Changsung Kang and Jin Tian. 2006. Inequality constraints in causal models with hidden variables. InProceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence. 233–240

work page 2006

[13] [13]

Milad Kazemi, Jessica Lally, Ekaterina Tishchenko, Hana Chockler, and Nicola Paoletti. 2025. Counterfactual Influence in Markov Decision Processes. InPro- ceedings of the Fourth Conference on Causal Learning and Reasoning (Proceedings of Machine Learning Research, Vol. 275), Biwei Huang and Mathias Drton (Eds.). PMLR, 792–817. https://proceedings.mlr.pres...

work page 2025

[14] [14]

Taylor W Killian, Marzyeh Ghassemi, and Shalmali Joshi. 2022. Counterfactually guided policy transfer in clinical settings. InConference on Health, Inference, and Learning. PMLR, 5–31

work page 2022

[15] [15]

Kwiatkowska, G

M. Kwiatkowska, G. Norman, and D. Parker. 2011. PRISM 4.0: Verification of Probabilistic Real-time Systems. InProc. 23rd International Conference on Computer Aided Verification (CA V’11) (LNCS, Vol. 6806), G. Gopalakrishnan and S. Qadeer (Eds.). Springer, 585–591

work page 2011

[16] [16]

Ang Li and Judea Pearl. 2024. Probabilities of causation with nonbinary treatment and effect. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 20465–20472

work page 2024

[17] [17]

Guy Lorberbom, Daniel D Johnson, Chris J Maddison, Daniel Tarlow, and Tamir Hazan. 2021. Learning generalized Gumbel-max causal mechanisms.Advances in Neural Information Processing Systems34 (2021), 26792–26803

work page 2021

[18] [18]

Chaochao Lu, Biwei Huang, Ke Wang, José Miguel Hernández-Lobato, Kun Zhang, and Bernhard Schölkopf. 2020. Sample-efficient reinforcement learning via counterfactual-based data augmentation.arXiv preprint arXiv:2012.09092 (2020)

work page arXiv 2020

[19] [19]

Chris J Maddison, Daniel Tarlow, and Tom Minka. 2014. A* sampling.Advances in Neural Information Processing Systems27 (2014)

work page 2014

[20] [20]

Charles F Manski. 1990. Nonparametric bounds on treatment effects.The American Economic Review80, 2 (1990), 319–323

work page 1990

[21] [21]

Frederik Baymler Mathiesen, Morteza Lahijanian, and Luca Laurenti. 2024. Inter- valMDP.jl: Accelerated Value Iteration for Interval Markov Decision Processes. IFAC-PapersOnLine58, 11, 1–6. https://doi.org/10.1016/j.ifacol.2024.07.416 8th IFAC Conference on Analysis and Design of Hybrid Systems ADHS 2024

work page doi:10.1016/j.ifacol.2024.07.416 2024

[22] [22]

Arash Nasr-Esfahany and Emre Kiciman. 2023. Counterfactual (non-) identi- fiability of learned structural causal models.arXiv preprint arXiv:2301.09031 (2023)

work page arXiv 2023

[23] [23]

Kimia Noorbakhsh and Manuel Rodriguez. 2022. Counterfactual temporal point processes.Advances in Neural Information Processing Systems35 (2022), 24810– 24823

work page 2022

[24] [24]

Michael Oberst and David Sontag. 2019. Counterfactual off-policy evaluation with Gumbel-max structural causal models. InICML

work page 2019

[25] [25]

2009.Causality(2 nd ed.)

Judea Pearl. 2009.Causality(2 nd ed.). Cambridge University Press. https: //doi.org/10.1017/CBO9780511803161

work page doi:10.1017/cbo9780511803161 2009

[26] [26]

Edoardo Pona, Milad Kazemi, Yali Du, David Watson, and Nicola Paoletti

work page

[27] [27]

Abstract Counterfactuals for Language Model Agents.arXiv preprint arXiv:2506.02946(2025)

work page arXiv 2025

[28] [28]

James M Robins. 1989. The analysis of randomized and non-randomized AIDS treatment trials using a new approach to causal inference in longitudinal studies. Health service research methodology: a focus on AIDS(1989), 113–159

work page 1989

[29] [29]

Marnix Suilen, Thiago D Simão, David Parker, and Nils Jansen. 2022. Robust anytime learning of Markov decision processes.Advances in Neural Information Processing Systems35 (2022), 28790–28802

work page 2022

[30] [30]

Yuewen Sun, Erli Wang, Biwei Huang, Chaochao Lu, Lu Feng, Changyin Sun, and Kun Zhang. 2024. ACAMDA: improving data efficiency in reinforcement learning through guided counterfactual data augmentation. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 15193–15201

work page 2024

[31] [31]

Jin Tian and Judea Pearl. 2002. A general identification condition for causal effects. InAAAI/IAAI. 567–573

work page 2002

[32] [32]

Stratis Tsirtsis, Abir De, and Manuel Rodriguez. 2021. Counterfactual expla- nations in sequential decision making under uncertainty.Advances in Neural Information Processing Systems34 (2021), 30127–30139

work page 2021

[33] [33]

Stratis Tsirtsis and Manuel Rodriguez. 2024. Finding counterfactually optimal action sequences in continuous state spaces.Advances in Neural Information Processing Systems36 (2024)

work page 2024

[34] [34]

Athanasios Vlontzos, Bernhard Kainz, and Ciarán M Gilligan-Lee. 2023. Esti- mating categorical counterfactuals via deep twin networks.Nature Machine Intelligence5, 2 (2023), 159–168

work page 2023

[35] [35]

Marco Zaffalon, Alessandro Antonucci, and Rafael Cabañas. 2020. Structural causal models are (solvable by) credal networks. InInternational Conference on Probabilistic Graphical Models. PMLR, 581–592

work page 2020

[36] [36]

Marco Zaffalon, Alessandro Antonucci, and Rafael Cabañas. 2021. Causal Expectation-Maximisation. InWHY-21 Workshop

work page 2021

[37] [37]

Marco Zaffalon, Alessandro Antonucci, Rafael Cabañas, and David Huber. 2023. Approximating counterfactual bounds while fusing observational, biased and randomised data sources.International Journal of Approximate Reasoning162 (2023), 109023

work page 2023

[38] [38]

Marco Zaffalon, Alessandro Antonucci, Rafael Cabañas, David Huber, and Dario Azzimonti. 2022. Bounding counterfactuals under selection bias. InInternational Conference on Probabilistic Graphical Models. PMLR, 289–300

work page 2022

[39] [39]

Marco Zaffalon, Alessandro Antonucci, Rafael Cabañas, David Huber, and Dario Azzimonti. 2024. Efficient computation of counterfactual bounds.International Journal of Approximate Reasoning(2024), 109111

work page 2024

[40] [40]

Junzhe Zhang, Jin Tian, and Elias Bareinboim. 2022. Partial Counterfactual Identification from Observational and Experimental Data. InProceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learn- ing Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (Eds.)...

work page 2022

[41] [41]

Qingfu Zhu, Weinan Zhang, Ting Liu, and William Yang Wang. 2020. Counter- factual off-policy training for neural dialogue generation. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 3438–3448. A GUMBEL-MAX SCMS The Gumbel-max SCM for an MDP is expressed as: 𝑆𝑡+1 =𝑓(𝑆 𝑡 , 𝐴𝑡 , 𝑈𝑡 =(𝐺 𝑠,𝑡 )𝑠∈ S )=arg max 𝑠∈ ...

work page 2020

[42] [42]

We can define a so-calledcounterfactual MDP M𝜏 by solving the SCM (8) for each transition along an observed path 𝜏 in an MDP M

or top-down Gumbel sampling [19]. We can define a so-calledcounterfactual MDP M𝜏 by solving the SCM (8) for each transition along an observed path 𝜏 in an MDP M. The counterfactual probability for each transition is defined, for𝑡=0, ..., 𝑇−1, as: 𝑃 M,𝑡,𝜏 (𝑠 ′ |𝑠, 𝑎)=𝑃(𝑠 ′ =arg max 𝑞∈ S log (𝑃 M (𝑞|𝑠, 𝑎) ) +𝐺 ′ 𝜏,𝑞,𝑡 ) ≈ 1 𝑁 𝑁∑︁ 𝑗=0 1 𝑠 ′ =arg max 𝑞∈ S n l...

work page