Computational Hardness of Static Distributionally Robust Markov Decision Processes

Yan Li

arxiv: 2511.02224 · v5 · submitted 2025-11-04 · 🧮 math.OC

Computational Hardness of Static Distributionally Robust Markov Decision Processes

Yan Li This is my paper

Pith reviewed 2026-05-18 01:54 UTC · model grok-4.3

classification 🧮 math.OC

keywords distributionally robust MDPsstatic formulationcomputational hardnessNP-hardnessMarkov policiesambiguity setstransition kernelspolicy optimization

0 comments

The pith

Finding optimal policies for static distributionally robust MDPs is NP-hard even when the ambiguity set contains only two transition kernels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that solving for the optimal policy in the static formulation of distributionally robust Markov decision processes is computationally intractable. The authors construct specific instances where the ambiguity set consists of exactly two transition kernels. For Markovian non-randomized policies, the problem of selecting the best policy reduces to an NP-hard task. For Markovian randomized policies, the robust value function contains strict local minimizers that are suboptimal, and locating the global optimum remains NP-hard. These findings indicate that exact policy optimization cannot be performed efficiently in the worst case for this class of robust sequential decision problems.

Core claim

The paper constructs specific problem instances of static DRMDPs with ambiguity sets of exactly two transition kernels. It proves that optimizing over Markovian non-randomized policies is NP-hard. Additionally, for Markovian randomized policies, the robust value function has sub-optimal strict local minimizers, rendering the optimization problem NP-hard as well.

What carries the argument

Reductions from known NP-hard problems to specially constructed MDPs whose policy choices encode the combinatorial decisions, using an ambiguity set limited to two transition kernels.

If this is right

Exact optimization over Markov policies cannot be performed efficiently in the worst case for these robust MDPs.
Optimization procedures for randomized policies may converge to strict local minima that are not globally optimal.
The computational barrier persists even when the uncertainty is represented by the smallest nontrivial ambiguity set.
Practical approaches must rely on approximation methods or reformulations that avoid direct policy search.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar hardness may appear in related sequential decision problems under distributional robustness when ambiguity sets remain small but fixed.
Researchers could investigate whether relaxing the Markov policy restriction or considering average-case instances yields tractable subclasses.
The results point toward hybrid methods that combine robust value estimation with combinatorial search heuristics for small ambiguity sets.

Load-bearing premise

The NP-hardness reductions rely on specially constructed MDPs and ambiguity sets with exactly two transition kernels that may not capture the structure of static DRMDPs arising in practice.

What would settle it

A polynomial-time algorithm that computes the optimal policy for every static DRMDP whose ambiguity set contains exactly two transition kernels would falsify the hardness claim.

read the original abstract

We present some hardness results on finding the optimal policy for the static formulation of distributionally robust Markov decision processes. We construct problem instances such that when the considered policy class is Markovian and non-randomized, finding the optimal policy is NP-hard. When the considered policy class is Markovian and randomized, the robust value function possesses sub-optimal strict local minimizers, and finding the optimal policy is also NP-hard. The considered instances involve an ambiguity set with only two transition kernels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents hardness results for the static formulation of distributionally robust Markov decision processes. It constructs specific instances with an ambiguity set of two transition kernels to prove that finding the optimal Markovian non-randomized policy is NP-hard. For Markovian randomized policies, it shows that the robust value function has sub-optimal strict local minimizers and that finding the optimal policy is NP-hard.

Significance. If the reductions hold, this result is significant because it establishes NP-hardness for static DRMDPs even with the smallest non-trivial ambiguity set size. The explicit constructions provide concrete evidence and can serve as benchmarks for testing algorithms. The local minima observation for randomized policies points to potential issues with local search methods in this setting. These contributions advance the understanding of computational complexity in robust MDPs.

major comments (2)

[§3] §3: The NP-hardness reduction for non-randomized policies must explicitly verify that the mapping from the source NP-complete problem (such as 3-SAT) to the DRMDP instance is polynomial-time and that the worst-case selection over the two kernels preserves exact optimality equivalence without introducing extraneous structure that could make the problem easier.
[§4] §4: The demonstration that the robust value function possesses sub-optimal strict local minimizers for randomized policies requires an explicit policy construction (or derivative condition) showing a stationary point whose value is strictly below the global optimum; this is load-bearing for the claim that local optimization methods may fail.

minor comments (2)

[Preliminaries] The preliminaries section should include a short remark distinguishing the static DRMDP formulation from its dynamic counterpart to aid readers unfamiliar with the distinction.
[§3 and §4] A table summarizing the constructed MDP parameters (states, actions, two kernels) for the hardness instances would improve readability and allow easier verification of the reductions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and positive assessment of the significance of our results. We address the major comments point by point below, providing clarifications and indicating revisions where appropriate.

read point-by-point responses

Referee: [§3] The NP-hardness reduction for non-randomized policies must explicitly verify that the mapping from the source NP-complete problem (such as 3-SAT) to the DRMDP instance is polynomial-time and that the worst-case selection over the two kernels preserves exact optimality equivalence without introducing extraneous structure that could make the problem easier.

Authors: We appreciate this request for additional explicit verification. In the proof of Theorem 1 in Section 3, the reduction from 3-SAT is constructed with a number of states and actions that is linear in the number of variables and clauses, which is clearly polynomial-time. The two transition kernels are defined such that one corresponds to the 'satisfying' transitions and the other to 'unsatisfying', ensuring that the min over the two kernels exactly encodes the satisfiability without extraneous factors. To make this clearer, we will add a dedicated paragraph after the construction detailing the time complexity and the equivalence of optimality. revision: yes
Referee: [§4] The demonstration that the robust value function possesses sub-optimal strict local minimizers for randomized policies requires an explicit policy construction (or derivative condition) showing a stationary point whose value is strictly below the global optimum; this is load-bearing for the claim that local optimization methods may fail.

Authors: We agree that an explicit construction strengthens the claim. In Section 4, we present a specific DRMDP instance with two kernels and a randomized policy π that is shown to be a strict local minimizer by verifying that small perturbations increase the robust value. We compute the robust value for this policy and compare it to the optimal one, demonstrating it is strictly sub-optimal. To address the referee's point directly, we will include the explicit policy probabilities and the condition on the derivative (or subgradient) at that point to confirm it is stationary. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper establishes its NP-hardness results for static DRMDPs through explicit constructions of problem instances (MDPs with ambiguity sets containing exactly two transition kernels) that satisfy the problem definition by construction. These reductions directly produce valid instances on which optimal policy search is hard for both non-randomized and randomized Markovian policies, with the randomized case also exhibiting suboptimal strict local minima. No load-bearing step reduces to a self-definition, a fitted parameter renamed as a prediction, or a self-citation chain; the hardness follows from the properties of the constructed instances without circular reference to the target claim.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard definitions of NP-hardness and the static DRMDP formulation; no free parameters, ad-hoc axioms, or invented entities are introduced in the abstract.

axioms (1)

standard math NP-hardness is defined via polynomial-time reduction from a known NP-hard problem
Invoked implicitly when stating that finding the optimal policy is NP-hard

pith-pipeline@v0.9.0 · 5588 in / 1183 out tokens · 27946 ms · 2026-05-18T01:54:53.064083+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages

[1]

On the theory of policy gradient methods: Optimality, approximation, and distribution shift.Journal of Machine Learning Research, 22(98):1–76, 2021

Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. On the theory of policy gradient methods: Optimality, approximation, and distribution shift.Journal of Machine Learning Research, 22(98):1–76, 2021

work page 2021
[2]

Gabow, Shachindra N Maheshwari, and Leon J

Harold N. Gabow, Shachindra N Maheshwari, and Leon J. Osterweil. On two problems in the generation of program test paths.IEEE Transactions on Software Engineering, (3):227–231, 1976

work page 1976
[3]

G.N. Iyengar. Robust Dynamic Programming.Mathematics of Operations Research, 30:257–280, 2005

work page 2005
[4]

PhD thesis, Massachusetts Institute of Technology, 2007

Yann Le Tallec.Robust, risk-sensitive, and data-driven control of Markov decision processes. PhD thesis, Massachusetts Institute of Technology, 2007

work page 2007
[5]

Rectangularity and duality of distributionally robust markov decision processes.arXiv preprint arXiv:2308.11139, 2023

Yan Li and Alexander Shapiro. Rectangularity and duality of distributionally robust markov decision processes.arXiv preprint arXiv:2308.11139, 2023

work page arXiv 2023
[6]

Nilim and L

A. Nilim and L. El Ghaoui. Robust control of Markov decision processes with uncertain transition probabilities.Operations Research, 53:780–798, 2005

work page 2005
[7]

Multi-model markov decision processes

Lauren N Steimle, David L Kaufman, and Brian T Denton. Multi-model markov decision processes. IISE Transactions, 53(10):1124–1139, 2021. 4

work page 2021
[8]

Le Tallec

Y. Le Tallec. Robust, Risk-Sensitive, and Data-Driven Control of Markov Decision Processes. Ph.D. thesis, Massachusetts Institute of Technology. Cambridge, MA, 2007

work page 2007
[9]

Mesures dans les espaces produits.Atti Accad

CT Ionescu Tulcea. Mesures dans les espaces produits.Atti Accad. Naz. Lincei Rend, 7:208–211, 1949

work page 1949
[10]

Robust markov decision processes.Mathematics of Operations Research, 38(1):153–183, 2013

Wolfram Wiesemann, Daniel Kuhn, and Ber¸ c Rustem. Robust markov decision processes.Mathematics of Operations Research, 38(1):153–183, 2013. 5

work page 2013

[1] [1]

On the theory of policy gradient methods: Optimality, approximation, and distribution shift.Journal of Machine Learning Research, 22(98):1–76, 2021

Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. On the theory of policy gradient methods: Optimality, approximation, and distribution shift.Journal of Machine Learning Research, 22(98):1–76, 2021

work page 2021

[2] [2]

Gabow, Shachindra N Maheshwari, and Leon J

Harold N. Gabow, Shachindra N Maheshwari, and Leon J. Osterweil. On two problems in the generation of program test paths.IEEE Transactions on Software Engineering, (3):227–231, 1976

work page 1976

[3] [3]

G.N. Iyengar. Robust Dynamic Programming.Mathematics of Operations Research, 30:257–280, 2005

work page 2005

[4] [4]

PhD thesis, Massachusetts Institute of Technology, 2007

Yann Le Tallec.Robust, risk-sensitive, and data-driven control of Markov decision processes. PhD thesis, Massachusetts Institute of Technology, 2007

work page 2007

[5] [5]

Rectangularity and duality of distributionally robust markov decision processes.arXiv preprint arXiv:2308.11139, 2023

Yan Li and Alexander Shapiro. Rectangularity and duality of distributionally robust markov decision processes.arXiv preprint arXiv:2308.11139, 2023

work page arXiv 2023

[6] [6]

Nilim and L

A. Nilim and L. El Ghaoui. Robust control of Markov decision processes with uncertain transition probabilities.Operations Research, 53:780–798, 2005

work page 2005

[7] [7]

Multi-model markov decision processes

Lauren N Steimle, David L Kaufman, and Brian T Denton. Multi-model markov decision processes. IISE Transactions, 53(10):1124–1139, 2021. 4

work page 2021

[8] [8]

Le Tallec

Y. Le Tallec. Robust, Risk-Sensitive, and Data-Driven Control of Markov Decision Processes. Ph.D. thesis, Massachusetts Institute of Technology. Cambridge, MA, 2007

work page 2007

[9] [9]

Mesures dans les espaces produits.Atti Accad

CT Ionescu Tulcea. Mesures dans les espaces produits.Atti Accad. Naz. Lincei Rend, 7:208–211, 1949

work page 1949

[10] [10]

Robust markov decision processes.Mathematics of Operations Research, 38(1):153–183, 2013

Wolfram Wiesemann, Daniel Kuhn, and Ber¸ c Rustem. Robust markov decision processes.Mathematics of Operations Research, 38(1):153–183, 2013. 5

work page 2013