Computational Hardness of Static Distributionally Robust Markov Decision Processes
Pith reviewed 2026-05-18 01:54 UTC · model grok-4.3
The pith
Finding optimal policies for static distributionally robust MDPs is NP-hard even when the ambiguity set contains only two transition kernels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper constructs specific problem instances of static DRMDPs with ambiguity sets of exactly two transition kernels. It proves that optimizing over Markovian non-randomized policies is NP-hard. Additionally, for Markovian randomized policies, the robust value function has sub-optimal strict local minimizers, rendering the optimization problem NP-hard as well.
What carries the argument
Reductions from known NP-hard problems to specially constructed MDPs whose policy choices encode the combinatorial decisions, using an ambiguity set limited to two transition kernels.
If this is right
- Exact optimization over Markov policies cannot be performed efficiently in the worst case for these robust MDPs.
- Optimization procedures for randomized policies may converge to strict local minima that are not globally optimal.
- The computational barrier persists even when the uncertainty is represented by the smallest nontrivial ambiguity set.
- Practical approaches must rely on approximation methods or reformulations that avoid direct policy search.
Where Pith is reading between the lines
- Similar hardness may appear in related sequential decision problems under distributional robustness when ambiguity sets remain small but fixed.
- Researchers could investigate whether relaxing the Markov policy restriction or considering average-case instances yields tractable subclasses.
- The results point toward hybrid methods that combine robust value estimation with combinatorial search heuristics for small ambiguity sets.
Load-bearing premise
The NP-hardness reductions rely on specially constructed MDPs and ambiguity sets with exactly two transition kernels that may not capture the structure of static DRMDPs arising in practice.
What would settle it
A polynomial-time algorithm that computes the optimal policy for every static DRMDP whose ambiguity set contains exactly two transition kernels would falsify the hardness claim.
read the original abstract
We present some hardness results on finding the optimal policy for the static formulation of distributionally robust Markov decision processes. We construct problem instances such that when the considered policy class is Markovian and non-randomized, finding the optimal policy is NP-hard. When the considered policy class is Markovian and randomized, the robust value function possesses sub-optimal strict local minimizers, and finding the optimal policy is also NP-hard. The considered instances involve an ambiguity set with only two transition kernels.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents hardness results for the static formulation of distributionally robust Markov decision processes. It constructs specific instances with an ambiguity set of two transition kernels to prove that finding the optimal Markovian non-randomized policy is NP-hard. For Markovian randomized policies, it shows that the robust value function has sub-optimal strict local minimizers and that finding the optimal policy is NP-hard.
Significance. If the reductions hold, this result is significant because it establishes NP-hardness for static DRMDPs even with the smallest non-trivial ambiguity set size. The explicit constructions provide concrete evidence and can serve as benchmarks for testing algorithms. The local minima observation for randomized policies points to potential issues with local search methods in this setting. These contributions advance the understanding of computational complexity in robust MDPs.
major comments (2)
- [§3] §3: The NP-hardness reduction for non-randomized policies must explicitly verify that the mapping from the source NP-complete problem (such as 3-SAT) to the DRMDP instance is polynomial-time and that the worst-case selection over the two kernels preserves exact optimality equivalence without introducing extraneous structure that could make the problem easier.
- [§4] §4: The demonstration that the robust value function possesses sub-optimal strict local minimizers for randomized policies requires an explicit policy construction (or derivative condition) showing a stationary point whose value is strictly below the global optimum; this is load-bearing for the claim that local optimization methods may fail.
minor comments (2)
- [Preliminaries] The preliminaries section should include a short remark distinguishing the static DRMDP formulation from its dynamic counterpart to aid readers unfamiliar with the distinction.
- [§3 and §4] A table summarizing the constructed MDP parameters (states, actions, two kernels) for the hardness instances would improve readability and allow easier verification of the reductions.
Simulated Author's Rebuttal
We thank the referee for their thorough review and positive assessment of the significance of our results. We address the major comments point by point below, providing clarifications and indicating revisions where appropriate.
read point-by-point responses
-
Referee: [§3] The NP-hardness reduction for non-randomized policies must explicitly verify that the mapping from the source NP-complete problem (such as 3-SAT) to the DRMDP instance is polynomial-time and that the worst-case selection over the two kernels preserves exact optimality equivalence without introducing extraneous structure that could make the problem easier.
Authors: We appreciate this request for additional explicit verification. In the proof of Theorem 1 in Section 3, the reduction from 3-SAT is constructed with a number of states and actions that is linear in the number of variables and clauses, which is clearly polynomial-time. The two transition kernels are defined such that one corresponds to the 'satisfying' transitions and the other to 'unsatisfying', ensuring that the min over the two kernels exactly encodes the satisfiability without extraneous factors. To make this clearer, we will add a dedicated paragraph after the construction detailing the time complexity and the equivalence of optimality. revision: yes
-
Referee: [§4] The demonstration that the robust value function possesses sub-optimal strict local minimizers for randomized policies requires an explicit policy construction (or derivative condition) showing a stationary point whose value is strictly below the global optimum; this is load-bearing for the claim that local optimization methods may fail.
Authors: We agree that an explicit construction strengthens the claim. In Section 4, we present a specific DRMDP instance with two kernels and a randomized policy π that is shown to be a strict local minimizer by verifying that small perturbations increase the robust value. We compute the robust value for this policy and compare it to the optimal one, demonstrating it is strictly sub-optimal. To address the referee's point directly, we will include the explicit policy probabilities and the condition on the derivative (or subgradient) at that point to confirm it is stationary. revision: yes
Circularity Check
No significant circularity
full rationale
The paper establishes its NP-hardness results for static DRMDPs through explicit constructions of problem instances (MDPs with ambiguity sets containing exactly two transition kernels) that satisfy the problem definition by construction. These reductions directly produce valid instances on which optimal policy search is hard for both non-randomized and randomized Markovian policies, with the randomized case also exhibiting suboptimal strict local minima. No load-bearing step reduces to a self-definition, a fitted parameter renamed as a prediction, or a self-citation chain; the hardness follows from the properties of the constructed instances without circular reference to the target claim.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math NP-hardness is defined via polynomial-time reduction from a known NP-hard problem
Reference graph
Works this paper leans on
-
[1]
Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. On the theory of policy gradient methods: Optimality, approximation, and distribution shift.Journal of Machine Learning Research, 22(98):1–76, 2021
work page 2021
-
[2]
Gabow, Shachindra N Maheshwari, and Leon J
Harold N. Gabow, Shachindra N Maheshwari, and Leon J. Osterweil. On two problems in the generation of program test paths.IEEE Transactions on Software Engineering, (3):227–231, 1976
work page 1976
-
[3]
G.N. Iyengar. Robust Dynamic Programming.Mathematics of Operations Research, 30:257–280, 2005
work page 2005
-
[4]
PhD thesis, Massachusetts Institute of Technology, 2007
Yann Le Tallec.Robust, risk-sensitive, and data-driven control of Markov decision processes. PhD thesis, Massachusetts Institute of Technology, 2007
work page 2007
-
[5]
Yan Li and Alexander Shapiro. Rectangularity and duality of distributionally robust markov decision processes.arXiv preprint arXiv:2308.11139, 2023
-
[6]
A. Nilim and L. El Ghaoui. Robust control of Markov decision processes with uncertain transition probabilities.Operations Research, 53:780–798, 2005
work page 2005
-
[7]
Multi-model markov decision processes
Lauren N Steimle, David L Kaufman, and Brian T Denton. Multi-model markov decision processes. IISE Transactions, 53(10):1124–1139, 2021. 4
work page 2021
- [8]
-
[9]
Mesures dans les espaces produits.Atti Accad
CT Ionescu Tulcea. Mesures dans les espaces produits.Atti Accad. Naz. Lincei Rend, 7:208–211, 1949
work page 1949
-
[10]
Robust markov decision processes.Mathematics of Operations Research, 38(1):153–183, 2013
Wolfram Wiesemann, Daniel Kuhn, and Ber¸ c Rustem. Robust markov decision processes.Mathematics of Operations Research, 38(1):153–183, 2013. 5
work page 2013
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.