arxiv: 2605.07537 · v1 · submitted 2026-05-08 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Multi-Environment POMDPs with Finite-Horizon Objectives

L\'eonard Brice , Filip Cano , Krishnendu Chatterjee , Thomas A. Henzinger , Stefanie Muroya

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:48 UTC · model grok-4.3

classification 💻 cs.AI

keywords multi-environment POMDPMEPOMDPPSPACE-completefinite horizonoptimal policyadversarial initial statepartially observable Markov decision process

0 comments

The pith

Optimal finite-horizon planning remains PSPACE-complete when initial states in POMDPs are chosen adversarially from multiple environments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves that computing the optimal value and policy for finite-horizon objectives in multi-environment POMDPs is PSPACE-complete. These models extend ordinary POMDPs by letting an adversary pick the starting state from a set of possible environments while keeping transitions and observations fixed. The authors also supply a practical algorithm and show on classical benchmarks that it runs substantially faster than the single prior method. A reader would care because many planning tasks under partial observability involve uncertainty or opposition over where the process begins.

Core claim

The decision problem of finding an optimal policy and its value in a multi-environment POMDP with a finite-horizon objective is PSPACE-complete. Membership in PSPACE follows from standard dynamic-programming techniques that track belief states over the finite horizon; hardness is shown by a polynomial-time reduction from the corresponding POMDP problem that embeds any POMDP instance into an MEPOMDP by adding a single adversarial choice over the initial environment.

What carries the argument

A polynomial-time reduction from finite-horizon POMDP value computation to MEPOMDP value computation that preserves optimality, combined with a belief-state dynamic program that enumerates reachable beliefs over the horizon length.

If this is right

Any finite-horizon planning instance with an adversarial initial state can be solved using the same complexity class as ordinary POMDPs.
Existing POMDP benchmarks can be reused directly to evaluate MEPOMDP solvers by adding an adversarial initial-state choice.
The supplied algorithm yields exact optimal policies and values for finite horizons and scales better than prior methods on standard test cases.
The modeling choice keeps the problem inside PSPACE, so practical approximations remain feasible even though the problem is hard.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the reduction is tight, then adding more possible starting environments does not push the finite-horizon problem into a higher complexity class.
The same reduction technique might adapt to other objectives such as reachability or safety, though the paper does not carry out those extensions.
Domains that already use POMDP planners (robot navigation, dialogue systems) could incorporate adversarial initial-state uncertainty without changing the underlying solver architecture.

Load-bearing premise

The adversary is allowed to select only the initial state from a finite set of environments while every environment shares identical transition probabilities and observation functions.

What would settle it

An explicit polynomial-time algorithm that returns the exact optimal finite-horizon value for every MEPOMDP instance would falsify the PSPACE-completeness claim.

Figures

Figures reproduced from arXiv: 2605.07537 by Filip Cano, Krishnendu Chatterjee, L\'eonard Brice, Stefanie Muroya, Thomas A. Henzinger.

**Figure 1.** Figure 1: An example of a POMDP The dashed boxes indicate that all states have the same observation o1, except the states t3 and t4, with observation o2. The double circles indicate states with reward 1; the other ones have reward 0. Policies. A policy in the POMDP P is a function σ : ObsSeq(P) ! D(A). It is deterministic if for every observation sequence ω ∈ ObsSeq(P), there is an action a with σ(ω)(a) = 1. We deno… view at source ↗

**Figure 2.** Figure 2: An illustration of Algorithm 1 Finally, the matching lower bound for the exact problem follows from [11] where it was shown for n = 1. We briefly show that the proof can be adapted to show the same hardness for the approximate version. We can then conclude on our main theoretical result. Lemma 4 (App. A). The approximate threshold problem is PSPACE-hard, even when n = 1. Theorem 1. The threshold problem an… view at source ↗

**Figure 3.** Figure 3: Horizon vs. Computation time in the RockSample benchmark, for different number of [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: RockSample problem. both in the obtained value of the MEPOMDP, as well as in the computation time required. We explain this phenomenon as a consequence of [1] being optimized for the infinite horizon discounted problem, where, according to the results presented in [1], the gap does play a significant role. F.2 Benchmark description As benchmarks, we used modified versions of standard problems in the POMDP … view at source ↗

**Figure 5.** Figure 5: Robot Navigation Maps. NAME (No. states). Figures from [2, Figures 6.1–6.4]. The motion model is specified by primitive outcomes. Let N denote no movement, F a one-cell forward move, L a 90◦ left turn, R a 90◦ right turn, and A absorption. The noisy outcomes are Action Primitive outcome distribution forward N : 0.11, F : 0.88, F F : 0.01 left N : 0.05, L : 0.90, LL : 0.05 right N : 0.05, R : 0.90, RR : 0.0… view at source ↗

**Figure 6.** Figure 6: Scalability of Rock Sampling benchmark 1 2 3 4 5 6 Horizon (k) 10−2 10−1 100 101 102 103 Time (seconds) No. Initial States (n) 2 3 15 20 25 30 35 40 45 50 No. States (|S|) 10−2 10−1 100 101 102 103 Time (seconds) horizon 1 2 3 4 5 6 [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Scalability of the Robot Navigation and Identification Benchmarks [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

read the original abstract

Partially Observable Markov Decision Processes (POMDPs) are systems in which one agent interacts with a stochastic environment, and receives only partial information about the current state. In a multi-environment POMDP (MEPOMDP), the initial state is unknown, and assumed to be adversarially chosen. In this work we focus on computing the optimal value and policy in MEPOMDPs with finite-horizon objectives. That problem is known to be PSPACE-complete in POMDPs. Our main results are as follows: (1) we establish that it is also PSPACE-complete in the more general setting of MEPOMDPs; (2) we present a practical algorithm and evaluate it on classical benchmarks, significantly outperforming the only previously known algorithm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows PSPACE-completeness carries over to MEPOMDPs and gives a faster algorithm than before, but the reduction needs checking against the shared-dynamics definition.

read the letter

The main result is that finite-horizon value and policy computation stays PSPACE-complete when you move from ordinary POMDPs to multi-environment versions where only the initial state is chosen adversarially. They also supply a new algorithm and show it beats the prior solver on the usual benchmarks. That combination of a clean complexity statement plus a practical improvement is what stands out. The reduction argument for hardness is the part that matters most. Because all environments share the same transition and observation functions, the construction has to produce hard instances without introducing environment-specific dynamics. The stress-test note is right to flag this; if the proof relies on varying those functions, the claim would not go through. Membership in PSPACE looks routine for finite horizons since the belief state stays manageable step by step. On the algorithmic side the evaluation is direct and the reported gains are concrete, which is helpful even if it does not change the asymptotic picture. This work is for people who already work on POMDP complexity or robust planning under partial observability. Someone who needs either the classification or a usable solver for finite-horizon cases will find it relevant. It deserves peer review because the claims are narrow enough for referees to verify the reduction and the implementation details without needing to invent new benchmarks.

Referee Report

1 major / 1 minor

Summary. The paper claims that computing the optimal value and policy for finite-horizon objectives in multi-environment POMDPs (MEPOMDPs), where the initial state is adversarially chosen from a set of environments, is PSPACE-complete (extending the known result for standard POMDPs). It also presents a practical algorithm for this problem and evaluates it on classical benchmarks, where it significantly outperforms the only previously known algorithm.

Significance. If the PSPACE-completeness result holds under the stated MEPOMDP definition, the work is significant for establishing that adversarial choice over initial states does not increase complexity beyond PSPACE for finite-horizon objectives. The algorithmic contribution, supported by empirical evaluation on benchmarks showing clear outperformance, provides a concrete practical advance over prior methods for robust planning under partial observability.

major comments (1)

[proof of PSPACE-hardness] The PSPACE-hardness reduction (in the section establishing complexity results) must be verified against the MEPOMDP definition, under which environments share identical transition and observation functions and differ only in the adversarially chosen initial state. If the reduction constructs instances requiring environment-specific dynamics to embed the hardness, the result would not apply to the model as defined and membership in PSPACE would also require re-examination.

minor comments (1)

The abstract and introduction could more explicitly state whether MEPOMDP environments are permitted to have distinct transition/observation functions or are restricted to differing only in initial state.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and constructive feedback on our manuscript. We address the major comment below.

read point-by-point responses

Referee: [proof of PSPACE-hardness] The PSPACE-hardness reduction (in the section establishing complexity results) must be verified against the MEPOMDP definition, under which environments share identical transition and observation functions and differ only in the adversarially chosen initial state. If the reduction constructs instances requiring environment-specific dynamics to embed the hardness, the result would not apply to the model as defined and membership in PSPACE would also require re-examination.

Authors: We thank the referee for this observation. Our PSPACE-hardness result is obtained by a direct reduction from the known PSPACE-complete finite-horizon POMDP planning problem. Any standard POMDP instance is an MEPOMDP with a singleton environment set. In this embedding there is only a single environment, so the transition and observation functions are necessarily identical (there are no other environments from which they could differ). The adversarial initial-state choice reduces to the standard POMDP initial-state distribution. Consequently the reduction never constructs instances that violate the shared-dynamics requirement of the MEPOMDP model. Membership in PSPACE is established independently by a dynamic-programming procedure that operates over time-indexed belief states; the same procedure applies verbatim when the initial belief is replaced by an adversarial choice over a finite set of initial states, without leaving PSPACE. revision: no

Circularity Check

0 steps flagged

No circularity; PSPACE-completeness via external reduction and independent algorithm

full rationale

The paper's core claims rest on a new reduction establishing PSPACE-hardness for MEPOMDPs from the known PSPACE-completeness of POMDPs (an external result) plus a separate membership argument in PSPACE. The practical algorithm is presented with its own description and benchmark evaluation that outperforms prior work, without any self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations that collapse the argument. The derivation chain is self-contained against external benchmarks and does not reduce any key step to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work extends standard POMDP definitions and complexity theory without introducing new free parameters, invented entities, or ad-hoc axioms beyond domain assumptions.

axioms (2)

domain assumption POMDPs are defined with partial observability, stochastic transitions, and finite-horizon objectives
Core modeling assumption used to define the MEPOMDP setting and the decision problem.
domain assumption PSPACE-completeness reductions from known POMDP problems apply directly when extending to adversarial initial states
Invoked to establish the hardness result for the generalized model.

pith-pipeline@v0.9.0 · 5439 in / 1328 out tokens · 60709 ms · 2026-05-11T01:48:44.469611+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Definition 3 (Multi-environment POMDP) and Theorem 1 (PSPACE-completeness of threshold problem)
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Algorithm 1 (non-deterministic polynomial-space construction of policies via recursive multi-expected payoff updates)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

[1]

Multi-environment POMDPs: Discrete model uncertainty under partial observability

Eline Bovy, Caleb Probine, Marnix Suilen, Ufuk Topcu, and Nils Jansen. Multi-environment POMDPs: Discrete model uncertainty under partial observability. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 9

work page 2025
[2]

PhD thesis, Brown University, USA, 1998

Anthony Rocco Cassandra.Exact and approximate algorithms for partially observable Markov decision processes. PhD thesis, Brown University, USA, 1998

work page 1998
[3]

What is decidable about par- tially observable Markov decision processes with ω-regular objectives.Journal of Com- puter and System Sciences, 82(5):878–911, 2016

Krishnendu Chatterjee, Martin Chmelík, and Mathieu Tracol. What is decidable about par- tially observable Markov decision processes with ω-regular objectives.Journal of Com- puter and System Sciences, 82(5):878–911, 2016. ISSN 0022-0000. doi: https://doi.org/ 10.1016/j.jcss.2016.02.009. URL https://www.sciencedirect.com/science/article/ pii/S0022000016000246

work page doi:10.1016/j.jcss.2016.02.009 2016
[4]

Multiple-environment Markov decision processes: Efficient analysis and applications

Krishnendu Chatterjee, Martin Chmelík, Deep Karkhanis, Petr Novotný, and Amélie Royer. Multiple-environment Markov decision processes: Efficient analysis and applications. In Proceedings of the 30th International Conference on Automated Planning and Scheduling (ICAPS), pages 48–56. AAAI Press, 2020

work page 2020
[5]

The value problem for multiple-environment MDPs with parity objective

Krishnendu Chatterjee, Laurent Doyen, Jean-François Raskin, and Ocan Sankur. The value problem for multiple-environment MDPs with parity objective. In Keren Censor-Hillel, Fabrizio Grandoni, Joël Ouaknine, and Gabriele Puppis, editors,52nd International Colloquium on Automata, Languages, and Programming, ICALP 2025, Aarhus, Denmark, July 8-11, 2025, LIPIc...

work page doi:10.4230/lipics.icalp.2025.150 2025
[6]

Milos Hauskrecht and Hamish S. F. Fraser. Planning treatment of ischemic heart disease with partially observable Markov decision processes.Artif. Intell. Medicine, 18(3):221–244, 2000. doi: 10.1016/S0933-3657(99)00042-1. URL https://doi.org/10.1016/S0933-3657(99) 00042-1

work page doi:10.1016/s0933-3657(99)00042-1 2000
[7]

On evaluating policies for robust POMDPs

Merlijn Krale, Eline M Bovy, Maris FL Galesloot, Thiago D Simão, and Nils Jansen. On evaluating policies for robust POMDPs. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[8]

I. E. Leonard and J. E. Lewis.Geometry of Convex Sets. Wiley-Blackwell, Hoboken, New Jersey, United States, 1st edition, 2016

work page 2016
[9]

On the undecidability of probabilistic planning and related stochastic optimization problems.Artificial Intelligence, 147(1):5– 34, 2003

Omid Madani, Steve Hanks, and Anne Condon. On the undecidability of probabilistic planning and related stochastic optimization problems.Artificial Intelligence, 147(1):5– 34, 2003. ISSN 0004-3702. doi: https://doi.org/10.1016/S0004-3702(02)00378-8. URL https://www.sciencedirect.com/science/article/pii/S0004370202003788. Plan- ning with Uncertainty and Inc...

work page doi:10.1016/s0004-3702(02)00378-8 2003
[10]

Robust partially observable Markov decision process

Takayuki Osogami. Robust partially observable Markov decision process. In Francis Bach and David Blei, editors,Proceedings of the 32nd International Conference on Machine Learning, volume 37 ofProceedings of Machine Learning Research, pages 106–115, Lille, France, 07–09 Jul 2015. PMLR. URLhttps://proceedings.mlr.press/v37/osogami15.html

work page 2015
[11]

Papadimitriou and John N

Christos H. Papadimitriou and John N. Tsitsiklis. The complexity of Markov decision processes. Mathematics of Operations Research, 12(3):441–450, 1987. ISSN 0364765X, 15265471. URL http://www.jstor.org/stable/3689975

work page arXiv 1987
[12]

John Wiley & Sons, 2014

Martin L. Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, New York, 1994. ISBN 978-0-471-72782-8. doi: 10.1002/9780470316887

work page doi:10.1002/9780470316887 1994
[13]

Multiple-environment Markov decision processes

Jean-François Raskin and Ocan Sankur. Multiple-environment Markov decision processes. In Venkatesh Raman and S. P. Suresh, editors,34th International Conference on Foundation of Software Technology and Theoretical Computer Science, FSTTCS 2014, New Delhi, India, December 15-17, 2014, LIPIcs, pages 531–543. Schloss Dagstuhl - Leibniz-Zentrum für Informatik...

work page doi:10.4230/lipics.fsttcs.2014.531 2014
[14]

Gordon, and Sebastian Thrun

Nicholas Roy, Geoffrey J. Gordon, and Sebastian Thrun. Finding approximate POMDP solutions through belief compression.J. Artif. Intell. Res., 23:1–40, 2005. doi: 10.1613/JAIR.1496. URL https://doi.org/10.1613/jair.1496. 10

work page doi:10.1613/jair.1496 2005
[15]

Walter J. Savitch. Relationships between nondeterministic and deterministic tape complexities. Journal of Computer and System Sciences, 4(2):177–192, 1970. ISSN 0022-0000. doi: https: //doi.org/10.1016/S0022-0000(70)80006-X

work page doi:10.1016/s0022-0000(70)80006-x 1970
[16]

Heuristic search value iteration for POMDPs

Trey Smith and Reid Simmons. Heuristic search value iteration for POMDPs. InProceedings of the 20th conference on Uncertainty in artificial intelligence, pages 520–527, 2004

work page 2004
[17]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. The MIT Press, Cambridge, Massachusetts, 2nd edition, 2018. ISBN 978-0-262-03924-6. URL http://incompleteideas.net/book/the-book-2nd.html

work page 2018
[18]

Robust almost-sure reachability in multi-environment MDPs

Marck van der Vegt, Nils Jansen, and Sebastian Junges. Robust almost-sure reachability in multi-environment MDPs. In Sriram Sankaranarayanan and Natasha Sharygina, editors,Tools and Algorithms for the Construction and Analysis of Systems, pages 508–526, Cham, 2023. Springer Nature Switzerland. ISBN 978-3-031-30823-9. A Proof of Lemma 4 Lemma 4(App. A).The...

work page 2023