pith. sign in

arxiv: 2605.01339 · v1 · submitted 2026-05-02 · 💻 cs.LG

Robust Parameter Learning for Uncertain MDPs

Pith reviewed 2026-05-09 14:25 UTC · model grok-4.3

classification 💻 cs.LG
keywords uncertain MDPsparametric MDPsPAC learningrobust synthesisMarkov decision processespolytopic approximationsparameter learning
0
0 comments X

The pith

Projecting empirical uncertainty onto shared parameters in parametric MDPs produces PAC confidence sets that respect transition dependencies and are tighter than independent intervals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes learning uncertain models for unknown Markov decision processes by representing transition probabilities as expressions over a shared set of parameters. Statistical uncertainty from observed frequencies is then projected onto this parameter space rather than treated separately for each probability. The result is a probably approximately correct uncertainty model that preserves algebraic dependencies between transitions. Because the resulting models are difficult to solve directly, the authors develop a hierarchy of sound polytopic outer approximations of the induced confidence set. This yields substantially tighter uncertainty estimates than classical interval-based techniques while still supporting robust policy synthesis.

Core claim

The paper establishes that uncertainty from empirical transition counts can be projected onto the parameter space of a pMDP to form a PAC-bounded uncertainty set for the underlying MDP that respects algebraic dependencies between transitions, and that this set admits a hierarchy of polytopic outer approximations that remain sound for robust verification and synthesis.

What carries the argument

The projection of statistical uncertainty from empirical transition frequencies onto the parameter space of a parametric MDP (pMDP), together with a hierarchy of polytopic outer approximations of the resulting confidence set.

If this is right

  • Robust policies synthesized on the approximated models remain valid for the true underlying MDP with the stated PAC guarantees.
  • The resulting uncertainty sets are strictly contained in those obtained from independent confidence intervals on each transition probability.
  • The algebraic dependencies between transitions are automatically respected in the confidence set.
  • Computation of robust policies stays feasible through the hierarchy of outer approximations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same projection technique could be applied to other models such as POMDPs whenever transitions share latent parameters.
  • If parametric forms can be inferred from data instead of assumed, the method might scale to settings where the structure is not known beforehand.
  • Empirical comparisons on standard benchmarks would quantify how much less conservative the resulting policies are relative to interval-based uncertain MDPs.

Load-bearing premise

A suitable parametric structure expressing the transition probabilities as functions of a shared parameter set must be known or provided in advance.

What would settle it

Sample trajectories from a known ground-truth MDP, apply the method to obtain an approximated uncertainty set, synthesize a robust policy, and check whether that policy fails to satisfy its specification for some transition function inside the true PAC set with probability higher than claimed.

Figures

Figures reproduced from arXiv: 2605.01339 by Alessandro Abate, David Parker, Yannik Schnitzer.

Figure 1
Figure 1. Figure 1: Construction of the parameter confidence region view at source ↗
Figure 2
Figure 2. Figure 2: Inclusion hierarchy of the considered uncertainty sets. For each set, we view at source ↗
Figure 3
Figure 3. Figure 3: Robust policy learning progress in the online setting. Solid lines show the view at source ↗
Figure 4
Figure 4. Figure 4: Extended results for robust policy learning. For each benchmark, the view at source ↗
read the original abstract

Learning-based approaches to verifying unknown Markov decision processes (MDPs) often employ uncertain MDPs. These models use, for example, confidence intervals to capture transition uncertainty and allow synthesis of policies that are robust to this uncertainty. However, this approach typically quantifies uncertainty independently for individual transition probabilities, ignoring dependencies due to shared latent quantities. We propose to learn such models using parametric MDPs (pMDPs), where transition probabilities are expressions over a set of parameters. We project statistical uncertainty from empirical transition frequencies onto the pMDP's parameter space, yielding a probably approximately correct (PAC) uncertainty model for the underlying MDP that respects the algebraic dependencies between transitions. The resulting models are algorithmically challenging to solve, so we propose a hierarchy of sound polytopic outer approximations of the induced confidence set. We implement and evaluate our approach, demonstrating substantially tighter uncertainty estimates than classical interval-based uncertain MDP learning techniques.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes learning uncertain MDPs via parametric MDPs (pMDPs), where transition probabilities are algebraic expressions over a shared parameter vector. Empirical transition frequencies are projected onto the parameter space to obtain a PAC-bounded confidence set that respects algebraic dependencies among transitions. Because the resulting pMDPs are hard to solve, the authors develop a hierarchy of sound polytopic outer approximations of the induced confidence set. Experiments are claimed to produce substantially tighter uncertainty estimates than classical interval-based uncertain-MDP methods.

Significance. If the parametric structure is correctly specified in advance and the projection preserves the PAC guarantee, the method would yield less conservative robust policies by exploiting known algebraic dependencies. The polytopic outer-approximation hierarchy is a practical algorithmic contribution that makes the models tractable. The approach is limited by its reliance on a user-provided parametric form; without a mechanism to discover or validate that form, the PAC claim holds only conditionally on correct specification.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (projection step): the claim that projecting empirical frequencies onto the pMDP parameter space yields a PAC uncertainty model is load-bearing, yet no explicit theorem or derivation shows how the original PAC failure probability is preserved under the (generally nonlinear) projection; without this reduction the central guarantee does not follow from standard PAC-MDP results.
  2. [Evaluation section] Evaluation section: the reported tighter uncertainty estimates presuppose that a suitable pMDP structure expressing all transitions as functions of shared parameters is already known; the manuscript gives no description of how this structure was obtained or validated for the benchmark domains, so it is impossible to assess whether the improvement is due to the projection technique or to favorable a-priori structure selection.
minor comments (1)
  1. [Preliminaries] The notation distinguishing the parameter vector, the mapping to transition probabilities, and the induced confidence set could be introduced with a small concrete example in the preliminaries to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. Below we address each major comment point by point and indicate the revisions we will incorporate.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (projection step): the claim that projecting empirical frequencies onto the pMDP parameter space yields a PAC uncertainty model is load-bearing, yet no explicit theorem or derivation shows how the original PAC failure probability is preserved under the (generally nonlinear) projection; without this reduction the central guarantee does not follow from standard PAC-MDP results.

    Authors: We agree that an explicit derivation is needed for rigor. The confidence set in parameter space is constructed as the preimage, under the parametric mapping P(θ), of the standard PAC confidence set over transition probabilities. Because the true parameter θ∗ induces the true transition probabilities, and the empirical frequencies lie inside the PAC set around those true probabilities with probability at least 1−δ, the true θ∗ necessarily belongs to the preimage set with the same probability. The PAC guarantee is therefore preserved by construction, independent of the nonlinearity of the mapping. We will add a formal theorem in §3 that states this reduction explicitly and derives the PAC bound for the parameter-space set directly from classical PAC-MDP results. revision: yes

  2. Referee: [Evaluation section] Evaluation section: the reported tighter uncertainty estimates presuppose that a suitable pMDP structure expressing all transitions as functions of shared parameters is already known; the manuscript gives no description of how this structure was obtained or validated for the benchmark domains, so it is impossible to assess whether the improvement is due to the projection technique or to favorable a-priori structure selection.

    Authors: The referee is correct that the method presupposes a user-supplied parametric structure; the PAC claim holds conditionally on correct specification of the algebraic dependencies. For the reported benchmarks the structures were obtained from standard domain knowledge (shared parameters for slip/wind effects in grid-worlds, factored action effects in other environments). We will revise the evaluation section to include an explicit description of each parametric form together with a short justification based on the underlying MDP semantics. We will also add a clarifying sentence that structure discovery and validation are orthogonal to the present contribution and are left for future work, consistent with the limitations already noted in the manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: PAC guarantee follows from standard concentration inequalities applied to given parametric structure

full rationale

The derivation projects empirical transition frequencies onto a pre-specified pMDP parameter space using algebraic expressions provided as input. The resulting PAC uncertainty set is obtained by applying standard statistical bounds (e.g., Hoeffding or similar) to the observed frequencies and mapping them through the known functions; this does not define the guarantee in terms of the output model itself. No equation or step reduces the claimed result to a fitted quantity, self-citation chain, or tautological renaming. The parametric structure is an explicit modeling assumption, not derived or validated within the method. The polytopic approximations are sound outer bounds on the induced set, preserving the non-circular statistical foundation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the modeling assumption that transitions admit a parametric representation; no free parameters are introduced or fitted beyond the standard statistical projection, and no new entities are postulated.

axioms (1)
  • domain assumption Transition probabilities of the underlying MDP can be expressed as algebraic expressions over a finite set of parameters.
    This is the foundational definition of a parametric MDP required for the uncertainty projection step.

pith-pipeline@v0.9.0 · 5447 in / 1251 out tokens · 22342 ms · 2026-05-09T14:25:26.574766+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 2 canonical work pages

  1. [1]

    Branching bisimulation learning

    Alessandro Abate, Mirco Giacobbe, Christian Micheletti, and Yannik Schnitzer. Branching bisimulation learning. InCAV (4), Lecture Notes in Computer Science, pages 161–184. Springer, 2025

  2. [2]

    Model checking and strategy synthesis with abstractions and certificates

    Alessandro Abate, Mirco Giacobbe, Diptarko Roy, and Yannik Schnitzer. Model checking and strategy synthesis with abstractions and certificates. InPrinciples of Verification (2), Lecture Notes in Computer Science, pages 360–391. Springer, 2024

  3. [3]

    Bisimulation learning

    Alessandro Abate, Mirco Giacobbe, and Yannik Schnitzer. Bisimulation learning. InCAV (3), Lecture Notes in Computer Science, pages 161–183. Springer, 2024

  4. [4]

    Improved algorithms for linear stochastic bandits

    Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. InNIPS, pages 2312–2320, 2011

  5. [5]

    PAC statistical model checking for Markov decision processes and stochastic games

    Pranav Ashok, Jan Kretínský, and Maximilian Weininger. PAC statistical model checking for Markov decision processes and stochastic games. InCAV (1), volume 11561 ofLecture Notes in Computer Science, pages 497–519. Springer, 2019

  6. [6]

    Near-optimal regret bounds for reinforcement learning

    Peter Auer, Thomas Jaksch, and Ronald Ortner. Near-optimal regret bounds for reinforcement learning. InNIPS, pages 89–96. Curran Associates, Inc., 2008

  7. [7]

    Model-based reinforcement learning with value-targeted regression

    Alex Ayoub, Zeyu Jia, Csaba Szepesvari, Mengdi Wang, and Lin Yang. Model-based reinforcement learning with value-targeted regression. InProceedings of the 37th International Conference on Machine Learning, volume 119, pages 463–474. PMLR, 2020

  8. [8]

    Efficient sensitivity analysis for parametric robust Markov chains

    Thom Badings, Sebastian Junges, Ahmadreza Marandi, Ufuk Topcu, and Nils Jansen. Efficient sensitivity analysis for parametric robust Markov chains. InCAV (3), Lecture Notes in Computer Science, pages 62–85. Springer, 2023

  9. [9]

    Badings, Alessandro Abate, Nils Jansen, David Parker, Hasan A

    Thom S. Badings, Alessandro Abate, Nils Jansen, David Parker, Hasan A. Poon- awala, and Mariëlle Stoelinga. Sampling-based robust control of autonomous systems with non-Gaussian noise. InAAAI, pages 9669–9678. AAAI Press, 2022

  10. [10]

    Ezio Bartocci, Radu Grosu, Panagiotis Katsaros, C. R. Ramakrishnan, and Scott A. Smolka. Model repair for probabilistic systems. InTACAS, Lecture Notes in Computer Science, pages 326–340. Springer, 2011

  11. [11]

    Markov decision processes with average-value-at- risk criteria.Math

    Nicole Bäuerle and Jonathan Ott. Markov decision processes with average-value-at- risk criteria.Math. Methods Oper. Res., 74(3):361–379, 2011

  12. [12]

    Tsitsiklis.Introduction to linear optimization, volume 6 ofAthena scientific optimization and computation series

    Dimitris Bertsimas and John N. Tsitsiklis.Introduction to linear optimization, volume 6 ofAthena scientific optimization and computation series. Athena Scientific, 1997. 18 Y. Schnitzer et al

  13. [13]

    Bianco and L

    A. Bianco and L. de Alfaro. Model checking of probabilistic and nondeterministic systems. In P. Thiagarajan, editor,Proc. 15th Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS’95), volume 1026 ofLNCS, pages 499–513. Springer, 1995

  14. [14]

    Oxford University Press, 2013

    Stéphane Boucheron, Gábor Lugosi, and Pascal Massart.Concentration Inequalities - A Nonasymptotic Theory of Independence. Oxford University Press, 2013

  15. [15]

    Formal verification with confidence intervals to establish quality of service properties of software systems.IEEE Trans

    Radu Calinescu, Carlo Ghezzi, Kenneth Johnson, Mauro Pezzè, Yasmin Rafiq, and Giordano Tamburrelli. Formal verification with confidence intervals to establish quality of service properties of software systems.IEEE Trans. Reliab., 65(1):107–125, 2016

  16. [16]

    Precise parameter synthesis for stochastic biochemical systems.Acta Informatica, 54(6):589–623, 2017

    Milan Ceska, Frits Dannenberg, Nicola Paoletti, Marta Kwiatkowska, and Lu- bos Brim. Precise parameter synthesis for stochastic biochemical systems.Acta Informatica, 54(6):589–623, 2017

  17. [17]

    C. J. Clopper and E. S. Pearson. The use of confidence or fiducial limits illustrated in the case of the binomial.Biometrika, 26(4):404–413, 1934

  18. [18]

    Planning with hidden parameter polynomial MDPs

    Clarissa Costen, Marc Rigter, Bruno Lacerda, and Nick Hawes. Planning with hidden parameter polynomial MDPs. InAAAI, pages 11963–11971. AAAI Press, 2023

  19. [19]

    Poonawala, and Ufuk Topcu

    Murat Cubuktepe, Nils Jansen, Sebastian Junges, Joost-Pieter Katoen, Ivan Pa- pusha, Hasan A. Poonawala, and Ufuk Topcu. Sequential convex programming for the efficient verification of parametric mdps. InTACAS (2), Lecture Notes in Computer Science, pages 133–150, 2017

  20. [20]

    Convex optimization for parameter synthesis in MDPs.IEEE Trans

    Murat Cubuktepe, Nils Jansen, Sebastian Junges, Joost-Pieter Katoen, and Ufuk Topcu. Convex optimization for parameter synthesis in MDPs.IEEE Trans. Autom. Control., 67(12):6333–6348, 2022

  21. [21]

    Pérez, and Marnix Suilen

    Kasper Engelen, Guillermo A. Pérez, and Marnix Suilen. Data-efficient safe policy improvement using parametric structure.CoRR, abs/2507.15532, 2025

  22. [22]

    Regret minimization in MDPs with options without prior knowledge

    Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Emma Brunskill. Regret minimization in MDPs with options without prior knowledge. InNIPS, pages 3166–3176, 2017

  23. [23]

    Gleixner, Timo Berthold, Benjamin Müller, and Stefan Weltge

    Ambros M. Gleixner, Timo Berthold, Benjamin Müller, and Stefan Weltge. Three enhancements for optimization-based bound tightening.J. Glob. Optim., 67(4):731– 757, 2017

  24. [24]

    Synthesis for PCTL in parametric Markov decision processes

    Ernst Moritz Hahn, Tingting Han, and Lijun Zhang. Synthesis for PCTL in parametric Markov decision processes. InNASA Formal Methods, Lecture Notes in Computer Science, pages 146–161. Springer, 2011

  25. [25]

    Generalized parameter lifting: Finer abstractions for parametric markov chains

    Linus Heck, Tim Quatmann, Jip Spel, Joost-Pieter Katoen, and Sebastian Junges. Generalized parameter lifting: Finer abstractions for parametric markov chains. In ATVA, Lecture Notes in Computer Science, pages 207–230. Springer, 2025

  26. [26]

    Garud N. Iyengar. Robust dynamic programming.Math. Oper. Res., 30(2):257–280, 2005

  27. [27]

    Parameter synthesis for Markov models: covering the parameter space.Formal Methods Syst

    Sebastian Junges, Erika Ábrahám, Christian Hensel, Nils Jansen, Joost-Pieter Katoen, Tim Quatmann, and Matthias Volk. Parameter synthesis for Markov models: covering the parameter space.Formal Methods Syst. Des., 62(1):181–259, 2024

  28. [28]

    Safety-constrained reinforcement learning for MDPs

    Sebastian Junges, Nils Jansen, Christian Dehnert, Ufuk Topcu, and Joost-Pieter Katoen. Safety-constrained reinforcement learning for MDPs. InTACAS, volume 9636 ofLecture Notes in Computer Science, pages 130–146. Springer, 2016

  29. [29]

    Mykel Kochenderfer.Decision Making Under Uncertainty: Theory and Application. 2015. Robust Parameter Learning for Uncertain MDPs 19

  30. [30]

    Kwiatkowska, Gethin Norman, and David Parker

    Marta Z. Kwiatkowska, Gethin Norman, and David Parker. PRISM 4.0: Verification of probabilistic real-time systems. InCAV, volume 6806 ofLecture Notes in Computer Science, pages 585–591. Springer, 2011

  31. [31]

    Parametric probabilistic transition systems for system design and analysis.Formal Aspects Comput., 19(1):93–109, 2007

    Ruggero Lanotte, Andrea Maggiolo-Schettini, and Angelo Troina. Parametric probabilistic transition systems for system design and analysis.Formal Aspects Comput., 19(1):93–109, 2007

  32. [32]

    Bisimulation through probabilistic testing

    Kim Guldstrand Larsen and Arne Skou. Bisimulation through probabilistic testing. InPOPL, pages 344–352. ACM Press, 1989

  33. [33]

    Appli- cations of second-order cone programming.Linear Algebra Appl., 284(1–3):193–228, 1998

    Miguel Sousa Lobo, Lieven Vandenberghe, Stephen Boyd, and Hervé Lebret. Appli- cations of second-order cone programming.Linear Algebra Appl., 284(1–3):193–228, 1998

  34. [34]

    Fan Long and Martin C. Rinard. Automatic patch generation by learning correct code. InPOPL, pages 298–312. ACM, 2016

  35. [35]

    McCormick

    Garth P. McCormick. Computability of global solutions to factorable nonconvex programs: Part I - convex underestimating problems.Math. Program., 10(1):147–175, 1976

  36. [36]

    What are the odds? improving statistical model checking of Markov decision processes

    Tobias Meggendorfer, Maximilian Weininger, and Patrick Wienhöft. What are the odds? improving statistical model checking of Markov decision processes. InProc. 2nd International Joint Conference on Quantitative Evaluation of Systems and Formal Modeling and Analysis of Timed Systems (QEST+FORMATS’25), LNCS, pages 195–218. Springer, 2025

  37. [37]

    Robust control of Markov decision processes with uncertain transition matrices.Oper

    Arnab Nilim and Laurent El Ghaoui. Robust control of Markov decision processes with uncertain transition matrices.Oper. Res., 53(5):780–798, 2005

  38. [38]

    Wijesuriya, Sofie Haesaert, and Alessandro Abate

    Elizabeth Polgreen, Viraj B. Wijesuriya, Sofie Haesaert, and Alessandro Abate. Data-efficient Bayesian verification of parametric Markov chains. InQEST, volume 9826 ofLecture Notes in Computer Science, pages 35–51. Springer, 2016

  39. [39]

    Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming

    Martin L. Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley Series in Probability and Statistics. Wiley, 1994

  40. [40]

    Parameter synthesis for Markov models: Faster than ever

    Tim Quatmann, Christian Dehnert, Nils Jansen, Sebastian Junges, and Joost-Pieter Katoen. Parameter synthesis for Markov models: Faster than ever. InATVA, volume 9938 ofLecture Notes in Computer Science, pages 50–67, 2016

  41. [41]

    Certifiably robust policies for uncertain parametric environments

    Yannik Schnitzer, Alessandro Abate, and David Parker. Certifiably robust policies for uncertain parametric environments. InTACAS (3), volume 15698 ofLecture Notes in Computer Science, pages 63–83. Springer, 2025

  42. [42]

    Efficient solution and learning of robust factored mdps.AAAI, 2026

    Yannik Schnitzer, Alessandro Abate, and David Parker. Efficient solution and learning of robust factored mdps.AAAI, 2026

  43. [43]

    Scott, Matthew D

    Joseph K. Scott, Matthew D. Stuber, and Paul I. Barton. Generalized mccormick relaxations.J. Glob. Optim., 51(4):569–606, 2011

  44. [44]

    Probabilistic noninterference through weak probabilistic bisimula- tion

    Geoffrey Smith. Probabilistic noninterference through weak probabilistic bisimula- tion. InCSFW, pages 3–13. IEEE Computer Society, 2003

  45. [45]

    Strehl and Michael L

    Alexander L. Strehl and Michael L. Littman. A theoretical analysis of model- based interval estimation. InICML, volume 119 ofACM International Conference Proceeding Series, pages 856–863. ACM, 2005

  46. [46]

    Nimbro explorer: Semiautonomous exploration and mobile manipulation in rough terrain.J

    Jörg Stückler, Max Schwarz, Mark Schadler, Angeliki Topalidou-Kyniazopoulou, and Sven Behnke. Nimbro explorer: Semiautonomous exploration and mobile manipulation in rough terrain.J. Field Robotics, 33(4):411–430, 2016

  47. [47]

    Badings, Eline M

    Marnix Suilen, Thom S. Badings, Eline M. Bovy, David Parker, and Nils Jansen. Robust markov decision processes: A place where AI and formal methods meet. In Principles of Verification (3), volume 15262 ofLecture Notes in Computer Science, pages 126–154. Springer, 2024. 20 Y. Schnitzer et al

  48. [48]

    Simão, David Parker, and Nils Jansen

    Marnix Suilen, Thiago D. Simão, David Parker, and Nils Jansen. Robust anytime learning of Markov decision processes. InNeurIPS, 2022

  49. [49]

    Leslie G. Valiant. A theory of the learnable. InSTOC, pages 436–445. ACM, 1984

  50. [50]

    Vanderbei.Linear programming - foundations and extensions, volume 4 of Kluwer international series in operations research and management service

    Robert J. Vanderbei.Linear programming - foundations and extensions, volume 4 of Kluwer international series in operations research and management service. Kluwer, 1998

  51. [51]

    Weinberger

    Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdú, and Marcelo J. Weinberger. Inequalities for thel1 deviation of the empirical distribution. Technical Report HPL-2003-97(R.1), Hewlett-Packard Laboratories, 2003

  52. [52]

    Robust Markov decision processes.Math

    Wolfram Wiesemann, Daniel Kuhn, and Berç Rustem. Robust Markov decision processes.Math. Oper. Res., 38(1):153–183, 2013

  53. [53]

    Wolff, Ufuk Topcu, and Richard M

    Eric M. Wolff, Ufuk Topcu, and Richard M. Murray. Robust control of uncertain markov decision processes with temporal logic specifications. InCDC. IEEE, 2012

  54. [54]

    Nearly minimax optimal reinforcement learning for linear mixture markov decision processes

    Dongruo Zhou, Quanquan Gu, and Csaba Szepesvári. Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. InProceedings of the 34th Annual Conference on Learning Theory, volume 134, pages 4532–4576. PMLR, 2021. A Proofs We provide the formal proofs for the Theorems 1 and 2. A.1 Proof of Theorem 1 Theorem 1(Soundness of U...

  55. [55]

    Then, by definition of the local projections, we have P [v](s, a) ∈ P R(U) (s, a)for every( s, a) ∈S×A

    PR(U) ⊇ P U.Let P [v] ∈ P U for somev ∈ U. Then, by definition of the local projections, we have P [v](s, a) ∈ P R(U) (s, a)for every( s, a) ∈S×A . Hence P[v]∈ P R(U), which showsPU ⊆ P R(U)

  56. [56]

    By definition, there existsv ∈ U such that µ = P [v](s, a)

    PΛ(U) ⊇ P R(U).Fix a state–action pair( s, a)and let µ∈ P R(U) (s, a). By definition, there existsv ∈ U such that µ = P [v](s, a). For each successors′ ∈S , let f = PΘ(s, a, s′) ∈Λ . Since lΛ f and uΛ f are defined as the minimum and maximum of f [v′]over allv ′ ∈ U , we have lΛ f ≤f [v] = µ(s′) ≤u Λ f. Thus µ∈ P Λ(U) (s, a). Since(s, a)was arbitrary, it ...

  57. [57]

    Therefore, for every expressionf∈Λ , minimizing or maximizing over the larger setB(U )can only decrease the lower bound and increase the upper bound

    PΘ(U) ⊇ P Λ(U).Since B(U )is the smallest axis-aligned hyperrectangle con- taining U, we haveU ⊆B (U ). Therefore, for every expressionf∈Λ , minimizing or maximizing over the larger setB(U )can only decrease the lower bound and increase the upper bound. In other words,lΘ f ≤l Λ f and uΘ f ≥u Λ f. Thus the interval induced by parameter-wise projection cont...

  58. [58]

    PΘ(U) ̸⊆ P I.Consider two parameters θ1, θ2 ∈ [0, 1]and three transition expres- sions f1(θ) =θ 1, f 2(θ) =θ 2, f 3(θ) = 1 2 (θ1 +θ 2), arising from distinct state–action pairs

    Incomparability ofPΘ(U) and PI.We give two counterexamples showing that neither inclusion holds in general. PΘ(U) ̸⊆ P I.Consider two parameters θ1, θ2 ∈ [0, 1]and three transition expres- sions f1(θ) =θ 1, f 2(θ) =θ 2, f 3(θ) = 1 2 (θ1 +θ 2), arising from distinct state–action pairs. Suppose the learned intervals aref1 ∈ [0,1],f 2 ∈[0,1], andf 3 ∈[ 1 2 ,...

  59. [59]

    The ellipsoidal baseline induces, for each(s, a), the uncertainty set PR(E) (s, a) :={µ∈∆(S)| ∃v∈ E δ :µ(s ′) =P[v](s, a, s ′)for alls ′ ∈S}

    The constantS is any a priori bound on∥u∥2, for D⊆ [0, 1]ℓ, the choiceS = √ ℓ is always sound, andλ > 0is an arbitrary ridge parameter, which we choose asλ= 10−2 to stabilise the regression. The ellipsoidal baseline induces, for each(s, a), the uncertainty set PR(E) (s, a) :={µ∈∆(S)| ∃v∈ E δ :µ(s ′) =P[v](s, a, s ′)for alls ′ ∈S}. The inner optimisation i...