Robust Parameter Learning for Uncertain MDPs

Alessandro Abate; David Parker; Yannik Schnitzer

arxiv: 2605.01339 · v1 · submitted 2026-05-02 · 💻 cs.LG

Robust Parameter Learning for Uncertain MDPs

Yannik Schnitzer , Alessandro Abate , David Parker This is my paper

Pith reviewed 2026-05-09 14:25 UTC · model grok-4.3

classification 💻 cs.LG

keywords uncertain MDPsparametric MDPsPAC learningrobust synthesisMarkov decision processespolytopic approximationsparameter learning

0 comments

The pith

Projecting empirical uncertainty onto shared parameters in parametric MDPs produces PAC confidence sets that respect transition dependencies and are tighter than independent intervals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes learning uncertain models for unknown Markov decision processes by representing transition probabilities as expressions over a shared set of parameters. Statistical uncertainty from observed frequencies is then projected onto this parameter space rather than treated separately for each probability. The result is a probably approximately correct uncertainty model that preserves algebraic dependencies between transitions. Because the resulting models are difficult to solve directly, the authors develop a hierarchy of sound polytopic outer approximations of the induced confidence set. This yields substantially tighter uncertainty estimates than classical interval-based techniques while still supporting robust policy synthesis.

Core claim

The paper establishes that uncertainty from empirical transition counts can be projected onto the parameter space of a pMDP to form a PAC-bounded uncertainty set for the underlying MDP that respects algebraic dependencies between transitions, and that this set admits a hierarchy of polytopic outer approximations that remain sound for robust verification and synthesis.

What carries the argument

The projection of statistical uncertainty from empirical transition frequencies onto the parameter space of a parametric MDP (pMDP), together with a hierarchy of polytopic outer approximations of the resulting confidence set.

If this is right

Robust policies synthesized on the approximated models remain valid for the true underlying MDP with the stated PAC guarantees.
The resulting uncertainty sets are strictly contained in those obtained from independent confidence intervals on each transition probability.
The algebraic dependencies between transitions are automatically respected in the confidence set.
Computation of robust policies stays feasible through the hierarchy of outer approximations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same projection technique could be applied to other models such as POMDPs whenever transitions share latent parameters.
If parametric forms can be inferred from data instead of assumed, the method might scale to settings where the structure is not known beforehand.
Empirical comparisons on standard benchmarks would quantify how much less conservative the resulting policies are relative to interval-based uncertain MDPs.

Load-bearing premise

A suitable parametric structure expressing the transition probabilities as functions of a shared parameter set must be known or provided in advance.

What would settle it

Sample trajectories from a known ground-truth MDP, apply the method to obtain an approximated uncertainty set, synthesize a robust policy, and check whether that policy fails to satisfy its specification for some transition function inside the true PAC set with probability higher than claimed.

Figures

Figures reproduced from arXiv: 2605.01339 by Alessandro Abate, David Parker, Yannik Schnitzer.

**Figure 1.** Figure 1: Construction of the parameter confidence region view at source ↗

**Figure 2.** Figure 2: Inclusion hierarchy of the considered uncertainty sets. For each set, we view at source ↗

**Figure 3.** Figure 3: Robust policy learning progress in the online setting. Solid lines show the view at source ↗

**Figure 4.** Figure 4: Extended results for robust policy learning. For each benchmark, the view at source ↗

read the original abstract

Learning-based approaches to verifying unknown Markov decision processes (MDPs) often employ uncertain MDPs. These models use, for example, confidence intervals to capture transition uncertainty and allow synthesis of policies that are robust to this uncertainty. However, this approach typically quantifies uncertainty independently for individual transition probabilities, ignoring dependencies due to shared latent quantities. We propose to learn such models using parametric MDPs (pMDPs), where transition probabilities are expressions over a set of parameters. We project statistical uncertainty from empirical transition frequencies onto the pMDP's parameter space, yielding a probably approximately correct (PAC) uncertainty model for the underlying MDP that respects the algebraic dependencies between transitions. The resulting models are algorithmically challenging to solve, so we propose a hierarchy of sound polytopic outer approximations of the induced confidence set. We implement and evaluate our approach, demonstrating substantially tighter uncertainty estimates than classical interval-based uncertain MDP learning techniques.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Projecting empirical uncertainty onto a pre-specified parametric MDP structure gives tighter PAC sets than independent intervals, but the whole thing rests on having the right parametrization upfront.

read the letter

The paper shows how to get tighter uncertainty sets for MDPs learned from data by projecting empirical frequencies onto a parametric model that encodes dependencies between transitions. This gives PAC guarantees that respect the algebraic structure, unlike standard interval methods that bound each probability separately. What stands out is the use of pMDPs where transitions are functions of parameters, and then mapping the confidence intervals from data to a set in parameter space. They follow this with a series of polytopic outer approximations to make robust policy synthesis feasible, since the exact set is hard to handle. The experiments back this up by showing substantially tighter bounds than classical approaches. That's the practical payoff for applications like safety-critical systems where loose bounds hurt performance. The main limitation is the need for a suitable parametric structure upfront. The method works only if the transition probabilities can be expressed as the right functions of shared parameters, and the paper doesn't provide ways to discover or check that structure from the data. If the parametrization doesn't match the true MDP, the PAC property could fail. The approximations are conservative by design as outer bounds, which is fine but means some tightness is lost. Overall, this is aimed at people in MDP verification and robust control who already have some model structure in mind. It builds on existing PAC learning for MDPs but adds the dependency respect in a clean way. The math looks solid on the projection and approximation parts, assuming the structure is correct. I'd say send it to peer review. The core claim holds up and the evaluation supports it.

Referee Report

2 major / 1 minor

Summary. The paper proposes learning uncertain MDPs via parametric MDPs (pMDPs), where transition probabilities are algebraic expressions over a shared parameter vector. Empirical transition frequencies are projected onto the parameter space to obtain a PAC-bounded confidence set that respects algebraic dependencies among transitions. Because the resulting pMDPs are hard to solve, the authors develop a hierarchy of sound polytopic outer approximations of the induced confidence set. Experiments are claimed to produce substantially tighter uncertainty estimates than classical interval-based uncertain-MDP methods.

Significance. If the parametric structure is correctly specified in advance and the projection preserves the PAC guarantee, the method would yield less conservative robust policies by exploiting known algebraic dependencies. The polytopic outer-approximation hierarchy is a practical algorithmic contribution that makes the models tractable. The approach is limited by its reliance on a user-provided parametric form; without a mechanism to discover or validate that form, the PAC claim holds only conditionally on correct specification.

major comments (2)

[Abstract and §3] Abstract and §3 (projection step): the claim that projecting empirical frequencies onto the pMDP parameter space yields a PAC uncertainty model is load-bearing, yet no explicit theorem or derivation shows how the original PAC failure probability is preserved under the (generally nonlinear) projection; without this reduction the central guarantee does not follow from standard PAC-MDP results.
[Evaluation section] Evaluation section: the reported tighter uncertainty estimates presuppose that a suitable pMDP structure expressing all transitions as functions of shared parameters is already known; the manuscript gives no description of how this structure was obtained or validated for the benchmark domains, so it is impossible to assess whether the improvement is due to the projection technique or to favorable a-priori structure selection.

minor comments (1)

[Preliminaries] The notation distinguishing the parameter vector, the mapping to transition probabilities, and the induced confidence set could be introduced with a small concrete example in the preliminaries to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. Below we address each major comment point by point and indicate the revisions we will incorporate.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (projection step): the claim that projecting empirical frequencies onto the pMDP parameter space yields a PAC uncertainty model is load-bearing, yet no explicit theorem or derivation shows how the original PAC failure probability is preserved under the (generally nonlinear) projection; without this reduction the central guarantee does not follow from standard PAC-MDP results.

Authors: We agree that an explicit derivation is needed for rigor. The confidence set in parameter space is constructed as the preimage, under the parametric mapping P(θ), of the standard PAC confidence set over transition probabilities. Because the true parameter θ∗ induces the true transition probabilities, and the empirical frequencies lie inside the PAC set around those true probabilities with probability at least 1−δ, the true θ∗ necessarily belongs to the preimage set with the same probability. The PAC guarantee is therefore preserved by construction, independent of the nonlinearity of the mapping. We will add a formal theorem in §3 that states this reduction explicitly and derives the PAC bound for the parameter-space set directly from classical PAC-MDP results. revision: yes
Referee: [Evaluation section] Evaluation section: the reported tighter uncertainty estimates presuppose that a suitable pMDP structure expressing all transitions as functions of shared parameters is already known; the manuscript gives no description of how this structure was obtained or validated for the benchmark domains, so it is impossible to assess whether the improvement is due to the projection technique or to favorable a-priori structure selection.

Authors: The referee is correct that the method presupposes a user-supplied parametric structure; the PAC claim holds conditionally on correct specification of the algebraic dependencies. For the reported benchmarks the structures were obtained from standard domain knowledge (shared parameters for slip/wind effects in grid-worlds, factored action effects in other environments). We will revise the evaluation section to include an explicit description of each parametric form together with a short justification based on the underlying MDP semantics. We will also add a clarifying sentence that structure discovery and validation are orthogonal to the present contribution and are left for future work, consistent with the limitations already noted in the manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: PAC guarantee follows from standard concentration inequalities applied to given parametric structure

full rationale

The derivation projects empirical transition frequencies onto a pre-specified pMDP parameter space using algebraic expressions provided as input. The resulting PAC uncertainty set is obtained by applying standard statistical bounds (e.g., Hoeffding or similar) to the observed frequencies and mapping them through the known functions; this does not define the guarantee in terms of the output model itself. No equation or step reduces the claimed result to a fitted quantity, self-citation chain, or tautological renaming. The parametric structure is an explicit modeling assumption, not derived or validated within the method. The polytopic approximations are sound outer bounds on the induced set, preserving the non-circular statistical foundation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the modeling assumption that transitions admit a parametric representation; no free parameters are introduced or fitted beyond the standard statistical projection, and no new entities are postulated.

axioms (1)

domain assumption Transition probabilities of the underlying MDP can be expressed as algebraic expressions over a finite set of parameters.
This is the foundational definition of a parametric MDP required for the uncertainty projection step.

pith-pipeline@v0.9.0 · 5447 in / 1251 out tokens · 22342 ms · 2026-05-09T14:25:26.574766+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 2 canonical work pages

[1]

Branching bisimulation learning

Alessandro Abate, Mirco Giacobbe, Christian Micheletti, and Yannik Schnitzer. Branching bisimulation learning. InCAV (4), Lecture Notes in Computer Science, pages 161–184. Springer, 2025

2025
[2]

Model checking and strategy synthesis with abstractions and certificates

Alessandro Abate, Mirco Giacobbe, Diptarko Roy, and Yannik Schnitzer. Model checking and strategy synthesis with abstractions and certificates. InPrinciples of Verification (2), Lecture Notes in Computer Science, pages 360–391. Springer, 2024

2024
[3]

Bisimulation learning

Alessandro Abate, Mirco Giacobbe, and Yannik Schnitzer. Bisimulation learning. InCAV (3), Lecture Notes in Computer Science, pages 161–183. Springer, 2024

2024
[4]

Improved algorithms for linear stochastic bandits

Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. InNIPS, pages 2312–2320, 2011

2011
[5]

PAC statistical model checking for Markov decision processes and stochastic games

Pranav Ashok, Jan Kretínský, and Maximilian Weininger. PAC statistical model checking for Markov decision processes and stochastic games. InCAV (1), volume 11561 ofLecture Notes in Computer Science, pages 497–519. Springer, 2019

2019
[6]

Near-optimal regret bounds for reinforcement learning

Peter Auer, Thomas Jaksch, and Ronald Ortner. Near-optimal regret bounds for reinforcement learning. InNIPS, pages 89–96. Curran Associates, Inc., 2008

2008
[7]

Model-based reinforcement learning with value-targeted regression

Alex Ayoub, Zeyu Jia, Csaba Szepesvari, Mengdi Wang, and Lin Yang. Model-based reinforcement learning with value-targeted regression. InProceedings of the 37th International Conference on Machine Learning, volume 119, pages 463–474. PMLR, 2020

2020
[8]

Efficient sensitivity analysis for parametric robust Markov chains

Thom Badings, Sebastian Junges, Ahmadreza Marandi, Ufuk Topcu, and Nils Jansen. Efficient sensitivity analysis for parametric robust Markov chains. InCAV (3), Lecture Notes in Computer Science, pages 62–85. Springer, 2023

2023
[9]

Badings, Alessandro Abate, Nils Jansen, David Parker, Hasan A

Thom S. Badings, Alessandro Abate, Nils Jansen, David Parker, Hasan A. Poon- awala, and Mariëlle Stoelinga. Sampling-based robust control of autonomous systems with non-Gaussian noise. InAAAI, pages 9669–9678. AAAI Press, 2022

2022
[10]

Ezio Bartocci, Radu Grosu, Panagiotis Katsaros, C. R. Ramakrishnan, and Scott A. Smolka. Model repair for probabilistic systems. InTACAS, Lecture Notes in Computer Science, pages 326–340. Springer, 2011

2011
[11]

Markov decision processes with average-value-at- risk criteria.Math

Nicole Bäuerle and Jonathan Ott. Markov decision processes with average-value-at- risk criteria.Math. Methods Oper. Res., 74(3):361–379, 2011

2011
[12]

Tsitsiklis.Introduction to linear optimization, volume 6 ofAthena scientific optimization and computation series

Dimitris Bertsimas and John N. Tsitsiklis.Introduction to linear optimization, volume 6 ofAthena scientific optimization and computation series. Athena Scientific, 1997. 18 Y. Schnitzer et al

1997
[13]

Bianco and L

A. Bianco and L. de Alfaro. Model checking of probabilistic and nondeterministic systems. In P. Thiagarajan, editor,Proc. 15th Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS’95), volume 1026 ofLNCS, pages 499–513. Springer, 1995

1995
[14]

Oxford University Press, 2013

Stéphane Boucheron, Gábor Lugosi, and Pascal Massart.Concentration Inequalities - A Nonasymptotic Theory of Independence. Oxford University Press, 2013

2013
[15]

Formal verification with confidence intervals to establish quality of service properties of software systems.IEEE Trans

Radu Calinescu, Carlo Ghezzi, Kenneth Johnson, Mauro Pezzè, Yasmin Rafiq, and Giordano Tamburrelli. Formal verification with confidence intervals to establish quality of service properties of software systems.IEEE Trans. Reliab., 65(1):107–125, 2016

2016
[16]

Precise parameter synthesis for stochastic biochemical systems.Acta Informatica, 54(6):589–623, 2017

Milan Ceska, Frits Dannenberg, Nicola Paoletti, Marta Kwiatkowska, and Lu- bos Brim. Precise parameter synthesis for stochastic biochemical systems.Acta Informatica, 54(6):589–623, 2017

2017
[17]

C. J. Clopper and E. S. Pearson. The use of confidence or fiducial limits illustrated in the case of the binomial.Biometrika, 26(4):404–413, 1934

1934
[18]

Planning with hidden parameter polynomial MDPs

Clarissa Costen, Marc Rigter, Bruno Lacerda, and Nick Hawes. Planning with hidden parameter polynomial MDPs. InAAAI, pages 11963–11971. AAAI Press, 2023

2023
[19]

Poonawala, and Ufuk Topcu

Murat Cubuktepe, Nils Jansen, Sebastian Junges, Joost-Pieter Katoen, Ivan Pa- pusha, Hasan A. Poonawala, and Ufuk Topcu. Sequential convex programming for the efficient verification of parametric mdps. InTACAS (2), Lecture Notes in Computer Science, pages 133–150, 2017

2017
[20]

Convex optimization for parameter synthesis in MDPs.IEEE Trans

Murat Cubuktepe, Nils Jansen, Sebastian Junges, Joost-Pieter Katoen, and Ufuk Topcu. Convex optimization for parameter synthesis in MDPs.IEEE Trans. Autom. Control., 67(12):6333–6348, 2022

2022
[21]

Pérez, and Marnix Suilen

Kasper Engelen, Guillermo A. Pérez, and Marnix Suilen. Data-efficient safe policy improvement using parametric structure.CoRR, abs/2507.15532, 2025

work page arXiv 2025
[22]

Regret minimization in MDPs with options without prior knowledge

Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Emma Brunskill. Regret minimization in MDPs with options without prior knowledge. InNIPS, pages 3166–3176, 2017

2017
[23]

Gleixner, Timo Berthold, Benjamin Müller, and Stefan Weltge

Ambros M. Gleixner, Timo Berthold, Benjamin Müller, and Stefan Weltge. Three enhancements for optimization-based bound tightening.J. Glob. Optim., 67(4):731– 757, 2017

2017
[24]

Synthesis for PCTL in parametric Markov decision processes

Ernst Moritz Hahn, Tingting Han, and Lijun Zhang. Synthesis for PCTL in parametric Markov decision processes. InNASA Formal Methods, Lecture Notes in Computer Science, pages 146–161. Springer, 2011

2011
[25]

Generalized parameter lifting: Finer abstractions for parametric markov chains

Linus Heck, Tim Quatmann, Jip Spel, Joost-Pieter Katoen, and Sebastian Junges. Generalized parameter lifting: Finer abstractions for parametric markov chains. In ATVA, Lecture Notes in Computer Science, pages 207–230. Springer, 2025

2025
[26]

Garud N. Iyengar. Robust dynamic programming.Math. Oper. Res., 30(2):257–280, 2005

2005
[27]

Parameter synthesis for Markov models: covering the parameter space.Formal Methods Syst

Sebastian Junges, Erika Ábrahám, Christian Hensel, Nils Jansen, Joost-Pieter Katoen, Tim Quatmann, and Matthias Volk. Parameter synthesis for Markov models: covering the parameter space.Formal Methods Syst. Des., 62(1):181–259, 2024

2024
[28]

Safety-constrained reinforcement learning for MDPs

Sebastian Junges, Nils Jansen, Christian Dehnert, Ufuk Topcu, and Joost-Pieter Katoen. Safety-constrained reinforcement learning for MDPs. InTACAS, volume 9636 ofLecture Notes in Computer Science, pages 130–146. Springer, 2016

2016
[29]

Mykel Kochenderfer.Decision Making Under Uncertainty: Theory and Application. 2015. Robust Parameter Learning for Uncertain MDPs 19

2015
[30]

Kwiatkowska, Gethin Norman, and David Parker

Marta Z. Kwiatkowska, Gethin Norman, and David Parker. PRISM 4.0: Verification of probabilistic real-time systems. InCAV, volume 6806 ofLecture Notes in Computer Science, pages 585–591. Springer, 2011

2011
[31]

Parametric probabilistic transition systems for system design and analysis.Formal Aspects Comput., 19(1):93–109, 2007

Ruggero Lanotte, Andrea Maggiolo-Schettini, and Angelo Troina. Parametric probabilistic transition systems for system design and analysis.Formal Aspects Comput., 19(1):93–109, 2007

2007
[32]

Bisimulation through probabilistic testing

Kim Guldstrand Larsen and Arne Skou. Bisimulation through probabilistic testing. InPOPL, pages 344–352. ACM Press, 1989

1989
[33]

Appli- cations of second-order cone programming.Linear Algebra Appl., 284(1–3):193–228, 1998

Miguel Sousa Lobo, Lieven Vandenberghe, Stephen Boyd, and Hervé Lebret. Appli- cations of second-order cone programming.Linear Algebra Appl., 284(1–3):193–228, 1998

1998
[34]

Fan Long and Martin C. Rinard. Automatic patch generation by learning correct code. InPOPL, pages 298–312. ACM, 2016

2016
[35]

McCormick

Garth P. McCormick. Computability of global solutions to factorable nonconvex programs: Part I - convex underestimating problems.Math. Program., 10(1):147–175, 1976

1976
[36]

What are the odds? improving statistical model checking of Markov decision processes

Tobias Meggendorfer, Maximilian Weininger, and Patrick Wienhöft. What are the odds? improving statistical model checking of Markov decision processes. InProc. 2nd International Joint Conference on Quantitative Evaluation of Systems and Formal Modeling and Analysis of Timed Systems (QEST+FORMATS’25), LNCS, pages 195–218. Springer, 2025

2025
[37]

Robust control of Markov decision processes with uncertain transition matrices.Oper

Arnab Nilim and Laurent El Ghaoui. Robust control of Markov decision processes with uncertain transition matrices.Oper. Res., 53(5):780–798, 2005

2005
[38]

Wijesuriya, Sofie Haesaert, and Alessandro Abate

Elizabeth Polgreen, Viraj B. Wijesuriya, Sofie Haesaert, and Alessandro Abate. Data-efficient Bayesian verification of parametric Markov chains. InQEST, volume 9826 ofLecture Notes in Computer Science, pages 35–51. Springer, 2016

2016
[39]

Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming

Martin L. Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley Series in Probability and Statistics. Wiley, 1994

1994
[40]

Parameter synthesis for Markov models: Faster than ever

Tim Quatmann, Christian Dehnert, Nils Jansen, Sebastian Junges, and Joost-Pieter Katoen. Parameter synthesis for Markov models: Faster than ever. InATVA, volume 9938 ofLecture Notes in Computer Science, pages 50–67, 2016

2016
[41]

Certifiably robust policies for uncertain parametric environments

Yannik Schnitzer, Alessandro Abate, and David Parker. Certifiably robust policies for uncertain parametric environments. InTACAS (3), volume 15698 ofLecture Notes in Computer Science, pages 63–83. Springer, 2025

2025
[42]

Efficient solution and learning of robust factored mdps.AAAI, 2026

Yannik Schnitzer, Alessandro Abate, and David Parker. Efficient solution and learning of robust factored mdps.AAAI, 2026

2026
[43]

Scott, Matthew D

Joseph K. Scott, Matthew D. Stuber, and Paul I. Barton. Generalized mccormick relaxations.J. Glob. Optim., 51(4):569–606, 2011

2011
[44]

Probabilistic noninterference through weak probabilistic bisimula- tion

Geoffrey Smith. Probabilistic noninterference through weak probabilistic bisimula- tion. InCSFW, pages 3–13. IEEE Computer Society, 2003

2003
[45]

Strehl and Michael L

Alexander L. Strehl and Michael L. Littman. A theoretical analysis of model- based interval estimation. InICML, volume 119 ofACM International Conference Proceeding Series, pages 856–863. ACM, 2005

2005
[46]

Nimbro explorer: Semiautonomous exploration and mobile manipulation in rough terrain.J

Jörg Stückler, Max Schwarz, Mark Schadler, Angeliki Topalidou-Kyniazopoulou, and Sven Behnke. Nimbro explorer: Semiautonomous exploration and mobile manipulation in rough terrain.J. Field Robotics, 33(4):411–430, 2016

2016
[47]

Badings, Eline M

Marnix Suilen, Thom S. Badings, Eline M. Bovy, David Parker, and Nils Jansen. Robust markov decision processes: A place where AI and formal methods meet. In Principles of Verification (3), volume 15262 ofLecture Notes in Computer Science, pages 126–154. Springer, 2024. 20 Y. Schnitzer et al

2024
[48]

Simão, David Parker, and Nils Jansen

Marnix Suilen, Thiago D. Simão, David Parker, and Nils Jansen. Robust anytime learning of Markov decision processes. InNeurIPS, 2022

2022
[49]

Leslie G. Valiant. A theory of the learnable. InSTOC, pages 436–445. ACM, 1984

1984
[50]

Vanderbei.Linear programming - foundations and extensions, volume 4 of Kluwer international series in operations research and management service

Robert J. Vanderbei.Linear programming - foundations and extensions, volume 4 of Kluwer international series in operations research and management service. Kluwer, 1998

1998
[51]

Weinberger

Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdú, and Marcelo J. Weinberger. Inequalities for thel1 deviation of the empirical distribution. Technical Report HPL-2003-97(R.1), Hewlett-Packard Laboratories, 2003

2003
[52]

Robust Markov decision processes.Math

Wolfram Wiesemann, Daniel Kuhn, and Berç Rustem. Robust Markov decision processes.Math. Oper. Res., 38(1):153–183, 2013

2013
[53]

Wolff, Ufuk Topcu, and Richard M

Eric M. Wolff, Ufuk Topcu, and Richard M. Murray. Robust control of uncertain markov decision processes with temporal logic specifications. InCDC. IEEE, 2012

2012
[54]

Nearly minimax optimal reinforcement learning for linear mixture markov decision processes

Dongruo Zhou, Quanquan Gu, and Csaba Szepesvári. Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. InProceedings of the 34th Annual Conference on Learning Theory, volume 134, pages 4532–4576. PMLR, 2021. A Proofs We provide the formal proofs for the Theorems 1 and 2. A.1 Proof of Theorem 1 Theorem 1(Soundness of U...

2021
[55]

Then, by definition of the local projections, we have P [v](s, a) ∈ P R(U) (s, a)for every( s, a) ∈S×A

PR(U) ⊇ P U.Let P [v] ∈ P U for somev ∈ U. Then, by definition of the local projections, we have P [v](s, a) ∈ P R(U) (s, a)for every( s, a) ∈S×A . Hence P[v]∈ P R(U), which showsPU ⊆ P R(U)
[56]

By definition, there existsv ∈ U such that µ = P [v](s, a)

PΛ(U) ⊇ P R(U).Fix a state–action pair( s, a)and let µ∈ P R(U) (s, a). By definition, there existsv ∈ U such that µ = P [v](s, a). For each successors′ ∈S , let f = PΘ(s, a, s′) ∈Λ . Since lΛ f and uΛ f are defined as the minimum and maximum of f [v′]over allv ′ ∈ U , we have lΛ f ≤f [v] = µ(s′) ≤u Λ f. Thus µ∈ P Λ(U) (s, a). Since(s, a)was arbitrary, it ...
[57]

Therefore, for every expressionf∈Λ , minimizing or maximizing over the larger setB(U )can only decrease the lower bound and increase the upper bound

PΘ(U) ⊇ P Λ(U).Since B(U )is the smallest axis-aligned hyperrectangle con- taining U, we haveU ⊆B (U ). Therefore, for every expressionf∈Λ , minimizing or maximizing over the larger setB(U )can only decrease the lower bound and increase the upper bound. In other words,lΘ f ≤l Λ f and uΘ f ≥u Λ f. Thus the interval induced by parameter-wise projection cont...
[58]

PΘ(U) ̸⊆ P I.Consider two parameters θ1, θ2 ∈ [0, 1]and three transition expres- sions f1(θ) =θ 1, f 2(θ) =θ 2, f 3(θ) = 1 2 (θ1 +θ 2), arising from distinct state–action pairs

Incomparability ofPΘ(U) and PI.We give two counterexamples showing that neither inclusion holds in general. PΘ(U) ̸⊆ P I.Consider two parameters θ1, θ2 ∈ [0, 1]and three transition expres- sions f1(θ) =θ 1, f 2(θ) =θ 2, f 3(θ) = 1 2 (θ1 +θ 2), arising from distinct state–action pairs. Suppose the learned intervals aref1 ∈ [0,1],f 2 ∈[0,1], andf 3 ∈[ 1 2 ,...
[59]

The ellipsoidal baseline induces, for each(s, a), the uncertainty set PR(E) (s, a) :={µ∈∆(S)| ∃v∈ E δ :µ(s ′) =P[v](s, a, s ′)for alls ′ ∈S}

The constantS is any a priori bound on∥u∥2, for D⊆ [0, 1]ℓ, the choiceS = √ ℓ is always sound, andλ > 0is an arbitrary ridge parameter, which we choose asλ= 10−2 to stabilise the regression. The ellipsoidal baseline induces, for each(s, a), the uncertainty set PR(E) (s, a) :={µ∈∆(S)| ∃v∈ E δ :µ(s ′) =P[v](s, a, s ′)for alls ′ ∈S}. The inner optimisation i...

work page arXiv

[1] [1]

Branching bisimulation learning

Alessandro Abate, Mirco Giacobbe, Christian Micheletti, and Yannik Schnitzer. Branching bisimulation learning. InCAV (4), Lecture Notes in Computer Science, pages 161–184. Springer, 2025

2025

[2] [2]

Model checking and strategy synthesis with abstractions and certificates

Alessandro Abate, Mirco Giacobbe, Diptarko Roy, and Yannik Schnitzer. Model checking and strategy synthesis with abstractions and certificates. InPrinciples of Verification (2), Lecture Notes in Computer Science, pages 360–391. Springer, 2024

2024

[3] [3]

Bisimulation learning

Alessandro Abate, Mirco Giacobbe, and Yannik Schnitzer. Bisimulation learning. InCAV (3), Lecture Notes in Computer Science, pages 161–183. Springer, 2024

2024

[4] [4]

Improved algorithms for linear stochastic bandits

Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. InNIPS, pages 2312–2320, 2011

2011

[5] [5]

PAC statistical model checking for Markov decision processes and stochastic games

Pranav Ashok, Jan Kretínský, and Maximilian Weininger. PAC statistical model checking for Markov decision processes and stochastic games. InCAV (1), volume 11561 ofLecture Notes in Computer Science, pages 497–519. Springer, 2019

2019

[6] [6]

Near-optimal regret bounds for reinforcement learning

Peter Auer, Thomas Jaksch, and Ronald Ortner. Near-optimal regret bounds for reinforcement learning. InNIPS, pages 89–96. Curran Associates, Inc., 2008

2008

[7] [7]

Model-based reinforcement learning with value-targeted regression

Alex Ayoub, Zeyu Jia, Csaba Szepesvari, Mengdi Wang, and Lin Yang. Model-based reinforcement learning with value-targeted regression. InProceedings of the 37th International Conference on Machine Learning, volume 119, pages 463–474. PMLR, 2020

2020

[8] [8]

Efficient sensitivity analysis for parametric robust Markov chains

Thom Badings, Sebastian Junges, Ahmadreza Marandi, Ufuk Topcu, and Nils Jansen. Efficient sensitivity analysis for parametric robust Markov chains. InCAV (3), Lecture Notes in Computer Science, pages 62–85. Springer, 2023

2023

[9] [9]

Badings, Alessandro Abate, Nils Jansen, David Parker, Hasan A

Thom S. Badings, Alessandro Abate, Nils Jansen, David Parker, Hasan A. Poon- awala, and Mariëlle Stoelinga. Sampling-based robust control of autonomous systems with non-Gaussian noise. InAAAI, pages 9669–9678. AAAI Press, 2022

2022

[10] [10]

Ezio Bartocci, Radu Grosu, Panagiotis Katsaros, C. R. Ramakrishnan, and Scott A. Smolka. Model repair for probabilistic systems. InTACAS, Lecture Notes in Computer Science, pages 326–340. Springer, 2011

2011

[11] [11]

Markov decision processes with average-value-at- risk criteria.Math

Nicole Bäuerle and Jonathan Ott. Markov decision processes with average-value-at- risk criteria.Math. Methods Oper. Res., 74(3):361–379, 2011

2011

[12] [12]

Tsitsiklis.Introduction to linear optimization, volume 6 ofAthena scientific optimization and computation series

Dimitris Bertsimas and John N. Tsitsiklis.Introduction to linear optimization, volume 6 ofAthena scientific optimization and computation series. Athena Scientific, 1997. 18 Y. Schnitzer et al

1997

[13] [13]

Bianco and L

A. Bianco and L. de Alfaro. Model checking of probabilistic and nondeterministic systems. In P. Thiagarajan, editor,Proc. 15th Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS’95), volume 1026 ofLNCS, pages 499–513. Springer, 1995

1995

[14] [14]

Oxford University Press, 2013

Stéphane Boucheron, Gábor Lugosi, and Pascal Massart.Concentration Inequalities - A Nonasymptotic Theory of Independence. Oxford University Press, 2013

2013

[15] [15]

Formal verification with confidence intervals to establish quality of service properties of software systems.IEEE Trans

Radu Calinescu, Carlo Ghezzi, Kenneth Johnson, Mauro Pezzè, Yasmin Rafiq, and Giordano Tamburrelli. Formal verification with confidence intervals to establish quality of service properties of software systems.IEEE Trans. Reliab., 65(1):107–125, 2016

2016

[16] [16]

Precise parameter synthesis for stochastic biochemical systems.Acta Informatica, 54(6):589–623, 2017

Milan Ceska, Frits Dannenberg, Nicola Paoletti, Marta Kwiatkowska, and Lu- bos Brim. Precise parameter synthesis for stochastic biochemical systems.Acta Informatica, 54(6):589–623, 2017

2017

[17] [17]

C. J. Clopper and E. S. Pearson. The use of confidence or fiducial limits illustrated in the case of the binomial.Biometrika, 26(4):404–413, 1934

1934

[18] [18]

Planning with hidden parameter polynomial MDPs

Clarissa Costen, Marc Rigter, Bruno Lacerda, and Nick Hawes. Planning with hidden parameter polynomial MDPs. InAAAI, pages 11963–11971. AAAI Press, 2023

2023

[19] [19]

Poonawala, and Ufuk Topcu

Murat Cubuktepe, Nils Jansen, Sebastian Junges, Joost-Pieter Katoen, Ivan Pa- pusha, Hasan A. Poonawala, and Ufuk Topcu. Sequential convex programming for the efficient verification of parametric mdps. InTACAS (2), Lecture Notes in Computer Science, pages 133–150, 2017

2017

[20] [20]

Convex optimization for parameter synthesis in MDPs.IEEE Trans

Murat Cubuktepe, Nils Jansen, Sebastian Junges, Joost-Pieter Katoen, and Ufuk Topcu. Convex optimization for parameter synthesis in MDPs.IEEE Trans. Autom. Control., 67(12):6333–6348, 2022

2022

[21] [21]

Pérez, and Marnix Suilen

Kasper Engelen, Guillermo A. Pérez, and Marnix Suilen. Data-efficient safe policy improvement using parametric structure.CoRR, abs/2507.15532, 2025

work page arXiv 2025

[22] [22]

Regret minimization in MDPs with options without prior knowledge

Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Emma Brunskill. Regret minimization in MDPs with options without prior knowledge. InNIPS, pages 3166–3176, 2017

2017

[23] [23]

Gleixner, Timo Berthold, Benjamin Müller, and Stefan Weltge

Ambros M. Gleixner, Timo Berthold, Benjamin Müller, and Stefan Weltge. Three enhancements for optimization-based bound tightening.J. Glob. Optim., 67(4):731– 757, 2017

2017

[24] [24]

Synthesis for PCTL in parametric Markov decision processes

Ernst Moritz Hahn, Tingting Han, and Lijun Zhang. Synthesis for PCTL in parametric Markov decision processes. InNASA Formal Methods, Lecture Notes in Computer Science, pages 146–161. Springer, 2011

2011

[25] [25]

Generalized parameter lifting: Finer abstractions for parametric markov chains

Linus Heck, Tim Quatmann, Jip Spel, Joost-Pieter Katoen, and Sebastian Junges. Generalized parameter lifting: Finer abstractions for parametric markov chains. In ATVA, Lecture Notes in Computer Science, pages 207–230. Springer, 2025

2025

[26] [26]

Garud N. Iyengar. Robust dynamic programming.Math. Oper. Res., 30(2):257–280, 2005

2005

[27] [27]

Parameter synthesis for Markov models: covering the parameter space.Formal Methods Syst

Sebastian Junges, Erika Ábrahám, Christian Hensel, Nils Jansen, Joost-Pieter Katoen, Tim Quatmann, and Matthias Volk. Parameter synthesis for Markov models: covering the parameter space.Formal Methods Syst. Des., 62(1):181–259, 2024

2024

[28] [28]

Safety-constrained reinforcement learning for MDPs

Sebastian Junges, Nils Jansen, Christian Dehnert, Ufuk Topcu, and Joost-Pieter Katoen. Safety-constrained reinforcement learning for MDPs. InTACAS, volume 9636 ofLecture Notes in Computer Science, pages 130–146. Springer, 2016

2016

[29] [29]

Mykel Kochenderfer.Decision Making Under Uncertainty: Theory and Application. 2015. Robust Parameter Learning for Uncertain MDPs 19

2015

[30] [30]

Kwiatkowska, Gethin Norman, and David Parker

Marta Z. Kwiatkowska, Gethin Norman, and David Parker. PRISM 4.0: Verification of probabilistic real-time systems. InCAV, volume 6806 ofLecture Notes in Computer Science, pages 585–591. Springer, 2011

2011

[31] [31]

Parametric probabilistic transition systems for system design and analysis.Formal Aspects Comput., 19(1):93–109, 2007

Ruggero Lanotte, Andrea Maggiolo-Schettini, and Angelo Troina. Parametric probabilistic transition systems for system design and analysis.Formal Aspects Comput., 19(1):93–109, 2007

2007

[32] [32]

Bisimulation through probabilistic testing

Kim Guldstrand Larsen and Arne Skou. Bisimulation through probabilistic testing. InPOPL, pages 344–352. ACM Press, 1989

1989

[33] [33]

Appli- cations of second-order cone programming.Linear Algebra Appl., 284(1–3):193–228, 1998

Miguel Sousa Lobo, Lieven Vandenberghe, Stephen Boyd, and Hervé Lebret. Appli- cations of second-order cone programming.Linear Algebra Appl., 284(1–3):193–228, 1998

1998

[34] [34]

Fan Long and Martin C. Rinard. Automatic patch generation by learning correct code. InPOPL, pages 298–312. ACM, 2016

2016

[35] [35]

McCormick

Garth P. McCormick. Computability of global solutions to factorable nonconvex programs: Part I - convex underestimating problems.Math. Program., 10(1):147–175, 1976

1976

[36] [36]

What are the odds? improving statistical model checking of Markov decision processes

Tobias Meggendorfer, Maximilian Weininger, and Patrick Wienhöft. What are the odds? improving statistical model checking of Markov decision processes. InProc. 2nd International Joint Conference on Quantitative Evaluation of Systems and Formal Modeling and Analysis of Timed Systems (QEST+FORMATS’25), LNCS, pages 195–218. Springer, 2025

2025

[37] [37]

Robust control of Markov decision processes with uncertain transition matrices.Oper

Arnab Nilim and Laurent El Ghaoui. Robust control of Markov decision processes with uncertain transition matrices.Oper. Res., 53(5):780–798, 2005

2005

[38] [38]

Wijesuriya, Sofie Haesaert, and Alessandro Abate

Elizabeth Polgreen, Viraj B. Wijesuriya, Sofie Haesaert, and Alessandro Abate. Data-efficient Bayesian verification of parametric Markov chains. InQEST, volume 9826 ofLecture Notes in Computer Science, pages 35–51. Springer, 2016

2016

[39] [39]

Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming

Martin L. Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley Series in Probability and Statistics. Wiley, 1994

1994

[40] [40]

Parameter synthesis for Markov models: Faster than ever

Tim Quatmann, Christian Dehnert, Nils Jansen, Sebastian Junges, and Joost-Pieter Katoen. Parameter synthesis for Markov models: Faster than ever. InATVA, volume 9938 ofLecture Notes in Computer Science, pages 50–67, 2016

2016

[41] [41]

Certifiably robust policies for uncertain parametric environments

Yannik Schnitzer, Alessandro Abate, and David Parker. Certifiably robust policies for uncertain parametric environments. InTACAS (3), volume 15698 ofLecture Notes in Computer Science, pages 63–83. Springer, 2025

2025

[42] [42]

Efficient solution and learning of robust factored mdps.AAAI, 2026

Yannik Schnitzer, Alessandro Abate, and David Parker. Efficient solution and learning of robust factored mdps.AAAI, 2026

2026

[43] [43]

Scott, Matthew D

Joseph K. Scott, Matthew D. Stuber, and Paul I. Barton. Generalized mccormick relaxations.J. Glob. Optim., 51(4):569–606, 2011

2011

[44] [44]

Probabilistic noninterference through weak probabilistic bisimula- tion

Geoffrey Smith. Probabilistic noninterference through weak probabilistic bisimula- tion. InCSFW, pages 3–13. IEEE Computer Society, 2003

2003

[45] [45]

Strehl and Michael L

Alexander L. Strehl and Michael L. Littman. A theoretical analysis of model- based interval estimation. InICML, volume 119 ofACM International Conference Proceeding Series, pages 856–863. ACM, 2005

2005

[46] [46]

Nimbro explorer: Semiautonomous exploration and mobile manipulation in rough terrain.J

Jörg Stückler, Max Schwarz, Mark Schadler, Angeliki Topalidou-Kyniazopoulou, and Sven Behnke. Nimbro explorer: Semiautonomous exploration and mobile manipulation in rough terrain.J. Field Robotics, 33(4):411–430, 2016

2016

[47] [47]

Badings, Eline M

Marnix Suilen, Thom S. Badings, Eline M. Bovy, David Parker, and Nils Jansen. Robust markov decision processes: A place where AI and formal methods meet. In Principles of Verification (3), volume 15262 ofLecture Notes in Computer Science, pages 126–154. Springer, 2024. 20 Y. Schnitzer et al

2024

[48] [48]

Simão, David Parker, and Nils Jansen

Marnix Suilen, Thiago D. Simão, David Parker, and Nils Jansen. Robust anytime learning of Markov decision processes. InNeurIPS, 2022

2022

[49] [49]

Leslie G. Valiant. A theory of the learnable. InSTOC, pages 436–445. ACM, 1984

1984

[50] [50]

Vanderbei.Linear programming - foundations and extensions, volume 4 of Kluwer international series in operations research and management service

Robert J. Vanderbei.Linear programming - foundations and extensions, volume 4 of Kluwer international series in operations research and management service. Kluwer, 1998

1998

[51] [51]

Weinberger

Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdú, and Marcelo J. Weinberger. Inequalities for thel1 deviation of the empirical distribution. Technical Report HPL-2003-97(R.1), Hewlett-Packard Laboratories, 2003

2003

[52] [52]

Robust Markov decision processes.Math

Wolfram Wiesemann, Daniel Kuhn, and Berç Rustem. Robust Markov decision processes.Math. Oper. Res., 38(1):153–183, 2013

2013

[53] [53]

Wolff, Ufuk Topcu, and Richard M

Eric M. Wolff, Ufuk Topcu, and Richard M. Murray. Robust control of uncertain markov decision processes with temporal logic specifications. InCDC. IEEE, 2012

2012

[54] [54]

Nearly minimax optimal reinforcement learning for linear mixture markov decision processes

Dongruo Zhou, Quanquan Gu, and Csaba Szepesvári. Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. InProceedings of the 34th Annual Conference on Learning Theory, volume 134, pages 4532–4576. PMLR, 2021. A Proofs We provide the formal proofs for the Theorems 1 and 2. A.1 Proof of Theorem 1 Theorem 1(Soundness of U...

2021

[55] [55]

Then, by definition of the local projections, we have P [v](s, a) ∈ P R(U) (s, a)for every( s, a) ∈S×A

PR(U) ⊇ P U.Let P [v] ∈ P U for somev ∈ U. Then, by definition of the local projections, we have P [v](s, a) ∈ P R(U) (s, a)for every( s, a) ∈S×A . Hence P[v]∈ P R(U), which showsPU ⊆ P R(U)

[56] [56]

By definition, there existsv ∈ U such that µ = P [v](s, a)

PΛ(U) ⊇ P R(U).Fix a state–action pair( s, a)and let µ∈ P R(U) (s, a). By definition, there existsv ∈ U such that µ = P [v](s, a). For each successors′ ∈S , let f = PΘ(s, a, s′) ∈Λ . Since lΛ f and uΛ f are defined as the minimum and maximum of f [v′]over allv ′ ∈ U , we have lΛ f ≤f [v] = µ(s′) ≤u Λ f. Thus µ∈ P Λ(U) (s, a). Since(s, a)was arbitrary, it ...

[57] [57]

Therefore, for every expressionf∈Λ , minimizing or maximizing over the larger setB(U )can only decrease the lower bound and increase the upper bound

PΘ(U) ⊇ P Λ(U).Since B(U )is the smallest axis-aligned hyperrectangle con- taining U, we haveU ⊆B (U ). Therefore, for every expressionf∈Λ , minimizing or maximizing over the larger setB(U )can only decrease the lower bound and increase the upper bound. In other words,lΘ f ≤l Λ f and uΘ f ≥u Λ f. Thus the interval induced by parameter-wise projection cont...

[58] [58]

PΘ(U) ̸⊆ P I.Consider two parameters θ1, θ2 ∈ [0, 1]and three transition expres- sions f1(θ) =θ 1, f 2(θ) =θ 2, f 3(θ) = 1 2 (θ1 +θ 2), arising from distinct state–action pairs

Incomparability ofPΘ(U) and PI.We give two counterexamples showing that neither inclusion holds in general. PΘ(U) ̸⊆ P I.Consider two parameters θ1, θ2 ∈ [0, 1]and three transition expres- sions f1(θ) =θ 1, f 2(θ) =θ 2, f 3(θ) = 1 2 (θ1 +θ 2), arising from distinct state–action pairs. Suppose the learned intervals aref1 ∈ [0,1],f 2 ∈[0,1], andf 3 ∈[ 1 2 ,...

[59] [59]

The ellipsoidal baseline induces, for each(s, a), the uncertainty set PR(E) (s, a) :={µ∈∆(S)| ∃v∈ E δ :µ(s ′) =P[v](s, a, s ′)for alls ′ ∈S}

The constantS is any a priori bound on∥u∥2, for D⊆ [0, 1]ℓ, the choiceS = √ ℓ is always sound, andλ > 0is an arbitrary ridge parameter, which we choose asλ= 10−2 to stabilise the regression. The ellipsoidal baseline induces, for each(s, a), the uncertainty set PR(E) (s, a) :={µ∈∆(S)| ∃v∈ E δ :µ(s ′) =P[v](s, a, s ′)for alls ′ ∈S}. The inner optimisation i...

work page arXiv