pith. sign in

arxiv: 2606.25797 · v1 · pith:UCSQL4BHnew · submitted 2026-06-24 · 💻 cs.AI

Confidence Sequences for Online Statistical Model Checking of Markov Decision Processes

Pith reviewed 2026-06-25 20:05 UTC · model grok-4.3

classification 💻 cs.AI
keywords confidence sequencesstatistical model checkingMarkov decision processesonline samplingsample efficiencystatistical verificationcyber-physical systems
0
0 comments X

The pith

Confidence sequences designed for online sampling yield valid statistical guarantees for Markov decision processes while using 50 times fewer samples than prior methods on average.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops several confidence sequences that support sequential sampling from MDPs where decisions to stop or continue depend on the data collected so far. These sequences are constructed to handle the combination of nondeterministic choices and probabilistic transitions without relying on fixed sample sizes. The authors implement the sequences in an efficient tool and compare performance against classical union-bound approaches. A sympathetic reader would care because MDPs model systems like cyber-physical devices and biological processes where exact transition probabilities are unavailable and must be estimated from samples.

Core claim

The central claim is that the presented confidence sequences remain valid in the online setting for MDPs, deliver tighter bounds than union-bound methods, and enable an implementation that requires 50 times fewer samples on average than the previous state of the art while still producing meaningful statistical guarantees about the underlying value.

What carries the argument

Confidence sequences for online statistical model checking, which supply time-uniform concentration bounds that stay valid under adaptive stopping rules.

If this is right

  • Statistical conclusions about unknown transition probabilities in MDPs can be obtained without assuming fixed sample sizes in advance.
  • Bounds on the true value of an MDP can be tightened enough to decide properties that union-bound methods leave inconclusive.
  • Verification of cyber-physical and biological models becomes feasible under realistic sampling budgets.
  • Existing tools that rely on union bounds or risk invalid online conclusions can be replaced by the new sequences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sequences could be tested on related models such as stochastic games that add an adversarial player.
  • Integration into existing probabilistic model checkers might reduce the computational cost of handling parameter uncertainty.
  • The sample reduction suggests the approach could scale verification to MDPs with larger state spaces than previously practical.
  • Real-time monitoring applications might use the online property to update guarantees as new observations arrive.

Load-bearing premise

The sequences stay valid and non-vacuous when applied to MDPs that mix nondeterministic choices with probabilistic transitions and when sampling stops based on observed data.

What would settle it

Execute the new implementation on a standard MDP benchmark suite and record either an incorrect statistical bound on at least one instance or an average sample count no lower than the previous best method.

Figures

Figures reproduced from arXiv: 2606.25797 by Konstantin Kueffner, Maximilian Weininger, Patrick Wienh\"oft, Tobias Meggendorfer.

Figure 1
Figure 1. Figure 1: Comparison of our presented methods on (nearly) uniform distributions [PITH_FULL_IMAGE:figures/full_fig_p016_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example MDP to show the problem with re-using confidence budget. [PITH_FULL_IMAGE:figures/full_fig_p026_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Bounds on uniform distributions [PITH_FULL_IMAGE:figures/full_fig_p046_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Bounds on linear distributions [PITH_FULL_IMAGE:figures/full_fig_p047_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Bounds on squared distributions [PITH_FULL_IMAGE:figures/full_fig_p048_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Bounds on one-vs-all distributions [PITH_FULL_IMAGE:figures/full_fig_p049_6.png] view at source ↗
read the original abstract

Markov decision processes (MDPs) are a classic model of decision making under uncertainty, exhibiting both non-deterministic choice as well as probabilistic uncertainty. Traditionally, exact knowledge of the underlying probabilities is assumed. However, this often is unrealistic, e.g.\ when modelling cyber-physical systems or biological processes. Here, statistical methods provide a way towards obtaining meaningful guarantees. The classical approach is to gather samples in the MDP, use these to draw statistical conclusions about the transition probabilities, and from there obtain bounds on the true value; then, if these bounds are too broad, repeat. However, existing implementations of this approach are either subtly incorrect or sub-optimal, and quite often both. We present several \emph{confidence sequences}, which are specifically designed for such \enquote{online} settings, implement all of them in an efficient tool, and show their practical applicability. In particular, we show that they outperform classical \enquote{union-bound} style approaches, and overall our implementation requires 50x less samples on average than previous state of the art.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes several confidence sequences specifically designed for online statistical model checking of MDPs that incorporate both nondeterministic action choices and probabilistic transitions. It implements them in an efficient tool, shows they outperform classical union-bound approaches, and claims the implementation requires 50x fewer samples on average than prior state-of-the-art methods.

Significance. If the sequences remain valid and non-vacuous under adaptive sampling and stopping, the work would offer a practical advance in sample-efficient statistical verification of MDPs for cyber-physical and biological systems. The reported 50x reduction would represent a substantial efficiency gain if supported by rigorous guarantees and reproducible experiments.

major comments (2)
  1. [theoretical construction / proof of validity] The central theoretical claim that the sequences are valid for online MDP settings is load-bearing for both correctness and the 50x sample-reduction result, yet the provided text supplies no explicit argument addressing the dependence between martingale increments and the stopping time when the stopping rule and sampling policy are functions of running value estimates that include a max over actions.
  2. [empirical evaluation] § on empirical results (or equivalent): the 50x average sample reduction is presented without visible error bars, benchmark descriptions, or controls for the specific MDP structure (nondeterminism + adaptivity), making it impossible to assess whether the bounds remain tight or whether hidden looseness from the online stopping rule is present.
minor comments (1)
  1. The abstract would be clearer if it briefly named the specific families of confidence sequences (e.g., via Ville, betting, or other constructions) rather than referring only to “several confidence sequences.”

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and constructive comments on the theoretical validity and empirical presentation. We address each major comment below.

read point-by-point responses
  1. Referee: [theoretical construction / proof of validity] The central theoretical claim that the sequences are valid for online MDP settings is load-bearing for both correctness and the 50x sample-reduction result, yet the provided text supplies no explicit argument addressing the dependence between martingale increments and the stopping time when the stopping rule and sampling policy are functions of running value estimates that include a max over actions.

    Authors: We agree that the manuscript would be strengthened by an explicit argument addressing the dependence between martingale increments and the stopping time under adaptive policies that depend on running value estimates (including the max over actions). The confidence sequences are constructed via martingale methods intended to remain valid under optional stopping and adaptive sampling, but we acknowledge the need for a dedicated clarification. In the revision we will add a subsection providing a proof outline that the relevant processes remain martingales (or supermartingales) despite the dependence induced by the policy and the max operator, thereby confirming validity of the sequences and the sample-efficiency claims. revision: yes

  2. Referee: [empirical evaluation] § on empirical results (or equivalent): the 50x average sample reduction is presented without visible error bars, benchmark descriptions, or controls for the specific MDP structure (nondeterminism + adaptivity), making it impossible to assess whether the bounds remain tight or whether hidden looseness from the online stopping rule is present.

    Authors: We agree that the empirical section would benefit from additional detail to allow assessment of tightness and the effect of the online stopping rule. In the revision we will add error bars to the reported sample-reduction figures, expand the benchmark descriptions to explicitly characterize the nondeterminism and adaptivity of each MDP, and include controls or ablations that isolate the contribution of the online stopping rule versus fixed-sample baselines. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance claims rest on independent implementation and testing

full rationale

The paper introduces confidence sequences for online model checking in MDPs and reports an empirical result (50x fewer samples than prior art) obtained by implementing and testing the sequences. No equations, fitted parameters, or self-citations appear in the abstract that would reduce any claimed prediction or uniqueness result to a definition or input by construction. The derivation chain is therefore self-contained against external benchmarks; the performance metric is measured rather than defined in terms of itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are stated. Standard assumptions of statistical model checking (i.i.d. samples, bounded rewards) are implicitly relied upon but not enumerated.

pith-pipeline@v0.9.1-grok · 5724 in / 965 out tokens · 15395 ms · 2026-06-25T20:05:02.950937+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

71 extracted references · 13 canonical work pages

  1. [1]

    In: COLT

    Agarwal, A., Kakade, S.M., Yang, L.F.: Model-based reinforcement learning with a generative model is minimax optimal. In: COLT. Proceedings of Machine Learning Research, vol. 125, pp. 67–83. PMLR (2020)

  2. [2]

    Agarwal, C., Guha, S., Křetínsk` y, J., Muruganandham, P.: PAC statistical model checking of mean payoff in discrete-and continuous-time MDP. In: CAV. pp. 3–25. Springer (2022)

  3. [3]

    In: AAAI

    Alshiekh, M., Bloem, R., Ehlers, R., Könighofer, B., Niekum, S., Topcu, U.: Safe reinforcement learning via shielding. In: AAAI. pp. 2669–2678. AAAI Press (2018)

  4. [4]

    In: CAV, Part I

    Ashok, P., Kretínský, J., Weininger, M.: PAC statistical model checking for Markov decision processes and stochastic games. In: CAV, Part I. LNCS, vol. 11561, pp. 497–519. Springer (2019)

  5. [5]

    In: International Conference on Quantitative Evaluation of Systems

    Bacci, G., Ingólfsdóttir, A., Larsen, K.G., Reynouard, R.: An mm algorithm to estimate parameters in continuous-time markov chains. In: International Conference on Quantitative Evaluation of Systems. pp. 82–100. Springer (2023)

  6. [6]

    In: AAAI

    Badings, T.S., Romao, L., Abate, A., Jansen, N.: Probabilities are not enough: For- mal controller synthesis for stochastic dynamical models with epistemic uncertainty. In: AAAI. pp. 14701–14710. AAAI Press (2023)

  7. [7]

    Badings, T.S., Romao, L., Abate, A., Parker, D., Poonawala, H.A., Stoelinga, M., Jansen, N.: Robust control for dynamical systems with non-gaussian noise via formal abstractions. J. Artif. Intell. Res.76, 341–391 (2023)

  8. [8]

    Badings, T.S., Simão, T.D., Suilen, M., Jansen, N.: Decision-making under uncer- tainty: beyond probabilities. Int. J. Softw. Tools Technol. Transf.25(3), 375–391 (2023)

  9. [9]

    In: NFM, Lecture Notes in Computer Science, vol

    Baier, C., Dubslaff, C., Wienhöft, P., Kiebel, S.J.: Strategy synthesis in Markov decision processes under limited sampling access. In: NFM, Lecture Notes in Computer Science, vol. 13903, pp. 86–103. Springer (2023)

  10. [10]

    In: NASA Formal Methods Symposium

    Baier, C., Dubslaff, C., Wienhöft, P., Kiebel, S.J.: Strategy synthesis in markov decision processes under limited sampling access. In: NASA Formal Methods Symposium. pp. 86–103. Springer (2023)

  11. [11]

    MIT Press (2008),https: //mitpress.mit.edu/books/principles-model-checking

    Baier, C., Katoen, J.: Principles of model checking. MIT Press (2008),https: //mitpress.mit.edu/books/principles-model-checking

  12. [12]

    In: CAV (2)

    Bazille,H.,Genest,B.,Jégourel,C.,Sun,J.:GlobalPACboundsforlearningdiscrete time Markov chains. In: CAV (2). Lecture Notes in Computer Science, vol. 12225, pp. 304–326. Springer (2020).https://doi.org/10.1007/978-3-030-53291-8_17

  13. [13]

    John Wiley & Sons (2017)

    Billingsley, P.: Probability and measure. John Wiley & Sons (2017)

  14. [14]

    Cambridge university press (2004)

    Boyd, S., Vandenberghe, L.: Convex optimization. Cambridge university press (2004)

  15. [15]

    Artificial Intelligence121(1-2), 31–47 (2000)

    Brafman, R.I., Tennenholtz, M.: A near-optimal polynomial time algorithm for learning in certain classes of stochastic games. Artificial Intelligence121(1-2), 31–47 (2000)

  16. [16]

    TheoretiCS4(2025)

    Brázdil, T., Chatterjee, K., Chmelik, M., Forejt, V., Kretínský, J., Kwiatkowska, M., Meggendorfer, T., Parker, D., Ujma, M.: Learning algorithms for verification of markov decision processes. TheoretiCS4(2025)

  17. [17]

    In: ATVA

    Brázdil, T., Chatterjee, K., Chmelik, M., Forejt, V., Křetínský, J., Kwiatkowska, M.Z., Parker, D., Ujma, M.: Verification of Markov decision processes using learning algorithms. In: ATVA. pp. 98–114. Springer (2014).https://doi.org/10.1007/ 978-3-319-11936-6_8

  18. [18]

    Statistical Science16(2), 101 – 133 (2001).https://doi.org/10.1214/ss/ 1009213286 20 K

    Brown, L.D., Cai, T.T., DasGupta, A.: Interval Estimation for a Binomial Propor- tion. Statistical Science16(2), 101 – 133 (2001).https://doi.org/10.1214/ss/ 1009213286 20 K. Kueffner et al

  19. [19]

    In: TACAS (to appear, preprint available athttps://arxiv.org/pdf/2411.00559)

    Budde, C.E., Hartmanns, A., Meggendorfer, T., Weininger, M., Wienhöft, P.: Sound statistical model checking for probabilities and expected rewards. In: TACAS (to appear, preprint available athttps://arxiv.org/pdf/2411.00559). Lecture Notes in Computer Science, Springer (2025)

  20. [20]

    In: 25 Years of Model Check- ing

    Chatterjee, K., Henzinger, T.A.: Value iteration. In: 25 Years of Model Check- ing. Lecture Notes in Computer Science, vol. 5000, pp. 107–138. Springer (2008). https://doi.org/10.1007/978-3-540-69850-0_7, https://doi.org/10. 1007/978-3-540-69850-0_7

  21. [21]

    Biometrika26(4), 404–413 (1934)

    Clopper, C.J., Pearson, E.S.: The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika26(4), 404–413 (1934)

  22. [22]

    In: International Conference on Tools and Algorithms for the Construction and Analysis of Systems

    Daca, P., Henzinger, T.A., Křetínsk` y, J., Petrov, T.: Faster statistical model checking for unbounded temporal properties. In: International Conference on Tools and Algorithms for the Construction and Analysis of Systems. pp. 112–129. Springer (2016)

  23. [23]

    Proceedings of the National Academy of Sciences58(1), 66–68 (1967)

    Darling, D.A., Robbins, H.: Confidence sequences for mean, variance, and median. Proceedings of the National Academy of Sciences58(1), 66–68 (1967)

  24. [24]

    In: International Symposium on Automated Technology for Verification and Analysis

    El Rabih, D., Pekergin, N.: Statistical model checking using perfect simulation. In: International Symposium on Automated Technology for Verification and Analysis. pp. 120–134. Springer (2009)

  25. [25]

    The American Statistician64(3), 242–249 (2010).https://doi.org/10.1198/tast.2010.09140

    Frey, J.: Fixed-width sequential confidence intervals for a proportion. The American Statistician64(3), 242–249 (2010).https://doi.org/10.1198/tast.2010.09140

  26. [26]

    In: Robotics: Science and Systems (2014).https: //doi.org/10.15607/RSS.2014.X.039

    Fu, J., Topcu, U.: Probably approximately correct MDP learning and control with temporal logic constraints. In: Robotics: Science and Systems (2014).https: //doi.org/10.15607/RSS.2014.X.039

  27. [27]

    Artificial Intelligence122(1-2), 71–109 (2000)

    Givan, R., Leach, S., Dean, T.: Bounded-parameter Markov decision processes. Artificial Intelligence122(1-2), 71–109 (2000)

  28. [28]

    Oxford university press (2020)

    Grimmett, G., Stirzaker, D.: Probability and random processes. Oxford university press (2020)

  29. [29]

    Haddad, S., Monmege, B.: Interval iteration algorithm for MDPs and IMDPs. Theor. Comput. Sci.735, 111–131 (2018).https://doi.org/10.1016/j.tcs.2016. 12.003

  30. [30]

    In: TACAS (1)

    Hartmanns, A., Junges, S., Quatmann, T., Weininger, M.: A practitioner’s guide to MDP model checking algorithms. In: TACAS (1). LNCS, vol. 13993, pp. 469–488. Springer (2023).https://doi.org/10.1007/978-3-031-30823-9_24

  31. [31]

    STTT (to appear), Preprint: https://sjunges.github.io/files/revised_practitioners_guide.pdf (2026)

    Hartmanns, A., Junges, S., Quatmann, T., Weininger, M.: The revised practi- tioner’s guide to MDP model checking algorithms. STTT (to appear), Preprint: https://sjunges.github.io/files/revised_practitioners_guide.pdf (2026)

  32. [32]

    In: AAAI

    HasanzadeZonuzy, A., Bura, A., Kalathil, D.M., Shakkottai, S.: Learning with safety constraints: Sample complexity of reinforcement learning for constrained MDPs. In: AAAI. pp. 7667–7674. AAAI Press (2021)

  33. [33]

    He, R., Jennings, P., Basu, S., Ghosh, A.P., Wu, H.: A bounded statistical approach for model checking of unbounded until properties. In: ASE. pp. 225–234. ACM (2010)

  34. [34]

    In: Inter- national Conference on Runtime Verification

    Henzinger, T.A., Kueffner, K., Singh, V., Sun, I.: Alignment monitoring. In: Inter- national Conference on Runtime Verification. pp. 140–159. Springer (2025)

  35. [35]

    Journal of the American Statistical Association58(301), 13–30 (1963)

    Hoeffding, W.: Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association58(301), 13–30 (1963)

  36. [36]

    Probability Surveys17, 257–317 (2020)

    Howard, S.R., Ramdas, A., McAuliffe, J., Sekhon, J.: Time-uniform chernoff bounds via nonnegative supermartingales. Probability Surveys17, 257–317 (2020)

  37. [37]

    The Annals of Statistics49(2), 1055–1080 (2021) Confidence Sequences for Online MDP-SMC 21

    Howard, S.R., Ramdas, A., McAuliffe, J., Sekhon, J.: Time-uniform, nonparametric, nonasymptotic confidence sequences. The Annals of Statistics49(2), 1055–1080 (2021) Confidence Sequences for Online MDP-SMC 21

  38. [38]

    In: Conference on Learning Theory

    Jamieson, K., Malloy, M., Nowak, R., Bubeck, S.: lil’ucb: An optimal exploration algorithm for multi-armed bandits. In: Conference on Learning Theory. pp. 423–439. PMLR (2014)

  39. [39]

    In: ICML

    Jin, C., Krishnamurthy, A., Simchowitz, M., Yu, T.: Reward-free exploration for reinforcement learning. In: ICML. Proceedings of Machine Learning Research, vol. 119, pp. 4870–4879. PMLR (2020)

  40. [40]

    Fundamenta Mathematicae6(1), 9–20 (1924)

    Khintchine, A.: über einen satz der wahrscheinlichkeitsrechnung. Fundamenta Mathematicae6(1), 9–20 (1924)

  41. [41]

    Structure-aware lower bounds and broadening the horizon of tractability for QBF

    Kretínský, J., Meggendorfer, T., Weininger, M.: Stopping criteria for value itera- tion on stochastic games with quantitative objectives. In: LICS. pp. 1–14. IEEE (2023). https://doi.org/10.1109/LICS56636.2023.10175771, https://doi.org/ 10.1109/LICS56636.2023.10175771

  42. [42]

    Kretínský, J., Michel, F., Michel, L., Pérez, G.A.: Finite-memory near-optimal learning for Markov decision processes with long-run average reward. In: UAI. Proceedings of Machine Learning Research, vol. 124, pp. 1149–1158. AUAI Press (2020)

  43. [43]

    In: CONCUR

    Křetínský, J., Pérez, G.A., Raskin, J.F.: Learning-based mean-payoff optimization in an unknown MDP under omega-regular constraints. In: CONCUR. pp. 8:1–8:18. Dagstuhl (2018)

  44. [44]

    In: International conference on computer aided verification

    Kwiatkowska, M., Norman, G., Parker, D.: Prism 4.0: Verification of probabilistic real-time systems. In: International conference on computer aided verification. pp. 585–591. Springer (2011)

  45. [45]

    In: QEST

    Kwiatkowska, M.Z., Norman, G., Parker, D.: The PRISM benchmark suite. In: QEST. pp. 203–204. IEEE Computer Society (2012).https://doi.org/10.1109/ QEST.2012.14

  46. [46]

    In: Rein- forcement Learning, Adaptation, Learning, and Optimization, vol

    Lange, S., Gabel, T., Riedmiller, M.A.: Batch reinforcement learning. In: Rein- forcement Learning, Adaptation, Learning, and Optimization, vol. 12, pp. 45–73. Springer (2012)

  47. [47]

    Cambridge University Press (2020)

    Lattimore, T., Szepesvári, C.: Bandit algorithms. Cambridge University Press (2020)

  48. [48]

    Madani, O., Hanks, S., Condon, A.: On the undecidability of probabilistic planning and infinite-horizon partially observable Markov decision problems. pp. 541–548. AAAI Press (1999)

  49. [49]

    In: COLT (2009),http://www.cs.mcgill.ca/%7Ecolt2009/papers/012

    Maurer, A., Pontil, M.: Empirical bernstein bounds and sample-variance penal- ization. In: COLT (2009),http://www.cs.mcgill.ca/%7Ecolt2009/papers/012. pdf#page=1

  50. [50]

    In: ATVA

    Meggendorfer, T.: PET - A partial exploration tool for probabilistic verification. In: ATVA. Lecture Notes in Computer Science, vol. 13505, pp. 320–326. Springer (2022)

  51. [51]

    In: CAV (3)

    Meggendorfer, T., Weininger, M.: Playing games with your PET: extending the partial exploration tool to stochastic games. In: CAV (3). Lecture Notes in Computer Science, vol. 14683, pp. 359–372. Springer (2024)

  52. [52]

    In: AAAI

    Meggendorfer, T., Weininger, M., Wienhöft, P.: Solving robust markov deci- sion processes: Generic, reliable, efficient. In: AAAI. pp. 26631–26641. AAAI Press (2025). https://doi.org/10.1609/AAAI.V39I25.34865, https://doi.org/ 10.1609/aaai.v39i25.34865

  53. [53]

    In: QEST+FORMATS

    Meggendorfer, T., Weininger, M., Wienhöft, P.: What are the odds? improving statistical model checking of markov decision processes. In: QEST+FORMATS. Lecture Notes in Computer Science, vol. 16143, pp. 195–218. Springer (2025). https://doi.org/10.1007/978-3-032-05792-1_11, https://doi.org/10.1007/ 978-3-032-05792-1_11 22 K. Kueffner et al

  54. [54]

    In: AAAI

    Perez, M., Somenzi, F., Trivedi, A.: A PAC learning algorithm for LTL and omega- regular objectives in mdps. In: AAAI. pp. 21510–21517. AAAI Press (2024).https: //doi.org/10.1609/AAAI.V38I19.30148

  55. [55]

    John Wiley and Sons (1994)

    Puterman, M.L.: Markov decision processes: Discrete stochastic dynamic program- ming. John Wiley and Sons (1994)

  56. [56]

    Foundations and Trends® in Statistics1(1-2), 1–390 (2025)

    Ramdas, A., Wang, R.: Hypothesis testing with e-values. Foundations and Trends® in Statistics1(1-2), 1–390 (2025)

  57. [57]

    In: International Conference on Computer Aided Verification

    Sen, K., Viswanathan, M., Agha, G.: Statistical model checking of black-box probabilistic systems. In: International Conference on Computer Aided Verification. pp. 202–215. Springer (2004)

  58. [58]

    In: ICML

    Shi, L., Li, G., Wei, Y., Chen, Y., Chi, Y.: Pessimistic q-learning for offline rein- forcement learning: Towards optimal sample complexity. In: ICML. Proceedings of Machine Learning Research, vol. 162, pp. 19967–20025. PMLR (2022)

  59. [59]

    In: NeurIPS (2022)

    Suilen, M., Simão, T.D., Parker, D., Jansen, N.: Robust anytime learning of Markov decision processes. In: NeurIPS (2022)

  60. [60]

    Tsybakov.Introduction to Nonparametric Estimation

    Tsybakov, A.B.: Introduction to Nonparametric Estimation. Springer Series in Statistics, Springer, New York, NY (2009).https://doi.org/10.1007/b13794

  61. [61]

    In: COLT

    Wagenmaker, A.J., Simchowitz, M., Jamieson, K.: Beyond no regret: Instance- dependent PAC reinforcement learning. In: COLT. Proceedings of Machine Learning Research, vol. 178, pp. 358–418. PMLR (2022)

  62. [62]

    Wainwright, M.J.: High-dimensional statistics: A non-asymptotic viewpoint, vol. 48. Cambridge university press (2019)

  63. [63]

    In: Breakthroughs in statistics: Foundations and basic theory, pp

    Wald, A.: Sequential tests of statistical hypotheses. In: Breakthroughs in statistics: Foundations and basic theory, pp. 256–298. Springer (1992)

  64. [64]

    Journal of the Royal Statistical Society Series B: Statistical Methodology 86(1), 1–27 (2024)

    Waudby-Smith, I., Ramdas, A.: Estimating means of bounded random variables by betting. Journal of the Royal Statistical Society Series B: Statistical Methodology 86(1), 1–27 (2024)

  65. [65]

    Weininger, M., Grover, K., Misra, S., Kretínský, J.: Guaranteed trade-offs in dynamic information flow tracking games. In: CDC. pp. 3786–3793. IEEE (2021). https://doi.org/10.1109/CDC45484.2021.9683447

  66. [66]

    IEEE Trans

    Wen, M., Topcu, U.: Probably approximately correct learning in adversarial envi- ronments with temporal logic specifications. IEEE Trans. Autom. Control.67(10), 5055–5070 (2022)

  67. [67]

    Wienhöft, P.: Statistical model checking with robust markov decision processes (2025),https://nbn-resolving.org/urn:nbn:de:bsz:14-qucosa2-1000969

  68. [68]

    In: 32nd International Joint Conference on Artificial Intelligence, IJCAI 2023

    Wienhöft, P., Suilen, M., Simão, T.D., Dubslaff, C., Baier, C., Jansen, N.: More for less: Safe policy improvement with stronger performance guarantees. In: 32nd International Joint Conference on Artificial Intelligence, IJCAI 2023. pp. 4406–4415. International Joint Conferences on Artificial Intelligence (IJCAI) (2023)

  69. [69]

    In: SBMF

    Younes, H.L.S., Clarke, E.M., Zuliani, P.: Statistical verification of probabilistic properties with unbounded until. In: SBMF. Lecture Notes in Computer Science, vol. 6527, pp. 144–160. Springer (2010)

  70. [70]

    seen” the entire system, can apply graph-based analysis, and switch to the grey box approach, using the remaining “unspent

    Younes, H.L.S., Simmons, R.G.: Probabilistic verification of discrete event systems using acceptance sampling. In: CAV. pp. 223–235. Springer (2002) Confidence Sequences for Online MDP-SMC 23 A Assumptions on Sampling Access and Knowledge In this appendix, we describe the different settings for the MDP-SMC problem that have appeared in the literature. The...

  71. [71]

    failure event

    We show this using a standard argument from information theory, i.e., we derive a Le Cam-style lower bound, which reduces this problem to binary hypothesis testing. Towards that end let p− := 1 2 −w and p+ := 1 2 + w. Notice that sincep+ −p − = 2w > w, no interval of width at mostw can contain bothp− and p+ simultaneously. We construct a hypothesis test t...