Convergence Rates of Posterior Distributions in Markov Decision Process

Eric Laber; Zhen Li

arxiv: 1907.09083 · v1 · pith:A4GKWIDZnew · submitted 2019-07-22 · 🧮 math.ST · cs.LG· math.OC· stat.TH

Convergence Rates of Posterior Distributions in Markov Decision Process

Zhen Li , Eric Laber This is my paper

Pith reviewed 2026-05-24 18:18 UTC · model grok-4.3

classification 🧮 math.ST cs.LGmath.OCstat.TH

keywords posterior convergence ratesMarkov decision processesThompson samplingregret boundsMarkov gamesBayesian inference

0 comments

The pith

Posterior distributions over MDP dynamics converge at explicit rates even when the parameter space is infinite-dimensional.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives convergence rates for the posterior distributions on the transition dynamics of Markov decision processes in both episodic and continuous settings. These rates apply to arbitrary state and action spaces, including cases where the parameter space is infinite-dimensional. The work also establishes rates for the posterior on the expected cumulative reward under fixed and optimal policies, as well as for regret bounds. It introduces a variant of Thompson sampling that attains these posterior rates together with a regret guarantee and extends the results to Markov games.

Core claim

We show the convergence rates of posterior distributions of the model dynamics in a MDP for both episodic and continuous tasks. The theoretical results hold for general state and action space and the parameter space of the dynamics can be infinite dimensional. Moreover, we show the convergence rates of posterior distributions of the mean accumulative reward under a fixed or the optimal policy and of the regret bound. A variant of Thompson sampling algorithm is proposed which provides both posterior convergence rates for the dynamics and the regret-type bound. Then the previous results are extended to Markov games.

What carries the argument

Posterior convergence rates for the transition dynamics in MDPs under Bayesian updating, applicable to general state-action spaces and infinite-dimensional parameters.

If this is right

Convergence rates apply to both episodic and continuous MDPs with general state and action spaces.
Rates extend to the posterior on mean cumulative reward under fixed or optimal policies.
The Thompson sampling variant achieves both posterior convergence rates and a regret-type bound.
The results extend to Markov games.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the rates hold, Bayesian approaches to reinforcement learning would gain explicit uncertainty quantification guarantees in nonparametric settings.
The Thompson sampling variant could be tested for empirical performance in high-dimensional control tasks where standard methods lack rate guarantees.
Similar posterior rate derivations might apply to other sequential decision models such as partially observable MDPs.

Load-bearing premise

Suitable priors exist and MDP regularity conditions hold to ensure posterior convergence even in infinite-dimensional parameter spaces.

What would settle it

An MDP example with infinite-dimensional dynamics where the posterior fails to converge at the stated rate under any prior satisfying the regularity conditions would falsify the claim.

read the original abstract

In this paper, we show the convergence rates of posterior distributions of the model dynamics in a MDP for both episodic and continuous tasks. The theoretical results hold for general state and action space and the parameter space of the dynamics can be infinite dimensional. Moreover, we show the convergence rates of posterior distributions of the mean accumulative reward under a fixed or the optimal policy and of the regret bound. A variant of Thompson sampling algorithm is proposed which provides both posterior convergence rates for the dynamics and the regret-type bound. Then the previous results are extended to Markov games. Finally, we show numerical results with three simulation scenarios and conclude with discussions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims posterior convergence rates for MDP dynamics, rewards, and regret in fully general state/action spaces including infinite-dimensional parameters, plus a Thompson sampling variant and extension to games, but the abstract supplies no assumptions or proof outlines.

read the letter

The central claim is that posterior distributions over transition kernels contract at explicit rates even when the parameter space is infinite-dimensional, and that these rates carry over to value functions under fixed or optimal policies, to regret, and to a Thompson sampling procedure. The work also extends the results to Markov games and includes three simulation examples. That scope is the main thing to know: it aims to give Bayesian nonparametric support for RL in settings far beyond the usual finite or parametric cases. If the derivations hold, it would be useful for people building theory around posterior sampling in sequential decisions. The numerical results are a small positive, as they at least show the claims are not purely formal. The soft spot is exactly the one flagged in the stress-test note. Standard results on posterior contraction in infinite dimensions require the prior to put positive mass on Kullback-Leibler neighborhoods of the true kernel plus suitable continuity or ergodicity conditions on the MDP so that the posterior can be controlled. The abstract gives no indication these are stated or checked for arbitrary state and action spaces. Without them the rates do not follow from existing theory. Because the full text was not supplied here I cannot check whether the paper actually lists and verifies the conditions, but the abstract alone leaves the load-bearing step unaddressed. This is the kind of paper that belongs in a reading group only if someone is already working on Bayesian RL theory and is willing to read the proofs carefully. It is not ready for citation until the regularity conditions are confirmed. A serious editor should send it to referees rather than desk-reject, because the topic is relevant and the claimed generality is worth checking, even if heavy revision is likely.

Referee Report

3 major / 2 minor

Summary. The manuscript claims to establish convergence rates for the posterior distributions of MDP transition dynamics (model parameters) in both episodic and continuous-time settings. These rates are asserted to hold for arbitrary state and action spaces, including cases where the parameter space is infinite-dimensional. The paper further claims posterior convergence rates for the mean cumulative reward under a fixed policy or the optimal policy, as well as for regret bounds. A variant of Thompson sampling is proposed that simultaneously achieves posterior convergence and a regret-type bound. Results are extended to Markov games, and the claims are supported by three simulation scenarios.

Significance. If the claimed rates can be rigorously established under explicit conditions, the work would provide nontrivial Bayesian nonparametric guarantees for posterior contraction in general MDPs and for derived quantities such as value functions and regret. Such results would be of interest to both the theoretical RL and Bayesian nonparametrics communities, particularly for settings beyond finite or parametric models.

major comments (3)

[Abstract / §1] The central claims (abstract and §1) assert posterior contraction rates for infinite-dimensional parameter spaces of the dynamics without stating the required regularity conditions on the prior (e.g., positive mass on Kullback-Leibler neighborhoods of the true kernel) or on the MDP (identifiability, continuity/ergodicity of the transition kernel, and suitable test functions). These conditions are load-bearing; without them the rates do not follow from standard Bayesian nonparametric theory.
[§4–5] Theorems on convergence rates for the mean accumulative reward and regret (likely §4–5) derive these quantities from the dynamics posterior but do not verify that the policy-induced measures remain absolutely continuous or that the value-function map is continuous in the topology used for the dynamics posterior. This step is necessary for the rates to transfer and is not addressed.
[§6] The Thompson-sampling variant (proposed in §6) is claimed to inherit both posterior convergence and a regret bound, yet the proof sketch does not quantify the interaction between the posterior sampling step and the exploration schedule under infinite-dimensional parameters.

minor comments (2)

[§2] Notation for the transition kernel and the parameter space is introduced without a dedicated preliminary section; readers must infer definitions from the theorem statements.
[§7] The numerical experiments (three scenarios) report empirical posterior contraction but do not include diagnostics for the infinite-dimensional case or comparison against the theoretical rates.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough review and valuable suggestions. We address each of the major comments below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract / §1] The central claims (abstract and §1) assert posterior contraction rates for infinite-dimensional parameter spaces of the dynamics without stating the required regularity conditions on the prior (e.g., positive mass on Kullback-Leibler neighborhoods of the true kernel) or on the MDP (identifiability, continuity/ergodicity of the transition kernel, and suitable test functions). These conditions are load-bearing; without them the rates do not follow from standard Bayesian nonparametric theory.

Authors: We agree that explicit statement of these conditions is necessary for clarity. The theorems in Section 3 of the manuscript do impose such conditions, including the prior placing positive mass on Kullback-Leibler neighborhoods and assumptions ensuring identifiability and ergodicity of the kernels. However, these are not highlighted in the abstract and introduction. We will revise the abstract and §1 to include a concise statement of the key regularity conditions required for the rates. revision: yes
Referee: [§4–5] Theorems on convergence rates for the mean accumulative reward and regret (likely §4–5) derive these quantities from the dynamics posterior but do not verify that the policy-induced measures remain absolutely continuous or that the value-function map is continuous in the topology used for the dynamics posterior. This step is necessary for the rates to transfer and is not addressed.

Authors: This is a valid point regarding the transfer of rates. The proofs assume continuity of the value function in the relevant topology and absolute continuity of the induced measures, but these are not explicitly verified or stated as lemmas. We will add a new proposition in Section 4 establishing the continuity of the value-function map under our assumptions and confirm absolute continuity for the policy-induced measures. This constitutes a partial revision since the core arguments exist but require explicit presentation. revision: partial
Referee: [§6] The Thompson-sampling variant (proposed in §6) is claimed to inherit both posterior convergence and a regret bound, yet the proof sketch does not quantify the interaction between the posterior sampling step and the exploration schedule under infinite-dimensional parameters.

Authors: We partially agree. The proof in §6 combines the posterior convergence rate with a regret analysis, but the interaction term under infinite-dimensional parameters is not quantified in detail. We will expand the proof sketch to include explicit bounds on this interaction, showing how the sampling step interacts with the exploration schedule. This will be incorporated in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: derivations rely on standard Bayesian nonparametric theory and MDP regularity conditions

full rationale

The paper establishes posterior convergence rates for MDP transition kernels (and derived quantities such as value functions and regret) under general state/action spaces and possibly infinite-dimensional parameter spaces. These results invoke the existence of suitable priors that place positive mass on Kullback-Leibler neighborhoods together with standard identifiability, continuity, and ergodicity conditions on the MDP; the abstract and claimed extensions (Thompson sampling variant, Markov games) do not reduce any claimed rate to a fitted parameter, a self-citation chain, or a definitional tautology. No load-bearing step is shown to be equivalent to its own inputs by construction, and the central claims remain independent of the paper's own fitted values or prior self-referential results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities; no details on priors, regularity conditions, or modeling choices are given.

pith-pipeline@v0.9.0 · 5627 in / 1154 out tokens · 28767 ms · 2026-05-24T18:18:46.064783+00:00 · methodology

Convergence Rates of Posterior Distributions in Markov Decision Process

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)