arxiv: 2602.17375 · v2 · submitted 2026-02-19 · 💻 cs.LG

MDP Planning as Policy Inference

David Tolpin This is my paper

Pith reviewed 2026-05-15 21:06 UTC · model grok-4.3

classification 💻 cs.LG

keywords MDP planningBayesian inferencepolicy optimizationvariational sequential Monte CarloThompson samplingreinforcement learningposterior over policiesstochastic control

0 comments

The pith

Episodic MDP planning reduces to Bayesian inference over policies whose posterior modes are the return-maximizing solutions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats a policy as a latent variable in an episodic Markov decision process and assigns it an unnormalized probability of optimality that rises monotonically with expected return. The resulting posterior therefore places its highest probability mass on policies that maximize return, while the spread of the posterior captures uncertainty about which behavior is optimal. To compute this posterior in discrete domains the authors adapt variational sequential Monte Carlo, adding a sweep that keeps the policy consistent across revisited states and coupling the transition randomness across particles so simulator noise does not confound the estimate. Acting is performed by drawing a policy from the posterior and executing it, which induces a stochastic controller equivalent to Thompson sampling rather than entropy-regularized search. Experiments on grid worlds, Blackjack, Triangle Tireworld, and Academic Advising illustrate the structure of the inferred policy distributions and contrast the resulting behavior with discrete Soft Actor-Critic.

Core claim

By casting episodic MDP planning as Bayesian inference over policies, where each policy receives an unnormalized probability of optimality that is monotone in its expected return, the posterior distribution over policies has modes that coincide exactly with return-maximizing solutions while posterior dispersion represents uncertainty over optimal behavior. Approximation of the posterior for deterministic policies under stochastic dynamics is performed by an adapted variational sequential Monte Carlo procedure that enforces policy consistency across revisited states and couples transition randomness across particles to eliminate confounding from simulator noise. Control is obtained by drawing

What carries the argument

The posterior over policies induced by an unnormalized probability of optimality that is monotone in expected return, approximated via adapted variational sequential Monte Carlo with a policy-consistency sweep and coupled particle transitions.

If this is right

The inferred posterior directly quantifies uncertainty over which policy is optimal.
Acting by posterior predictive sampling produces a stochastic policy through Thompson sampling without explicit entropy regularization.
The method yields policy distributions whose structure differs qualitatively and statistically from those produced by discrete Soft Actor-Critic on the same domains.
Posterior modes coincide with return-maximizing policies under the monotonicity construction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same construction could incorporate prior distributions over policy classes without changing the inference machinery.
Extending the approach to continuous or high-dimensional state spaces would require replacing the discrete VSMC adaptation with a different posterior approximation technique.
The posterior dispersion offers a built-in signal for exploration that is tied directly to policy uncertainty rather than added noise.
Examining the full shape of the posterior (not only its modes) could identify sets of near-optimal policies that differ in secondary objectives.

Load-bearing premise

That an unnormalized probability of optimality defined to be monotone in expected return produces a posterior whose modes are exactly the return-maximizing policies and that the adapted VSMC procedure yields a faithful approximation without bias from simulator noise.

What would settle it

On a small grid-world MDP whose optimal policy is known by exhaustive search, the modes of the approximated posterior are not the policies that achieve the highest expected return.

read the original abstract

We cast episodic Markov decision process (MDP) planning as Bayesian inference over policies. A policy is treated as the latent variable and is assigned an unnormalized probability of optimality that is monotone in its expected return, yielding a posterior distribution whose modes coincide with return-maximizing solutions while posterior dispersion represents uncertainty over optimal behavior. To approximate this posterior in discrete domains, we adapt variational sequential Monte Carlo (VSMC) to inference over deterministic policies under stochastic dynamics, introducing a sweep that enforces policy consistency across revisited states and couples transition randomness across particles to avoid confounding from simulator noise. Acting is performed by posterior predictive sampling, which induces a stochastic control policy through a Thompson-sampling interpretation rather than entropy regularization. Across grid worlds, Blackjack, Triangle Tireworld, and Academic Advising, we analyze the structure of inferred policy distributions and compare the resulting behavior to discrete Soft Actor-Critic, highlighting qualitative and statistical differences that arise from policy-level uncertainty.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper recasts episodic MDP planning as Bayesian posterior inference over deterministic policies, with modes matching optima by monotonicity in return, and adapts VSMC with consistency sweeps plus coupled transitions for discrete domains.

read the letter

The central move is to treat the policy as a latent variable and assign it an unnormalized probability of optimality that increases with expected return. This makes the posterior modes coincide with return-maximizing policies by construction, while the dispersion captures uncertainty over which policy is best. Acting then uses posterior predictive sampling, which amounts to Thompson sampling over policies rather than entropy regularization. The technical contribution is the VSMC adaptation: a sweep that enforces policy consistency on revisited states and coupling of transition randomness across particles to keep simulator noise from confounding the particles. Experiments on grid worlds, Blackjack, Triangle Tireworld, and Academic Advising compare the resulting policy distributions and behavior against discrete Soft Actor-Critic, showing where the uncertainty representation changes the induced stochastic policy. The framing is clean and the sampler details are concrete enough to reproduce on those domains. The main limitations are scope and verification. All results stay in small episodic discrete settings, with no scaling data or continuous-state tests. The approximation quality is shown empirically rather than with bias bounds, so readers will want to check how sensitive the posterior is to the number of particles and sweep iterations. The construction itself does not appear circular; it rests on an external definition of expected return. This work is aimed at researchers in model-based RL and planning who already work with discrete MDPs and want a probabilistic handle on policy uncertainty. It is worth sending to peer review because the core claim follows directly from the monotonicity definition, the method is implemented and compared on standard benchmarks, and the differences from soft RL are shown in the data.

Referee Report

2 major / 3 minor

Summary. The paper casts episodic MDP planning as Bayesian inference over policies, treating a policy as the latent variable assigned an unnormalized probability of optimality that is monotone in expected return; the resulting posterior has modes that coincide with return-maximizing policies, with dispersion representing uncertainty over optimal behavior. To approximate the posterior over deterministic policies in discrete domains, the authors adapt variational sequential Monte Carlo (VSMC) by adding a policy-consistency sweep across revisited states and coupling transition randomness across particles. Acting is performed via posterior predictive sampling (Thompson-sampling interpretation). The approach is evaluated on grid worlds, Blackjack, Triangle Tireworld, and Academic Advising, with comparisons to discrete Soft Actor-Critic highlighting qualitative and statistical differences.

Significance. If the VSMC adaptation is faithful, the work supplies a parameter-free construction that directly links optimality to posterior modes without entropy regularization, together with a practical inference procedure and reproducible empirical comparisons across four domains. The Thompson-sampling interpretation of posterior predictive acting offers a distinct mechanism for stochastic control that may improve exploration under policy uncertainty. These elements constitute a coherent contribution to the planning-as-inference literature.

major comments (2)

[Section 3] The central claim that posterior modes coincide with return-maximizing policies follows directly from the monotonicity assumption on the unnormalized optimality probability; however, the manuscript should explicitly state the precise functional form of this probability (e.g., exponential or other strictly increasing map of J(π)) in the main text or an appendix to make the construction fully reproducible.
[Section 4.2] §4.2 (VSMC adaptation): the policy-consistency sweep and coupled-transition mechanism are presented as bias-reducing heuristics; without a formal bias or consistency argument (even asymptotic), it remains unclear whether the particle approximation converges to the true posterior modes under stochastic dynamics, which is load-bearing for the empirical claims.

minor comments (3)

[Figures 2-3] Figure 2 and 3 captions should explicitly note the number of particles, sweep iterations, and random seeds used, to allow direct replication of the reported policy-distribution statistics.
[Section 5] The comparison to discrete Soft Actor-Critic would be strengthened by reporting the same performance metric (e.g., average return over N episodes) with standard errors for both methods across all four domains.
[Section 3] Notation: the symbol for the unnormalized optimality probability should be introduced once and used consistently; currently it appears under multiple informal descriptions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation for minor revision. We address each major comment below.

read point-by-point responses

Referee: [Section 3] The central claim that posterior modes coincide with return-maximizing policies follows directly from the monotonicity assumption on the unnormalized optimality probability; however, the manuscript should explicitly state the precise functional form of this probability (e.g., exponential or other strictly increasing map of J(π)) in the main text or an appendix to make the construction fully reproducible.

Authors: We agree that an explicit functional form improves reproducibility. In the revised manuscript we will state in Section 3 that the unnormalized optimality probability is p(π) ∝ exp(β J(π)) for temperature β > 0 (any strictly increasing map of J(π) yields the same mode property). This will be added to the main text with a brief derivation confirming that modes remain at return-maximizing policies. revision: yes
Referee: [Section 4.2] §4.2 (VSMC adaptation): the policy-consistency sweep and coupled-transition mechanism are presented as bias-reducing heuristics; without a formal bias or consistency argument (even asymptotic), it remains unclear whether the particle approximation converges to the true posterior modes under stochastic dynamics, which is load-bearing for the empirical claims.

Authors: We acknowledge that the policy-consistency sweep and coupled-transition mechanism are presented as practical heuristics to reduce bias under stochastic dynamics. The current manuscript contains no formal convergence or bias analysis. We will add an explicit limitations paragraph in Section 4.2 noting the heuristic nature of these adaptations and that convergence to the true posterior is supported only by the reported empirical results across the four domains. A rigorous asymptotic analysis is left for future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper defines an unnormalized optimality probability as monotone in expected return J(π) and notes that the resulting posterior modes coincide with return-maximizing policies. This is a direct definitional consequence rather than an independent derivation or fitted prediction. The VSMC adaptation (policy-consistency sweep and coupled transitions) is introduced as a practical approximation technique and evaluated empirically on grid worlds, Blackjack, Triangle Tireworld, and Academic Advising without reducing to self-citations, renamed known results, or parameters fitted to the target quantities. No load-bearing step in the provided abstract or described construction reduces to its own inputs by construction; the central claim rests on the external definition of expected return and standard Bayesian posterior properties.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that monotonicity in expected return defines a useful optimality probability and that the VSMC modifications preserve correctness. No free parameters, axioms, or invented entities are explicitly listed in the abstract.

pith-pipeline@v0.9.0 · 5443 in / 1221 out tokens · 20467 ms · 2026-05-15T21:06:00.053583+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

log p̃(π) = E_τπ [∑ R(st, a(π)t, st+1)] … yielding a posterior distribution whose modes coincide with return-maximizing solutions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.