pith. sign in

arxiv: 2405.03888 · v5 · submitted 2024-05-06 · 🧮 math.OC

Measurized Markov Decision Processes

Pith reviewed 2026-05-24 01:22 UTC · model grok-4.3

classification 🧮 math.OC
keywords Markov decision processesmeasurized MDPsalgebraic liftingmeasure-valued statessemicontinuous assumptionsBorel measurable policiesaverage reward criterion
0
0 comments X

The pith

Lifting MDPs to probability measure states creates a deterministic generalization that supports constraints and approximations with Borel-measurable optimal policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces measurized MDPs as deterministic processes on the space of probability measures, showing they generalize stochastic MDPs without losing fidelity. This allows embedding standard MDPs via an algebraic lifting that incorporates external shocks, leading to non-deterministic measure-valued processes. By using the semicontinuous-semicompact framework instead of universal measurability, the approach yields optimal Borel-measurable value functions and policies under milder, easier-to-verify assumptions for both discounted and average reward criteria. The framework further enables direct incorporation of constraints and value function approximations not feasible in the original MDP setting.

Core claim

Measurized MDPs are deterministic MDPs whose states are probability measures on the original state space and whose actions are stochastic kernels; they generalize stochastic MDPs and, when the lifted processes satisfy semicontinuous-semicompact assumptions, admit optimal Borel-measurable value functions and policies under milder conditions than the universally measurable framework, for both discounted infinite-horizon and long-run average reward criteria. Any MDP can be algebraically lifted to such a process, and the setting permits constraints and approximations unavailable in standard MDPs.

What carries the argument

The algebraic lifting procedure that maps any MDP impacted by external random shocks to a non-deterministic measure-valued MDP analyzed under semicontinuous-semicompact assumptions.

If this is right

  • Optimal policies remain Borel-measurable rather than requiring universal measurability.
  • Constraints can be imposed directly on the measure-valued states.
  • Value function approximations become available in the lifted space.
  • The long-run average reward case is handled within the same framework with similar guarantees.
  • Non-deterministic measure-valued MDPs arise naturally from standard MDPs with shocks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This lifting could facilitate solving MDPs by embedding them into a space where deterministic optimization methods apply more readily.
  • It may enable new connections to problems like distributionally robust control by treating measures explicitly as states.
  • Applying the procedure to a simple MDP with additive shocks would test whether optimality is preserved exactly.

Load-bearing premise

The lifted MDPs satisfy the semicontinuous-semicompact assumptions of Hernández-Lerma and Lasserre.

What would settle it

A concrete counterexample of an MDP with external shocks whose algebraic lifting satisfies the semicontinuous-semicompact assumptions but lacks a Borel-measurable optimal policy would falsify the accessibility and milder conditions claims.

Figures

Figures reproduced from arXiv: 2405.03888 by Alba V. Olivares-Nadal, Daniel Adelman.

Figure 1
Figure 1. Figure 1: Equivalence between the states ν of the measurized MDP and the states s of the original MDP from which it was lifted. Let π = {φt}t≥0; Theorem 1 enables one to consider exclusively either deterministic or Markov policies without loss of optimality. Then using the definition of the revenue function r, we can rewrite the infinite horizon problem (5) as the discounted measurized MDP problem as follows V ∗ (ν0… view at source ↗
Figure 2
Figure 2. Figure 2: Connection between the original value function [PITH_FULL_IMAGE:figures/full_fig_p024_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Equivalence between the optimal actions φ ∗ of measurized MDP and the optimal actions u ∗ of the original MDP it was lifted from. The following result shows that the optimal policy of the measurized optimality equations (M-α-DCOE) coincides with the optimal policy to (α-DCOE); hence it belongs to ΠD and is attainable. Although not explicitly stated in Bertsekas and Shreve (1996), this result can be inferre… view at source ↗
Figure 4
Figure 4. Figure 4: Visual representation of the relationship between the dual variables [PITH_FULL_IMAGE:figures/full_fig_p029_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of the relationship between the LP formulation of the stochastic MDP [PITH_FULL_IMAGE:figures/full_fig_p029_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of the measurizing process; i.e. how one can intuitively lift any stochastic MDP to the measure-valued framework. We start with formulation (LP). The first step is to integrate the constraints over the action space U using a stochastic kernel φ ∈ Φ, where Φ is defined as in (3): inf V ∈V Z S V (s)dν0(s) (28) 31 [PITH_FULL_IMAGE:figures/full_fig_p031_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Illustration of how the measurizing process lifts different types of MDP into the measure-valued framework. Example 6.1 (Measure-valued MDPs as lifted MDPs with random transitions) Consider the usual stochastic MDP with transition kernel Q contingent on random shocks z ∈ Z, following a distribution µ. More specifically, Q(A|·) is a measurable function defined on K × Z for all A ∈ B(S), and Q(·|s, u, z) is … view at source ↗
Figure 8
Figure 8. Figure 8: Relationship between bias function h(·) in (AROE) and the equilbrium constraint in (E) In this section we derive the measurized optimality equations from the equilibrium prob￾lem and show their validity7 . We start noting that the h(·) terms in (M-AROE) can￾cel out for state distributions and decision rules that are in equilibrium, i.e., for those φ ∈ Φ such that there exists a state distribution νφ = F(νφ… view at source ↗
Figure 9
Figure 9. Figure 9: Connection between stochastic and triplets solving (AROE) and ( [PITH_FULL_IMAGE:figures/full_fig_p038_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Connection between measurized DIH, AR and equilibrium solutions as [PITH_FULL_IMAGE:figures/full_fig_p044_10.png] view at source ↗
read the original abstract

In this paper, we explore lifting Markov Decision Processes (MDPs) to the space of probability measures and consider the so-called measurized MDPs: deterministic processes where states are probability measures on the original state space, and actions are stochastic kernels on the original action space. We show that measurized MDPs are a generalization of stochastic MDPs, thus the measurized framework can be deployed without loss of fidelity. Bertsekas and Shreve studied similar deterministic MDPs under the discounted infinite-horizon criterion in the context of universally measurable policies. Here, we also consider the long-run average reward case, but we cast lifted MDPs within the semicontinuous-semicompact framework of Hern\'andez-Lerma and Lasserre. This makes the lifted framework more accessible as it entails (i) optimal Borel-measurable value functions and policies, (ii) reasonably mild assumptions that are easier to verify than those in the universally-measurable framework, and (iii) simpler proofs. In addition, we showcase the untapped potential of lifted MDPs by demonstrating how the measurized framework enables the incorporation of constraints and value function approximations that are not available from the standard MDP setting. Furthermore, we introduce a novel algebraic lifting procedure for any MDP, showing that non-deterministic measure-valued MDPs can emerge from lifting MDPs impacted by external random shocks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces measurized MDPs obtained by lifting standard MDPs to the space of probability measures, where states are probability measures on the original state space and actions are stochastic kernels. It claims that measurized MDPs generalize stochastic MDPs without loss of fidelity, enable incorporation of constraints and value-function approximations unavailable in the standard setting, and that an algebraic lifting procedure applied to MDPs subject to external random shocks produces non-deterministic measure-valued dynamics. By embedding the lifted processes in the semicontinuous-semicompact framework of Hernández-Lerma and Lasserre (rather than the universally measurable setting of Bertsekas-Shreve), the paper asserts that optimal Borel-measurable value functions and policies are obtained under milder, more verifiable assumptions for both discounted and average-reward criteria.

Significance. If the lifting procedure is shown to preserve the required semicontinuity and semicompactness properties, the framework would supply a technically accessible route to measure-valued MDPs that yields Borel-measurable optima while supporting new modeling features such as explicit constraints on the measure-valued state.

major comments (2)
  1. [Algebraic lifting procedure] The section introducing the algebraic lifting procedure: the claim that the lifted processes satisfy the semicontinuous-semicompact assumptions of Hernández-Lerma and Lasserre (upper/lower semicontinuity of the reward and weak continuity of the transition kernel) after the algebraic lifting is asserted but not accompanied by an explicit verification for general (non-compact) state spaces; the weak topology on the space of probability measures does not automatically inherit these properties from the original MDP, and this verification is load-bearing for the Borel-measurability and accessibility claims.
  2. [Framework choice and benefits] Paragraphs on framework choice and benefits: the central assertion that the measurized framework yields optimal Borel-measurable policies and value functions under milder conditions than Bertsekas-Shreve rests entirely on the lifted MDPs satisfying the semicontinuous-semicompact assumptions for both discounted and average-reward criteria; without a detailed argument or counterexample-free demonstration that the assumptions transfer, the comparison to the universally measurable framework cannot be substantiated.
minor comments (1)
  1. The abstract and introduction could more explicitly separate the deterministic measurized MDPs from the non-deterministic measure-valued processes that arise only after the algebraic lifting with external shocks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our manuscript. We address each major comment below and will incorporate revisions to strengthen the presentation of the algebraic lifting and framework comparison.

read point-by-point responses
  1. Referee: [Algebraic lifting procedure] The section introducing the algebraic lifting procedure: the claim that the lifted processes satisfy the semicontinuous-semicompact assumptions of Hernández-Lerma and Lasserre (upper/lower semicontinuity of the reward and weak continuity of the transition kernel) after the algebraic lifting is asserted but not accompanied by an explicit verification for general (non-compact) state spaces; the weak topology on the space of probability measures does not automatically inherit these properties from the original MDP, and this verification is load-bearing for the Borel-measurability and accessibility claims.

    Authors: We agree that an explicit verification is required for general (non-compact) state spaces, as the weak topology on probability measures does not automatically preserve the properties. The manuscript asserts the preservation via the algebraic nature of the lift but does not supply the detailed argument. In revision we will add a dedicated lemma and proof showing that if the original MDP satisfies the Hernández-Lerma–Lasserre semicontinuity and weak-continuity conditions, then the measurized process does as well, including the requisite arguments for the weak topology. revision: yes

  2. Referee: [Framework choice and benefits] Paragraphs on framework choice and benefits: the central assertion that the measurized framework yields optimal Borel-measurable policies and value functions under milder conditions than Bertsekas-Shreve rests entirely on the lifted MDPs satisfying the semicontinuous-semicompact assumptions for both discounted and average-reward criteria; without a detailed argument or counterexample-free demonstration that the assumptions transfer, the comparison to the universally measurable framework cannot be substantiated.

    Authors: The comparison to the Bertsekas–Shreve universally measurable setting indeed depends on the lifted processes satisfying the semicontinuous-semicompact assumptions. As indicated in our response to the first comment, the revised manuscript will contain the explicit verification for both criteria. This will substantiate that the assumptions are milder and more readily verifiable, thereby supporting the claimed advantages in accessibility and Borel measurability. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; derivation applies external frameworks to a constructed lifting.

full rationale

The paper constructs measurized MDPs via an algebraic lifting of standard MDPs to probability-measure states and stochastic-kernel actions, then invokes the semicontinuous-semicompact assumptions of Hernández-Lerma and Lasserre (an external reference) to obtain Borel-measurable value functions and policies. The generalization claim follows directly from the lifting definition rather than any fitted parameter or self-referential loop. No load-bearing self-citations, self-definitional reductions, or renamings of known results appear; the central results rest on verifying the lifted processes satisfy the cited external conditions, which is presented as an independent step rather than a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims rest on the transfer of semicontinuous-semicompact assumptions to the lifted space and on the preservation of optimality under the measure-valued lifting; no free parameters are introduced.

axioms (2)
  • domain assumption Lifted MDPs satisfy the semicontinuous-semicompact conditions of Hernández-Lerma and Lasserre.
    Invoked to obtain Borel-measurable optimal value functions and policies with milder assumptions than universally measurable policies.
  • domain assumption The measure-valued lifting preserves the optimal policies and values of the original stochastic MDP without loss of fidelity.
    Required for the claim that the framework can be deployed without loss of fidelity.
invented entities (1)
  • measurized MDP no independent evidence
    purpose: Deterministic process whose states are probability measures and actions are stochastic kernels, used to generalize stochastic MDPs and enable new constraints.
    New construct introduced by the paper; no independent evidence outside the lifting construction is supplied.

pith-pipeline@v0.9.0 · 5767 in / 1675 out tokens · 34222 ms · 2026-05-24T01:22:28.469448+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

  1. [1]

    Adelman, Daniel, Christiane Barz, and Alba V Olivares-Nadal, 2025, Dynamic basis function generation for network revenue management,INFORMS Journal on Computing

    Adelman, Daniel, 2007, Dynamic bid prices in revenue management,Operations Research 55, 647–661. Adelman, Daniel, Christiane Barz, and Alba V Olivares-Nadal, 2025, Dynamic basis function generation for network revenue management,INFORMS Journal on Computing. Adelman, Daniel, and Diego Klabjan, 2012, Computing near-optimal policies in generalized joint rep...

  2. [2]

    Bertsekas, Dimitri, and Steven E Shreve, 1996,Stochastic optimal control: the discrete-time case, volume 5 (Athena Scientific)

    Bellman, Richard, 1966, Dynamic programming,Science153, 34–37. Bertsekas, Dimitri, and Steven E Shreve, 1996,Stochastic optimal control: the discrete-time case, volume 5 (Athena Scientific). Billingsley, Patrick, 1999,Convergence of probability measures(John Wiley & Sons). Borkar, Vivek, and Rahul Jain, 2014, Risk-constrained markov decision processes,IEE...

  3. [3]

    Puterman, Martin L, 2014,Markov decision processes: discrete stochastic dynamic program- ming(John Wiley & Sons)

    Powell, Warren B, 2007,Approximate Dynamic Programming: Solving the curses of dimen- sionality, volume 703 (John Wiley & Sons). Puterman, Martin L, 2014,Markov decision processes: discrete stochastic dynamic program- ming(John Wiley & Sons). Shreve, Steven E, and Dimitri P Bertsekas, 1979, Universally measurable policies in dynamic programming,Mathematics...