pith. sign in

arxiv: 1907.06128 · v1 · pith:ACOHLKBUnew · submitted 2019-07-13 · 🧮 math.OC · cs.SY· eess.SY

Continuous-Time Markov Decision Processes with Controlled Observations

Pith reviewed 2026-05-24 21:52 UTC · model grok-4.3

classification 🧮 math.OC cs.SYeess.SY
keywords continuous-time Markov decision processescontrolled observationsgated queueinginventory controldynamic programmingoptimal observation epochsPoisson arrivals
0
0 comments X

The pith

Decision makers can jointly optimize observation times and control actions in continuous-time discounted jump Markov decision processes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a framework for continuous-time discounted jump Markov decision processes in which observations occur only at chosen discrete instants. At each observation the controller must pick both the next observation time and the control trajectory to apply until then, with the state evolving according to controlled jump rates in between. The framework yields dynamic programming equations that characterize the joint optimum. In gated queueing systems the resulting optimal observation schedule is independent of the current state; in an inventory problem with Poisson arrivals the schedule is state-dependent and denser where the optimal action changes frequently.

Core claim

The authors provide a theoretical framework that the decision maker can utilize to find the optimal observation epochs and the optimal actions jointly. Two cases are investigated. One is gated queueing systems in which we explicitly characterize the optimal action and the optimal observation where the optimal observation is shown to be independent of the state. Another is the inventory control problem with Poisson arrival process in which we obtain numerically the optimal action and observation. The results show that it is optimal to observe more frequently at a region of states where the optimal action adapts constantly.

What carries the argument

Dynamic programming equations that jointly optimize the timing of the next observation and the control trajectory between observations.

If this is right

  • In gated queueing systems the optimal observation schedule can be chosen without reference to the current state.
  • In inventory control with Poisson arrivals, observation frequency increases in state regions where the optimal action changes with the state.
  • The value function for the joint problem is characterized by the dynamic programming equations derived from the model.
  • Numerical computation of the joint optimum is feasible for concrete problems such as inventory control.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The state-independence result could simplify real-time implementation in queueing applications by removing the need to track state for scheduling decisions.
  • The same joint-optimization approach might be tested on other systems with costly observations, such as maintenance scheduling or sensor activation.
  • Algorithms that solve the dynamic programming equations at scale would make the framework practical for larger state spaces.

Load-bearing premise

The underlying process is a continuous-time discounted jump Markov decision process and an optimal joint policy over actions and observation times exists and satisfies the dynamic programming equations.

What would settle it

An explicit gated queueing example in which the optimal next observation time varies with the current state would falsify the independence result.

Figures

Figures reproduced from arXiv: 1907.06128 by Quanyan Zhu, Veeraruna Kavitha, Yunhan Huang.

Figure 2
Figure 2. Figure 2: The amount of inventory and the corresponding optima [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 1
Figure 1. Figure 1: The value function v ⋆(x), the optimal action a ⋆ x and the optimal time for next observation T ⋆ x with respect to x. Here, we set the reference for the amount of inventory to be θ = 8. The departure rate of the Poisson process is µ = 2. Here, the Poisson arrival process is homogeneous with upper bound a¯¯ = 5 and lower bound 0. The time for the next observation is within a range [T , T] where T = 2 and T… view at source ↗
read the original abstract

In this paper, we study a continuous-time discounted jump Markov decision process with both controlled actions and observations. The observation is only available for a discrete set of time instances. At each time of observation, one has to select an optimal timing for the next observation and a control trajectory for the time interval between two observation points. We provide a theoretical framework that the decision maker can utilize to find the optimal observation epochs and the optimal actions jointly. Two cases are investigated. One is gated queueing systems in which we explicitly characterize the optimal action and the optimal observation where the optimal observation is shown to be independent of the state. Another is the inventory control problem with Poisson arrival process in which we obtain numerically the optimal action and observation. The results show that it is optimal to observe more frequently at a region of states where the optimal action adapts constantly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper studies continuous-time discounted jump Markov decision processes in which both actions and observation times are controlled. At each observation epoch the decision maker jointly selects the time until the next observation and a control trajectory to be followed until then. A dynamic-programming framework is developed for this joint optimization. Two applications are treated: gated queueing systems, for which an explicit characterization is given and the optimal observation time is proved to be state-independent, and an inventory-control problem with Poisson arrivals, for which numerical solutions are computed showing that observation frequency increases in regions where the optimal action changes rapidly.

Significance. If the derivations are correct, the work supplies a usable DP-based method for trading off observation cost against control performance in CTMDPs. The state-independence result for gated queues is a clean structural property that simplifies computation. The inventory example illustrates how the framework behaves on a standard applied problem. The explicit characterization in one case and the reproducible numerical procedure in the other are positive features.

minor comments (3)
  1. [§3] §3 (or the section presenting the DP equations): the value-function recursion between observation epochs should be written out explicitly, including the integral form of the discounted cost under a fixed control trajectory, so that the optimality equations for the joint choice of next observation time and action are unambiguous.
  2. [Inventory numerical results] Inventory numerical section: the statement that observation is more frequent where the action adapts constantly should be accompanied by a table or plot that reports the computed inter-observation times for representative states, together with the corresponding optimal actions, so the claimed correlation can be verified directly.
  3. [Notation] Notation: the symbol used for the controlled jump rate should be defined once at first use and then employed consistently; currently the same letter appears to denote both the rate and the control in some passages.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The report correctly identifies the core contributions: the joint DP framework for actions and observation times, the explicit state-independent observation policy for gated queues, and the numerical behavior in the inventory example. No major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper develops a DP-based framework for jointly optimizing observation times and controls in a continuous-time discounted jump MDP, with an explicit structural result (state-independent optimal observation) for the gated queueing case. All load-bearing steps rest on standard existence assumptions for optimal policies via the Bellman equations of the model and on explicit characterization from the controlled jump rates and observation constraints; no fitted parameters are renamed as predictions, no self-citation chain is invoked to justify uniqueness, and no ansatz or renaming reduces the claimed results to their inputs by construction. The derivation is therefore self-contained against the model primitives.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; full list of modeling assumptions, existence proofs, and any fitted parameters cannot be extracted. No free parameters, invented entities, or non-standard axioms are mentioned in the provided text.

pith-pipeline@v0.9.0 · 5674 in / 1194 out tokens · 18649 ms · 2026-05-24T21:52:40.103753+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

  1. [1]

    M. L. Puterman, Markov Decision Processes.: Discrete Stochastic Dy- namic Programming. John Wiley & Sons, 2014

  2. [2]

    Optimal control of admission to a queueing s ystem,

    S. Stidham, “Optimal control of admission to a queueing s ystem,” IEEE Transactions on Automatic Control , vol. 30, no. 8, pp. 705–713, 1985

  3. [3]

    Applications of markov decision processes i n communica- tion networks,

    E. Altman, “Applications of markov decision processes i n communica- tion networks,” in Handbook of Markov decision processes . Springer, 2002, pp. 489–536

  4. [4]

    Uav path planning in a dynamic env ironment via partially observable markov decision process,

    S. Ragi and E. K. Chong, “Uav path planning in a dynamic env ironment via partially observable markov decision process,” IEEE Transactions on Aerospace and Electronic Systems , vol. 49, no. 4, pp. 2397–2412, 2013

  5. [5]

    Krishnamurthy, Partially observed Markov decision processes

    V . Krishnamurthy, Partially observed Markov decision processes. Cam- bridge University Press, 2016

  6. [6]

    Minimax control of switching systems under s ampling,

    T. Bas ¸ar, “Minimax control of switching systems under s ampling,” Systems & Control Letters , vol. 25, no. 5, pp. 315–325, 1995

  7. [7]

    Stochastic opt imal control under poisson-distributed observations,

    M. Ades, P . E. Caines, and R. P . Malham´ e, “Stochastic opt imal control under poisson-distributed observations,” IEEE Transactions on Auto- matic Control, vol. 45, no. 1, pp. 3–13, 2000

  8. [8]

    Optimal control of lti systems over unreliable communication links,

    O. C. Imer, S. Y¨ uksel, and T. Bas ¸ar, “Optimal control of lti systems over unreliable communication links,” Automatica, vol. 42, no. 9, pp. 1429–1439, 2006

  9. [9]

    Durrett, Probability: theory and examples

    R. Durrett, Probability: theory and examples . Cambridge university press, 2019, vol. 49

  10. [10]

    Some generalizations of the theory of cumulat ive sums of random variables,

    A. Wald, “Some generalizations of the theory of cumulat ive sums of random variables,” The Annals of Mathematical Statistics , vol. 16, no. 3, pp. 287–293, 1945

  11. [11]

    Liberzon, Calculus of variations and optimal control theory: a concise introduction

    D. Liberzon, Calculus of variations and optimal control theory: a concise introduction. Princeton University Press, 2011

  12. [12]

    Numerical optimal control,

    M. Diehl and S. Gros, “Numerical optimal control,” 2017

  13. [13]

    D. P . Bertsekas and J. N. Tsitsiklis, Neuro-dynamic programming . Athena Scientific Belmont, MA, 1996, vol. 5. 8