Continuous-Time Markov Decision Processes with Controlled Observations

Quanyan Zhu; Veeraruna Kavitha; Yunhan Huang

arxiv: 1907.06128 · v1 · pith:ACOHLKBUnew · submitted 2019-07-13 · 🧮 math.OC · cs.SY· eess.SY

Continuous-Time Markov Decision Processes with Controlled Observations

Yunhan Huang , Veeraruna Kavitha , Quanyan Zhu This is my paper

Pith reviewed 2026-05-24 21:52 UTC · model grok-4.3

classification 🧮 math.OC cs.SYeess.SY

keywords continuous-time Markov decision processescontrolled observationsgated queueinginventory controldynamic programmingoptimal observation epochsPoisson arrivals

0 comments

The pith

Decision makers can jointly optimize observation times and control actions in continuous-time discounted jump Markov decision processes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a framework for continuous-time discounted jump Markov decision processes in which observations occur only at chosen discrete instants. At each observation the controller must pick both the next observation time and the control trajectory to apply until then, with the state evolving according to controlled jump rates in between. The framework yields dynamic programming equations that characterize the joint optimum. In gated queueing systems the resulting optimal observation schedule is independent of the current state; in an inventory problem with Poisson arrivals the schedule is state-dependent and denser where the optimal action changes frequently.

Core claim

The authors provide a theoretical framework that the decision maker can utilize to find the optimal observation epochs and the optimal actions jointly. Two cases are investigated. One is gated queueing systems in which we explicitly characterize the optimal action and the optimal observation where the optimal observation is shown to be independent of the state. Another is the inventory control problem with Poisson arrival process in which we obtain numerically the optimal action and observation. The results show that it is optimal to observe more frequently at a region of states where the optimal action adapts constantly.

What carries the argument

Dynamic programming equations that jointly optimize the timing of the next observation and the control trajectory between observations.

If this is right

In gated queueing systems the optimal observation schedule can be chosen without reference to the current state.
In inventory control with Poisson arrivals, observation frequency increases in state regions where the optimal action changes with the state.
The value function for the joint problem is characterized by the dynamic programming equations derived from the model.
Numerical computation of the joint optimum is feasible for concrete problems such as inventory control.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The state-independence result could simplify real-time implementation in queueing applications by removing the need to track state for scheduling decisions.
The same joint-optimization approach might be tested on other systems with costly observations, such as maintenance scheduling or sensor activation.
Algorithms that solve the dynamic programming equations at scale would make the framework practical for larger state spaces.

Load-bearing premise

The underlying process is a continuous-time discounted jump Markov decision process and an optimal joint policy over actions and observation times exists and satisfies the dynamic programming equations.

What would settle it

An explicit gated queueing example in which the optimal next observation time varies with the current state would falsify the independence result.

Figures

Figures reproduced from arXiv: 1907.06128 by Quanyan Zhu, Veeraruna Kavitha, Yunhan Huang.

**Figure 1.** Figure 1: The value function v ⋆(x), the optimal action a ⋆ x and the optimal time for next observation T ⋆ x with respect to x. Here, we set the reference for the amount of inventory to be θ = 8. The departure rate of the Poisson process is µ = 2. Here, the Poisson arrival process is homogeneous with upper bound a¯¯ = 5 and lower bound 0. The time for the next observation is within a range [T , T] where T = 2 and T… view at source ↗

read the original abstract

In this paper, we study a continuous-time discounted jump Markov decision process with both controlled actions and observations. The observation is only available for a discrete set of time instances. At each time of observation, one has to select an optimal timing for the next observation and a control trajectory for the time interval between two observation points. We provide a theoretical framework that the decision maker can utilize to find the optimal observation epochs and the optimal actions jointly. Two cases are investigated. One is gated queueing systems in which we explicitly characterize the optimal action and the optimal observation where the optimal observation is shown to be independent of the state. Another is the inventory control problem with Poisson arrival process in which we obtain numerically the optimal action and observation. The results show that it is optimal to observe more frequently at a region of states where the optimal action adapts constantly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a DP framework for jointly picking observation times and controls in continuous-time jump MDPs, plus a clean state-independence result for gated queues.

read the letter

The core contribution is a dynamic programming setup that lets the decision maker choose both the next observation instant and the control trajectory between observations in a discounted continuous-time jump MDP. In the gated queue case they derive that the optimal observation time does not depend on the current state; the inventory example shows numerically that observation frequency should increase where the optimal action changes often. That independence result is the clearest new piece relative to standard continuous-time MDP theory, and it follows from the model structure rather than any post-hoc fitting. The numerical inventory run illustrates the joint optimization in a concrete setting with Poisson arrivals. The work stays inside the usual existence assumptions for optimal policies in jump processes, which keeps the claims grounded. The main limitation is that the derivations and verification steps are not visible in the abstract, so it is difficult to judge how much slack exists in the characterization or whether the numerical policy is robust to parameter changes. The scope is also narrow—only jump processes with controlled rates between observations—so the result does not immediately transfer to diffusion or non-Markov settings. This is useful reading for people already working on observation scheduling in queueing or inventory models; a general control theorist would find the queue independence interesting but not transformative. It is worth sending to a serious referee in operations research or applied probability, provided the full proofs hold up under checking.

Referee Report

0 major / 3 minor

Summary. The paper studies continuous-time discounted jump Markov decision processes in which both actions and observation times are controlled. At each observation epoch the decision maker jointly selects the time until the next observation and a control trajectory to be followed until then. A dynamic-programming framework is developed for this joint optimization. Two applications are treated: gated queueing systems, for which an explicit characterization is given and the optimal observation time is proved to be state-independent, and an inventory-control problem with Poisson arrivals, for which numerical solutions are computed showing that observation frequency increases in regions where the optimal action changes rapidly.

Significance. If the derivations are correct, the work supplies a usable DP-based method for trading off observation cost against control performance in CTMDPs. The state-independence result for gated queues is a clean structural property that simplifies computation. The inventory example illustrates how the framework behaves on a standard applied problem. The explicit characterization in one case and the reproducible numerical procedure in the other are positive features.

minor comments (3)

[§3] §3 (or the section presenting the DP equations): the value-function recursion between observation epochs should be written out explicitly, including the integral form of the discounted cost under a fixed control trajectory, so that the optimality equations for the joint choice of next observation time and action are unambiguous.
[Inventory numerical results] Inventory numerical section: the statement that observation is more frequent where the action adapts constantly should be accompanied by a table or plot that reports the computed inter-observation times for representative states, together with the corresponding optimal actions, so the claimed correlation can be verified directly.
[Notation] Notation: the symbol used for the controlled jump rate should be defined once at first use and then employed consistently; currently the same letter appears to denote both the rate and the control in some passages.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The report correctly identifies the core contributions: the joint DP framework for actions and observation times, the explicit state-independent observation policy for gated queues, and the numerical behavior in the inventory example. No major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper develops a DP-based framework for jointly optimizing observation times and controls in a continuous-time discounted jump MDP, with an explicit structural result (state-independent optimal observation) for the gated queueing case. All load-bearing steps rest on standard existence assumptions for optimal policies via the Bellman equations of the model and on explicit characterization from the controlled jump rates and observation constraints; no fitted parameters are renamed as predictions, no self-citation chain is invoked to justify uniqueness, and no ansatz or renaming reduces the claimed results to their inputs by construction. The derivation is therefore self-contained against the model primitives.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; full list of modeling assumptions, existence proofs, and any fitted parameters cannot be extracted. No free parameters, invented entities, or non-standard axioms are mentioned in the provided text.

pith-pipeline@v0.9.0 · 5674 in / 1194 out tokens · 18649 ms · 2026-05-24T21:52:40.103753+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 1 (Dynamic programming equation) ... v(x) = sup ... {r̄(x,a(·),T) + β^T ∑ q(x,x';a,T)v(x') + g(T)}
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

gated queueing ... optimal observation is shown to be independent of the state

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

M. L. Puterman, Markov Decision Processes.: Discrete Stochastic Dy- namic Programming. John Wiley & Sons, 2014

work page 2014
[2]

Optimal control of admission to a queueing s ystem,

S. Stidham, “Optimal control of admission to a queueing s ystem,” IEEE Transactions on Automatic Control , vol. 30, no. 8, pp. 705–713, 1985

work page 1985
[3]

Applications of markov decision processes i n communica- tion networks,

E. Altman, “Applications of markov decision processes i n communica- tion networks,” in Handbook of Markov decision processes . Springer, 2002, pp. 489–536

work page 2002
[4]

Uav path planning in a dynamic env ironment via partially observable markov decision process,

S. Ragi and E. K. Chong, “Uav path planning in a dynamic env ironment via partially observable markov decision process,” IEEE Transactions on Aerospace and Electronic Systems , vol. 49, no. 4, pp. 2397–2412, 2013

work page 2013
[5]

Krishnamurthy, Partially observed Markov decision processes

V . Krishnamurthy, Partially observed Markov decision processes. Cam- bridge University Press, 2016

work page 2016
[6]

Minimax control of switching systems under s ampling,

T. Bas ¸ar, “Minimax control of switching systems under s ampling,” Systems & Control Letters , vol. 25, no. 5, pp. 315–325, 1995

work page 1995
[7]

Stochastic opt imal control under poisson-distributed observations,

M. Ades, P . E. Caines, and R. P . Malham´ e, “Stochastic opt imal control under poisson-distributed observations,” IEEE Transactions on Auto- matic Control, vol. 45, no. 1, pp. 3–13, 2000

work page 2000
[8]

Optimal control of lti systems over unreliable communication links,

O. C. Imer, S. Y¨ uksel, and T. Bas ¸ar, “Optimal control of lti systems over unreliable communication links,” Automatica, vol. 42, no. 9, pp. 1429–1439, 2006

work page 2006
[9]

Durrett, Probability: theory and examples

R. Durrett, Probability: theory and examples . Cambridge university press, 2019, vol. 49

work page 2019
[10]

Some generalizations of the theory of cumulat ive sums of random variables,

A. Wald, “Some generalizations of the theory of cumulat ive sums of random variables,” The Annals of Mathematical Statistics , vol. 16, no. 3, pp. 287–293, 1945

work page 1945
[11]

Liberzon, Calculus of variations and optimal control theory: a concise introduction

D. Liberzon, Calculus of variations and optimal control theory: a concise introduction. Princeton University Press, 2011

work page 2011
[12]

Numerical optimal control,

M. Diehl and S. Gros, “Numerical optimal control,” 2017

work page 2017
[13]

D. P . Bertsekas and J. N. Tsitsiklis, Neuro-dynamic programming . Athena Scientiﬁc Belmont, MA, 1996, vol. 5. 8

work page 1996

[1] [1]

M. L. Puterman, Markov Decision Processes.: Discrete Stochastic Dy- namic Programming. John Wiley & Sons, 2014

work page 2014

[2] [2]

Optimal control of admission to a queueing s ystem,

S. Stidham, “Optimal control of admission to a queueing s ystem,” IEEE Transactions on Automatic Control , vol. 30, no. 8, pp. 705–713, 1985

work page 1985

[3] [3]

Applications of markov decision processes i n communica- tion networks,

E. Altman, “Applications of markov decision processes i n communica- tion networks,” in Handbook of Markov decision processes . Springer, 2002, pp. 489–536

work page 2002

[4] [4]

Uav path planning in a dynamic env ironment via partially observable markov decision process,

S. Ragi and E. K. Chong, “Uav path planning in a dynamic env ironment via partially observable markov decision process,” IEEE Transactions on Aerospace and Electronic Systems , vol. 49, no. 4, pp. 2397–2412, 2013

work page 2013

[5] [5]

Krishnamurthy, Partially observed Markov decision processes

V . Krishnamurthy, Partially observed Markov decision processes. Cam- bridge University Press, 2016

work page 2016

[6] [6]

Minimax control of switching systems under s ampling,

T. Bas ¸ar, “Minimax control of switching systems under s ampling,” Systems & Control Letters , vol. 25, no. 5, pp. 315–325, 1995

work page 1995

[7] [7]

Stochastic opt imal control under poisson-distributed observations,

M. Ades, P . E. Caines, and R. P . Malham´ e, “Stochastic opt imal control under poisson-distributed observations,” IEEE Transactions on Auto- matic Control, vol. 45, no. 1, pp. 3–13, 2000

work page 2000

[8] [8]

Optimal control of lti systems over unreliable communication links,

O. C. Imer, S. Y¨ uksel, and T. Bas ¸ar, “Optimal control of lti systems over unreliable communication links,” Automatica, vol. 42, no. 9, pp. 1429–1439, 2006

work page 2006

[9] [9]

Durrett, Probability: theory and examples

R. Durrett, Probability: theory and examples . Cambridge university press, 2019, vol. 49

work page 2019

[10] [10]

Some generalizations of the theory of cumulat ive sums of random variables,

A. Wald, “Some generalizations of the theory of cumulat ive sums of random variables,” The Annals of Mathematical Statistics , vol. 16, no. 3, pp. 287–293, 1945

work page 1945

[11] [11]

Liberzon, Calculus of variations and optimal control theory: a concise introduction

D. Liberzon, Calculus of variations and optimal control theory: a concise introduction. Princeton University Press, 2011

work page 2011

[12] [12]

Numerical optimal control,

M. Diehl and S. Gros, “Numerical optimal control,” 2017

work page 2017

[13] [13]

D. P . Bertsekas and J. N. Tsitsiklis, Neuro-dynamic programming . Athena Scientiﬁc Belmont, MA, 1996, vol. 5. 8

work page 1996