Stable and practical semi-Markov modelling of intermittently-observed data

Christopher Jackson

arxiv: 2508.20949 · v2 · submitted 2025-08-28 · 📊 stat.ME · stat.CO

Stable and practical semi-Markov modelling of intermittently-observed data

Christopher Jackson This is my paper

Pith reviewed 2026-05-18 20:30 UTC · model grok-4.3

classification 📊 stat.ME stat.CO

keywords semi-Markov modelsphase-type distributionsintermittent observationsmulti-state modelshidden Markov modelsmoment matchingcognitive decline

0 comments

The pith

A phase-type distribution approximation allows semi-Markov models to handle intermittent observations for any state structure.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a practical way to fit semi-Markov models to intermittently observed multi-state data by representing sojourn times with phase-type distributions. This turns the semi-Markov model into a hidden Markov model, simplifying likelihood calculation for any state structure. To improve identifiability, the phase-type is restricted via moment-matching to approximate Gamma or Weibull distributions. The resulting method is implemented in R software and demonstrated on simulations and cognitive decline data.

Core claim

The paper claims that by restricting the phase-type family to moment-matched approximations of Gamma or Weibull distributions, a semi-Markov model can be expressed as a hidden Markov model. This allows the likelihood for intermittently observed multi-state data to be calculated easily for general state structures, and the model becomes stable and identifiable while still capturing time-dependent sojourns.

What carries the argument

Moment-matching phase-type distribution for state sojourn times, converting semi-Markov to hidden Markov model for likelihood computation.

If this is right

General multi-state structures become feasible without custom restrictions.
Bayesian and maximum likelihood estimation are both supported in the new software.
Applications like modeling cognitive function decline can use time-dependent transitions.
Simulation-based calibration validates the method's performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar moment-matching could apply to other distributions or observation types in survival analysis.
Future work might compare approximation error to exact methods in small state spaces.
The software could integrate with existing multi-state modeling tools for broader adoption.

Load-bearing premise

The moment-matching approximation to Gamma or Weibull is sufficiently accurate to maintain the semi-Markov behavior and model stability without significant loss of fidelity.

What would settle it

Generate data from an exact semi-Markov model with known Gamma sojourns under intermittent observation, fit the phase-type approximated model, and check if recovered parameters match the true values within expected error; large discrepancies would falsify the practicality claim.

read the original abstract

Multi-state models are commonly used for intermittent observations of a state over time, but these are generally based on the Markov assumption, that transition rates are independent of the time spent in current and previous states. In a semi-Markov model, the rates can depend on the time spent in the current state, though available methods for this are either restricted to specific state structures or lack general software. This paper develops the approach of using a "phase-type" distribution for the sojourn time in a state, which expresses a semi-Markov model as a hidden Markov model, allowing the likelihood to be calculated easily for any state structure. While this approach involves a proliferation of latent parameters, identifiability can be improved by restricting the phase-type family to one which approximates a simpler distribution such as the Gamma or Weibull. This paper proposes a moment-matching method to obtain this approximation, making general semi-Markov models for intermittent data accessible in software for the first time. The method is implemented in a new R package, "msmbayes", which implements Bayesian or maximum likelihood estimation for multi-state models with general state structures and covariates. The software is tested using simulation-based calibration, and an application to cognitive function decline illustrates the use of the method in a typical modelling workflow.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Software for general semi-Markov models with intermittent data via phase-type and moment matching, though tail accuracy needs checking.

read the letter

The one or two things to know: this paper supplies working software for fitting semi-Markov multi-state models to intermittently observed data across general state structures, using a phase-type representation turned into a hidden Markov model plus moment matching for the sojourn distributions. It is new in delivering a general implementation in the msmbayes R package that handles Bayesian or maximum likelihood fits with covariates. The simulation-based calibration and the cognitive function example show how the method fits into a standard workflow. That is the practical advance over earlier restricted approaches. The approximation step is the main area to watch. By matching moments to approximate a Gamma or Weibull with a phase-type distribution, the method reduces parameters and aids identifiability. However, this may not fully preserve the tail probabilities that matter when observation intervals vary and the state graph has cycles or multiple exits. The paper does not provide a quantitative assessment of any resulting bias in the likelihood or estimates, so that remains a point for closer examination. Readers working in medical statistics or reliability analysis who need flexible sojourn time modeling will find this useful, especially if they can use the package directly. It shows clear engagement with the computational challenges in the literature. The work deserves peer review to test the approximation's accuracy in more detail and confirm the software's reliability. I recommend putting it through review rather than desk rejecting it.

Referee Report

2 major / 1 minor

Summary. The manuscript presents a method for semi-Markov modeling of intermittently observed multi-state data. It uses phase-type distributions for state sojourn times, approximated by moment-matching to Gamma or Weibull distributions, to represent the semi-Markov process as a hidden Markov model. This facilitates likelihood computation for general state structures. The approach is implemented in the R package msmbayes for Bayesian or maximum likelihood estimation, tested via simulation-based calibration, and applied to data on cognitive function decline.

Significance. Should the moment-matching approximation prove robust for preserving essential semi-Markov dynamics under intermittent observation, the paper would provide a valuable practical tool for fitting flexible semi-Markov models where existing methods are restrictive or lack software support. The open-source implementation in msmbayes and the simulation-based calibration for validation are strengths that support reproducibility and usability in the field.

major comments (2)

[Section 3] The moment-matching approximation to Gamma or Weibull distributions is central to improving identifiability, but the manuscript does not provide a quantitative bound on the approximation error for the tail of the sojourn time distribution. This is particularly relevant for intermittent observations where inter-observation times can be long, potentially leading to inaccurate integrated hazards in the likelihood.
[Likelihood derivation] The assumption that matching the first two moments suffices to control transition probabilities over arbitrary intervals is not supported by error analysis. For state graphs with cycles or competing exits, the phase-type restriction may distort the time-inhomogeneous behavior, affecting the stability and identifiability claims. Reporting the condition number of the observed-data information matrix or bias in simulated likelihoods would be necessary.

minor comments (1)

[Abstract] The abstract could benefit from a brief mention of the typical number of phases used in the phase-type distributions or how the number is selected.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive summary and constructive major comments, which highlight important aspects of the moment-matching approximation and its implications for the likelihood. We address each point below and have planned revisions to strengthen the manuscript's rigor while preserving its focus on practical implementation.

read point-by-point responses

Referee: [Section 3] The moment-matching approximation to Gamma or Weibull distributions is central to improving identifiability, but the manuscript does not provide a quantitative bound on the approximation error for the tail of the sojourn time distribution. This is particularly relevant for intermittent observations where inter-observation times can be long, potentially leading to inaccurate integrated hazards in the likelihood.

Authors: We agree that a quantitative assessment of tail approximation error would enhance the manuscript, particularly for long inter-observation intervals. The current validation relies on simulation-based calibration showing overall good performance, but we acknowledge the absence of explicit bounds. In the revised manuscript, we will add numerical comparisons in Section 3 of the relative error in the survival function (and thus integrated hazards) between the phase-type approximation and target Gamma/Weibull distributions across a grid of time intervals up to 10 times the mean sojourn time, reporting maximum relative errors for representative parameter values. revision: yes
Referee: [Likelihood derivation] The assumption that matching the first two moments suffices to control transition probabilities over arbitrary intervals is not supported by error analysis. For state graphs with cycles or competing exits, the phase-type restriction may distort the time-inhomogeneous behavior, affecting the stability and identifiability claims. Reporting the condition number of the observed-data information matrix or bias in simulated likelihoods would be necessary.

Authors: We appreciate the call for explicit error analysis in the context of the full multi-state likelihood. Moment matching approximates the marginal sojourn distribution, after which the phase-type representation yields an exact likelihood for the approximating model. We agree this does not automatically guarantee control of transition probabilities in graphs with cycles or competing exits. In revision, we will expand the simulation studies to include bias and coverage for parameter estimates in cyclic and competing-risk structures, and report condition numbers of the observed information matrix for the simulated datasets where computation is feasible. A full theoretical error bound on the likelihood for arbitrary intervals lies beyond the paper's scope but will be noted as a limitation with the strengthened empirical results. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper introduces a phase-type distribution representation to recast semi-Markov sojourn times as a hidden Markov model, enabling standard likelihood evaluation for arbitrary state structures under intermittent observation. The moment-matching restriction to approximate Gamma or Weibull distributions is presented as an explicit modeling choice to control parameter proliferation and improve identifiability, rather than a fitted quantity redefined as a prediction or a self-referential definition. No load-bearing equation or step reduces the claimed result to its own inputs by construction, and the abstract frames the contribution as a new computational device with software implementation and simulation-based calibration. This is the most common honest non-finding for a methods paper that builds on established HMM likelihood machinery without circular self-citation chains or ansatz smuggling.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the phase-type representation converting semi-Markov to hidden Markov models and on the moment-matching approximation being adequate for identifiability in general structures.

free parameters (1)

phase-type parameters
Latent parameters of the phase-type distribution that are chosen to match moments of a target Gamma or Weibull distribution.

axioms (2)

domain assumption Phase-type distributions can express semi-Markov sojourn times as hidden Markov models whose likelihood is tractable for arbitrary state structures.
Core technical step described in the abstract.
domain assumption Moment-matching produces a phase-type approximation close enough to Gamma or Weibull to improve identifiability without distorting the semi-Markov dynamics.
Justification given for restricting the phase-type family.

pith-pipeline@v0.9.0 · 5744 in / 1472 out tokens · 48892 ms · 2026-05-18T20:30:14.239637+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

This paper proposes a moment-matching method to obtain this approximation... phase-type distribution of a particular family whose first three moments agree with those of the Gamma or Weibull.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A multi-state model with a phase-type sojourn distribution is an example of a hidden Markov model... likelihood can therefore be evaluated easily using the forward algorithm.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.