Neural Continuous-Time Markov Chain: Discrete Diffusion via Decoupled Jump Timing and Direction

Fukang Wen; Jingyuan Li; Pipi Hu; Renqian Luo; Wei Liu; Xiaoyi Jiang; Yi Zhu; Zuoqiang Shi

arxiv: 2604.15694 · v2 · submitted 2026-04-17 · 💻 cs.LG · math.PR

Neural Continuous-Time Markov Chain: Discrete Diffusion via Decoupled Jump Timing and Direction

Jingyuan Li , Xiaoyi Jiang , Fukang Wen , Wei Liu , Renqian Luo , Yi Zhu , Zuoqiang Shi , Pipi Hu This is my paper

Pith reviewed 2026-05-11 01:46 UTC · model grok-4.3

classification 💻 cs.LG math.PR

keywords discrete diffusioncontinuous-time Markov chainsneural CTMCjump timingjump directionELBO factorizationlanguage generationPoisson process

0 comments

The pith

Neural CTMC parameterizes the reverse CTMC process with separate heads for jump timing and direction, turning the ELBO into independent Poisson and categorical KL terms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that monolithic parameterization of the reverse rate matrix in discrete diffusion CTMCs can be replaced by two dedicated heads—one for the exit rate that controls when jumps occur and one for the categorical distribution over jump targets. This decomposition exploits the Poisson structure of CTMC paths so that the evidence lower bound factors into a timing-specific KL and a direction-specific KL, each with a simple, gradient-equivalent loss. On language tasks the resulting model produces lower generative perplexity than prior monolithic approaches while using the same training budget.

Core claim

The ELBO for a learned reverse CTMC process reduces to a path-space KL divergence between the true and approximate reverse dynamics; when the reverse rate matrix is factored into an exit-rate function and a jump distribution, this KL separates additively into a Poisson KL over jump times and a categorical KL over jump destinations, yielding a tractable, consistent training objective that does not require monolithic score or data-prediction proxies.

What carries the argument

Decoupled parameterization of the reverse rate matrix via an exit-rate head (Poisson intensity governing jump timing) and a jump-distribution head (categorical probabilities over destination states).

If this is right

The training objective becomes a sum of independent Poisson and categorical losses that are gradient-equivalent to the full path KL.
A uniform-initialization variant reaches 16.36 generative perplexity on TinyStories, outperforming GIDD and MDLM under the same evaluator.
On OpenWebText the method records the lowest perplexity among compared models for every tested sampling step count from 16 to 128.
The same architecture admits straightforward release of pretrained weights for reproducibility.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The timing-direction split may allow independent scaling or regularization of each head, something monolithic models cannot do directly.
Because the loss already separates timing from direction, one could in principle modulate sampling speed by adjusting only the exit-rate head without retraining the direction head.
The same factorization could be applied to other continuous-time discrete processes whose generator matrix admits a similar rate-and-target decomposition.

Load-bearing premise

Separately parameterizing the exit rate and jump distribution with two network heads still captures the full joint reverse dynamics without adding approximation error or optimization problems that cancel the factorization gain.

What would settle it

A direct comparison in which the factorized loss is replaced by a monolithic parameterization of the same rate matrix and the resulting perplexity on TinyStories or OpenWebText is measured; if the monolithic version matches or beats the factorized version under identical compute, the claimed advantage of the decomposition is refuted.

Figures

Figures reproduced from arXiv: 2604.15694 by Fukang Wen, Jingyuan Li, Pipi Hu, Renqian Luo, Wei Liu, Xiaoyi Jiang, Yi Zhu, Zuoqiang Shi.

**Figure 1.** Figure 1: Sample quality comparison. (a) On TinyStories, Neural CTMC (Euler and τ -leaping) converges to substantially lower perplexity than MDLM and GIDD under identical training conditions. (b) On OpenWebText, Neural CTMC performs best among the equal-budget baselines across the step counts we evaluate, and remains competitive with SEDD despite using 2.6× fewer training tokens. perplexity (PPL). Since this estimat… view at source ↗

**Figure 2.** Figure 2: Unconditional MNIST samples generated by Neural CTMC at epoch 80 with 128 sampling steps. [PITH_FULL_IMAGE:figures/full_fig_p026_2.png] view at source ↗

read the original abstract

Discrete diffusion models based on continuous-time Markov chains (CTMCs) have shown strong performance on language and discrete data generation, yet existing approaches typically parameterize the reverse rate matrix monolithically -- through proxies such as concrete scores (SEDD) or clean-data predictions (MDLM, GIDD) -- rather than aligning the parameterization with the intrinsic CTMC decomposition into jump timing and jump direction. We propose \textbf{Neural CTMC}, which exploits the underlying Poisson structure of CTMC dynamics by separately parameterizing the reverse process through an \emph{exit rate} (when to jump) and a \emph{jump distribution} (where to jump) via two dedicated network heads. We show that the evidence lower bound (ELBO) reduces to a path-space KL divergence between the true and learned reverse processes that factorizes into a Poisson KL for timing and a categorical KL for direction, and admits a tractable, gradient-equivalent and consistent loss. Experimentally, scored by Gemma2-9B, our pure-uniform Neural CTMC achieves $16.36$ generative perplexity on TinyStories (vs.\ GIDD $37.60$ and MDLM $42.66$). On OpenWebText, it attains the best perplexity at the same training-token budget across 16--128 sampling steps among the methods we compare (e.g., at 128 steps: Neural CTMC $183.6$ vs.\ MDLM $210.5$ and GIDD $249.8$). To facilitate reproducibility, we release our pretrained weights at https://huggingface.co/Jiangxy1117/Neural-CTMC.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper splits the reverse CTMC into separate exit-rate and jump-distribution heads to factorize the ELBO into Poisson and categorical terms, and reports stronger perplexity than GIDD or MDLM on language tasks.

read the letter

The main takeaway is that reparameterizing the reverse process with two dedicated heads—one for when to jump and one for where—lets the path-space KL split cleanly into independent timing and direction losses. This is not just a trick; any valid rate matrix can be recovered from a positive exit rate and a categorical jump distribution, so the move preserves expressivity while making the objective tractable and gradient-equivalent to the original ELBO. The factorization follows from standard CTMC path properties rather than any new approximation.

Referee Report

2 major / 3 minor

Summary. The paper proposes Neural CTMC, a reparameterization of the reverse process in continuous-time Markov chain (CTMC) discrete diffusion models. Instead of monolithic rate-matrix proxies, it uses two dedicated network heads to separately predict the exit rate (jump timing) and the jump distribution (jump direction). The central claim is that the evidence lower bound (ELBO) then reduces exactly to a path-space KL divergence that factorizes into an independent Poisson KL term for timing and a categorical KL term for direction; the resulting loss is tractable, gradient-equivalent to the original ELBO, and consistent. Experiments report that a pure-uniform variant achieves 16.36 generative perplexity on TinyStories (versus 37.60 for GIDD and 42.66 for MDLM) and the lowest perplexity on OpenWebText across 16–128 sampling steps at fixed training budget.

Significance. If the factorization and gradient equivalence hold without hidden approximation error, the work provides a structurally aligned parameterization for CTMC diffusion that exploits the independent exponential waiting times and multinomial jump targets inherent to CTMC paths. This could simplify training dynamics and improve sample quality for discrete data. The release of pretrained weights at the cited Hugging Face repository is a concrete reproducibility strength.

major comments (2)

The abstract and reader's summary assert that the ELBO 'reduces to' a factorized path-space KL with 'gradient-equivalent' loss, yet no explicit derivation, intermediate steps, or proof sketch is referenced in the provided material. Because this factorization is load-bearing for the tractability claim, the full derivation (including how the reparameterization preserves the original measure and why the gradients match) must be supplied in the main text or appendix.
Experimental results report point estimates (e.g., 16.36 perplexity on TinyStories, 183.6 at 128 steps on OpenWebText) without error bars, multiple random seeds, or ablation studies on the two-head architecture. If the central claim is that the decoupled parameterization yields both theoretical and practical gains, these controls are necessary to establish that the reported improvements are robust rather than artifacts of a single run or hyper-parameter choice.

minor comments (3)

Notation for the exit-rate head and jump-distribution head should be introduced with explicit symbols (e.g., λ_θ(x_t) and p_θ(·|x_t)) and contrasted with the monolithic rate matrix Q_θ used in prior work (SEDD, MDLM, GIDD) to make the reparameterization transparent.
The abstract states 'pure-uniform Neural CTMC'; the precise meaning of 'pure-uniform' (e.g., whether the jump distribution is fixed to uniform or learned) should be clarified in the methods section.
The link to released weights is useful; the paper should also indicate the exact training-token budget, model size, and optimizer settings used for the reported OpenWebText and TinyStories runs so that the comparison is fully reproducible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and recommendation for minor revision. We address each major comment below and commit to revisions that strengthen the theoretical exposition and experimental robustness.

read point-by-point responses

Referee: The abstract and reader's summary assert that the ELBO 'reduces to' a factorized path-space KL with 'gradient-equivalent' loss, yet no explicit derivation, intermediate steps, or proof sketch is referenced in the provided material. Because this factorization is load-bearing for the tractability claim, the full derivation (including how the reparameterization preserves the original measure and why the gradients match) must be supplied in the main text or appendix.

Authors: We agree that an explicit, self-contained derivation is necessary to support the central claims. While the manuscript states the ELBO reduction, the intermediate steps were condensed. In the revised version we will insert a complete derivation in Appendix A (with forward references from Section 3), covering: (i) the path-space KL between the true and learned reverse CTMC processes, (ii) the exact factorization into independent Poisson and categorical KL terms under the two-head parameterization, (iii) preservation of the original path measure, and (iv) the proof that the resulting loss is gradient-equivalent to the original ELBO with no hidden approximations. This will make the tractability argument fully rigorous. revision: yes
Referee: Experimental results report point estimates (e.g., 16.36 perplexity on TinyStories, 183.6 at 128 steps on OpenWebText) without error bars, multiple random seeds, or ablation studies on the two-head architecture. If the central claim is that the decoupled parameterization yields both theoretical and practical gains, these controls are necessary to establish that the reported improvements are robust rather than artifacts of a single run or hyper-parameter choice.

Authors: We concur that statistical controls and targeted ablations are required to substantiate the practical gains. In the revision we will report means and standard deviations over at least three independent random seeds for all key perplexity numbers on both TinyStories and OpenWebText. We will also add an ablation comparing the two-head (exit-rate + jump-distribution) architecture against an otherwise identical monolithic rate-matrix baseline, keeping training budget and architecture size fixed. These additions will isolate the contribution of the decoupled parameterization. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The derivation begins from the standard CTMC path measure (independent exponential waiting times and multinomial jump targets) and shows that the ELBO equals a path-space KL that separates into a Poisson term on exit times and a categorical term on jump targets. This separation is an algebraic identity once the rate matrix is reparameterized as exit-rate times jump-distribution; the reparameterization itself is bijective for any valid Q-matrix and introduces no fitted quantity that is later renamed as a prediction. No self-citation is invoked to justify uniqueness or to close the argument, and the resulting loss is shown to be gradient-equivalent to the original ELBO without additional assumptions that embed the target result. The construction is therefore self-contained against the external definition of CTMC dynamics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the standard decomposition of CTMC dynamics into independent timing and direction components; no additional free parameters or invented entities are introduced beyond ordinary neural-network parameterization.

axioms (1)

domain assumption Continuous-time Markov chain reverse dynamics decompose into a Poisson process governing jump timing and an independent categorical distribution governing jump direction.
This decomposition is invoked to justify separate network heads and the factorized KL loss.

pith-pipeline@v0.9.0 · 5622 in / 1408 out tokens · 47893 ms · 2026-05-11T01:46:39.625562+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Dimension-Free Convergence of Discrete Diffusion Models: Adjoint Equations Induce the Right Space
cs.LG 2026-05 unverdicted novelty 7.0

Introduces adjoint-equation framework establishing dimension-free convergence bounds in any IPM for discrete diffusion models under masked and uniform priors.

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

doi: 10.1101/2023.09.11.556673. David F. Anderson. A modified next reaction method for simulating chemical systems with time dependent propensities and delays.The Journal of Chemical Physics, 127(21),

work page doi:10.1101/2023.09.11.556673 2023
[2]

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

URLhttps://proceedings.mlr.press/v235/campbell24a.html. Ronen Eldan and Yuanzhi Li. Tinystories: How small can language models be and still speak coherent english?arXiv preprint arXiv:2305.07759,

work page internal anchor Pith review arXiv
[3]

Both are tokenized with the GPT-2 BPE tokenizer (vocabulary size 50257) and truncated or padded to a maximum sequence length of

D Additional Experimental Details Datasets.We evaluate on two datasets: (1) TinyStories [Eldan and Li, 2023], a corpus of simple short stories for language modeling; (2) OpenWebText (OWT) [Radford et al., 2019], a large-scale web text corpus. Both are tokenized with the GPT-2 BPE tokenizer (vocabulary size 50257) and truncated or padded to a maximum seque...

work page 2023
[4]

under far greater strain

The Lab has committed to introducing autonomy and run AI development programs at warehouse laboratories in 28 villages, comparable to an ITTA cooperative. She also continued to pursue efforts to install local government staff that can better handle high-end robots, like the AITs. Sample 4.Chief thing here today from my employer, Barlan Yblin. You’ve got t...

work page 2009

[1] [1]

doi: 10.1101/2023.09.11.556673. David F. Anderson. A modified next reaction method for simulating chemical systems with time dependent propensities and delays.The Journal of Chemical Physics, 127(21),

work page doi:10.1101/2023.09.11.556673 2023

[2] [2]

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

URLhttps://proceedings.mlr.press/v235/campbell24a.html. Ronen Eldan and Yuanzhi Li. Tinystories: How small can language models be and still speak coherent english?arXiv preprint arXiv:2305.07759,

work page internal anchor Pith review arXiv

[3] [3]

Both are tokenized with the GPT-2 BPE tokenizer (vocabulary size 50257) and truncated or padded to a maximum sequence length of

D Additional Experimental Details Datasets.We evaluate on two datasets: (1) TinyStories [Eldan and Li, 2023], a corpus of simple short stories for language modeling; (2) OpenWebText (OWT) [Radford et al., 2019], a large-scale web text corpus. Both are tokenized with the GPT-2 BPE tokenizer (vocabulary size 50257) and truncated or padded to a maximum seque...

work page 2023

[4] [4]

under far greater strain

The Lab has committed to introducing autonomy and run AI development programs at warehouse laboratories in 28 villages, comparable to an ITTA cooperative. She also continued to pursue efforts to install local government staff that can better handle high-end robots, like the AITs. Sample 4.Chief thing here today from my employer, Barlan Yblin. You’ve got t...

work page 2009