Neural Continuous-Time Markov Chain: Discrete Diffusion via Decoupled Jump Timing and Direction
Pith reviewed 2026-05-11 01:46 UTC · model grok-4.3
The pith
Neural CTMC parameterizes the reverse CTMC process with separate heads for jump timing and direction, turning the ELBO into independent Poisson and categorical KL terms.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The ELBO for a learned reverse CTMC process reduces to a path-space KL divergence between the true and approximate reverse dynamics; when the reverse rate matrix is factored into an exit-rate function and a jump distribution, this KL separates additively into a Poisson KL over jump times and a categorical KL over jump destinations, yielding a tractable, consistent training objective that does not require monolithic score or data-prediction proxies.
What carries the argument
Decoupled parameterization of the reverse rate matrix via an exit-rate head (Poisson intensity governing jump timing) and a jump-distribution head (categorical probabilities over destination states).
If this is right
- The training objective becomes a sum of independent Poisson and categorical losses that are gradient-equivalent to the full path KL.
- A uniform-initialization variant reaches 16.36 generative perplexity on TinyStories, outperforming GIDD and MDLM under the same evaluator.
- On OpenWebText the method records the lowest perplexity among compared models for every tested sampling step count from 16 to 128.
- The same architecture admits straightforward release of pretrained weights for reproducibility.
Where Pith is reading between the lines
- The timing-direction split may allow independent scaling or regularization of each head, something monolithic models cannot do directly.
- Because the loss already separates timing from direction, one could in principle modulate sampling speed by adjusting only the exit-rate head without retraining the direction head.
- The same factorization could be applied to other continuous-time discrete processes whose generator matrix admits a similar rate-and-target decomposition.
Load-bearing premise
Separately parameterizing the exit rate and jump distribution with two network heads still captures the full joint reverse dynamics without adding approximation error or optimization problems that cancel the factorization gain.
What would settle it
A direct comparison in which the factorized loss is replaced by a monolithic parameterization of the same rate matrix and the resulting perplexity on TinyStories or OpenWebText is measured; if the monolithic version matches or beats the factorized version under identical compute, the claimed advantage of the decomposition is refuted.
Figures
read the original abstract
Discrete diffusion models based on continuous-time Markov chains (CTMCs) have shown strong performance on language and discrete data generation, yet existing approaches typically parameterize the reverse rate matrix monolithically -- through proxies such as concrete scores (SEDD) or clean-data predictions (MDLM, GIDD) -- rather than aligning the parameterization with the intrinsic CTMC decomposition into jump timing and jump direction. We propose \textbf{Neural CTMC}, which exploits the underlying Poisson structure of CTMC dynamics by separately parameterizing the reverse process through an \emph{exit rate} (when to jump) and a \emph{jump distribution} (where to jump) via two dedicated network heads. We show that the evidence lower bound (ELBO) reduces to a path-space KL divergence between the true and learned reverse processes that factorizes into a Poisson KL for timing and a categorical KL for direction, and admits a tractable, gradient-equivalent and consistent loss. Experimentally, scored by Gemma2-9B, our pure-uniform Neural CTMC achieves $16.36$ generative perplexity on TinyStories (vs.\ GIDD $37.60$ and MDLM $42.66$). On OpenWebText, it attains the best perplexity at the same training-token budget across 16--128 sampling steps among the methods we compare (e.g., at 128 steps: Neural CTMC $183.6$ vs.\ MDLM $210.5$ and GIDD $249.8$). To facilitate reproducibility, we release our pretrained weights at https://huggingface.co/Jiangxy1117/Neural-CTMC.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Neural CTMC, a reparameterization of the reverse process in continuous-time Markov chain (CTMC) discrete diffusion models. Instead of monolithic rate-matrix proxies, it uses two dedicated network heads to separately predict the exit rate (jump timing) and the jump distribution (jump direction). The central claim is that the evidence lower bound (ELBO) then reduces exactly to a path-space KL divergence that factorizes into an independent Poisson KL term for timing and a categorical KL term for direction; the resulting loss is tractable, gradient-equivalent to the original ELBO, and consistent. Experiments report that a pure-uniform variant achieves 16.36 generative perplexity on TinyStories (versus 37.60 for GIDD and 42.66 for MDLM) and the lowest perplexity on OpenWebText across 16–128 sampling steps at fixed training budget.
Significance. If the factorization and gradient equivalence hold without hidden approximation error, the work provides a structurally aligned parameterization for CTMC diffusion that exploits the independent exponential waiting times and multinomial jump targets inherent to CTMC paths. This could simplify training dynamics and improve sample quality for discrete data. The release of pretrained weights at the cited Hugging Face repository is a concrete reproducibility strength.
major comments (2)
- The abstract and reader's summary assert that the ELBO 'reduces to' a factorized path-space KL with 'gradient-equivalent' loss, yet no explicit derivation, intermediate steps, or proof sketch is referenced in the provided material. Because this factorization is load-bearing for the tractability claim, the full derivation (including how the reparameterization preserves the original measure and why the gradients match) must be supplied in the main text or appendix.
- Experimental results report point estimates (e.g., 16.36 perplexity on TinyStories, 183.6 at 128 steps on OpenWebText) without error bars, multiple random seeds, or ablation studies on the two-head architecture. If the central claim is that the decoupled parameterization yields both theoretical and practical gains, these controls are necessary to establish that the reported improvements are robust rather than artifacts of a single run or hyper-parameter choice.
minor comments (3)
- Notation for the exit-rate head and jump-distribution head should be introduced with explicit symbols (e.g., λ_θ(x_t) and p_θ(·|x_t)) and contrasted with the monolithic rate matrix Q_θ used in prior work (SEDD, MDLM, GIDD) to make the reparameterization transparent.
- The abstract states 'pure-uniform Neural CTMC'; the precise meaning of 'pure-uniform' (e.g., whether the jump distribution is fixed to uniform or learned) should be clarified in the methods section.
- The link to released weights is useful; the paper should also indicate the exact training-token budget, model size, and optimizer settings used for the reported OpenWebText and TinyStories runs so that the comparison is fully reproducible.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation and recommendation for minor revision. We address each major comment below and commit to revisions that strengthen the theoretical exposition and experimental robustness.
read point-by-point responses
-
Referee: The abstract and reader's summary assert that the ELBO 'reduces to' a factorized path-space KL with 'gradient-equivalent' loss, yet no explicit derivation, intermediate steps, or proof sketch is referenced in the provided material. Because this factorization is load-bearing for the tractability claim, the full derivation (including how the reparameterization preserves the original measure and why the gradients match) must be supplied in the main text or appendix.
Authors: We agree that an explicit, self-contained derivation is necessary to support the central claims. While the manuscript states the ELBO reduction, the intermediate steps were condensed. In the revised version we will insert a complete derivation in Appendix A (with forward references from Section 3), covering: (i) the path-space KL between the true and learned reverse CTMC processes, (ii) the exact factorization into independent Poisson and categorical KL terms under the two-head parameterization, (iii) preservation of the original path measure, and (iv) the proof that the resulting loss is gradient-equivalent to the original ELBO with no hidden approximations. This will make the tractability argument fully rigorous. revision: yes
-
Referee: Experimental results report point estimates (e.g., 16.36 perplexity on TinyStories, 183.6 at 128 steps on OpenWebText) without error bars, multiple random seeds, or ablation studies on the two-head architecture. If the central claim is that the decoupled parameterization yields both theoretical and practical gains, these controls are necessary to establish that the reported improvements are robust rather than artifacts of a single run or hyper-parameter choice.
Authors: We concur that statistical controls and targeted ablations are required to substantiate the practical gains. In the revision we will report means and standard deviations over at least three independent random seeds for all key perplexity numbers on both TinyStories and OpenWebText. We will also add an ablation comparing the two-head (exit-rate + jump-distribution) architecture against an otherwise identical monolithic rate-matrix baseline, keeping training budget and architecture size fixed. These additions will isolate the contribution of the decoupled parameterization. revision: yes
Circularity Check
No significant circularity
full rationale
The derivation begins from the standard CTMC path measure (independent exponential waiting times and multinomial jump targets) and shows that the ELBO equals a path-space KL that separates into a Poisson term on exit times and a categorical term on jump targets. This separation is an algebraic identity once the rate matrix is reparameterized as exit-rate times jump-distribution; the reparameterization itself is bijective for any valid Q-matrix and introduces no fitted quantity that is later renamed as a prediction. No self-citation is invoked to justify uniqueness or to close the argument, and the resulting loss is shown to be gradient-equivalent to the original ELBO without additional assumptions that embed the target result. The construction is therefore self-contained against the external definition of CTMC dynamics.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Continuous-time Markov chain reverse dynamics decompose into a Poisson process governing jump timing and an independent categorical distribution governing jump direction.
Forward citations
Cited by 1 Pith paper
-
Dimension-Free Convergence of Discrete Diffusion Models: Adjoint Equations Induce the Right Space
Introduces adjoint-equation framework establishing dimension-free convergence bounds in any IPM for discrete diffusion models under masked and uniform priors.
Reference graph
Works this paper leans on
-
[1]
doi: 10.1101/2023.09.11.556673. David F. Anderson. A modified next reaction method for simulating chemical systems with time dependent propensities and delays.The Journal of Chemical Physics, 127(21),
-
[2]
TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
URLhttps://proceedings.mlr.press/v235/campbell24a.html. Ronen Eldan and Yuanzhi Li. Tinystories: How small can language models be and still speak coherent english?arXiv preprint arXiv:2305.07759,
work page internal anchor Pith review arXiv
-
[3]
D Additional Experimental Details Datasets.We evaluate on two datasets: (1) TinyStories [Eldan and Li, 2023], a corpus of simple short stories for language modeling; (2) OpenWebText (OWT) [Radford et al., 2019], a large-scale web text corpus. Both are tokenized with the GPT-2 BPE tokenizer (vocabulary size 50257) and truncated or padded to a maximum seque...
work page 2023
-
[4]
The Lab has committed to introducing autonomy and run AI development programs at warehouse laboratories in 28 villages, comparable to an ITTA cooperative. She also continued to pursue efforts to install local government staff that can better handle high-end robots, like the AITs. Sample 4.Chief thing here today from my employer, Barlan Yblin. You’ve got t...
work page 2009
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.