Differentiable Filtering for Learning Hidden Markov Models

Heng-Sheng Chang; Prashant G. Mehta; Reginald Zhiyan Chen

arxiv: 2511.10571 · v2 · submitted 2025-11-13 · 💻 cs.LG · cs.SY· eess.SY· math.PR

Differentiable Filtering for Learning Hidden Markov Models

Reginald Zhiyan Chen , Heng-Sheng Chang , Prashant G. Mehta This is my paper

Pith reviewed 2026-05-17 22:03 UTC · model grok-4.3

classification 💻 cs.LG cs.SYeess.SYmath.PR

keywords Hidden Markov Modelsparameter learningdifferentiable filteringforward recursionneural networksBaum-Welchspectral learningautoregressive training

0 comments

The pith

Belief Net recovers Hidden Markov Model parameters by representing the forward filter as an interpretable neural network optimized via gradient descent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a new method to learn the parameters of Hidden Markov Models from observation sequences. It casts the standard forward filtering recursion as a neural network whose weights are directly the logits of the HMM parameters, allowing the use of stochastic gradient descent for optimization. This is trained using a simple autoregressive loss that predicts the next observation. The approach is shown to converge faster than the classical Baum-Welch algorithm on synthetic data and to work in settings where the number of possible observations is larger than the number of hidden states, a case where spectral learning methods break down.

Core claim

Belief Net formulates the forward filter as a structured neural network in which the learnable weights correspond exactly to the logits of the initial state distribution, transition probabilities, and emission probabilities of the underlying HMM. Training proceeds end-to-end by minimizing the autoregressive next-observation prediction loss on sequences generated from the model. Experiments on synthetic HMM data demonstrate faster convergence compared with Baum-Welch and successful parameter recovery in both undercomplete and overcomplete regimes, while spectral methods fail in the overcomplete case.

What carries the argument

Belief Net: a decoder-only neural architecture that recursively updates the belief state (posterior over hidden states) using weights set to the HMM parameter logits.

If this is right

Standard backpropagation and stochastic gradient descent can be used to learn HMM parameters without custom algorithms.
The learned parameters remain fully interpretable as they are explicitly the model matrices.
Parameter recovery succeeds even when the observation space is larger than the hidden state space.
The same framework can be applied to real-world sequential data such as language modeling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the architecture to handle continuous observations or non-stationary transitions could broaden its applicability to more complex time series.
Hybrid models combining this interpretable core with larger black-box networks might improve performance on tasks requiring both structure and capacity.
This differentiable formulation opens the door to joint learning of HMM parameters alongside other model components in an end-to-end pipeline.

Load-bearing premise

The assumption that training the structured network with next-observation loss alone will recover the exact underlying transition and emission matrices of the true HMM.

What would settle it

Generate synthetic sequences from a known HMM, train Belief Net to convergence, and check whether the extracted parameters match the ground-truth matrices within numerical tolerance.

Figures

Figures reproduced from arXiv: 2511.10571 by Heng-Sheng Chang, Prashant G. Mehta, Reginald Zhiyan Chen.

**Figure 3.** Figure 3: Language modeling results on Federalist Papers. [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 5.** Figure 5: Overcomplete results on synthetic data. The [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: nanoGPT-s v.s. nanoGPT-m on synthetic HMM data with state dimension [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Learned transition and emission matrices of the HMM with Belief Net trained on the real-word text data: the [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

read the original abstract

Hidden Markov Models (HMMs) are fundamental for modeling sequential data, yet learning their parameters from observations remains challenging. Classical methods like the Baum-Welch algorithm are computationally intensive and prone to local optima, while modern spectral algorithms offer provable guarantees but may produce probability outputs outside valid ranges. This work introduces Belief Net, a differentiable filtering framework that learns HMM parameters by formulating the forward filter as a structured neural network and optimizing it with stochastic gradient descent. This architecture recursively updates the belief state, which represents the posterior probability distribution over hidden states based on the observation history. Unlike black-box transformer models, Belief Net's learnable weights are explicitly the logits of the initial distribution, transition matrix, and emission matrix, ensuring full interpretability. The model processes observation sequences using a decoder-only (causal) architecture and is trained end-to-end with standard autoregressive next-observation prediction loss. On synthetic HMM data, Belief Net achieves faster convergence than Baum-Welch while successfully recovering parameters in both undercomplete and overcomplete settings, whereas spectral methods prove ineffective in the latter. Comparisons with transformer-based models are also presented on real-world language data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Belief Net embeds the HMM forward filter in a structured net with weights tied directly to the model logits and trains it via autoregressive loss, showing reported gains over Baum-Welch and spectral methods on synthetic data.

read the letter

The main point is that this paper turns the standard HMM forward recursion into a decoder-only neural net whose weights are literally the logits of the initial, transition, and emission parameters. Training uses ordinary next-observation cross-entropy, which keeps the whole thing differentiable and fully interpretable without black-box layers. On their synthetic tests it reportedly converges faster than Baum-Welch and recovers parameters even when the model is overcomplete, where spectral methods fall apart.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Belief Net, a differentiable filtering framework for learning HMM parameters. It represents the forward filter recursion as a structured neural network whose weights are explicitly the logits of the initial distribution, transition matrix, and emission matrix. The model is trained end-to-end via stochastic gradient descent on an autoregressive next-observation prediction loss. On synthetic HMM data the method is reported to converge faster than Baum-Welch, to recover the generating parameters in both undercomplete and overcomplete regimes, and to outperform spectral methods in the overcomplete case; comparisons with transformer models on real-world language data are also presented.

Significance. If the empirical recovery of true parameters in overcomplete settings can be shown to be robust rather than initialization-dependent, the approach would supply an interpretable, gradient-based alternative to classical HMM learning that preserves the probabilistic structure while enabling end-to-end optimization. The explicit parameterization of network weights as HMM logits is a clear strength for interpretability.

major comments (2)

[§4] §4 (Synthetic Experiments): The central claim that Belief Net recovers the true transition and emission matrices in overcomplete regimes rests on the autoregressive loss being exactly the marginal likelihood of the observations. This likelihood surface is invariant under state redundancy and rank reduction, so multiple parameter sets realize the same observation distribution. The manuscript provides no identifiability analysis, uniqueness penalty, or regularization; therefore observed recovery on synthetic data may be an artifact of initialization or data length rather than a general property. Please report parameter recovery error (e.g., Frobenius norm to ground truth) across at least 10 random seeds with error bars, and include an ablation on initialization strategies.
[§3] §3 (Architecture): The description of the decoder-only causal architecture and the differentiable implementation of the filtering recursion should contain explicit update equations (including normalization) that demonstrate how the belief state remains a valid probability distribution at every step and how gradients flow through the recursion without additional constraints.

minor comments (2)

[Abstract and §4] The abstract and experimental sections mention quantitative comparisons but do not list concrete metrics, exact data lengths, or protocol details; adding these would improve reproducibility.
[Figures in §4] All figures reporting convergence or recovery should include error bars and state the number of independent trials.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our work. We address each of the major comments in detail below and outline the revisions we will make to the manuscript.

read point-by-point responses

Referee: [§4] §4 (Synthetic Experiments): The central claim that Belief Net recovers the true transition and emission matrices in overcomplete regimes rests on the autoregressive loss being exactly the marginal likelihood of the observations. This likelihood surface is invariant under state redundancy and rank reduction, so multiple parameter sets realize the same observation distribution. The manuscript provides no identifiability analysis, uniqueness penalty, or regularization; therefore observed recovery on synthetic data may be an artifact of initialization or data length rather than a general property. Please report parameter recovery error (e.g., Frobenius norm to ground truth) across at least 10 random seeds with error bars, and include an ablation on initialization strategies.

Authors: We acknowledge the referee's point regarding the potential non-identifiability of HMM parameters due to the invariance of the marginal likelihood under state permutations and redundancies. Our experiments were conducted with random initializations and showed recovery of the true parameters, suggesting that the optimization landscape allows convergence to the generating model in practice. To provide stronger evidence and address concerns about initialization dependence, we will revise the manuscript to include parameter recovery errors measured by the Frobenius norm to the ground truth, averaged over at least 10 random seeds with standard deviation error bars. We will also add an ablation study examining different initialization strategies to demonstrate the robustness of the recovery. revision: yes
Referee: [§3] §3 (Architecture): The description of the decoder-only causal architecture and the differentiable implementation of the filtering recursion should contain explicit update equations (including normalization) that demonstrate how the belief state remains a valid probability distribution at every step and how gradients flow through the recursion without additional constraints.

Authors: We agree that explicit equations would improve the clarity of the architecture description. In the revised version of the manuscript, we will provide the precise update equations for the belief state in the decoder-only architecture. These will include the normalization step (via softmax) to ensure the belief state is a valid probability distribution after each update. Regarding gradient flow, we will explain that the recursion is implemented using differentiable operations such as matrix multiplications and normalizations, allowing gradients to flow end-to-end through standard backpropagation without the need for additional constraints or special handling. revision: yes

Circularity Check

0 steps flagged

No circularity: parameterization is explicit MLE via differentiable filter; recovery claims are empirical

full rationale

The paper defines Belief Net by setting its weights directly to the logits of the HMM initial, transition, and emission parameters and optimizes them end-to-end with the standard autoregressive next-observation loss (equivalent to the marginal likelihood). This is a reparameterized implementation of maximum-likelihood estimation using gradient descent on a structured recursion, not a derivation that reduces to its own inputs by construction. No uniqueness theorem, self-citation chain, or ansatz is invoked to force the result; the reported faster convergence and parameter recovery on synthetic data (under- and over-complete) are presented as experimental outcomes. The architecture supplies an independent training signal through the filtering recursion and loss, and the central claims do not collapse to tautology. The method is therefore self-contained against external benchmarks with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the assumption that the HMM forward filter can be exactly realized as a neural-network layer whose parameters are the model probabilities, and that gradient descent on next-observation prediction will recover those probabilities.

axioms (2)

domain assumption The forward filter recursion for an HMM can be implemented as a structured neural network layer.
The paper formulates the forward filter as a structured neural network.
domain assumption Training with autoregressive next-observation prediction loss on observation sequences will yield the correct HMM parameters.
The model is trained end-to-end with standard autoregressive next-observation prediction loss.

invented entities (1)

Belief Net no independent evidence
purpose: Differentiable filtering framework that learns HMM parameters via structured neural network
New architecture introduced whose learnable weights are the logits of the initial distribution, transition matrix, and emission matrix.

pith-pipeline@v0.9.0 · 5510 in / 1593 out tokens · 36718 ms · 2026-05-17T22:03:54.977018+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 2 internal anchors

[1]

Byron Boots, Sajid M Siddiqi, and Geoffrey J Gordon

URLhttps://www.nytimes.com/interactive/2023/04/26/upshot/ gpt-from-scratch.html. Byron Boots, Sajid M Siddiqi, and Geoffrey J Gordon. Closing the learning-planning loop with predictive state representations.The International Journal of Robotics Research, 30(7):954–966,

work page 2023
[2]

Large-scale machine learning with stochastic gradient descent

L´eon Bottou. Large-scale machine learning with stochastic gradient descent. InProceedings of COMPSTAT’2010: 19th International Conference on Computational Statistics, pages 177–186. Springer,

work page 2010
[3]

Bernstein, J., Wang, Y .-X., Azizzadenesheli, K., and Anand- kumar, A

doi: 10.1137/16M1080173. John-Joseph Brady, Yuhui Luo, Wenwu Wang, V ´ıctor Elvira, and Yunpeng Li. Regime learning for differentiable particle filters. In2024 27th International Conference on Information Fusion (FUSION), pages 1–6. IEEE,

work page doi:10.1137/16m1080173
[4]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

work page 1901
[5]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

11 CHENCHANGMEHTA Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[6]

Transformer feed-forward layers are key-value memories

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495,

work page 2021
[7]

On limitation of transformer for learning hmms.arXiv preprint arXiv:2406.04089,

Jiachen Hu, Qinghua Liu, and Chi Jin. On limitation of transformer for learning hmms.arXiv preprint arXiv:2406.04089,

work page arXiv
[8]

From small to large language models: Revisiting the feder- alist papers.arXiv preprint arXiv:2503.01869,

So Won Jeong and Veronika Roˇckov´a. From small to large language models: Revisiting the feder- alist papers.arXiv preprint arXiv:2503.01869,

work page arXiv
[9]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Long range arena: A benchmark for efficient transformers

Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. Long range arena: A benchmark for efficient transformers.arXiv preprint arXiv:2011.04006,

work page arXiv 2011
[11]

Yuan Wu and Sicheng He

ISSN 00905364, 21688966. Yuan Wu and Sicheng He. Dkfnet: Differentiable kalman filter for field inversion and machine learning.arXiv preprint arXiv:2509.07474,

work page arXiv
[12]

•Related Work:A review of related literature on learning state-space models and connections to neural network architectures

13 CHENCHANGMEHTA Appendix The appendix provides additional details on the related work, implementation details, and additional experimental results. •Related Work:A review of related literature on learning state-space models and connections to neural network architectures. •Implementation:Details on the implementation of each model, including library usa...

work page 2008
[13]

and, more recently, under the paradigm of differentiable filtering (Kloss et al., 2021; Wu and He, 2025). For more general non-linear, non- Gaussian systems, similar approaches have been developed using Differentiable Particle Filters, which leverage sequential Monte Carlo techniques within a deep learning framework to perform state-space inference Brady ...

work page 2021
[14]

Sec- tion B.2 describes the spectral algorithm adapted from thespectral-learningrepository, extended with a custom probability prediction function

Section B.1 outlines the Baum-Welch algorithm implemented using thehmmlearnlibrary. Sec- tion B.2 describes the spectral algorithm adapted from thespectral-learningrepository, extended with a custom probability prediction function. Section B.3 details the Transformer-based models implemented using thenanoGPTrepository, which offers a minimal implementatio...

work page 2024
[15]

15 CHENCHANGMEHTA B.3

PredictionThe conditional probability for the next observationZ t+1 given historyZ 0:t is: P[Z t+1 =z k |Z 0:t]∝b ⊺ ∞Bkbt To handle the cases of negative probabilities, all negative value entries were replaced with the min- imum positive value at the current step, and the resulting vector was renormalized to sum to unity. 15 CHENCHANGMEHTA B.3. Transforme...

work page 2024
[16]

The transition matrixAshows that each state only transitions to a few other states, indi- cating that the model is learning a sparse transition structure. The emission matrixCshows more 18 BELIEFNET Model Baum-Welch Spectral nanoGPT-s nanoGPT-mBelief NetRandom Perplexity9.835 20.491 4.446 3.8617.40482 Table 3: Perplexity of each trained model on Federalis...

work page arXiv

[1] [1]

Byron Boots, Sajid M Siddiqi, and Geoffrey J Gordon

URLhttps://www.nytimes.com/interactive/2023/04/26/upshot/ gpt-from-scratch.html. Byron Boots, Sajid M Siddiqi, and Geoffrey J Gordon. Closing the learning-planning loop with predictive state representations.The International Journal of Robotics Research, 30(7):954–966,

work page 2023

[2] [2]

Large-scale machine learning with stochastic gradient descent

L´eon Bottou. Large-scale machine learning with stochastic gradient descent. InProceedings of COMPSTAT’2010: 19th International Conference on Computational Statistics, pages 177–186. Springer,

work page 2010

[3] [3]

Bernstein, J., Wang, Y .-X., Azizzadenesheli, K., and Anand- kumar, A

doi: 10.1137/16M1080173. John-Joseph Brady, Yuhui Luo, Wenwu Wang, V ´ıctor Elvira, and Yunpeng Li. Regime learning for differentiable particle filters. In2024 27th International Conference on Information Fusion (FUSION), pages 1–6. IEEE,

work page doi:10.1137/16m1080173

[4] [4]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

work page 1901

[5] [5]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

11 CHENCHANGMEHTA Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010

[6] [6]

Transformer feed-forward layers are key-value memories

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495,

work page 2021

[7] [7]

On limitation of transformer for learning hmms.arXiv preprint arXiv:2406.04089,

Jiachen Hu, Qinghua Liu, and Chi Jin. On limitation of transformer for learning hmms.arXiv preprint arXiv:2406.04089,

work page arXiv

[8] [8]

From small to large language models: Revisiting the feder- alist papers.arXiv preprint arXiv:2503.01869,

So Won Jeong and Veronika Roˇckov´a. From small to large language models: Revisiting the feder- alist papers.arXiv preprint arXiv:2503.01869,

work page arXiv

[9] [9]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Long range arena: A benchmark for efficient transformers

Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. Long range arena: A benchmark for efficient transformers.arXiv preprint arXiv:2011.04006,

work page arXiv 2011

[11] [11]

Yuan Wu and Sicheng He

ISSN 00905364, 21688966. Yuan Wu and Sicheng He. Dkfnet: Differentiable kalman filter for field inversion and machine learning.arXiv preprint arXiv:2509.07474,

work page arXiv

[12] [12]

•Related Work:A review of related literature on learning state-space models and connections to neural network architectures

13 CHENCHANGMEHTA Appendix The appendix provides additional details on the related work, implementation details, and additional experimental results. •Related Work:A review of related literature on learning state-space models and connections to neural network architectures. •Implementation:Details on the implementation of each model, including library usa...

work page 2008

[13] [13]

and, more recently, under the paradigm of differentiable filtering (Kloss et al., 2021; Wu and He, 2025). For more general non-linear, non- Gaussian systems, similar approaches have been developed using Differentiable Particle Filters, which leverage sequential Monte Carlo techniques within a deep learning framework to perform state-space inference Brady ...

work page 2021

[14] [14]

Sec- tion B.2 describes the spectral algorithm adapted from thespectral-learningrepository, extended with a custom probability prediction function

Section B.1 outlines the Baum-Welch algorithm implemented using thehmmlearnlibrary. Sec- tion B.2 describes the spectral algorithm adapted from thespectral-learningrepository, extended with a custom probability prediction function. Section B.3 details the Transformer-based models implemented using thenanoGPTrepository, which offers a minimal implementatio...

work page 2024

[15] [15]

15 CHENCHANGMEHTA B.3

PredictionThe conditional probability for the next observationZ t+1 given historyZ 0:t is: P[Z t+1 =z k |Z 0:t]∝b ⊺ ∞Bkbt To handle the cases of negative probabilities, all negative value entries were replaced with the min- imum positive value at the current step, and the resulting vector was renormalized to sum to unity. 15 CHENCHANGMEHTA B.3. Transforme...

work page 2024

[16] [16]

The transition matrixAshows that each state only transitions to a few other states, indi- cating that the model is learning a sparse transition structure. The emission matrixCshows more 18 BELIEFNET Model Baum-Welch Spectral nanoGPT-s nanoGPT-mBelief NetRandom Perplexity9.835 20.491 4.446 3.8617.40482 Table 3: Perplexity of each trained model on Federalis...

work page arXiv