pith. sign in

arxiv: 2605.21108 · v1 · pith:7FGF6ENJnew · submitted 2026-05-20 · 💻 cs.LG · cs.AI

Efficient Learning of Deep State Space Models via Importance Smoothing

Pith reviewed 2026-05-21 06:15 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords deep state space modelsimportance smoothingparallel variational Monte Carlosequential Monte Carlovariational inferencetime series modelinggenerative modelsdiscriminative models
0
0 comments X

The pith

A parallel variational Monte Carlo method trains deep state space models for both generative and discriminative tasks while running ten times faster than sequential alternatives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Deep state space models capture time series observed through noise but remain hard to train at scale. Existing approaches split into variational auto-encoding limited mostly to generation and sequential Monte Carlo methods that do not parallelize on modern hardware. The paper introduces parallel variational Monte Carlo that uses importance smoothing to remove the sequential bottleneck. This lets the same training procedure handle both prediction from observations and generation of new sequences. A reader would care because the change makes these models practical for larger datasets and longer series without sacrificing reported accuracy or stability.

Core claim

Parallel variational Monte Carlo (PVMC) bridges auto-encoding and SMC-based training of deep state space models by replacing the sequential forward pass with a parallelizable importance-smoothed estimator, achieving state-of-the-art or better results on baseline experiments while training approximately 10 times faster than the fastest competing SMC approach for both discriminative and generative tasks.

What carries the argument

parallel variational Monte Carlo via importance smoothing, which parallelizes the Monte Carlo estimation used to train deep latent state space systems

If this is right

  • DSSM training becomes feasible on parallel hardware for both prediction and data generation tasks.
  • Training time drops by a factor of roughly 10 compared with the fastest sequential SMC competitors.
  • Accuracy and numerical stability are maintained across the two task types on the tested baselines.
  • The same estimator supports scaling DSSMs without separate code paths for generative versus discriminative use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The removal of sequential dependencies may extend naturally to other Monte Carlo estimators used in sequential modeling.
  • Shorter training cycles could allow more frequent retraining or larger-scale hyperparameter exploration in practice.
  • The approach opens the possibility of hybrid models that switch between generation and prediction within a single trained network.

Load-bearing premise

The proposed PVMC method can be used robustly to train DSSMs for both discriminative and generative tasks while preserving accuracy and numerical stability.

What would settle it

If direct runs on the paper's baseline experiments show neither state-of-the-art accuracy nor a clear order-of-magnitude training speedup relative to prior SMC methods, the central efficiency and bridging claims would not hold.

Figures

Figures reproduced from arXiv: 2605.21108 by John-Joseph Brady, Nikolas Nusken, Yunpeng Li.

Figure 1
Figure 1. Figure 1: Comparison of the sampling and weighting strategies. The black dots represent the sampled particles. Dependence via sampling and weighting are represented by red and blue arrows respectively. Red-blue dashed arrows show bi-directional dependence weighting and forward dependence via sampling. Grey arrows denote deterministic dependence. h is a high-dimensional hidden variable that aggregates information ove… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of the mean autocorrelation of absolute (above) and squared (below) daily returns over 1000 generated trajectories of 360 days to 6 non-overlapping real SPX trajectories of the same length. −1.50 −1.25 −1.00 −0.75 −0.50 −0.25 0.00 0.25 0.50 Skewness 0.0 0.1 0.2 0.3 0.4 Frequency PVMC DMM TCVAE P-VAE Soft 2 4 6 8 10 12 14 16 18 20 Kurtosis 0.00 0.05 0.10 0.15 Frequency PVMC DMM TCVAE P-VAE Soft … view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of the distribution of the skewness and kurtosis between the 1000 generated 360 day trajectories from each model. The black bars indicate the skewness and kurtosis for 6 non-overlapping 360 day trajectories from the real SPX. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The proposal network we use in our experiments. The sampling distribution of the particles at time t is a Gaussian with mean µt and a covariance matrix with σt Jσt on the diagonal where J is the Hadamard (element-wise) product and zeros off-diagonal. The width of the convolutional kernel; the length of the input to the feed forward neural networks, s; the depth, d; the dimensionality of each hidden state, … view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of runtimes of particle based methods for the Linear and Gaussian SSM specified in Equation (31). selection length of s = 9 to input to the extremal feed-forward blocks, and stack d = 4 layers. Every layer uses a RELU activation, apart from the last which uses the identity function. The log standard deviation scaling factor is 1. The proposal has a total of 46, 278 parameters. We train our model… view at source ↗
Figure 6
Figure 6. Figure 6: Box plots showing the spread of MSEs and squared sliced 2-Wasserstein distances achieved by each training approach for the different training runs for the prey-predator experiment. Failed runs are not plotted. empirical sliced 2-Wasserstein distance between importance weighted approximations, N = 1, 000, of the posterior of the latent state at the last time-step under the learned SSM and under the true dat… view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of the mean autocorrelation of daily returns over 1000 generated trajectories of 360 days to 6 non-overlapping real SPX trajectories of the same length. of the individual daily returns generated by each method in [PITH_FULL_IMAGE:figures/full_fig_p031_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Histograms to compare the frequencies of the daily returns between each method and the SPX. The SPX’s distribution plotted in orange and the generated data’s in blue. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_8.png] view at source ↗
read the original abstract

Latent state space systems are ubiquitous in statistical modelling, arising naturally when a time series is observed through a noisy measurement function, however training deep state space models (DSSM) at scale remains difficult. Two largely distinct strategies and literatures have developed around the training of DSSMs. Firstly, auto-encoding DSSMs train generative DSSMs by optimising a variational lower bound. Secondly, DSSMs trained by back-propagating the outputs of a classical sequential Monte Carlo algorithm (SMC). Such approaches can train DSSMs for discriminative as well as generative tasks, however, due to the sequentiality of their forward pass, scale poorly on modern hardware. We propose a new training method \emph{parallel variational Monte Carlo} (PVMC) that bridges the gap between the paradigms, and can be used robustly to train DSSMs for both discriminative and generative tasks. Our method achieves state-of-the-art or better results on a set of baseline experiments and trains $10\times$ faster than the fastest competing SMC approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Parallel Variational Monte Carlo (PVMC), a training method for deep state space models (DSSMs) that combines variational inference with importance smoothing to enable parallel computation. It bridges auto-encoding variational bounds and classical SMC training, supporting both generative and discriminative tasks, and reports state-of-the-art or better empirical results together with a claimed 10× training speedup over the fastest competing SMC baselines.

Significance. If the performance and speedup claims hold under standard experimental protocols, the work would provide a practical bridge between two previously separate DSSM training literatures and improve scalability on modern hardware for time-series modeling tasks.

major comments (2)
  1. [§3.2] §3.2, Algorithm 1: the parallel importance smoothing step is presented as preserving the same marginal likelihood estimator as sequential SMC, but the variance analysis only bounds the per-particle contribution; it is unclear whether the overall estimator remains unbiased when the smoothing kernel is applied in parallel across time steps.
  2. [Table 2] Table 2, PVMC row: the reported 10× wall-clock speedup is measured against a single-threaded SMC baseline; no comparison is given against a properly parallelized SMC implementation using the same hardware resources, which weakens the central efficiency claim.
minor comments (2)
  1. [§2.3] Notation for the smoothing kernel K_t is introduced in §2.3 but reused without redefinition in the PVMC derivation; a brief reminder or forward reference would improve readability.
  2. [Figure 3] Figure 3 caption does not state the number of independent runs or whether error bars represent standard deviation or standard error.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and the recommendation for minor revision. We address each major comment below and describe the corresponding revisions to the manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2, Algorithm 1: the parallel importance smoothing step is presented as preserving the same marginal likelihood estimator as sequential SMC, but the variance analysis only bounds the per-particle contribution; it is unclear whether the overall estimator remains unbiased when the smoothing kernel is applied in parallel across time steps.

    Authors: We thank the referee for this observation. The parallel importance smoothing in Algorithm 1 is constructed so that the marginal likelihood estimator remains unbiased: the smoothing kernel is applied to the particle trajectories such that the joint proposal distribution factors consistently across time steps, and the product of the incremental importance weights continues to have expectation equal to the true marginal likelihood (see the derivation leading to Equation 7). The per-particle variance bound is an intermediate step whose aggregation yields the total variance of the unbiased estimator. To make this explicit, we have added a short paragraph and proof sketch in the revised §3.2 clarifying that unbiasedness is preserved under the parallel schedule. revision: yes

  2. Referee: [Table 2] Table 2, PVMC row: the reported 10× wall-clock speedup is measured against a single-threaded SMC baseline; no comparison is given against a properly parallelized SMC implementation using the same hardware resources, which weakens the central efficiency claim.

    Authors: The referee correctly notes that the reported speedup uses the standard single-threaded SMC implementation from prior work. Classical SMC is sequential by construction, and any attempt to parallelize it across time steps requires approximations or re-derivations that are essentially the contribution of PVMC. The 10× figure therefore compares against the fastest published SMC baseline under the protocols used in the literature. We have added a clarifying paragraph in the experimental section of the revised manuscript explaining this point and noting that a fully parallel SMC would need techniques comparable to those introduced here. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper introduces PVMC as a new algorithmic bridge between variational auto-encoding and SMC-based training for DSSMs, with performance claims resting on experimental benchmarks and implementation speedups rather than any derivation that reduces to fitted parameters or self-referential definitions. No equations or steps in the provided abstract or description equate a claimed prediction or result to an input quantity by construction, and the method is positioned as following from external variational Monte Carlo literature without load-bearing self-citations or uniqueness theorems imported from the authors' prior work. The central claims remain independently verifiable through the reported baselines and timing comparisons.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, invented entities, or paper-specific axioms beyond the standard background assumptions of variational inference and Monte Carlo sampling for latent variable models.

axioms (1)
  • domain assumption Standard assumptions of variational inference and sequential Monte Carlo for latent state-space models hold and can be parallelized via importance smoothing.
    The proposal of PVMC implicitly relies on these established statistical modeling techniques.

pith-pipeline@v0.9.0 · 5704 in / 1201 out tokens · 44241 ms · 2026-05-21T06:15:28.786234+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 6 internal anchors

  1. [1]

    Andersson, J. R. and Zhao, Z. Diffusion differentiable resampling.arXiv preprint arXiv:2512.10401,

  2. [2]

    Pydpf: A python package for differentiable particle filtering.arXiv preprint arXiv:2510.25693,

    Brady, J.-J., Cox, B., Elvira, V ., and Li, Y . Pydpf: A python package for differentiable particle filtering.arXiv preprint arXiv:2510.25693,

  3. [3]

    Importance Weighted Autoencoders

    Burda, Y ., Grosse, R., and Salakhutdinov, R. Importance weighted autoencoders.arXiv preprint arXiv:1509.00519,

  4. [4]

    Large sample analysis of the median heuristic

    9 Efficient Learning of Deep State Space Models via Importance Smoothing Garreau, D., Jitkrittum, W., and Kanagawa, M. Large sample analysis of the median heuristic.arXiv preprint arXiv:1707.07269,

  5. [5]

    Butterfly resampling: asymptotics for particle filters with constrained interactions

    Heine, K., Whiteley, N., Cemgil, A. T., and Guldas, H. Butterfly resampling: asymptotics for particle filters with constrained interactions.arXiv preprint arXiv:1411.5876,

  6. [6]

    Auto-Encoding Variational Bayes

    PMLR. Kingma, D. P. and Welling, M. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114,

  7. [7]

    Deep Kalman Filters

    Krishnan, R. G., Shalit, U., and Sontag, D. Deep kalman filters.arXiv preprint arXiv:1511.05121,

  8. [8]

    Meldrum, J., Suleiman, B., Rabhi, F., and Alibasa, M. J. New money: A systematic review of synthetic data gen- eration for finance.arXiv preprint arXiv:2510.26076,

  9. [9]

    Ramachandran, P., Zoph, B., and Le, Q. V . Searching for activation functions.arXiv preprint arXiv:1710.05941,

  10. [10]

    URL https://doi.org/10.2514/3.3166

    doi: 10.2514/3.3166. Särkkä, S. and García-Fernández, A. F. Temporal paral- lelization of bayesian smoothers.IEEE Trans. Automatic Control, 66(1):299–306,

  11. [11]

    and Wood, F

    ´Scibior, A. and Wood, F. Differentiable particle filtering without modifying the forward pass.arXiv preprint arXiv:2106.10314,

  12. [12]

    log TY t=0 Kt X 1 t , X1 t−1 # = T X t=0 E logK t X 1 t , X1 t−1 ,(54) LN V AE= 1 N NX n=1 E

    Lemma A.2.For the sequence of square matricesA i:j   NX ni,...,nj−1 =1 jY t=i Ant,nt−1 t   ni−1,nj = jO t=i At !ni−1,nj , i < j,(34) whereN is the order-preserving product of a sequence of matrices, and the upper indices refer to matrix entries. Proof. For convenience, we defineNi t=i At ≡A i. We proceed by induction: The base case is immediate, reduc...

  13. [13]

    The likelihood estimate,exp LN PVMC , is estimated from an empirical mean of these terms

    TY t=1 Pt (dXt|Xt−1)H t (yt |X t)(73) =p(y 0:T ).(74) Therefore, the expected value of the term contributed by any of the trajectories summed over in Equation (19) is equal to the likelihood. The likelihood estimate,exp LN PVMC , is estimated from an empirical mean of these terms. So, it is unbiased. C.2. Proof of Proposition 3.2 According to Proposition ...

  14. [14]

    E.2. Non-factorisable proposal distributions It is common for V AE-DSSMs to use proposal distributions where the particles at each time-step are non-independent, such as the structured proposals introduced in (Krishnan et al., 2017). Furthermore, the proposal structure of the classical particle filter has that each particle at time t is dependent on the s...

  15. [15]

    However, we wish to include every possible trajectory through the set of sampled particles in our set of targets

    then no problem arises. However, we wish to include every possible trajectory through the set of sampled particles in our set of targets. 23 Efficient Learning of Deep State Space Models via Importance Smoothing To restore domination whilst preserving a state-causal proposal we consider the following Markovian proposal distribution V X 1:N 0:T = NY n0,......

  16. [16]

    Remark E.2.We can further extend Equation(96)to weighted mixtures where the weights of each mixture component, w1:N 1:T , are almost never0

    to parameterise the proposal distribution. Remark E.2.We can further extend Equation(96)to weighted mixtures where the weights of each mixture component, w1:N 1:T , are almost never0. wT ∝ P X T0 0 V0 X T0 0 H0 y0 |X T0 0 TY t=1 Mt X Tt t |X Tt−1 t−1 Ht yt |X Tt t PN i=1 ˜wi tVt X Tt t |X i t−1 .(97) In doing so, we permit an algorithm that samples from a...

  17. [17]

    No implemented models make use of just in time (jit) compilation

    as the numerical and auto-differentiation library. No implemented models make use of just in time (jit) compilation. G.2. Baselines In total, we consider 1 ablation and 10 baseline approaches. Many baselines are not suitable for all the tasks that we wish to test PVMC on, so we vary the baselines between the experiments. We justify our choices individuall...

  18. [18]

    G.4. Prey-predator mode — state estimation The goal of this experiment is to demonstrate PVMC’s ability to learn a SSM that accurately represents the distribution of the latent state by supervised training. The auto-encoding baselines DMM and TCV AE do not admit supervised losses so are excluded from this experiment, as are the non-differentiable methods....

  19. [19]

    29 Efficient Learning of Deep State Space Models via Importance Smoothing G.5

    These plots demonstrate that whilst established DSMC methods, Stop-Grad, Soft DPF and MDPS, occasionally converge to a good representation, they suffer from much greater random seed dependence than PVMC. 29 Efficient Learning of Deep State Space Models via Importance Smoothing G.5. Financial time series — generative modelling This final experiment is incl...

  20. [20]

    The observation model represents the distribution of log daily returns given the latent state as a mixture of Gaussians

    layers each with a hidden dimension of 32 (Chen & Li, 2024), with a total of 28,824 learnable parameters. The observation model represents the distribution of log daily returns given the latent state as a mixture of Gaussians. All components of the mixture at all time-steps have the same learned variance, but the weights and locations of each mixture comp...

  21. [21]

    The DMM similarly represents the proposal by a Gaussian

    The total number of learnable parameters of the proposal distribution is 40,408 . The DMM similarly represents the proposal by a Gaussian. Instead of a 1D convolution the DMM parameterises the mean and log covariance by assimilating the results of foward and backwards in time LSTMs (Hochreiter & Schmidhuber, 1997), full details can be found in (Krishnan et al.,

  22. [22]

    For PVMC, DMM, and P-V AE we decrease the suppression from a factor of 0.05 to 1 over the course of training

    where we artificially suppress the gradient signal due to the prior and dynamic models. For PVMC, DMM, and P-V AE we decrease the suppression from a factor of 0.05 to 1 over the course of training. For Soft DPF it is unclear how one would implement such suppression and it is, to the best of our knowledge, not used in prior work in DPFs. For TCV AE we adop...