Dreaming Smoothly and Sample Efficiently with Gradient Penalized Latent Dynamics

P. R. Kumar (1) ((1) Texas A&M University); Romil V. Sonigra (1)

arxiv: 2605.23089 · v1 · pith:EG5H5ITLnew · submitted 2026-05-21 · 💻 cs.LG · cs.AI

Dreaming Smoothly and Sample Efficiently with Gradient Penalized Latent Dynamics

Romil V. Sonigra (1) , P. R. Kumar (1) ((1) Texas A&M University) This is my paper

Pith reviewed 2026-05-25 05:22 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords model-based reinforcement learninglatent dynamicssmoothness regularizationDreamerV3sample efficiencycontinuous controlJacobian penalty

0 comments

The pith

A Jacobian penalty on latent dynamics in DreamerV3 produces smoother transitions and higher sample efficiency in continuous control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an explicit smoothness regularizer into latent world models to address the lack of local smoothness enforcement in existing approaches like DreamerV3. The regularizer penalizes the row-wise Jacobian of the posterior latent distribution and is estimated with stochastic probes, acting as the continuous analog to finite-difference smoothing of transitions. This addition is tested on proprioceptive DeepMind Control tasks, where it improves aggregate sample efficiency with larger gains on complex locomotion environments and more stable long-horizon learning on quadrupeds. A reader would care because smoother learned dynamics could support more reliable planning and faster policy improvement in model-based reinforcement learning.

Core claim

GPLD applies a gradient-penalized latent dynamics regularizer to DreamerV3 by imposing a row-wise Jacobian penalty on the posterior latent distribution to encourage locally smooth transition learning. The penalty is estimated efficiently via Hutchinson-style stochastic probes and interpreted as the continuous-latent counterpart of finite-difference smoothing in discrete embedded-state MDPs. On DeepMind Control proprioceptive tasks, the method improves aggregate sample efficiency, with particularly strong gains on higher-complexity locomotion environments, earlier high-return behavior on quadruped tasks, and more consistent late-stage learning over longer horizons.

What carries the argument

GPLD, the gradient-penalized latent dynamics regularizer that applies a row-wise Jacobian penalty to the posterior latent distribution to enforce local smoothness in learned transition dynamics.

If this is right

Higher sample efficiency across proprioceptive control tasks
Faster achievement of high-return policies on complex locomotion environments
More consistent performance during extended training on quadruped tasks
Effective regularization that requires only efficient stochastic estimation

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The regularization could be combined with other transition regularizers to test additive effects on planning accuracy.
Smoother dynamics might reduce error accumulation in longer-horizon planning problems not covered in the experiments.
The approach may require task-specific tuning of the penalty strength when moving from proprioceptive to pixel-based observations.

Load-bearing premise

Penalizing the row-wise Jacobian of the posterior latent distribution will produce locally smooth transition dynamics that improve policy performance without harmful side effects on representation learning or planning.

What would settle it

Applying the Jacobian penalty to DreamerV3 on the same DeepMind Control proprioceptive and locomotion tasks and observing no improvement or a decrease in sample efficiency relative to the baseline.

Figures

Figures reproduced from arXiv: 2605.23089 by P. R. Kumar (1) ((1) Texas A&M University), Romil V. Sonigra (1).

**Figure 2.** Figure 2: Aggregate proprioceptive-control performance on DMC tasks. GPLD-DreamerV3 improves [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Long-horizon quadruped performance. Mean episodic return is reported across seeds, [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Pixel-observation DMC results. With image-frame observations, GPLD-DreamerV3 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: GPLD ablations over sampling fraction, posterior/prior regularization, and penalty schedul [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: DMC Proprioceptive Individual Tasks E Computational Cost Analysis Computation cost analysis: (Avg Baseline time = 2:01:29/seed for 500k env steps.) GPLD introduces additional computation because the gradient penalty requires vector-Jacobian products. The main algorithmic cost driver is the sampling fraction ρ, which determines the fraction of batch states on which the penalty is evaluated. In contrast, cha… view at source ↗

**Figure 7.** Figure 7: DMC Pixel Individual Tasks [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Encoder-decoder warm-start diagnostic for pixel observations. The encoder and decoder are [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Aggregate local sensitivity of the learned posterior and prior distributions for Walker walk [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

read the original abstract

Model-based reinforcement learning improves sample efficiency by learning a world model. However, existing latent world models such as DreamerV3 do not explicitly enforce local smoothness in their learned transition dynamics, leaving a useful inductive bias for transition dynamics learning unexploited. We propose GPLD, a gradient-penalized latent dynamics regularizer for DreamerV3 that applies a row-wise Jacobian penalty to the posterior latent distribution to encourage locally smooth transition learning. We show that this penalty can be interpreted as the continuous-latent analog of finite-difference smoothing of transition laws in discrete embedded-state MDPs, and estimate it efficiently using Hutchinson-style stochastic probes. Empirically, across DeepMind Control proprioceptive tasks, GPLD improves aggregate sample efficiency, with particularly strong gains on higher-complexity locomotion environments. On more challenging quadruped tasks, GPLD reaches high-return behavior earlier and exhibits more consistent late-stage learning over longer horizons. Explicit local smoothness regularization is a simple and effective way to improve latent world models for smooth continuous control environments. Code for GPLD is available at github.com/romils9/gpld-mbrl .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The Jacobian penalty targets the posterior inference model rather than the transition prior, so the claimed smoothness benefit for planning does not clearly follow.

read the letter

The main thing here is that GPLD applies a row-wise Jacobian penalty to the posterior latent distribution inside DreamerV3 and reports better sample efficiency on DeepMind Control tasks. The second point is that the penalty sits on the inference network q(z_t | o_t, z_{t-1}), while the dynamics used for imagined rollouts are the prior p(z_{t+1} | z_t, a_t). The abstract gives no derivation showing how the penalty reaches the transition parameters or the planning trajectories, so the link between the regularizer and smoother transitions remains unshown. If the effect stays inside representation learning, the policy gains via smoother dynamics do not follow. They do give an interpretation as the continuous analog of finite-difference smoothing and use Hutchinson probes for the estimate, which is a reasonable engineering choice. Code is released, and the reported aggregate improvements, especially on locomotion, are the concrete positive result. The abstract supplies no numbers on statistical significance, no ablation tables, and no exact baseline comparisons, so the empirical claim rests on limited visible evidence. This is for researchers already using Dreamer-style world models who want to test a simple regularizer on continuous control problems. A reader looking for incremental practical tweaks might find it worth trying if the full experiments clarify the mechanism. The paper shows clear engagement with the RSSM architecture and existing regularization ideas, so it deserves a serious referee to check whether the full text supplies the missing derivation and controls.

Referee Report

2 major / 2 minor

Summary. The paper proposes GPLD, a gradient-penalized latent dynamics regularizer added to DreamerV3. It applies a row-wise Jacobian penalty to the posterior latent distribution q(z_t | o_t, z_{t-1}) to encourage locally smooth transition learning, interprets the penalty as the continuous-latent analog of finite-difference smoothing, and estimates it via Hutchinson-style probes. Empirically, it reports improved aggregate sample efficiency on DeepMind Control proprioceptive tasks, with stronger gains on complex locomotion and quadruped environments, and releases code at github.com/romils9/gpld-mbrl.

Significance. If the results hold, the work demonstrates that explicit local smoothness regularization is a simple inductive bias that can improve latent world models for continuous control without major architectural changes. The public code release is a clear strength that supports reproducibility and further investigation.

major comments (2)

[Method (GPLD definition and RSSM integration)] The central claim is that the row-wise Jacobian penalty on the posterior encourages 'locally smooth transition learning' that improves policy performance via better imagined rollouts. However, in the RSSM architecture the transition dynamics are defined by the prior p(z_{t+1}|z_t,a_t), while the penalty is applied to the inference model q(z_t|o_t,z_{t-1}). No derivation or propagation argument is given showing how gradients from the posterior penalty affect the dynamics parameters or the quality of model-based planning trajectories.
[Abstract and Experiments] The abstract states that GPLD 'improves aggregate sample efficiency' with 'particularly strong gains' on locomotion tasks, yet supplies no quantitative details on statistical significance, number of random seeds, exact baseline comparisons, or ablation controls. This leaves the empirical support for the central claim difficult to evaluate.

minor comments (2)

Clarify whether the Hutchinson probe is applied only at training time or also during imagination; the current description leaves the computational overhead for planning unclear.
The phrase 'row-wise Jacobian penalty' should be accompanied by the explicit mathematical expression (including the norm and expectation) in the main text rather than deferred to an appendix.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address the two major comments point by point below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Method (GPLD definition and RSSM integration)] The central claim is that the row-wise Jacobian penalty on the posterior encourages 'locally smooth transition learning' that improves policy performance via better imagined rollouts. However, in the RSSM architecture the transition dynamics are defined by the prior p(z_{t+1}|z_t,a_t), while the penalty is applied to the inference model q(z_t|o_t,z_{t-1}). No derivation or propagation argument is given showing how gradients from the posterior penalty affect the dynamics parameters or the quality of model-based planning trajectories.

Authors: We agree that an explicit derivation of how the posterior penalty propagates to the prior dynamics would strengthen the presentation. The posterior and prior are coupled through the KL term in the ELBO and share latent representations used for planning; the row-wise Jacobian penalty on q encourages smoother latent trajectories that the prior must approximate. Nevertheless, we acknowledge the absence of a detailed gradient-flow argument in the current manuscript. We will add a dedicated subsection deriving the effect on the transition parameters and on imagined rollouts. revision: yes
Referee: [Abstract and Experiments] The abstract states that GPLD 'improves aggregate sample efficiency' with 'particularly strong gains' on locomotion tasks, yet supplies no quantitative details on statistical significance, number of random seeds, exact baseline comparisons, or ablation controls. This leaves the empirical support for the central claim difficult to evaluate.

Authors: We accept that the abstract would be clearer with quantitative anchors. The full manuscript reports results aggregated over five random seeds with standard errors, direct comparison to the DreamerV3 baseline, and ablations of the penalty term. We will revise the abstract to include the number of seeds, a brief statement on statistical significance of the aggregate improvement, and reference to the ablation studies, while preserving brevity. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper defines GPLD explicitly as a row-wise Jacobian penalty on the posterior latent distribution, provides an interpretation as continuous analog of finite-difference smoothing, and reports empirical sample-efficiency gains on held-out DeepMind Control tasks. No equation or claim reduces a reported prediction or performance result to a quantity fitted inside the same loop by construction. No load-bearing self-citation chain or uniqueness theorem is invoked. The regularization term is an added loss component whose effect is measured externally via policy return, satisfying the criteria for an independent derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The penalty coefficient is implicitly present but its value and selection procedure are not reported.

pith-pipeline@v0.9.0 · 5739 in / 982 out tokens · 19932 ms · 2026-05-25T05:22:18.283448+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages

[1]

2018 , editor =

Asadi, Kavosh and Misra, Dipendra and Littman, Michael , booktitle =. 2018 , editor =

work page 2018
[2]

2022 , eprint=

Towards Deeper Deep Reinforcement Learning with Spectral Normalization , author=. 2022 , eprint=

work page 2022
[3]

Generalization in Reinforcement Learning: Safely Approximating the Value Function , url =

Boyan, Justin and Moore, Andrew , booktitle =. Generalization in Reinforcement Learning: Safely Approximating the Value Function , url =

work page
[4]

2024 , eprint=

Learning Smooth Humanoid Locomotion through Lipschitz-Constrained Policies , author=. 2024 , eprint=

work page 2024
[5]

2022 , eprint=

Why neural networks find simple solutions: the many regularizers of geometric complexity , author=. 2022 , eprint=

work page 2022
[6]

2021 , eprint=

Minimizing Energy Consumption Leads to the Emergence of Gaits in Legged Robots , author=. 2021 , eprint=

work page 2021
[7]

2017 , eprint=

Improved Training of Wasserstein GANs , author=. 2017 , eprint=

work page 2017
[8]

2015 , eprint=

Explaining and Harnessing Adversarial Examples , author=. 2015 , eprint=

work page 2015
[9]

Deep Learning , author=

work page
[10]

2021 , eprint=

Spectral Normalisation for Deep Reinforcement Learning: an Optimisation Perspective , author=. 2021 , eprint=

work page 2021
[11]

2024 , eprint=

Benchmarking Smoothness and Reducing High-Frequency Oscillations in Continuous Control Policies , author=. 2024 , eprint=

work page 2024
[12]

2018 , copyright =

Ha, David and Schmidhuber, Jürgen , title =. 2018 , copyright =. doi:10.5281/ZENODO.1207631 , url =

work page doi:10.5281/zenodo.1207631 2018
[13]

Nature , volume=

Mastering diverse control tasks through world models , author=. Nature , volume=. 2025 , doi=

work page 2025
[14]

2024 , eprint=

Some Fundamental Aspects about Lipschitz Continuity of Neural Networks , author=. 2024 , eprint=

work page 2024
[15]

Journal of Machine Learning Research , volume=

Finite-Time Bounds for Fitted Value Iteration , author=. Journal of Machine Learning Research , volume=. 2008 , url=

work page 2008
[16]

2020 , eprint=

Learning Agile Robotic Locomotion Skills by Imitating Animals , author=. 2020 , eprint=

work page 2020
[17]

I Can't Believe It's Not Better!

A case for new neural network smoothness constraints , author =. Proceedings on "I Can't Believe It's Not Better!" at NeurIPS Workshops , pages =. 2020 , editor =

work page 2020
[18]

2017 , eprint=

Proximal Policy Optimization Algorithms , author=. 2017 , eprint=

work page 2017
[19]

2017 , eprint=

Trust Region Policy Optimization , author=. 2017 , eprint=

work page 2017
[20]

2020 , issn =

dm-control: Software and tasks for continuous control , journal =. 2020 , issn =. doi:https://doi.org/10.1016/j.simpa.2020.100022 , url =

work page doi:10.1016/j.simpa.2020.100022 2020
[21]

2020 , eprint=

Adversarial Lipschitz Regularization , author=. 2020 , eprint=

work page 2020
[22]

Issues in Using Function Approximation for Reinforcement Learning

Thrun, Sebastian and Schwartz, Anton. Issues in Using Function Approximation for Reinforcement Learning. Proceedings of the 1993 Connectionist Models Summer School. 1993

work page 1993
[23]

, year =

Venkatraman, Arun and Hebert, Martial and Bagnell, J. , year =. Improving Multi-Step Prediction of Learned Time Series Models , volume =. Proceedings of the AAAI Conference on Artificial Intelligence , doi =

work page
[24]

HG2P: Hippocampus-inspired high-reward graph and model-free Q-gradient penalty for path planning and motion control , journal =

Haoran Wang and Yaoru Sun and Zeshen Tang and Haibo Shi and Chenyuan Jiao , keywords =. HG2P: Hippocampus-inspired high-reward graph and model-free Q-gradient penalty for path planning and motion control , journal =. 2025 , issn =. doi:https://doi.org/10.1016/j.neunet.2025.107781 , url =

work page doi:10.1016/j.neunet.2025.107781 2025
[25]

2025 , eprint=

UniPhys: Unified Planner and Controller with Diffusion for Flexible Physics-Based Character Control , author=. 2025 , eprint=

work page 2025
[26]

International Conference on Learning Representations (ICLR) , year=

TD-MPC2: Scalable, Robust World Models for Continuous Control , author=. International Conference on Learning Representations (ICLR) , year=

work page
[27]

2022 , eprint=

Temporal Difference Learning for Model Predictive Control , author=. 2022 , eprint=

work page 2022
[28]

The Fourteenth International Conference on Learning Representations , year=

R2-Dreamer: Redundancy-Reduced World Models without Decoders or Augmentation , author=. The Fourteenth International Conference on Learning Representations , year=

work page
[29]

2021 , eprint=

Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning , author=. 2021 , eprint=

work page 2021
[30]

Puterman, Martin L , biburl =

work page

[1] [1]

2018 , editor =

Asadi, Kavosh and Misra, Dipendra and Littman, Michael , booktitle =. 2018 , editor =

work page 2018

[2] [2]

2022 , eprint=

Towards Deeper Deep Reinforcement Learning with Spectral Normalization , author=. 2022 , eprint=

work page 2022

[3] [3]

Generalization in Reinforcement Learning: Safely Approximating the Value Function , url =

Boyan, Justin and Moore, Andrew , booktitle =. Generalization in Reinforcement Learning: Safely Approximating the Value Function , url =

work page

[4] [4]

2024 , eprint=

Learning Smooth Humanoid Locomotion through Lipschitz-Constrained Policies , author=. 2024 , eprint=

work page 2024

[5] [5]

2022 , eprint=

Why neural networks find simple solutions: the many regularizers of geometric complexity , author=. 2022 , eprint=

work page 2022

[6] [6]

2021 , eprint=

Minimizing Energy Consumption Leads to the Emergence of Gaits in Legged Robots , author=. 2021 , eprint=

work page 2021

[7] [7]

2017 , eprint=

Improved Training of Wasserstein GANs , author=. 2017 , eprint=

work page 2017

[8] [8]

2015 , eprint=

Explaining and Harnessing Adversarial Examples , author=. 2015 , eprint=

work page 2015

[9] [9]

Deep Learning , author=

work page

[10] [10]

2021 , eprint=

Spectral Normalisation for Deep Reinforcement Learning: an Optimisation Perspective , author=. 2021 , eprint=

work page 2021

[11] [11]

2024 , eprint=

Benchmarking Smoothness and Reducing High-Frequency Oscillations in Continuous Control Policies , author=. 2024 , eprint=

work page 2024

[12] [12]

2018 , copyright =

Ha, David and Schmidhuber, Jürgen , title =. 2018 , copyright =. doi:10.5281/ZENODO.1207631 , url =

work page doi:10.5281/zenodo.1207631 2018

[13] [13]

Nature , volume=

Mastering diverse control tasks through world models , author=. Nature , volume=. 2025 , doi=

work page 2025

[14] [14]

2024 , eprint=

Some Fundamental Aspects about Lipschitz Continuity of Neural Networks , author=. 2024 , eprint=

work page 2024

[15] [15]

Journal of Machine Learning Research , volume=

Finite-Time Bounds for Fitted Value Iteration , author=. Journal of Machine Learning Research , volume=. 2008 , url=

work page 2008

[16] [16]

2020 , eprint=

Learning Agile Robotic Locomotion Skills by Imitating Animals , author=. 2020 , eprint=

work page 2020

[17] [17]

I Can't Believe It's Not Better!

A case for new neural network smoothness constraints , author =. Proceedings on "I Can't Believe It's Not Better!" at NeurIPS Workshops , pages =. 2020 , editor =

work page 2020

[18] [18]

2017 , eprint=

Proximal Policy Optimization Algorithms , author=. 2017 , eprint=

work page 2017

[19] [19]

2017 , eprint=

Trust Region Policy Optimization , author=. 2017 , eprint=

work page 2017

[20] [20]

2020 , issn =

dm-control: Software and tasks for continuous control , journal =. 2020 , issn =. doi:https://doi.org/10.1016/j.simpa.2020.100022 , url =

work page doi:10.1016/j.simpa.2020.100022 2020

[21] [21]

2020 , eprint=

Adversarial Lipschitz Regularization , author=. 2020 , eprint=

work page 2020

[22] [22]

Issues in Using Function Approximation for Reinforcement Learning

Thrun, Sebastian and Schwartz, Anton. Issues in Using Function Approximation for Reinforcement Learning. Proceedings of the 1993 Connectionist Models Summer School. 1993

work page 1993

[23] [23]

, year =

Venkatraman, Arun and Hebert, Martial and Bagnell, J. , year =. Improving Multi-Step Prediction of Learned Time Series Models , volume =. Proceedings of the AAAI Conference on Artificial Intelligence , doi =

work page

[24] [24]

HG2P: Hippocampus-inspired high-reward graph and model-free Q-gradient penalty for path planning and motion control , journal =

Haoran Wang and Yaoru Sun and Zeshen Tang and Haibo Shi and Chenyuan Jiao , keywords =. HG2P: Hippocampus-inspired high-reward graph and model-free Q-gradient penalty for path planning and motion control , journal =. 2025 , issn =. doi:https://doi.org/10.1016/j.neunet.2025.107781 , url =

work page doi:10.1016/j.neunet.2025.107781 2025

[25] [25]

2025 , eprint=

UniPhys: Unified Planner and Controller with Diffusion for Flexible Physics-Based Character Control , author=. 2025 , eprint=

work page 2025

[26] [26]

International Conference on Learning Representations (ICLR) , year=

TD-MPC2: Scalable, Robust World Models for Continuous Control , author=. International Conference on Learning Representations (ICLR) , year=

work page

[27] [27]

2022 , eprint=

Temporal Difference Learning for Model Predictive Control , author=. 2022 , eprint=

work page 2022

[28] [28]

The Fourteenth International Conference on Learning Representations , year=

R2-Dreamer: Redundancy-Reduced World Models without Decoders or Augmentation , author=. The Fourteenth International Conference on Learning Representations , year=

work page

[29] [29]

2021 , eprint=

Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning , author=. 2021 , eprint=

work page 2021

[30] [30]

Puterman, Martin L , biburl =

work page