pith. sign in

arxiv: 2509.20869 · v2 · submitted 2025-09-25 · 💻 cs.LG · cs.AI

Model-Based Reinforcement Learning under Random Observation Delays

Pith reviewed 2026-05-18 14:30 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords reinforcement learningPOMDPobservation delaysbelief state filteringmodel-based RLsensor delaysrobotic tasks
0
0 comments X

The pith

A sequential belief-state filtering process enables model-based RL to handle random observation delays in POMDPs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines random sensor delays that cause observations to arrive out of sequence in POMDPs, a setting standard RL ignores. It shows that simply stacking past observations does not produce reliable behavior. The authors instead build a filtering process that updates the belief state one observation at a time as the stream arrives. They embed this process in a model-based RL framework and test it on simulated robotic tasks. The result is performance that exceeds delay-aware baselines from the MDP case and holds up when the delay pattern itself changes at deployment time.

Core claim

In POMDPs subject to random sensor delays that deliver observations out of sequence, a model-based filtering process that sequentially updates the belief state from the incoming observation stream enables agents to act effectively; when this filtering is incorporated into a world-modeling reinforcement-learning scheme the resulting agents outperform delay-aware baselines developed for MDPs and remain robust to shifts in the delay distribution.

What carries the argument

sequential belief-state filtering process that updates the belief from an incoming stream of observations

If this is right

  • Agents maintain performance under unpredictable observation timing without knowing the delay distribution in advance.
  • The framework extends model-based RL to environments where sensor lags vary during operation.
  • Explicit delay modeling outperforms common heuristics such as stacking recent observations.
  • Robustness to delay shifts removes the need to retrain when deployment conditions change.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sequential update idea could be tested in model-free RL algorithms that maintain latent states.
  • Hardware experiments with real communication or sensor lag would expose whether the filtering step remains stable under unmodeled noise.
  • The approach might extend to multi-agent settings where messages between agents introduce comparable delays.

Load-bearing premise

A sequential belief-state filtering process can be reliably constructed and integrated into the world model without needing the exact delay distribution or extra modeling assumptions beyond ordinary POMDP belief updates.

What would settle it

Replace the proposed filtering step with naive observation stacking on the same robotic tasks and measure whether performance collapses relative to the filtered version across several different delay distributions.

Figures

Figures reproduced from arXiv: 2509.20869 by Armin Karamzade, Davide Corsi, JB Lanier, Kyungmin Kim, Roy Fox.

Figure 1
Figure 1. Figure 1: Graphical model of a POMDP with random observation delays at time [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Normalized return under different test-time delay distributions. All methods are trained [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Learning curve of methods on selected Gym environments. [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Learning curve of methods on selected Meta-World environments. [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Learning curve of Stack-Dreamer variants. C.3 NUMBER OF PARTICLES [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Performance vs different number of particles. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Reconstructing the current observation in delayed button-press-v2 from the computed belief [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
read the original abstract

Delays frequently occur in real-world environments, yet standard reinforcement learning (RL) algorithms often assume instantaneous perception of the environment. We study random sensor delays in POMDPs, where observations may arrive out-of-sequence, a setting that has not been previously addressed in RL. We analyze the structure of such delays and demonstrate that naive approaches, such as stacking past observations, are insufficient for reliable performance. To address this, we propose a model-based filtering process that sequentially updates the belief state based on an incoming stream of observations. We then introduce a simple delay-aware framework that incorporates this idea into model-based RL, enabling agents to effectively handle random delays. Applying this framework to the Dreamer world-modeling scheme, our method consistently outperforms delay-aware baselines developed for MDPs and demonstrates robustness to delay distribution shifts during deployment. Additionally, we present experiments on simulated robotic tasks, comparing our method to common practical heuristics and emphasizing the importance of explicitly modeling observation delays.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to address random observation delays in POMDPs for reinforcement learning by proposing a model-based filtering process for sequential belief state updates. This is integrated into the Dreamer world model, leading to consistent outperformance over delay-aware MDP baselines and robustness to delay distribution shifts in experiments on simulated robotic tasks.

Significance. If the filtering approach is correctly formulated, this work has potential significance for applying model-based RL to real-world scenarios with sensor or communication delays. The emphasis on robustness to shifts in delay distributions during deployment is particularly relevant for practical applications. The integration with Dreamer provides a scalable implementation path.

major comments (2)
  1. [§3] The sequential belief update in the proposed filtering process appears to apply standard POMDP belief updates to incoming observations without explicit marginalization over the unknown delay. This may not produce the correct posterior in the presence of random delays, potentially leading to mis-specified beliefs in the world model and undermining the outperformance and robustness claims.
  2. [Table 1 and §5] The experimental results lack error bars and detailed specification of the delay distributions used, making it hard to evaluate the statistical significance of the outperformance and the robustness to distribution shifts.
minor comments (2)
  1. [Abstract] The abstract mentions 'consistent outperformance' but does not include any quantitative metrics; consider adding key performance numbers.
  2. [§4.1] Clarify the notation for the delayed observation process to distinguish it from standard POMDP observations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback. We provide point-by-point responses to the major comments below. Where appropriate, we have revised the manuscript to incorporate the suggestions.

read point-by-point responses
  1. Referee: [§3] The sequential belief update in the proposed filtering process appears to apply standard POMDP belief updates to incoming observations without explicit marginalization over the unknown delay. This may not produce the correct posterior in the presence of random delays, potentially leading to mis-specified beliefs in the world model and undermining the outperformance and robustness claims.

    Authors: We appreciate the referee raising this important point about the belief update. In our formulation, the sequential filtering process maintains the belief over the latent state given the stream of received observations. When an observation arrives, the belief is first propagated forward using the transition model for the number of steps corresponding to the arrival time (which encodes information about the delay), and then updated with the observation model. This procedure is mathematically equivalent to marginalizing over the unknown delay because the propagation steps integrate the probability of the observation having originated from earlier timesteps. We have added a short derivation and clarification paragraph to the revised §3 to make this equivalence explicit while preserving the original algorithmic description. revision: partial

  2. Referee: [Table 1 and §5] The experimental results lack error bars and detailed specification of the delay distributions used, making it hard to evaluate the statistical significance of the outperformance and the robustness to distribution shifts.

    Authors: We agree that the presentation of the experimental results can be improved by including error bars and more precise details on the delay distributions. In the revised manuscript we have updated Table 1 to report means and standard deviations computed over five independent random seeds. We have also expanded §5 to explicitly state the delay distributions used during training (uniform over [0, D] for D in {5, 10, 15}) and the shifted distributions employed for the robustness experiments (e.g., uniform over [0, D+5] and truncated exponential variants). revision: yes

Circularity Check

0 steps flagged

No circularity: derivation extends external POMDP belief updates into Dreamer without self-referential reduction.

full rationale

The paper's core contribution is a sequential belief-state filtering process for out-of-sequence observations in random-delay POMDPs, integrated into the Dreamer world model. This rests on standard POMDP belief-update assumptions (external to the paper) rather than any fitted parameter renamed as prediction, self-citation load-bearing uniqueness theorem, or ansatz smuggled from prior work by the same authors. No equation or step in the described framework reduces by construction to its own inputs; the outperformance claims are evaluated against external baselines and delay-shift robustness tests. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract; the approach rests on standard POMDP belief-state assumptions and the applicability of sequential filtering to delayed streams, with no free parameters or new entities described.

axioms (1)
  • domain assumption POMDP belief states can be sequentially updated from an incoming stream of possibly delayed observations.
    This is the core premise enabling the proposed filtering process.

pith-pipeline@v0.9.0 · 5699 in / 1188 out tokens · 39366 ms · 2026-05-18T14:30:24.435306+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 3 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    A cerebellar-based solution to the nondeterministic time delay problem in robotic control

    Ignacio Abad \' a, Francisco Naveros, Eduardo Ros, Richard R Carrillo, and Niceto R Luque. A cerebellar-based solution to the nondeterministic time delay problem in robotic control. Science Robotics, 6 0 (58): 0 eabf2756, 2021

  3. [3]

    Closed-loop control with delayed information

    Eitan Altman and Philippe Nain. Closed-loop control with delayed information. ACM sigmetrics performance evaluation review, 20 0 (1): 0 193--204, 1992

  4. [4]

    Update with out-of-sequence measurements in tracking: exact solution

    Yaakov Bar-Shalom. Update with out-of-sequence measurements in tracking: exact solution. IEEE Transactions on aerospace and electronic systems, 38 0 (3): 0 769--777, 2002

  5. [5]

    Reinforcement learning with random delays

    Yann Bouteiller, Simon Ramstedt, Giovanni Beltrame, Christopher Pal, and Jonathan Binas. Reinforcement learning with random delays. In International conference on learning representations, 2020

  6. [6]

    Bayesian filtering: From kalman filters to particle filters, and beyond

    Zhe Chen et al. Bayesian filtering: From kalman filters to particle filters, and beyond. Statistics, 182 0 (1): 0 1--69, 2003

  7. [7]

    Acting in delayed environments with non-stationary markov policies

    Esther Derman, Gal Dalal, and Shie Mannor. Acting in delayed environments with non-stationary markov policies. arXiv preprint arXiv:2101.11992, 2021

  8. [8]

    Communication delay in uav missions: A controller gain analysis to improve flight stability

    Leonardo Alves Fagundes-Junior, Andre Fialho Coelho, Daniel Khede Dourado Villa, Mario Sarcinelli-Filho, and Alexandre Santos Brand \ a o. Communication delay in uav missions: A controller gain analysis to improve flight stability. IEEE Latin America Transactions, 21 0 (1): 0 7--15, 2023

  9. [9]

    Recurrent world models facilitate policy evolution

    David Ha and J \"u rgen Schmidhuber. Recurrent world models facilitate policy evolution. Advances in neural information processing systems, 31, 2018

  10. [10]

    Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp.\ 1861--1870. Pmlr, 2018

  11. [11]

    Learning latent dynamics for planning from pixels

    Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. In International conference on machine learning, pp.\ 2555--2565. PMLR, 2019

  12. [12]

    Mastering Atari with Discrete World Models

    Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193, 2020

  13. [13]

    Mastering Diverse Domains through World Models

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023

  14. [14]

    Mastering diverse control tasks through world models

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse control tasks through world models. Nature, pp.\ 1--7, 2025

  15. [15]

    Temporal difference learning for model predictive control

    Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal difference learning for model predictive control. arXiv preprint arXiv:2203.04955, 2022

  16. [16]

    Deep variational reinforcement learning for pomdps

    Maximilian Igl, Luisa Zintgraf, Tuan Anh Le, Frank Wood, and Shimon Whiteson. Deep variational reinforcement learning for pomdps. In International conference on machine learning, pp.\ 2117--2126. PMLR, 2018

  17. [17]

    When to trust your model: Model-based policy optimization

    Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. Advances in neural information processing systems, 32, 2019

  18. [18]

    Planning and acting in partially observable stochastic domains

    Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101 0 (1-2): 0 99--134, 1998

  19. [19]

    Reinforcement learning from delayed observations via world models

    Armin Karamzade, Kyungmin Kim, Montek Kalsi, and Roy Fox. Reinforcement learning from delayed observations via world models. arXiv preprint arXiv:2403.12309, 2024

  20. [20]

    Markov decision processes with delays and asynchronous cost collection

    Konstantinos V Katsikopoulos and Sascha E Engelbrecht. Markov decision processes with delays and asynchronous cost collection. IEEE transactions on automatic control, 48 0 (4): 0 568--574, 2003

  21. [21]

    Belief projection-based reinforcement learning for environments with delayed feedback

    Jangwon Kim, Hangyeol Kim, Jiwook Kang, Jongchan Baek, and Soohee Han. Belief projection-based reinforcement learning for environments with delayed feedback. Advances in Neural Information Processing Systems, 36: 0 678--696, 2023

  22. [22]

    A partially observable markov decision process with lagged information

    Soung Hie Kim and Byung Ho Jeong. A partially observable markov decision process with lagged information. Journal of the Operational Research Society, 38 0 (5): 0 439--446, 1987

  23. [23]

    Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model

    Alex X Lee, Anusha Nagabandi, Pieter Abbeel, and Sergey Levine. Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model. Advances in Neural Information Processing Systems, 33: 0 741--752, 2020

  24. [24]

    Learning a belief representation for delayed reinforcement learning

    Pierre Liotet, Erick Venneri, and Marcello Restelli. Learning a belief representation for delayed reinforcement learning. In 2021 International Joint Conference on Neural Networks (IJCNN), pp.\ 1--8. IEEE, 2021

  25. [25]

    Delayed reinforcement learning by imitation

    Pierre Liotet, Davide Maran, Lorenzo Bisi, and Marcello Restelli. Delayed reinforcement learning by imitation. In International conference on machine learning, pp.\ 13528--13556. PMLR, 2022

  26. [26]

    Particle filter recurrent neural networks

    Xiao Ma, Peter Karkus, David Hsu, and Wee Sun Lee. Particle filter recurrent neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.\ 5101--5108, 2020 a

  27. [27]

    Discriminative particle filter reinforcement learning for complex partial observations

    Xiao Ma, Peter Karkus, David Hsu, Wee Sun Lee, and Nan Ye. Discriminative particle filter reinforcement learning for complex partial observations. arXiv preprint arXiv:2002.09884, 2020 b

  28. [28]

    Setting up a reinforcement learning task with a real-world robot

    A Rupam Mahmood, Dmytro Korenkevych, Brent J Komer, and James Bergstra. Setting up a reinforcement learning task with a real-world robot. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.\ 4635--4640. IEEE, 2018

  29. [29]

    Transform- ers are sample-efficient world models.arXiv preprint arXiv:2209.00588,

    Vincent Micheli, Eloi Alonso, and Fran c ois Fleuret. Transformers are sample-efficient world models. arXiv preprint arXiv:2209.00588, 2022

  30. [30]

    Control delay in reinforcement learning for real-time dynamic systems: A memoryless approach

    Erik Schuitema, Lucian Bu s oniu, Robert Babu s ka, and Pieter Jonker. Control delay in reinforcement learning for real-time dynamic systems: A memoryless approach. In 2010 IEEE/RSJ international conference on intelligent robots and systems, pp.\ 3226--3231. IEEE, 2010

  31. [31]

    Mujoco: A physics engine for model-based control

    Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pp.\ 5026--5033. IEEE, 2012

  32. [32]

    Tree search-based policy optimization under stochastic execution delay

    David Valensi, Esther Derman, Shie Mannor, and Gal Dalal. Tree search-based policy optimization under stochastic execution delay. arXiv preprint arXiv:2404.05440, 2024

  33. [33]

    Planning and learning in environments with delayed feedback

    Thomas J Walsh, Ali Nouri, Lihong Li, and Michael L Littman. Planning and learning in environments with delayed feedback. In Machine Learning: ECML 2007: 18th European Conference on Machine Learning, Warsaw, Poland, September 17-21, 2007. Proceedings 18, pp.\ 442--453. Springer, 2007

  34. [34]

    Addressing signal delay in deep reinforcement learning

    Wei Wang, Dongqi Han, Xufang Luo, and Dongsheng Li. Addressing signal delay in deep reinforcement learning. In The Twelfth International Conference on Learning Representations, 2023

  35. [35]

    Variational delayed policy optimization

    Qingyuan Wu, Simon S Zhan, Yixuan Wang, Yuhui Wang, Chung-Wei Lin, Chen Lv, Qi Zhu, and Chao Huang. Variational delayed policy optimization. Advances in Neural Information Processing Systems, 37: 0 54330--54356, 2024 a

  36. [36]

    Boosting reinforcement learning with strongly delayed feedback through auxiliary short delays

    Qingyuan Wu, Simon Sinong Zhan, Yixuan Wang, Yuhui Wang, Chung-Wei Lin, Chen Lv, Qi Zhu, J \"u rgen Schmidhuber, and Chao Huang. Boosting reinforcement learning with strongly delayed feedback through auxiliary short delays. arXiv preprint arXiv:2402.03141, 2024 b

  37. [37]

    Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning.CoRR, abs/1910.10897, 2019

    Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning (CoRL), 2019. URL https://arxiv.org/abs/1910.10897

  38. [38]

    Storm: Efficient stochastic transformer based world models for reinforcement learning

    Weipu Zhang, Gang Wang, Jian Sun, Yetian Yuan, and Gao Huang. Storm: Efficient stochastic transformer based world models for reinforcement learning. Advances in Neural Information Processing Systems, 36: 0 27147--27166, 2023

  39. [39]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  40. [40]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  41. [41]

    TD-MPC2: Scalable, Robust World Models for Continuous Control

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...