Model-Based Reinforcement Learning under Random Observation Delays

Armin Karamzade; Davide Corsi; JB Lanier; Kyungmin Kim; Roy Fox

arxiv: 2509.20869 · v2 · submitted 2025-09-25 · 💻 cs.LG · cs.AI

Model-Based Reinforcement Learning under Random Observation Delays

Armin Karamzade , Kyungmin Kim , JB Lanier , Davide Corsi , Roy Fox This is my paper

Pith reviewed 2026-05-18 14:30 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords reinforcement learningPOMDPobservation delaysbelief state filteringmodel-based RLsensor delaysrobotic tasks

0 comments

The pith

A sequential belief-state filtering process enables model-based RL to handle random observation delays in POMDPs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines random sensor delays that cause observations to arrive out of sequence in POMDPs, a setting standard RL ignores. It shows that simply stacking past observations does not produce reliable behavior. The authors instead build a filtering process that updates the belief state one observation at a time as the stream arrives. They embed this process in a model-based RL framework and test it on simulated robotic tasks. The result is performance that exceeds delay-aware baselines from the MDP case and holds up when the delay pattern itself changes at deployment time.

Core claim

In POMDPs subject to random sensor delays that deliver observations out of sequence, a model-based filtering process that sequentially updates the belief state from the incoming observation stream enables agents to act effectively; when this filtering is incorporated into a world-modeling reinforcement-learning scheme the resulting agents outperform delay-aware baselines developed for MDPs and remain robust to shifts in the delay distribution.

What carries the argument

sequential belief-state filtering process that updates the belief from an incoming stream of observations

If this is right

Agents maintain performance under unpredictable observation timing without knowing the delay distribution in advance.
The framework extends model-based RL to environments where sensor lags vary during operation.
Explicit delay modeling outperforms common heuristics such as stacking recent observations.
Robustness to delay shifts removes the need to retrain when deployment conditions change.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same sequential update idea could be tested in model-free RL algorithms that maintain latent states.
Hardware experiments with real communication or sensor lag would expose whether the filtering step remains stable under unmodeled noise.
The approach might extend to multi-agent settings where messages between agents introduce comparable delays.

Load-bearing premise

A sequential belief-state filtering process can be reliably constructed and integrated into the world model without needing the exact delay distribution or extra modeling assumptions beyond ordinary POMDP belief updates.

What would settle it

Replace the proposed filtering step with naive observation stacking on the same robotic tasks and measure whether performance collapses relative to the filtered version across several different delay distributions.

Figures

Figures reproduced from arXiv: 2509.20869 by Armin Karamzade, Davide Corsi, JB Lanier, Kyungmin Kim, Roy Fox.

**Figure 2.** Figure 2: Normalized return under different test-time delay distributions. All methods are trained [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Learning curve of methods on selected Gym environments. [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Learning curve of methods on selected Meta-World environments. [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Learning curve of Stack-Dreamer variants. C.3 NUMBER OF PARTICLES [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Performance vs different number of particles. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Reconstructing the current observation in delayed button-press-v2 from the computed belief [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

read the original abstract

Delays frequently occur in real-world environments, yet standard reinforcement learning (RL) algorithms often assume instantaneous perception of the environment. We study random sensor delays in POMDPs, where observations may arrive out-of-sequence, a setting that has not been previously addressed in RL. We analyze the structure of such delays and demonstrate that naive approaches, such as stacking past observations, are insufficient for reliable performance. To address this, we propose a model-based filtering process that sequentially updates the belief state based on an incoming stream of observations. We then introduce a simple delay-aware framework that incorporates this idea into model-based RL, enabling agents to effectively handle random delays. Applying this framework to the Dreamer world-modeling scheme, our method consistently outperforms delay-aware baselines developed for MDPs and demonstrates robustness to delay distribution shifts during deployment. Additionally, we present experiments on simulated robotic tasks, comparing our method to common practical heuristics and emphasizing the importance of explicitly modeling observation delays.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper offers a belief-filtering extension to Dreamer for random delays but leaves open whether the updates properly handle delay uncertainty without extra assumptions.

read the letter

The main point is that this work targets random sensor delays causing out-of-sequence observations in POMDPs, a setting not previously addressed in RL. They extend the Dreamer world model with a sequential belief update process that processes the incoming observation stream and claim consistent gains over MDP delay-aware baselines plus robustness when the delay distribution shifts at test time. Experiments on simulated robotic tasks round out the story by pitting the method against common practical heuristics. What stands out is the clear analysis of why naive stacking fails and the straightforward integration into an existing model-based pipeline. That framing makes the practical motivation easy to follow. The soft spot sits in the filtering step itself. If the belief updates treat late observations as if they arrived on time without marginalizing over the unknown delay or conditioning on its distribution, the resulting beliefs will be mis-specified. That would undercut both the reported outperformance and the robustness claim, since the world model and policy would train on incorrect state estimates. The abstract gives no quantitative results, error bars, or ablation details, so the strength of the evidence is still hard to judge. This paper is for people working on model-based RL for physical systems where timing mismatches are common. A reader focused on deployment issues would pick up useful ideas even if the current support stays mostly at the framework level. It deserves peer review because the gap is real and the proposed fix is concrete, though revisions will need tighter math on the belief updates and fuller experimental controls.

Referee Report

2 major / 2 minor

Summary. The paper claims to address random observation delays in POMDPs for reinforcement learning by proposing a model-based filtering process for sequential belief state updates. This is integrated into the Dreamer world model, leading to consistent outperformance over delay-aware MDP baselines and robustness to delay distribution shifts in experiments on simulated robotic tasks.

Significance. If the filtering approach is correctly formulated, this work has potential significance for applying model-based RL to real-world scenarios with sensor or communication delays. The emphasis on robustness to shifts in delay distributions during deployment is particularly relevant for practical applications. The integration with Dreamer provides a scalable implementation path.

major comments (2)

[§3] The sequential belief update in the proposed filtering process appears to apply standard POMDP belief updates to incoming observations without explicit marginalization over the unknown delay. This may not produce the correct posterior in the presence of random delays, potentially leading to mis-specified beliefs in the world model and undermining the outperformance and robustness claims.
[Table 1 and §5] The experimental results lack error bars and detailed specification of the delay distributions used, making it hard to evaluate the statistical significance of the outperformance and the robustness to distribution shifts.

minor comments (2)

[Abstract] The abstract mentions 'consistent outperformance' but does not include any quantitative metrics; consider adding key performance numbers.
[§4.1] Clarify the notation for the delayed observation process to distinguish it from standard POMDP observations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback. We provide point-by-point responses to the major comments below. Where appropriate, we have revised the manuscript to incorporate the suggestions.

read point-by-point responses

Referee: [§3] The sequential belief update in the proposed filtering process appears to apply standard POMDP belief updates to incoming observations without explicit marginalization over the unknown delay. This may not produce the correct posterior in the presence of random delays, potentially leading to mis-specified beliefs in the world model and undermining the outperformance and robustness claims.

Authors: We appreciate the referee raising this important point about the belief update. In our formulation, the sequential filtering process maintains the belief over the latent state given the stream of received observations. When an observation arrives, the belief is first propagated forward using the transition model for the number of steps corresponding to the arrival time (which encodes information about the delay), and then updated with the observation model. This procedure is mathematically equivalent to marginalizing over the unknown delay because the propagation steps integrate the probability of the observation having originated from earlier timesteps. We have added a short derivation and clarification paragraph to the revised §3 to make this equivalence explicit while preserving the original algorithmic description. revision: partial
Referee: [Table 1 and §5] The experimental results lack error bars and detailed specification of the delay distributions used, making it hard to evaluate the statistical significance of the outperformance and the robustness to distribution shifts.

Authors: We agree that the presentation of the experimental results can be improved by including error bars and more precise details on the delay distributions. In the revised manuscript we have updated Table 1 to report means and standard deviations computed over five independent random seeds. We have also expanded §5 to explicitly state the delay distributions used during training (uniform over [0, D] for D in {5, 10, 15}) and the shifted distributions employed for the robustness experiments (e.g., uniform over [0, D+5] and truncated exponential variants). revision: yes

Circularity Check

0 steps flagged

No circularity: derivation extends external POMDP belief updates into Dreamer without self-referential reduction.

full rationale

The paper's core contribution is a sequential belief-state filtering process for out-of-sequence observations in random-delay POMDPs, integrated into the Dreamer world model. This rests on standard POMDP belief-update assumptions (external to the paper) rather than any fitted parameter renamed as prediction, self-citation load-bearing uniqueness theorem, or ansatz smuggled from prior work by the same authors. No equation or step in the described framework reduces by construction to its own inputs; the outperformance claims are evaluated against external baselines and delay-shift robustness tests. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract; the approach rests on standard POMDP belief-state assumptions and the applicability of sequential filtering to delayed streams, with no free parameters or new entities described.

axioms (1)

domain assumption POMDP belief states can be sequentially updated from an incoming stream of possibly delayed observations.
This is the core premise enabling the proposed filtering process.

pith-pipeline@v0.9.0 · 5699 in / 1188 out tokens · 39366 ms · 2026-05-18T14:30:24.435306+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a model-based filtering process that sequentially updates the belief state based on an incoming stream of observations... ϕt := p(xt | ō1:t, a1:t−1) ... E[ψ(xt | xt−1, ōκt+1:t, at−1)] (Eq. 5)
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The auxiliary transition distribution ψ ... uses the variational posterior when oτ is observed, and otherwise defaults to the prior dynamics model.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 3 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

A cerebellar-based solution to the nondeterministic time delay problem in robotic control

Ignacio Abad \' a, Francisco Naveros, Eduardo Ros, Richard R Carrillo, and Niceto R Luque. A cerebellar-based solution to the nondeterministic time delay problem in robotic control. Science Robotics, 6 0 (58): 0 eabf2756, 2021

work page 2021
[3]

Closed-loop control with delayed information

Eitan Altman and Philippe Nain. Closed-loop control with delayed information. ACM sigmetrics performance evaluation review, 20 0 (1): 0 193--204, 1992

work page 1992
[4]

Update with out-of-sequence measurements in tracking: exact solution

Yaakov Bar-Shalom. Update with out-of-sequence measurements in tracking: exact solution. IEEE Transactions on aerospace and electronic systems, 38 0 (3): 0 769--777, 2002

work page 2002
[5]

Reinforcement learning with random delays

Yann Bouteiller, Simon Ramstedt, Giovanni Beltrame, Christopher Pal, and Jonathan Binas. Reinforcement learning with random delays. In International conference on learning representations, 2020

work page 2020
[6]

Bayesian filtering: From kalman filters to particle filters, and beyond

Zhe Chen et al. Bayesian filtering: From kalman filters to particle filters, and beyond. Statistics, 182 0 (1): 0 1--69, 2003

work page 2003
[7]

Acting in delayed environments with non-stationary markov policies

Esther Derman, Gal Dalal, and Shie Mannor. Acting in delayed environments with non-stationary markov policies. arXiv preprint arXiv:2101.11992, 2021

work page arXiv 2021
[8]

Communication delay in uav missions: A controller gain analysis to improve flight stability

Leonardo Alves Fagundes-Junior, Andre Fialho Coelho, Daniel Khede Dourado Villa, Mario Sarcinelli-Filho, and Alexandre Santos Brand \ a o. Communication delay in uav missions: A controller gain analysis to improve flight stability. IEEE Latin America Transactions, 21 0 (1): 0 7--15, 2023

work page 2023
[9]

Recurrent world models facilitate policy evolution

David Ha and J \"u rgen Schmidhuber. Recurrent world models facilitate policy evolution. Advances in neural information processing systems, 31, 2018

work page 2018
[10]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp.\ 1861--1870. Pmlr, 2018

work page 2018
[11]

Learning latent dynamics for planning from pixels

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. In International conference on machine learning, pp.\ 2555--2565. PMLR, 2019

work page 2019
[12]

Mastering Atari with Discrete World Models

Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[13]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Mastering diverse control tasks through world models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse control tasks through world models. Nature, pp.\ 1--7, 2025

work page 2025
[15]

Temporal difference learning for model predictive control

Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal difference learning for model predictive control. arXiv preprint arXiv:2203.04955, 2022

work page arXiv 2022
[16]

Deep variational reinforcement learning for pomdps

Maximilian Igl, Luisa Zintgraf, Tuan Anh Le, Frank Wood, and Shimon Whiteson. Deep variational reinforcement learning for pomdps. In International conference on machine learning, pp.\ 2117--2126. PMLR, 2018

work page 2018
[17]

When to trust your model: Model-based policy optimization

Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. Advances in neural information processing systems, 32, 2019

work page 2019
[18]

Planning and acting in partially observable stochastic domains

Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101 0 (1-2): 0 99--134, 1998

work page 1998
[19]

Reinforcement learning from delayed observations via world models

Armin Karamzade, Kyungmin Kim, Montek Kalsi, and Roy Fox. Reinforcement learning from delayed observations via world models. arXiv preprint arXiv:2403.12309, 2024

work page arXiv 2024
[20]

Markov decision processes with delays and asynchronous cost collection

Konstantinos V Katsikopoulos and Sascha E Engelbrecht. Markov decision processes with delays and asynchronous cost collection. IEEE transactions on automatic control, 48 0 (4): 0 568--574, 2003

work page 2003
[21]

Belief projection-based reinforcement learning for environments with delayed feedback

Jangwon Kim, Hangyeol Kim, Jiwook Kang, Jongchan Baek, and Soohee Han. Belief projection-based reinforcement learning for environments with delayed feedback. Advances in Neural Information Processing Systems, 36: 0 678--696, 2023

work page 2023
[22]

A partially observable markov decision process with lagged information

Soung Hie Kim and Byung Ho Jeong. A partially observable markov decision process with lagged information. Journal of the Operational Research Society, 38 0 (5): 0 439--446, 1987

work page 1987
[23]

Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model

Alex X Lee, Anusha Nagabandi, Pieter Abbeel, and Sergey Levine. Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model. Advances in Neural Information Processing Systems, 33: 0 741--752, 2020

work page 2020
[24]

Learning a belief representation for delayed reinforcement learning

Pierre Liotet, Erick Venneri, and Marcello Restelli. Learning a belief representation for delayed reinforcement learning. In 2021 International Joint Conference on Neural Networks (IJCNN), pp.\ 1--8. IEEE, 2021

work page 2021
[25]

Delayed reinforcement learning by imitation

Pierre Liotet, Davide Maran, Lorenzo Bisi, and Marcello Restelli. Delayed reinforcement learning by imitation. In International conference on machine learning, pp.\ 13528--13556. PMLR, 2022

work page 2022
[26]

Particle filter recurrent neural networks

Xiao Ma, Peter Karkus, David Hsu, and Wee Sun Lee. Particle filter recurrent neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.\ 5101--5108, 2020 a

work page 2020
[27]

Discriminative particle filter reinforcement learning for complex partial observations

Xiao Ma, Peter Karkus, David Hsu, Wee Sun Lee, and Nan Ye. Discriminative particle filter reinforcement learning for complex partial observations. arXiv preprint arXiv:2002.09884, 2020 b

work page arXiv 2002
[28]

Setting up a reinforcement learning task with a real-world robot

A Rupam Mahmood, Dmytro Korenkevych, Brent J Komer, and James Bergstra. Setting up a reinforcement learning task with a real-world robot. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.\ 4635--4640. IEEE, 2018

work page 2018
[29]

Transform- ers are sample-efficient world models.arXiv preprint arXiv:2209.00588,

Vincent Micheli, Eloi Alonso, and Fran c ois Fleuret. Transformers are sample-efficient world models. arXiv preprint arXiv:2209.00588, 2022

work page arXiv 2022
[30]

Control delay in reinforcement learning for real-time dynamic systems: A memoryless approach

Erik Schuitema, Lucian Bu s oniu, Robert Babu s ka, and Pieter Jonker. Control delay in reinforcement learning for real-time dynamic systems: A memoryless approach. In 2010 IEEE/RSJ international conference on intelligent robots and systems, pp.\ 3226--3231. IEEE, 2010

work page 2010
[31]

Mujoco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pp.\ 5026--5033. IEEE, 2012

work page 2012
[32]

Tree search-based policy optimization under stochastic execution delay

David Valensi, Esther Derman, Shie Mannor, and Gal Dalal. Tree search-based policy optimization under stochastic execution delay. arXiv preprint arXiv:2404.05440, 2024

work page arXiv 2024
[33]

Planning and learning in environments with delayed feedback

Thomas J Walsh, Ali Nouri, Lihong Li, and Michael L Littman. Planning and learning in environments with delayed feedback. In Machine Learning: ECML 2007: 18th European Conference on Machine Learning, Warsaw, Poland, September 17-21, 2007. Proceedings 18, pp.\ 442--453. Springer, 2007

work page 2007
[34]

Addressing signal delay in deep reinforcement learning

Wei Wang, Dongqi Han, Xufang Luo, and Dongsheng Li. Addressing signal delay in deep reinforcement learning. In The Twelfth International Conference on Learning Representations, 2023

work page 2023
[35]

Variational delayed policy optimization

Qingyuan Wu, Simon S Zhan, Yixuan Wang, Yuhui Wang, Chung-Wei Lin, Chen Lv, Qi Zhu, and Chao Huang. Variational delayed policy optimization. Advances in Neural Information Processing Systems, 37: 0 54330--54356, 2024 a

work page 2024
[36]

Boosting reinforcement learning with strongly delayed feedback through auxiliary short delays

Qingyuan Wu, Simon Sinong Zhan, Yixuan Wang, Yuhui Wang, Chung-Wei Lin, Chen Lv, Qi Zhu, J \"u rgen Schmidhuber, and Chao Huang. Boosting reinforcement learning with strongly delayed feedback through auxiliary short delays. arXiv preprint arXiv:2402.03141, 2024 b

work page arXiv 2024
[37]

Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning.CoRR, abs/1910.10897, 2019

Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning (CoRL), 2019. URL https://arxiv.org/abs/1910.10897

work page arXiv 2019
[38]

Storm: Efficient stochastic transformer based world models for reinforcement learning

Weipu Zhang, Gang Wang, Jian Sun, Yetian Yuan, and Gao Huang. Storm: Efficient stochastic transformer based world models for reinforcement learning. Advances in Neural Information Processing Systems, 36: 0 27147--27166, 2023

work page 2023
[39]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[40]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[41]

TD-MPC2: Scalable, Robust World Models for Continuous Control

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page internal anchor Pith review Pith/arXiv arXiv 2003

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

A cerebellar-based solution to the nondeterministic time delay problem in robotic control

Ignacio Abad \' a, Francisco Naveros, Eduardo Ros, Richard R Carrillo, and Niceto R Luque. A cerebellar-based solution to the nondeterministic time delay problem in robotic control. Science Robotics, 6 0 (58): 0 eabf2756, 2021

work page 2021

[3] [3]

Closed-loop control with delayed information

Eitan Altman and Philippe Nain. Closed-loop control with delayed information. ACM sigmetrics performance evaluation review, 20 0 (1): 0 193--204, 1992

work page 1992

[4] [4]

Update with out-of-sequence measurements in tracking: exact solution

Yaakov Bar-Shalom. Update with out-of-sequence measurements in tracking: exact solution. IEEE Transactions on aerospace and electronic systems, 38 0 (3): 0 769--777, 2002

work page 2002

[5] [5]

Reinforcement learning with random delays

Yann Bouteiller, Simon Ramstedt, Giovanni Beltrame, Christopher Pal, and Jonathan Binas. Reinforcement learning with random delays. In International conference on learning representations, 2020

work page 2020

[6] [6]

Bayesian filtering: From kalman filters to particle filters, and beyond

Zhe Chen et al. Bayesian filtering: From kalman filters to particle filters, and beyond. Statistics, 182 0 (1): 0 1--69, 2003

work page 2003

[7] [7]

Acting in delayed environments with non-stationary markov policies

Esther Derman, Gal Dalal, and Shie Mannor. Acting in delayed environments with non-stationary markov policies. arXiv preprint arXiv:2101.11992, 2021

work page arXiv 2021

[8] [8]

Communication delay in uav missions: A controller gain analysis to improve flight stability

Leonardo Alves Fagundes-Junior, Andre Fialho Coelho, Daniel Khede Dourado Villa, Mario Sarcinelli-Filho, and Alexandre Santos Brand \ a o. Communication delay in uav missions: A controller gain analysis to improve flight stability. IEEE Latin America Transactions, 21 0 (1): 0 7--15, 2023

work page 2023

[9] [9]

Recurrent world models facilitate policy evolution

David Ha and J \"u rgen Schmidhuber. Recurrent world models facilitate policy evolution. Advances in neural information processing systems, 31, 2018

work page 2018

[10] [10]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp.\ 1861--1870. Pmlr, 2018

work page 2018

[11] [11]

Learning latent dynamics for planning from pixels

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. In International conference on machine learning, pp.\ 2555--2565. PMLR, 2019

work page 2019

[12] [12]

Mastering Atari with Discrete World Models

Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[13] [13]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

Mastering diverse control tasks through world models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse control tasks through world models. Nature, pp.\ 1--7, 2025

work page 2025

[15] [15]

Temporal difference learning for model predictive control

Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal difference learning for model predictive control. arXiv preprint arXiv:2203.04955, 2022

work page arXiv 2022

[16] [16]

Deep variational reinforcement learning for pomdps

Maximilian Igl, Luisa Zintgraf, Tuan Anh Le, Frank Wood, and Shimon Whiteson. Deep variational reinforcement learning for pomdps. In International conference on machine learning, pp.\ 2117--2126. PMLR, 2018

work page 2018

[17] [17]

When to trust your model: Model-based policy optimization

Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. Advances in neural information processing systems, 32, 2019

work page 2019

[18] [18]

Planning and acting in partially observable stochastic domains

Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101 0 (1-2): 0 99--134, 1998

work page 1998

[19] [19]

Reinforcement learning from delayed observations via world models

Armin Karamzade, Kyungmin Kim, Montek Kalsi, and Roy Fox. Reinforcement learning from delayed observations via world models. arXiv preprint arXiv:2403.12309, 2024

work page arXiv 2024

[20] [20]

Markov decision processes with delays and asynchronous cost collection

Konstantinos V Katsikopoulos and Sascha E Engelbrecht. Markov decision processes with delays and asynchronous cost collection. IEEE transactions on automatic control, 48 0 (4): 0 568--574, 2003

work page 2003

[21] [21]

Belief projection-based reinforcement learning for environments with delayed feedback

Jangwon Kim, Hangyeol Kim, Jiwook Kang, Jongchan Baek, and Soohee Han. Belief projection-based reinforcement learning for environments with delayed feedback. Advances in Neural Information Processing Systems, 36: 0 678--696, 2023

work page 2023

[22] [22]

A partially observable markov decision process with lagged information

Soung Hie Kim and Byung Ho Jeong. A partially observable markov decision process with lagged information. Journal of the Operational Research Society, 38 0 (5): 0 439--446, 1987

work page 1987

[23] [23]

Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model

Alex X Lee, Anusha Nagabandi, Pieter Abbeel, and Sergey Levine. Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model. Advances in Neural Information Processing Systems, 33: 0 741--752, 2020

work page 2020

[24] [24]

Learning a belief representation for delayed reinforcement learning

Pierre Liotet, Erick Venneri, and Marcello Restelli. Learning a belief representation for delayed reinforcement learning. In 2021 International Joint Conference on Neural Networks (IJCNN), pp.\ 1--8. IEEE, 2021

work page 2021

[25] [25]

Delayed reinforcement learning by imitation

Pierre Liotet, Davide Maran, Lorenzo Bisi, and Marcello Restelli. Delayed reinforcement learning by imitation. In International conference on machine learning, pp.\ 13528--13556. PMLR, 2022

work page 2022

[26] [26]

Particle filter recurrent neural networks

Xiao Ma, Peter Karkus, David Hsu, and Wee Sun Lee. Particle filter recurrent neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.\ 5101--5108, 2020 a

work page 2020

[27] [27]

Discriminative particle filter reinforcement learning for complex partial observations

Xiao Ma, Peter Karkus, David Hsu, Wee Sun Lee, and Nan Ye. Discriminative particle filter reinforcement learning for complex partial observations. arXiv preprint arXiv:2002.09884, 2020 b

work page arXiv 2002

[28] [28]

Setting up a reinforcement learning task with a real-world robot

A Rupam Mahmood, Dmytro Korenkevych, Brent J Komer, and James Bergstra. Setting up a reinforcement learning task with a real-world robot. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.\ 4635--4640. IEEE, 2018

work page 2018

[29] [29]

Transform- ers are sample-efficient world models.arXiv preprint arXiv:2209.00588,

Vincent Micheli, Eloi Alonso, and Fran c ois Fleuret. Transformers are sample-efficient world models. arXiv preprint arXiv:2209.00588, 2022

work page arXiv 2022

[30] [30]

Control delay in reinforcement learning for real-time dynamic systems: A memoryless approach

Erik Schuitema, Lucian Bu s oniu, Robert Babu s ka, and Pieter Jonker. Control delay in reinforcement learning for real-time dynamic systems: A memoryless approach. In 2010 IEEE/RSJ international conference on intelligent robots and systems, pp.\ 3226--3231. IEEE, 2010

work page 2010

[31] [31]

Mujoco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pp.\ 5026--5033. IEEE, 2012

work page 2012

[32] [32]

Tree search-based policy optimization under stochastic execution delay

David Valensi, Esther Derman, Shie Mannor, and Gal Dalal. Tree search-based policy optimization under stochastic execution delay. arXiv preprint arXiv:2404.05440, 2024

work page arXiv 2024

[33] [33]

Planning and learning in environments with delayed feedback

Thomas J Walsh, Ali Nouri, Lihong Li, and Michael L Littman. Planning and learning in environments with delayed feedback. In Machine Learning: ECML 2007: 18th European Conference on Machine Learning, Warsaw, Poland, September 17-21, 2007. Proceedings 18, pp.\ 442--453. Springer, 2007

work page 2007

[34] [34]

Addressing signal delay in deep reinforcement learning

Wei Wang, Dongqi Han, Xufang Luo, and Dongsheng Li. Addressing signal delay in deep reinforcement learning. In The Twelfth International Conference on Learning Representations, 2023

work page 2023

[35] [35]

Variational delayed policy optimization

Qingyuan Wu, Simon S Zhan, Yixuan Wang, Yuhui Wang, Chung-Wei Lin, Chen Lv, Qi Zhu, and Chao Huang. Variational delayed policy optimization. Advances in Neural Information Processing Systems, 37: 0 54330--54356, 2024 a

work page 2024

[36] [36]

Boosting reinforcement learning with strongly delayed feedback through auxiliary short delays

Qingyuan Wu, Simon Sinong Zhan, Yixuan Wang, Yuhui Wang, Chung-Wei Lin, Chen Lv, Qi Zhu, J \"u rgen Schmidhuber, and Chao Huang. Boosting reinforcement learning with strongly delayed feedback through auxiliary short delays. arXiv preprint arXiv:2402.03141, 2024 b

work page arXiv 2024

[37] [37]

Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning.CoRR, abs/1910.10897, 2019

Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning (CoRL), 2019. URL https://arxiv.org/abs/1910.10897

work page arXiv 2019

[38] [38]

Storm: Efficient stochastic transformer based world models for reinforcement learning

Weipu Zhang, Gang Wang, Jian Sun, Yetian Yuan, and Gao Huang. Storm: Efficient stochastic transformer based world models for reinforcement learning. Advances in Neural Information Processing Systems, 36: 0 27147--27166, 2023

work page 2023

[39] [39]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[40] [40]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[41] [41]

TD-MPC2: Scalable, Robust World Models for Continuous Control

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page internal anchor Pith review Pith/arXiv arXiv 2003