DAWM: Diffusion Action World Models for Offline Reinforcement Learning via Action-Inferred Transitions

Matthias Schubert; Niklas Strauss; Xiao Han; Yusong Li; Zongyue Li

arxiv: 2509.19538 · v2 · pith:EDJTSMWDnew · submitted 2025-09-23 · 💻 cs.LG · cs.AI

DAWM: Diffusion Action World Models for Offline Reinforcement Learning via Action-Inferred Transitions

Zongyue Li , Xiao Han , Yusong Li , Niklas Strauss , Matthias Schubert This is my paper

Pith reviewed 2026-05-18 13:51 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords offline reinforcement learningdiffusion modelsworld modelsinverse dynamicsdata augmentationD4RL benchmarktemporal difference learning

0 comments

The pith

A diffusion model generates state-reward trajectories conditioned on actions and returns-to-go, then an inverse dynamics model infers the missing actions to produce complete transitions for standard offline RL training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that diffusion-based world models can be made directly compatible with one-step temporal difference learning by splitting the generation of future states and rewards from the recovery of actions. It does this through a conditional diffusion process that takes current state, chosen action, and return-to-go as input, followed by a separate inverse dynamics model that fills in the action for each synthetic transition. A sympathetic reader would care because most practical offline RL algorithms rely on simple one-step TD updates rather than long-horizon planning, and prior diffusion world models either omitted actions or required complex joint modeling that hurt performance. If the separation works, offline agents can train on much larger synthetic datasets without suffering from the distribution shift that usually arises when actions are missing or poorly predicted.

Core claim

DAWM generates future state-reward trajectories with a diffusion model conditioned on the current state, action, and return-to-go, then uses an inverse dynamics model to infer the actions that complete each transition. This modular construction supplies ready-to-use synthetic transitions that conservative offline RL methods such as TD3BC and IQL can consume with ordinary one-step TD learning. On the D4RL benchmark the resulting augmented datasets produce consistent gains over earlier diffusion-based world models across multiple tasks.

What carries the argument

The modular pairing of a diffusion model that synthesizes state-reward sequences conditioned on state, action and return-to-go with a separate inverse dynamics model that recovers the actions needed for one-step TD updates.

If this is right

Conservative offline RL algorithms gain access to larger effective datasets without changing their core one-step TD update rule.
The same synthetic transitions can be reused across multiple value-based offline methods rather than being tied to a single planning or model-based procedure.
Training remains computationally lighter than joint diffusion models that must generate states, rewards, and actions in one coupled process.
Performance improvements appear on standard D4RL tasks when the generated trajectories are mixed with real data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same modular split could be tested in online fine-tuning settings where a small amount of real interaction data is available to correct any residual errors in the inferred actions.
If the inverse dynamics model is itself learned from the same offline dataset, the overall method may still inherit biases present in the original data distribution.
Extending the conditioning to include predicted future actions rather than only the immediate action might reduce the number of steps needed before compounding errors appear.

Load-bearing premise

The synthetic transitions created by the diffusion model plus inverse dynamics model are accurate enough that one-step TD training on the mixture does not introduce harmful distribution shift or compounding prediction errors.

What would settle it

If TD3BC or IQL trained on the DAWM-augmented trajectories show no improvement or clear degradation relative to the same algorithms trained only on the original dataset or prior diffusion baselines across several D4RL environments, the central claim would be falsified.

read the original abstract

Diffusion-based world models have demonstrated strong capabilities in synthesizing realistic long-horizon trajectories for offline reinforcement learning (RL). However, many existing methods do not directly generate actions alongside states and rewards, limiting their compatibility with standard value-based offline RL algorithms that rely on one-step temporal difference (TD) learning. While prior work has explored joint modeling of states, rewards, and actions to address this issue, such formulations often lead to increased training complexity and reduced performance in practice. We propose \textbf{DAWM}, a diffusion-based world model that generates future state-reward trajectories conditioned on the current state, action, and return-to-go, paired with an inverse dynamics model (IDM) for efficient action inference. This modular design produces complete synthetic transitions suitable for one-step TD-based offline RL, enabling effective and computationally efficient training. Empirically, we show that conservative offline RL algorithms such as TD3BC and IQL benefit significantly from training on these augmented trajectories, consistently outperforming prior diffusion-based baselines across multiple tasks in the D4RL benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes DAWM, a diffusion-based world model for offline RL. It generates future state-reward trajectories conditioned on current state, action, and return-to-go using a diffusion model, then applies an inverse dynamics model (IDM) to infer actions and produce complete synthetic transitions. These augmentations are used to train conservative offline RL algorithms such as TD3BC and IQL, with the central empirical claim being consistent outperformance over prior diffusion-based baselines on multiple D4RL tasks.

Significance. If the generated transitions maintain consistency with the conditioning actions, the modular design could provide an efficient alternative to joint state-action-reward modeling in diffusion world models, enabling better compatibility with one-step TD-based offline RL methods and improving performance in data-limited settings.

major comments (2)

[Method (diffusion model and IDM pairing)] The central claim that augmented trajectories improve TD3BC and IQL performance rests on the unverified assumption that IDM-inferred actions â closely match the diffusion conditioning actions a on generated states; no error analysis, consistency checks, or ablations on out-of-support diffusion states are provided, risking distribution shift in the one-step TD updates.
[Experiments and results] Experimental results claim consistent outperformance but report no variance across random seeds, no details on hyperparameter controls, and no ablation isolating the IDM component, which is required to substantiate that the modular separation produces high-quality transitions.

minor comments (2)

[Abstract] The abstract refers to 'multiple tasks in the D4RL benchmark' without naming the specific tasks or reporting quantitative deltas.
[Preliminaries or Method] Notation for return-to-go conditioning and the exact diffusion conditioning inputs should be formalized with equations for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment below and commit to revisions that strengthen the empirical support for the modular design of DAWM.

read point-by-point responses

Referee: [Method (diffusion model and IDM pairing)] The central claim that augmented trajectories improve TD3BC and IQL performance rests on the unverified assumption that IDM-inferred actions â closely match the diffusion conditioning actions a on generated states; no error analysis, consistency checks, or ablations on out-of-support diffusion states are provided, risking distribution shift in the one-step TD updates.

Authors: We agree that explicit verification of action consistency is necessary to fully substantiate the central claim. The original manuscript does not include quantitative error analysis between conditioning actions and IDM-inferred actions, nor ablations focused on out-of-support states generated by the diffusion model. In the revised version we will add these analyses, including mean absolute error metrics on both in-distribution and out-of-distribution generated states, together with an ablation measuring the effect of action mismatch on downstream TD3BC and IQL performance. This will directly address concerns about potential distribution shift in the one-step TD updates. revision: yes
Referee: [Experiments and results] Experimental results claim consistent outperformance but report no variance across random seeds, no details on hyperparameter controls, and no ablation isolating the IDM component, which is required to substantiate that the modular separation produces high-quality transitions.

Authors: We acknowledge that the reported results lack multi-seed statistics, detailed hyperparameter controls, and an explicit ablation of the IDM. In the revision we will rerun all experiments with at least five random seeds, report means and standard deviations, expand the hyperparameter section with the full search ranges and selection procedure, and add an ablation that compares the full DAWM pipeline against a variant that replaces the IDM with direct action generation or random actions. These additions will isolate the contribution of the modular IDM component and provide stronger evidence for the quality of the synthetic transitions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical benchmarks

full rationale

The paper proposes DAWM as a modular diffusion world model that generates state-reward trajectories conditioned on state/action/return-to-go, paired with a separate IDM for action inference. The central claim is an empirical one: that conservative offline RL methods (TD3BC, IQL) trained on the resulting augmented trajectories outperform prior diffusion baselines on D4RL tasks. No derivation chain is presented that reduces a prediction or uniqueness result to a fitted parameter or self-citation by construction. The method is introduced as a practical engineering choice to enable one-step TD learning, with performance evaluated externally on standard benchmarks rather than by re-deriving quantities already implicit in the training data or loss.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method assumes standard diffusion training objectives and the existence of a well-behaved inverse dynamics model; no new physical entities or ad-hoc constants are introduced beyond typical RL hyperparameters.

free parameters (1)

diffusion steps and conditioning weights
Typical diffusion schedule and conditioning strength parameters that must be chosen or tuned for the state-reward generator.

axioms (1)

domain assumption The inverse dynamics model can accurately recover actions from state transitions in the learned distribution.
Invoked when the paper states that the IDM produces complete synthetic transitions suitable for one-step TD learning.

pith-pipeline@v0.9.0 · 5721 in / 1317 out tokens · 41819 ms · 2026-05-18T13:51:37.429916+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DAWM, a diffusion-based world model that generates future state-reward trajectories conditioned on the current state, action, and return-to-go, paired with an inverse dynamics model (IDM) for efficient action inference

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Human Cognition in Machines: A Unified Perspective of World Models
cs.RO 2026-04 unverdicted novelty 6.0

The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...
PhyWorld: Physics-Faithful World Model for Video Generation
cs.CV 2026-05 unverdicted novelty 5.0

PhyWorld improves temporal consistency and physical plausibility in video world models via flow matching fine-tuning followed by DPO on physics preference pairs, with reported gains on VBench and a custom physical-fai...

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · cited by 2 Pith papers · 11 internal anchors

[1]

Is Conditional Generative Modeling all you need for Decision-Making?

URL https: //arxiv.org/abs/2211.15657. Alonso, E., Jelley, A., Micheli, V ., Kanervisto, A., Storkey, A., Pearce, T., and Fleuret, F. Diffusion for world modeling: Visual details matter in atari,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P., Srinivas, A., and Mordatch, I

URL https://arxiv.org/abs/2405.12399. Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P., Srinivas, A., and Mordatch, I. De- cision transformer: Reinforcement learning via sequence modeling,

work page arXiv
[3]

Decision Transformer: Reinforcement Learning via Sequence Modeling

URL https://arxiv.org/abs/ 2106.01345. Ding, Z., Zhang, A., Tian, Y ., and Zheng, Q. Diffusion world model: Future modeling beyond step-by-step rollout for 7 DA WM: Diffusion Action World Models for Offline Reinforcement Learning via Action-Inferred Transitions offline reinforcement learning,

work page internal anchor Pith review arXiv
[4]

arXiv preprint arXiv:2402.03570 , year=

URL https:// arxiv.org/abs/2402.03570. Eysenbach, B., Geist, M., Levine, S., and Salakhutdinov, R. A connection between one-step regularization and critic regularization in reinforcement learning,

work page arXiv
[5]

Feinberg, V ., Wan, A., Stoica, I., Jordan, M

URL https://arxiv.org/abs/2307.12968. Feinberg, V ., Wan, A., Stoica, I., Jordan, M. I., Gonzalez, J. E., and Levine, S. Model-based value estimation for efficient model-free reinforcement learning,

work page arXiv
[6]

Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning

URL https://arxiv.org/abs/1803.00101. Fu, J., Kumar, A., Nachum, O., Tucker, G., and Levine, S. D4rl: Datasets for deep data-driven reinforcement learning,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Furuta, H., Matsushima, T., Kozuno, T., Matsuo, Y ., Levine, S., Nachum, O., and Gu, S

URL https://arxiv.org/abs/1812.02900. Furuta, H., Matsushima, T., Kozuno, T., Matsuo, Y ., Levine, S., Nachum, O., and Gu, S. S. Policy information capacity: Information-theoretic measure for task com- plexity in deep reinforcement learning,

work page arXiv
[8]

URL https://arxiv.org/abs/2103.12726. Ha, D. and Schmidhuber, J. World models

work page arXiv
[9]

Hafner, D., Lillicrap, T., Ba, J., and Norouzi, M

doi: 10.5281/ZENODO.1207631. URL https://zenodo. org/record/1207631. Hafner, D., Lillicrap, T., Ba, J., and Norouzi, M. Dream to control: Learning behaviors by latent imagination,

work page doi:10.5281/zenodo.1207631
[10]

Dream to Control: Learning Behaviors by Latent Imagination

URLhttps://arxiv.org/abs/1912.01603. Hafner, D., Lillicrap, T., Norouzi, M., and Ba, J. Mastering atari with discrete world models,

work page internal anchor Pith review Pith/arXiv arXiv 1912
[11]

Mastering Atari with Discrete World Models

URL https: //arxiv.org/abs/2010.02193. Hafner, D., Pasukonis, J., Ba, J., and Lillicrap, T. Master- ing diverse domains through world models,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[12]

URL https://arxiv.org/abs/2301.04104. Ho, J. and Salimans, T. Classifier-free diffusion guid- ance,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Classifier-Free Diffusion Guidance

URL https://arxiv.org/abs/ 2207.12598. Hyv¨arinen, A. Estimation of non-normalized statistical models by score matching.Journal of Machine Learning Research, 6(24):695–709,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Janner, M., Li, Q., and Levine, S

URL https://arxiv.org/abs/2404.06356. Janner, M., Li, Q., and Levine, S. Offline reinforcement learning as one big sequence modeling problem,

work page arXiv
[15]

Reinforcement learning as one big sequence modeling problem

URLhttps://arxiv.org/abs/2106.02039. Janner, M., Du, Y ., Tenenbaum, J. B., and Levine, S. Plan- ning with diffusion for flexible behavior synthesis,

work page arXiv
[16]

Planning with Diffusion for Flexible Behavior Synthesis

URLhttps://arxiv.org/abs/2205.09991. Kostrikov, I., Nair, A., and Levine, S. Offline reinforcement learning with implicit q-learning,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Offline Reinforcement Learning with Implicit Q-Learning

URL https: //arxiv.org/abs/2110.06169. Kumar, A., Zhou, A., Tucker, G., and Levine, S. Conserva- tive q-learning for offline reinforcement learning,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Conservative q-learning for offline reinforcement learning.CoRR, abs/2006.04779, 2020

URLhttps://arxiv.org/abs/2006.04779. Li, S., Wang, X., Zuo, R., Sun, K., Cui, L., Ding, J., Liu, P., and Ma, Z. Robust visual imitation learning with inverse dynamics representations,

work page arXiv 2006
[19]

Li, Z., Yan, S., Ma, Y ., Li, Y ., Lyu, X., and Schu- bert, M

URL https: //arxiv.org/abs/2310.14274. Li, Z., Yan, S., Ma, Y ., Li, Y ., Lyu, X., and Schu- bert, M. BEYOND SINGLE-STEP: MULTI-FRAME ACTION- CONDITIONED VIDEO GENERATION FOR REINFORCE- MENT LEARNING ENVIRON- MENTS. InICLR 2025 Workshop on World Models: Un- derstanding, Modelling and Scaling,

work page arXiv 2025
[20]

Transform- ers are sample-efficient world models.arXiv preprint arXiv:2209.00588,

URL https:// arxiv.org/abs/2209.00588. Palenicek, D., Lutter, M., and Peters, J. Revisiting model- based value expansion,

work page arXiv
[21]

org/abs/2203.14660

URL https://arxiv. org/abs/2203.14660. Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. Curiosity-driven exploration by self-supervised predic- tion,

work page arXiv
[22]

Robine, J., H ¨oftmann, M., Uelwer, T., and Harmeling, S

URL https: //arxiv.org/abs/2312.08533. Robine, J., H ¨oftmann, M., Uelwer, T., and Harmeling, S. Transformer-based world models are happy with 100k in- teractions,

work page arXiv
[23]

Transformer-based world models are happy with 100k interactions.arXiv preprint arXiv:2303.07109,

URL https://arxiv.org/abs/ 2303.07109. Ronneberger, O., Fischer, P., and Brox, T. U-net: Convolu- tional networks for biomedical image segmentation,

work page arXiv
[24]

U-Net: Convolutional Networks for Biomedical Image Segmentation

URLhttps://arxiv.org/abs/1505.04597. Sutton, R. Learning to predict by the method of temporal differences.Machine Learning, 3:9–44, 08

work page internal anchor Pith review Pith/arXiv arXiv
[25]

8 DA WM: Diffusion Action World Models for Offline Reinforcement Learning via Action-Inferred Transitions Sutton, R

doi: 10.1007/BF00115009. 8 DA WM: Diffusion Action World Models for Offline Reinforcement Learning via Action-Inferred Transitions Sutton, R. and Barto, A. Reinforcement learning: An intro- duction.IEEE Transactions on Neural Networks, 9(5): 1054–1054,

work page doi:10.1007/bf00115009
[26]

Szulc, M., Łyskawa, J., and Wawrzy ´nski, P

doi: 10.1109/TNN.1998.712192. Szulc, M., Łyskawa, J., and Wawrzy ´nski, P. A frame- work for reinforcement learning with autocorrelated actions,

work page doi:10.1109/tnn.1998.712192 1998
[27]

Torabi, F., Warnell, G., and Stone, P

URL https://arxiv.org/abs/ 2009.04777. Torabi, F., Warnell, G., and Stone, P. Behavioral cloning from observation,

work page arXiv 2009
[28]

Behavioral Cloning from Observation

URL https://arxiv.org/ abs/1805.01954. Vincent, P. A connection between score matching and de- noising autoencoders.Neural Computation, 23(7):1661– 1674,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

URLhttps://doi.org/10.1162/neco_a_01178

doi: 10.1162/NECO a 00142. Williams, R. J. Simple statistical gradient-following algo- rithms for connectionist reinforcement learning.Mach. Learn., 8(3–4):229–256, May

work page doi:10.1162/neco
[30]

Williams

ISSN 0885-6125. doi: 10.1007/BF00992696. URL https://doi.org/ 10.1007/BF00992696. Zheng, Q., Henaff, M., Amos, B., and Grover, A. Semi- supervised offline reinforcement learning with action-free trajectories,

work page doi:10.1007/bf00992696
[31]

Zhou, G., Swaminathan, S., Raju, R

URL https://arxiv.org/abs/ 2210.06518. Zhou, G., Swaminathan, S., Raju, R. V ., Guntupalli, J. S., Lehrach, W., Ortiz, J., Dedieu, A., L´azaro-Gredilla, M., and Murphy, K. Diffusion model predictive control,

work page arXiv
[32]

9 DA WM: Diffusion Action World Models for Offline Reinforcement Learning via Action-Inferred Transitions A

URLhttps://arxiv.org/abs/2410.05364. 9 DA WM: Diffusion Action World Models for Offline Reinforcement Learning via Action-Inferred Transitions A. Appendix A.1. Hyperparameters We summarize the architectures and all hyperparameters used in our experiments in this section. For all experiments, we use our own PyTorch implementation of the Diffusion World Mod...

work page arXiv

[1] [1]

Is Conditional Generative Modeling all you need for Decision-Making?

URL https: //arxiv.org/abs/2211.15657. Alonso, E., Jelley, A., Micheli, V ., Kanervisto, A., Storkey, A., Pearce, T., and Fleuret, F. Diffusion for world modeling: Visual details matter in atari,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P., Srinivas, A., and Mordatch, I

URL https://arxiv.org/abs/2405.12399. Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P., Srinivas, A., and Mordatch, I. De- cision transformer: Reinforcement learning via sequence modeling,

work page arXiv

[3] [3]

Decision Transformer: Reinforcement Learning via Sequence Modeling

URL https://arxiv.org/abs/ 2106.01345. Ding, Z., Zhang, A., Tian, Y ., and Zheng, Q. Diffusion world model: Future modeling beyond step-by-step rollout for 7 DA WM: Diffusion Action World Models for Offline Reinforcement Learning via Action-Inferred Transitions offline reinforcement learning,

work page internal anchor Pith review arXiv

[4] [4]

arXiv preprint arXiv:2402.03570 , year=

URL https:// arxiv.org/abs/2402.03570. Eysenbach, B., Geist, M., Levine, S., and Salakhutdinov, R. A connection between one-step regularization and critic regularization in reinforcement learning,

work page arXiv

[5] [5]

Feinberg, V ., Wan, A., Stoica, I., Jordan, M

URL https://arxiv.org/abs/2307.12968. Feinberg, V ., Wan, A., Stoica, I., Jordan, M. I., Gonzalez, J. E., and Levine, S. Model-based value estimation for efficient model-free reinforcement learning,

work page arXiv

[6] [6]

Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning

URL https://arxiv.org/abs/1803.00101. Fu, J., Kumar, A., Nachum, O., Tucker, G., and Levine, S. D4rl: Datasets for deep data-driven reinforcement learning,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Furuta, H., Matsushima, T., Kozuno, T., Matsuo, Y ., Levine, S., Nachum, O., and Gu, S

URL https://arxiv.org/abs/1812.02900. Furuta, H., Matsushima, T., Kozuno, T., Matsuo, Y ., Levine, S., Nachum, O., and Gu, S. S. Policy information capacity: Information-theoretic measure for task com- plexity in deep reinforcement learning,

work page arXiv

[8] [8]

URL https://arxiv.org/abs/2103.12726. Ha, D. and Schmidhuber, J. World models

work page arXiv

[9] [9]

Hafner, D., Lillicrap, T., Ba, J., and Norouzi, M

doi: 10.5281/ZENODO.1207631. URL https://zenodo. org/record/1207631. Hafner, D., Lillicrap, T., Ba, J., and Norouzi, M. Dream to control: Learning behaviors by latent imagination,

work page doi:10.5281/zenodo.1207631

[10] [10]

Dream to Control: Learning Behaviors by Latent Imagination

URLhttps://arxiv.org/abs/1912.01603. Hafner, D., Lillicrap, T., Norouzi, M., and Ba, J. Mastering atari with discrete world models,

work page internal anchor Pith review Pith/arXiv arXiv 1912

[11] [11]

Mastering Atari with Discrete World Models

URL https: //arxiv.org/abs/2010.02193. Hafner, D., Pasukonis, J., Ba, J., and Lillicrap, T. Master- ing diverse domains through world models,

work page internal anchor Pith review Pith/arXiv arXiv 2010

[12] [12]

URL https://arxiv.org/abs/2301.04104. Ho, J. and Salimans, T. Classifier-free diffusion guid- ance,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Classifier-Free Diffusion Guidance

URL https://arxiv.org/abs/ 2207.12598. Hyv¨arinen, A. Estimation of non-normalized statistical models by score matching.Journal of Machine Learning Research, 6(24):695–709,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Janner, M., Li, Q., and Levine, S

URL https://arxiv.org/abs/2404.06356. Janner, M., Li, Q., and Levine, S. Offline reinforcement learning as one big sequence modeling problem,

work page arXiv

[15] [15]

Reinforcement learning as one big sequence modeling problem

URLhttps://arxiv.org/abs/2106.02039. Janner, M., Du, Y ., Tenenbaum, J. B., and Levine, S. Plan- ning with diffusion for flexible behavior synthesis,

work page arXiv

[16] [16]

Planning with Diffusion for Flexible Behavior Synthesis

URLhttps://arxiv.org/abs/2205.09991. Kostrikov, I., Nair, A., and Levine, S. Offline reinforcement learning with implicit q-learning,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Offline Reinforcement Learning with Implicit Q-Learning

URL https: //arxiv.org/abs/2110.06169. Kumar, A., Zhou, A., Tucker, G., and Levine, S. Conserva- tive q-learning for offline reinforcement learning,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Conservative q-learning for offline reinforcement learning.CoRR, abs/2006.04779, 2020

URLhttps://arxiv.org/abs/2006.04779. Li, S., Wang, X., Zuo, R., Sun, K., Cui, L., Ding, J., Liu, P., and Ma, Z. Robust visual imitation learning with inverse dynamics representations,

work page arXiv 2006

[19] [19]

Li, Z., Yan, S., Ma, Y ., Li, Y ., Lyu, X., and Schu- bert, M

URL https: //arxiv.org/abs/2310.14274. Li, Z., Yan, S., Ma, Y ., Li, Y ., Lyu, X., and Schu- bert, M. BEYOND SINGLE-STEP: MULTI-FRAME ACTION- CONDITIONED VIDEO GENERATION FOR REINFORCE- MENT LEARNING ENVIRON- MENTS. InICLR 2025 Workshop on World Models: Un- derstanding, Modelling and Scaling,

work page arXiv 2025

[20] [20]

Transform- ers are sample-efficient world models.arXiv preprint arXiv:2209.00588,

URL https:// arxiv.org/abs/2209.00588. Palenicek, D., Lutter, M., and Peters, J. Revisiting model- based value expansion,

work page arXiv

[21] [21]

org/abs/2203.14660

URL https://arxiv. org/abs/2203.14660. Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. Curiosity-driven exploration by self-supervised predic- tion,

work page arXiv

[22] [22]

Robine, J., H ¨oftmann, M., Uelwer, T., and Harmeling, S

URL https: //arxiv.org/abs/2312.08533. Robine, J., H ¨oftmann, M., Uelwer, T., and Harmeling, S. Transformer-based world models are happy with 100k in- teractions,

work page arXiv

[23] [23]

Transformer-based world models are happy with 100k interactions.arXiv preprint arXiv:2303.07109,

URL https://arxiv.org/abs/ 2303.07109. Ronneberger, O., Fischer, P., and Brox, T. U-net: Convolu- tional networks for biomedical image segmentation,

work page arXiv

[24] [24]

U-Net: Convolutional Networks for Biomedical Image Segmentation

URLhttps://arxiv.org/abs/1505.04597. Sutton, R. Learning to predict by the method of temporal differences.Machine Learning, 3:9–44, 08

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

8 DA WM: Diffusion Action World Models for Offline Reinforcement Learning via Action-Inferred Transitions Sutton, R

doi: 10.1007/BF00115009. 8 DA WM: Diffusion Action World Models for Offline Reinforcement Learning via Action-Inferred Transitions Sutton, R. and Barto, A. Reinforcement learning: An intro- duction.IEEE Transactions on Neural Networks, 9(5): 1054–1054,

work page doi:10.1007/bf00115009

[26] [26]

Szulc, M., Łyskawa, J., and Wawrzy ´nski, P

doi: 10.1109/TNN.1998.712192. Szulc, M., Łyskawa, J., and Wawrzy ´nski, P. A frame- work for reinforcement learning with autocorrelated actions,

work page doi:10.1109/tnn.1998.712192 1998

[27] [27]

Torabi, F., Warnell, G., and Stone, P

URL https://arxiv.org/abs/ 2009.04777. Torabi, F., Warnell, G., and Stone, P. Behavioral cloning from observation,

work page arXiv 2009

[28] [28]

Behavioral Cloning from Observation

URL https://arxiv.org/ abs/1805.01954. Vincent, P. A connection between score matching and de- noising autoencoders.Neural Computation, 23(7):1661– 1674,

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

URLhttps://doi.org/10.1162/neco_a_01178

doi: 10.1162/NECO a 00142. Williams, R. J. Simple statistical gradient-following algo- rithms for connectionist reinforcement learning.Mach. Learn., 8(3–4):229–256, May

work page doi:10.1162/neco

[30] [30]

Williams

ISSN 0885-6125. doi: 10.1007/BF00992696. URL https://doi.org/ 10.1007/BF00992696. Zheng, Q., Henaff, M., Amos, B., and Grover, A. Semi- supervised offline reinforcement learning with action-free trajectories,

work page doi:10.1007/bf00992696

[31] [31]

Zhou, G., Swaminathan, S., Raju, R

URL https://arxiv.org/abs/ 2210.06518. Zhou, G., Swaminathan, S., Raju, R. V ., Guntupalli, J. S., Lehrach, W., Ortiz, J., Dedieu, A., L´azaro-Gredilla, M., and Murphy, K. Diffusion model predictive control,

work page arXiv

[32] [32]

9 DA WM: Diffusion Action World Models for Offline Reinforcement Learning via Action-Inferred Transitions A

URLhttps://arxiv.org/abs/2410.05364. 9 DA WM: Diffusion Action World Models for Offline Reinforcement Learning via Action-Inferred Transitions A. Appendix A.1. Hyperparameters We summarize the architectures and all hyperparameters used in our experiments in this section. For all experiments, we use our own PyTorch implementation of the Diffusion World Mod...

work page arXiv