Autoregressive Diffusion World Models for Off-Policy Evaluation of LLM Agents

Guojun Xiong; Kaixuan Liu; Shengpu Tang; Weinan Zhang

arxiv: 2606.05558 · v1 · pith:6W2PCU2Rnew · submitted 2026-06-04 · 💻 cs.LG

Autoregressive Diffusion World Models for Off-Policy Evaluation of LLM Agents

Kaixuan Liu , Guojun Xiong , Weinan Zhang , Shengpu Tang This is my paper

Pith reviewed 2026-06-28 02:50 UTC · model grok-4.3

classification 💻 cs.LG

keywords off-policy evaluationLLM agentsdiffusion modelsworld modelsautoregressive modelsoffline RLmulti-turn interactions

0 comments

The pith

ADWM enables accurate offline evaluation of LLM agents by simulating step-by-step trajectories with a policy-conditioned diffusion world model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes ADWM as a way to estimate how well a new LLM agent policy would perform in interactive environments, using only previously collected trajectories instead of running the policy live. It does this by training a latent diffusion model that can generate what the environment would do next, one transition at a time. Unlike earlier diffusion approaches that diffuse entire trajectories at once, ADWM treats each step separately so the agent and world model can take turns in the correct order. The policy itself steers the denoising process at every step to make sure the simulated actions and states match what the agent would actually choose. This setup promises to give reliable value estimates for multi-turn tasks without the risks or costs of real interactions.

Core claim

The central discovery is that modeling each transition in an LLM agent trajectory as an independent denoising process in a latent diffusion world model, with direct conditioning from the evaluation policy's score function, allows the generation of simulated trajectories that accurately reflect the policy's behavior and yield precise value estimates.

What carries the argument

The autoregressive diffusion world model that denoises one transition at a time while the LLM agent guides the process through policy-conditioned scores, alternating with the environment simulation.

Load-bearing premise

That independent per-transition denoising combined with policy guidance at each step generates rollouts whose statistics match those of real interactions with the evaluation policy.

What would settle it

Run the same policies both in the real environment and through ADWM simulations on identical tasks and compare the resulting value estimates; large differences would indicate the simulations do not accurately reflect policy behavior.

Figures

Figures reproduced from arXiv: 2606.05558 by Guojun Xiong, Kaixuan Liu, Shengpu Tang, Weinan Zhang.

**Figure 1.** Figure 1: Comparison of evaluation paradigms for LLM agents. (Left) On-policy evaluation requires executing the agent in the real environment, which is expensive and potentially unsafe. (Middle) Traditional off-policy evaluation learns a model-based simulator from offline data, but suffers from two fundamental issues: distribution shift between the behavior and target policies, and compounding error accumulated over… view at source ↗

**Figure 2.** Figure 2: Architecture of ADWM. Offline trajectories are encoded by E into latent states zt, which are processed by a diffusion world model pθ through K-step denoising. A projector Gψ maps each latent to soft tokens o˜t that the evaluation policy πe can read in its own embedding space. πe plays two complementary roles: it steers the denoising process via policy guidance (dashed arrow, Section 4.2), and samples actio… view at source ↗

**Figure 3.** Figure 3: Training loss curves across all four benchmarks. Total world-model loss [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Component ablation on three environments (avg- [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

Evaluating large language model (LLM) agents in multi-turn interactive environments is expensive and risky, as it requires online environment interaction. We propose ADWM (Autoregressive Diffusion World Model), an evaluation framework that estimates the performance of a new LLM agent policy purely from pre-collected trajectories. The core idea is to learn a latent diffusion world model that simulates how the environment responds to the evaluation policy, without ever executing it in the real environment. Existing diffusion-based OPE methods guide full trajectories in a single pass by jointly diffusing states and actions, an assumption that breaks down for LLM agents whose actions are discrete text that must be sampled from the policy after observing the environment. Unlike autoregressive world models that suffer from compounding errors, ADWM models each transition as an independent denoising process, enabling reliable step-by-step rollouts where the world model and agent alternate in causal order. Crucially, the LLM agent under evaluation directly guides the diffusion generation at each step via a policy-conditioned score function, ensuring that simulated trajectories accurately reflect its decision-making patterns. Empirically, ADWM achieves accurate value estimates and evaluation reliability across diverse multi-turn agent tasks, demonstrating its promise as a practical framework for offline LLM agent evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ADWM's per-transition independent diffusion plus policy-guided scoring is a direct response to why joint-diffusion OPE fails on discrete-text LLM agents.

read the letter

The paper's central technical move is to replace joint trajectory diffusion with independent denoising per transition, while letting the evaluation LLM policy condition the score function at every step so the world model and agent alternate causally. This targets the exact breakdown the abstract identifies in prior diffusion OPE work: discrete text actions cannot be jointly diffused because they must be sampled from the policy after seeing the state.

It does a clean job of spelling out that distinction and showing how the independent-per-transition design plus direct policy guidance produces step-by-step rollouts that stay aligned with the target policy. The claim that this avoids both the joint-diffusion incompatibility and the compounding errors of standard autoregressive world models is stated plainly and follows from the setup.

The main soft spot is that independence across transitions still leaves open whether long-horizon dependencies are captured well enough for the value estimates to stay reliable; the abstract asserts accuracy on diverse tasks, but any referee would want to see whether the gains hold when the number of turns increases or when the collected data has limited coverage of the evaluation policy. The experiments will have to carry that load.

This is for groups already working on offline evaluation or world models for interactive LLM agents. A reader who needs a practical way to test new policies without online rollouts will find the concrete technical choice useful even if they end up modifying the diffusion details.

I would send it to peer review. The problem it attacks is real, the proposed distinction is explicit, and the framing is coherent enough that referees can usefully pressure-test the empirical side.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes ADWM (Autoregressive Diffusion World Model), a framework for off-policy evaluation of LLM agents from pre-collected trajectories. It learns a latent diffusion world model that simulates environment responses via per-transition independent denoising processes, with the evaluation policy providing direct guidance through a policy-conditioned score function at each step. This enables causal step-by-step rollouts alternating between the world model and agent, claimed to avoid compounding errors and yield accurate value estimates for multi-turn tasks without online interaction.

Significance. If the empirical claims hold, the work addresses a practically important problem in safe, low-cost evaluation of LLM agents. The combination of independent transition denoising with explicit policy guidance offers a targeted solution to limitations of prior diffusion-based OPE methods on discrete actions and multi-turn settings, and could serve as a reusable offline evaluation tool.

major comments (2)

[Abstract and §3] Abstract and §3 (method): the claim that independent per-transition denoising plus policy guidance produces reliable multi-turn rollouts without compounding errors is load-bearing for the central contribution, yet the manuscript provides no quantitative analysis of rollout error accumulation as a function of horizon length or comparison against joint-trajectory diffusion baselines on the same metric.
[§4] §4 (experiments): the assertion of 'accurate value estimates and evaluation reliability across diverse multi-turn agent tasks' is presented without reported numerical values for value estimation error, confidence intervals, baseline comparisons (e.g., standard OPE or autoregressive world models), dataset sizes, or ablation results on the policy-guidance component.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which identify areas where additional empirical support would strengthen the central claims. We address each point below and will revise the manuscript to incorporate the requested analyses.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (method): the claim that independent per-transition denoising plus policy guidance produces reliable multi-turn rollouts without compounding errors is load-bearing for the central contribution, yet the manuscript provides no quantitative analysis of rollout error accumulation as a function of horizon length or comparison against joint-trajectory diffusion baselines on the same metric.

Authors: We agree that explicit quantitative analysis of rollout error accumulation would provide stronger validation of the independent-denoising design. Section 3 motivates the approach via the per-transition factorization and policy-conditioned guidance, but does not include horizon-dependent error curves or direct comparisons to joint-trajectory diffusion models. In the revised manuscript we will add these: (i) plots of state/action reconstruction error versus rollout horizon on held-out trajectories, and (ii) side-by-side evaluation against a joint-trajectory diffusion baseline using the same error metric and datasets. revision: yes
Referee: [§4] §4 (experiments): the assertion of 'accurate value estimates and evaluation reliability across diverse multi-turn agent tasks' is presented without reported numerical values for value estimation error, confidence intervals, baseline comparisons (e.g., standard OPE or autoregressive world models), dataset sizes, or ablation results on the policy-guidance component.

Authors: The current experimental section reports qualitative and aggregate performance but omits the detailed numerical reporting requested. We will expand §4 to include: tables with value-estimation error (e.g., MSE or absolute error) and 95% confidence intervals, explicit dataset sizes, comparisons against standard OPE estimators and autoregressive world-model baselines, and an ablation isolating the policy-guidance term. These additions will be placed in the main text or a dedicated appendix table. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The provided abstract and description contain no equations, derivations, or self-citations that reduce any claimed result to a fitted input or prior result by construction. The method is presented as a novel modeling choice (independent per-transition denoising with policy guidance) whose performance is asserted via empirical evaluation on tasks, without any load-bearing mathematical step that equates output to input by definition. This is the common case of a self-contained empirical proposal with no detectable circularity in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.1-grok · 5751 in / 1002 out tokens · 36219 ms · 2026-06-28T02:50:34.601698+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 16 canonical work pages · 10 internal anchors

[1]

Is Conditional Generative Modeling all you need for Decision-Making?

Anurag Ajay, Yilun Du, Abhi Gupta, Joshua Tenenbaum, Tommi Jaakkola, and Pulkit Agrawal. Is conditional generative modeling all you need for decision-making?arXiv preprint arXiv:2211.15657,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Combating the Compounding-Error Problem with a Multi-step Model

Kavosh Asadi, Dipendra Misra, Seungchan Kim, and Michel L Littman. Combating the compounding-error problem with a multi-step model.arXiv preprint arXiv:1905.13320,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[3]

Better than your teacher: Llm agents that learn from privileged ai feedback

Sanjiban Choudhury and Paloma Sodhi. Better than your teacher: Llm agents that learn from privileged ai feedback. arXiv preprint arXiv:2410.05434,

work page arXiv
[4]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Let- man, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Vid2world: Crafting video diffusion models to interactive world models.arXiv preprint arXiv:2505.14357,

Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, and Mingsheng Long. Vid2world: Crafting video diffusion models to interactive world models.arXiv preprint arXiv:2505.14357,

work page arXiv
[8]

Policy-guided diffusion.arXiv preprint arXiv:2404.06356,

9 Matthew Thomas Jackson, Michael Tryfan Matthews, Cong Lu, Benjamin Ellis, Shimon Whiteson, and Jakob Foerster. Policy-guided diffusion.arXiv preprint arXiv:2404.06356,

work page arXiv
[9]

Planning with Diffusion for Flexible Behavior Synthesis

Michael Janner, Yilun Du, Joshua B Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis.arXiv preprint arXiv:2205.09991,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Transformers are sample-efficient world models.arXiv preprint arXiv:2209.00588, 2022

Vincent Micheli, Eloi Alonso, and François Fleuret. Transformers are sample-efficient world models.arXiv preprint arXiv:2209.00588,

work page arXiv
[12]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Hyperparameter selection for offline reinforcement learning.arXiv preprint arXiv:2007.09055,

Tom Le Paine, Cosmin Paduraru, Andrea Michi, Caglar Gulcehre, Konrad Zolna, Alexander Novikov, Ziyu Wang, and Nando de Freitas. Hyperparameter selection for offline reinforcement learning.arXiv preprint arXiv:2007.09055,

work page arXiv 2007
[14]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[15]

Model-based reinforcement learning with an approximate, learned model

Leonid Kuvayev Rich Sutton. Model-based reinforcement learning with an approximate, learned model. InProceedings of the ninth Yale workshop on adaptive and learning systems, volume 1996, pages 101–105,

1996
[16]

Empirical study of off-policy policy evaluation for reinforcement learning.arXiv preprint arXiv:1911.06854,

Cameron V oloshin, Hoang M Le, Nan Jiang, and Yisong Yue. Empirical study of off-policy policy evaluation for reinforcement learning.arXiv preprint arXiv:1911.06854,

work page arXiv 1911
[17]

Scienceworld: Is your agent smarter than a 5th grader? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11279–11298,

Ruoyao Wang, Peter Jansen, Marc-Alexandre Côté, and Prithviraj Ammanabrolu. Scienceworld: Is your agent smarter than a 5th grader? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11279–11298,

2022
[18]

Hotpotqa: A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2369–2380,

2018
[19]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

TY t=1 πb(at |h t)P(o t+1 |h t, at)· TY t=1 πe(at |h t) πb(at |h t) =p πb(τ)· TY t=1 πe(at |h t) πb(at |h t) ,(25) where the environment transitionP(o t+1 |h t, at)cancels between numerator and denominator. Taking logarithms, logp πe(τ) = logp πb(τ) + TX t=1 logπ e(at |h t)− TX t=1 logπ b(at |h t).(26) Equation (26) is the starting point shared by importa...

2024
[21]

All non-linearities are SiLU except the IDM / policy / projector heads, which use ReLU and GELU respectively

and zero-initialised output projection; the IDM and BC heads are 2-layer MLPs over (z, h). All non-linearities are SiLU except the IDM / policy / projector heads, which use ReLU and GELU respectively. Table 4: Per-module parameter counts in the ADWMworld model (7.38M parameters total ford z=64). Module Function Params (M) % of total Observation encoder (f...

2022

[1] [1]

Is Conditional Generative Modeling all you need for Decision-Making?

Anurag Ajay, Yilun Du, Abhi Gupta, Joshua Tenenbaum, Tommi Jaakkola, and Pulkit Agrawal. Is conditional generative modeling all you need for decision-making?arXiv preprint arXiv:2211.15657,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Combating the Compounding-Error Problem with a Multi-step Model

Kavosh Asadi, Dipendra Misra, Seungchan Kim, and Michel L Littman. Combating the compounding-error problem with a multi-step model.arXiv preprint arXiv:1905.13320,

work page internal anchor Pith review Pith/arXiv arXiv 1905

[3] [3]

Better than your teacher: Llm agents that learn from privileged ai feedback

Sanjiban Choudhury and Paloma Sodhi. Better than your teacher: Llm agents that learn from privileged ai feedback. arXiv preprint arXiv:2410.05434,

work page arXiv

[4] [4]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Let- man, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Vid2world: Crafting video diffusion models to interactive world models.arXiv preprint arXiv:2505.14357,

Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, and Mingsheng Long. Vid2world: Crafting video diffusion models to interactive world models.arXiv preprint arXiv:2505.14357,

work page arXiv

[8] [8]

Policy-guided diffusion.arXiv preprint arXiv:2404.06356,

9 Matthew Thomas Jackson, Michael Tryfan Matthews, Cong Lu, Benjamin Ellis, Shimon Whiteson, and Jakob Foerster. Policy-guided diffusion.arXiv preprint arXiv:2404.06356,

work page arXiv

[9] [9]

Planning with Diffusion for Flexible Behavior Synthesis

Michael Janner, Yilun Du, Joshua B Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis.arXiv preprint arXiv:2205.09991,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Transformers are sample-efficient world models.arXiv preprint arXiv:2209.00588, 2022

Vincent Micheli, Eloi Alonso, and François Fleuret. Transformers are sample-efficient world models.arXiv preprint arXiv:2209.00588,

work page arXiv

[12] [12]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Hyperparameter selection for offline reinforcement learning.arXiv preprint arXiv:2007.09055,

Tom Le Paine, Cosmin Paduraru, Andrea Michi, Caglar Gulcehre, Konrad Zolna, Alexander Novikov, Ziyu Wang, and Nando de Freitas. Hyperparameter selection for offline reinforcement learning.arXiv preprint arXiv:2007.09055,

work page arXiv 2007

[14] [14]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768,

work page internal anchor Pith review Pith/arXiv arXiv 2010

[15] [15]

Model-based reinforcement learning with an approximate, learned model

Leonid Kuvayev Rich Sutton. Model-based reinforcement learning with an approximate, learned model. InProceedings of the ninth Yale workshop on adaptive and learning systems, volume 1996, pages 101–105,

1996

[16] [16]

Empirical study of off-policy policy evaluation for reinforcement learning.arXiv preprint arXiv:1911.06854,

Cameron V oloshin, Hoang M Le, Nan Jiang, and Yisong Yue. Empirical study of off-policy policy evaluation for reinforcement learning.arXiv preprint arXiv:1911.06854,

work page arXiv 1911

[17] [17]

Scienceworld: Is your agent smarter than a 5th grader? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11279–11298,

Ruoyao Wang, Peter Jansen, Marc-Alexandre Côté, and Prithviraj Ammanabrolu. Scienceworld: Is your agent smarter than a 5th grader? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11279–11298,

2022

[18] [18]

Hotpotqa: A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2369–2380,

2018

[19] [19]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

TY t=1 πb(at |h t)P(o t+1 |h t, at)· TY t=1 πe(at |h t) πb(at |h t) =p πb(τ)· TY t=1 πe(at |h t) πb(at |h t) ,(25) where the environment transitionP(o t+1 |h t, at)cancels between numerator and denominator. Taking logarithms, logp πe(τ) = logp πb(τ) + TX t=1 logπ e(at |h t)− TX t=1 logπ b(at |h t).(26) Equation (26) is the starting point shared by importa...

2024

[21] [21]

All non-linearities are SiLU except the IDM / policy / projector heads, which use ReLU and GELU respectively

and zero-initialised output projection; the IDM and BC heads are 2-layer MLPs over (z, h). All non-linearities are SiLU except the IDM / policy / projector heads, which use ReLU and GELU respectively. Table 4: Per-module parameter counts in the ADWMworld model (7.38M parameters total ford z=64). Module Function Params (M) % of total Observation encoder (f...

2022