Feedback World Model Enables Precise Guidance of Diffusion Policy

Bofan Lyu; Chuhao Zhou; Geng Li; Gen Li; Jianfei Yang; Jiaqi Bai; Jindou Jia; Jingliang Li; Pengfei Liu; Tuo An

arxiv: 2605.15705 · v1 · pith:EXIFTAO5new · submitted 2026-05-15 · 💻 cs.RO · cs.AI

Feedback World Model Enables Precise Guidance of Diffusion Policy

Tuo An , Jindou Jia , Gen Li , Jingliang Li , Chuhao Zhou , Pengfei Liu , Bofan Lyu , Jiaqi Bai

show 3 more authors

Xinying Guo Geng Li Jianfei Yang

This is my paper

Pith reviewed 2026-05-20 19:04 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords feedback world modeldiffusion policyrobotic manipulationdistribution shiftonline feedbackprediction correctionworld modelsaction guidance

0 comments

The pith

A feedback state updated with real observations lets world models correct their predictions at runtime for better robotic control outside training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

World models for robots often produce unreliable predictions when encountering new situations not seen in training. The authors propose updating a lightweight feedback state online using the actual next state observed after each action to iteratively fix future forecasts. This closes the prediction-observation loop at inference time without requiring additional training data or model updates. The method also uses action-aware guidance to focus on controllable elements in the predictions. If effective, it enables diffusion policies to achieve higher success rates in out-of-distribution scenarios by making world models more robust through simple runtime corrections.

Core claim

The paper establishes that a feedback world model maintains a lightweight state updated from real observations to correct prediction errors in latent space, functioning as an observer with convergence guarantees under mild conditions. This online correction compensates for model inaccuracies, and when paired with action-aware guidance that emphasizes controllable components, it leads to improved translation of predictions into control actions. Experiments on LIBERO-Plus, Robomimic, and real-world tasks show this reduces prediction error by up to 76.4% and boosts out-of-distribution success rates by 30%.

What carries the argument

the lightweight feedback state that is updated online from real observations to iteratively correct future predictions, interpreted as a latent-space observer

If this is right

World model prediction error decreases substantially under distribution shift.
Out-of-distribution task success rates for diffusion policies increase by around 30%.
The approach applies across simulated benchmarks like LIBERO-Plus and Robomimic as well as physical robot setups.
Convergence of the feedback correction holds under mild conditions without parameter updates.
Action-aware guidance better maps corrected predictions to effective controls.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar online feedback could improve predictive models in other sequential decision domains facing uncertainty.
This suggests that many pre-trained world models might gain robustness by adding inference-time state updates rather than full retraining.
Testing the method on longer-horizon tasks could reveal how well the correction accumulates over extended sequences.

Load-bearing premise

The lightweight feedback state can be updated from real observations to correct model errors and converge to accurate predictions without any additional training or parameter changes.

What would settle it

If experiments on out-of-distribution tasks show that prediction errors remain unchanged or increase after multiple updates to the feedback state, the correction mechanism would be falsified.

read the original abstract

World models aim to improve robotic decision making by predicting the consequences of actions. However, in practice, their predictions often become unreliable once the robot encounters states outside the training distribution, limiting their effectiveness at deployment. We observe that execution itself provides a natural but underutilized signal: after each action, the robot directly observes the true next state, revealing the mismatch between predicted and actual outcomes. Building on this insight, we propose feedback world model, a new paradigm that closes the loop between prediction and observation at inference time. Instead of treating the world model as a static open-loop predictor, our method maintains a lightweight feedback state that is updated online to iteratively correct future predictions, compensating for model errors using real-time observations without additional training data or parameter updates. We show that this process can be interpreted as a latent-space observer and admits convergence guarantees under mild conditions. We further introduce action-aware guidance to better translate corrected predictions into control by emphasizing action-controllable components while suppressing irrelevant variations. Experiments on LIBERO-Plus, Robomimic, and real-world manipulation tasks demonstrate that our method substantially improves both prediction accuracy and policy performance under distribution shift. In particular, it reduces world model prediction error by up to 76.4% and improves out-of-distribution (OOD) success rate by 30%. These results show that incorporating real-time feedback at inference time provides a simple yet powerful alternative to static world modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The feedback state for online correction is a practical inference-time idea, but the convergence guarantees look shaky for contact-rich robot tasks.

read the letter

The core contribution is a feedback world model that keeps a lightweight state updated from real observations after each action to fix future predictions on the fly. This closes the loop at inference without retraining or extra data, and they pair it with action-aware guidance to focus on controllable parts of the prediction. Experiments report up to 76% lower prediction error and 30% better out-of-distribution success on LIBERO-Plus, Robomimic, and real manipulation tasks, which is a solid empirical signal for deployment issues with diffusion policies.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes a feedback world model for diffusion policies in robotic manipulation. It maintains a lightweight feedback state updated online from real observations after each action to iteratively correct future world model predictions and compensate for errors under distribution shift, without retraining. This update process is interpreted as a latent-space observer that admits convergence guarantees under mild conditions. The method also introduces action-aware guidance to emphasize action-controllable components in the corrected predictions. Experiments on LIBERO-Plus, Robomimic, and real-world tasks report up to 76.4% reduction in world model prediction error and 30% improvement in out-of-distribution success rates.

Significance. If the convergence guarantees hold and the performance gains are attributable to the online feedback correction rather than auxiliary components, the approach offers a practical inference-time mechanism to improve robustness of world models in robotics. This could be valuable for deployment in contact-rich and partially observable environments where static predictors degrade, providing a training-free alternative to retraining or ensemble methods.

major comments (2)

[Abstract and convergence analysis section] Abstract and convergence analysis section: The claim that the feedback state update can be interpreted as a latent-space observer admitting convergence guarantees under mild conditions (e.g., Lipschitz continuity or observability) is load-bearing for attributing the 76.4% prediction error reduction and 30% OOD success gains to this mechanism. The manuscript does not verify whether these conditions hold for the nonlinear, stochastic, high-dimensional, and contact-rich dynamics in LIBERO-Plus, Robomimic, and real-world tasks; if violated, the iterative updates could diverge or oscillate, making gains potentially attributable to the separate action-aware guidance instead.
[Experimental results section] Experimental results section: The reported improvements should be supported by ablations that isolate the contribution of the online feedback state updates from the action-aware guidance. Without such controls, it remains unclear whether the iterative correction is the primary driver of the observed error reduction and success rate gains on the evaluated tasks.

minor comments (3)

The exact mathematical form of the feedback state update rule and its integration with the diffusion policy should be provided with explicit equations to enable reproduction and verification of the claimed convergence properties.
Clarify the definition of 'mild conditions' for convergence with a precise statement of assumptions (e.g., on the latent dynamics) rather than leaving them implicit.
Include details on how the lightweight feedback state is initialized and maintained across timesteps in the real-world experiments to address potential sensitivity to observation noise.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our work. We address the major concerns point by point below and have updated the manuscript accordingly to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract and convergence analysis section] The claim that the feedback state update can be interpreted as a latent-space observer admitting convergence guarantees under mild conditions (e.g., Lipschitz continuity or observability) is load-bearing for attributing the 76.4% prediction error reduction and 30% OOD success gains to this mechanism. The manuscript does not verify whether these conditions hold for the nonlinear, stochastic, high-dimensional, and contact-rich dynamics in LIBERO-Plus, Robomimic, and real-world tasks; if violated, the iterative updates could diverge or oscillate, making gains potentially attributable to the separate action-aware guidance instead.

Authors: We appreciate the referee highlighting the importance of the assumptions in the convergence analysis. The guarantees are derived under mild conditions such as Lipschitz continuity of the dynamics and latent observability, as stated in the relevant section. Explicit verification of these properties is challenging in high-dimensional stochastic robotic settings and was not included in the original manuscript. However, the observed stability and error reductions across benchmarks indicate that divergence or oscillation did not occur in practice. We have added a discussion of assumption applicability and potential limitations to the revised convergence analysis section, while clarifying that prediction error reductions stem primarily from the feedback mechanism rather than guidance alone. revision: partial
Referee: The reported improvements should be supported by ablations that isolate the contribution of the online feedback state updates from the action-aware guidance. Without such controls, it remains unclear whether the iterative correction is the primary driver of the observed error reduction and success rate gains on the evaluated tasks.

Authors: We agree that isolating the contributions of the feedback updates versus action-aware guidance is essential. In the revised manuscript we include new ablation experiments that disable the online feedback state updates while retaining action-aware guidance, and compare against the full method. These results show that the feedback mechanism accounts for the majority of the world-model error reduction, with guidance providing additional policy-level benefits. The ablations will be presented in the updated experimental results section. revision: yes

Circularity Check

0 steps flagged

No significant circularity: derivation uses external real observations as independent correction signal

full rationale

The paper maintains a lightweight feedback state updated online from direct robot observations of true next states after each action. This external signal is used to iteratively correct predictions at inference time without additional training or parameter updates. The claimed error reductions and OOD gains are measured against these real observations rather than being forced by any fitted internal quantity or self-referential definition. The latent-space observer interpretation and convergence guarantees under mild conditions are presented as an analysis of the proposed update rule, not as a load-bearing premise that reduces to prior self-citation or ansatz. No step in the provided abstract or described method renames a known result or imports uniqueness via overlapping-author citation in a way that collapses the central claim to its inputs. The approach is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; limited visibility into internal assumptions or parameters. The paper invokes convergence under mild conditions for the latent-space observer interpretation.

axioms (1)

domain assumption The feedback update process admits convergence guarantees under mild conditions when interpreted as a latent-space observer.
Stated in abstract as supporting the online correction mechanism.

pith-pipeline@v0.9.0 · 5812 in / 1148 out tokens · 52022 ms · 2026-05-20T19:04:40.792863+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the auxiliary feedback state satisfies lim t→∞ ||z_t − bar{z}_t|| ≤ γ / λ_min(L)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 6 internal anchors

[1]

Dynaguide: Steering diffusion polices with active dynamic guidance.arXiv preprint arXiv:2506.13922,

Maximilian Du and Shuran Song. Dynaguide: Steering diffusion polices with active dynamic guidance.arXiv preprint arXiv:2506.13922,

work page arXiv
[2]

LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, et al. Libero-plus: In-depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Dream to Control: Learning Behaviors by Latent Imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019a. Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. InProceedings of International Co...

work page internal anchor Pith review Pith/arXiv arXiv 1912
[4]

World Model for Robot Learning: A Comprehensive Survey

Bohan Hou, Gen Li, Jindou Jia, Tuo An, Xinying Guo, Sicong Leng, Haoran Geng, Yanjie Ze, Tatsuya Harada, Philip Torr, Oier Mees, Marc Pollefeys, Zhuang Liu, Jiajun Wu, Pieter Abbeel, Jitendra Malik, Yilun Du, and Jianfei Yang. World model for robot learning: A comprehensive survey.arXiv preprint arXiv:2605.00080,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Action-to-Action Flow Matching

Jindou Jia, Gen Li, Xiangyu Chen, Tuo An, Yuxuan Hu, Jingliang Li, Xinying Guo, and Jianfei Yang. Action-to-action flow matching.arXiv preprint arXiv:2602.07322,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Guanxing Lu, Wenkai Guo, Chubin Zhang, Yuheng Zhou, Haonan Jiang, Zifeng Gao, Yansong Tang, and Ziwei Wang

Xiaokang Liu, Zechen Bai, Hai Ci, Kevin Yuchen Ma, and Mike Zheng Shou. World-vla-loop: Closed-loop learning of video world model and vla policy.arXiv preprint arXiv:2602.06508,

work page arXiv
[7]

World guidance: World modeling in condition space for action generation.arXiv preprint arXiv:2602.22010,

Yue Su, Sijin Chen, Haixin Shi, Mingyu Liu, Zhengshen Zhang, Ningyuan Huang, Weiheng Zhong, Zhengbang Zhu, Yuxiao Liu, and Xihui Liu. World guidance: World modeling in condition space for action generation.arXiv preprint arXiv:2602.22010,

work page arXiv
[8]

Vla-jepa: Enhancing vision-language-action model with latent world model.arXiv preprint arXiv:2602.10098, 2026a

20 Jingwen Sun, Wenyao Zhang, Zekun Qi, Shaojie Ren, Zezhi Liu, Hanxin Zhu, Guangzhong Sun, Xin Jin, and Zhibo Chen. Vla-jepa: Enhancing vision-language-action model with latent world model.arXiv preprint arXiv:2602.10098, 2026a. Xiaoquan Sun, Zetian Xu, Chen Cao, Zonghe Liu, Yihan Sun, Jingrui Pang, Ruijian Zhang, Zhen Yang, Kang Pang, Dingxin He, et al....

work page arXiv
[9]

Hierarchical diffusion policy: Manipulation trajectory generation via contact guidance.IEEE Transactions on Robotics, 41:2086–2104,

Dexin Wang, Chunsheng Liu, Faliang Chang, and Yichen Xu. Hierarchical diffusion policy: Manipulation trajectory generation via contact guidance.IEEE Transactions on Robotics, 41:2086–2104,

work page 2086
[10]

Progressvla: Progress-guided diffusion policy for vision-language robotic manipulation.arXiv preprint arXiv:2603.27670,

Hongyu Yan, Qiwei Li, Jiaolong Yang, and Yadong Mu. Progressvla: Progress-guided diffusion policy for vision-language robotic manipulation.arXiv preprint arXiv:2603.27670,

work page arXiv
[11]

World Action Models are Zero-shot Policies

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Ddp-wm: Disentangled dynamics prediction for efficient world models.arXiv preprint arXiv:2602.01780,

Shicheng Yin, Kaixuan Yin, Weixing Chen, Yang Liu, Guanbin Li, and Liang Lin. Ddp-wm: Disentangled dynamics prediction for efficient world models.arXiv preprint arXiv:2602.01780,

work page arXiv
[13]

Fast-WAM: Do World Action Models Need Test-time Future Imagination?

Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666,

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Dynaguide: Steering diffusion polices with active dynamic guidance.arXiv preprint arXiv:2506.13922,

Maximilian Du and Shuran Song. Dynaguide: Steering diffusion polices with active dynamic guidance.arXiv preprint arXiv:2506.13922,

work page arXiv

[2] [2]

LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, et al. Libero-plus: In-depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Dream to Control: Learning Behaviors by Latent Imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019a. Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. InProceedings of International Co...

work page internal anchor Pith review Pith/arXiv arXiv 1912

[4] [4]

World Model for Robot Learning: A Comprehensive Survey

Bohan Hou, Gen Li, Jindou Jia, Tuo An, Xinying Guo, Sicong Leng, Haoran Geng, Yanjie Ze, Tatsuya Harada, Philip Torr, Oier Mees, Marc Pollefeys, Zhuang Liu, Jiajun Wu, Pieter Abbeel, Jitendra Malik, Yilun Du, and Jianfei Yang. World model for robot learning: A comprehensive survey.arXiv preprint arXiv:2605.00080,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Action-to-Action Flow Matching

Jindou Jia, Gen Li, Xiangyu Chen, Tuo An, Yuxuan Hu, Jingliang Li, Xinying Guo, and Jianfei Yang. Action-to-action flow matching.arXiv preprint arXiv:2602.07322,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Guanxing Lu, Wenkai Guo, Chubin Zhang, Yuheng Zhou, Haonan Jiang, Zifeng Gao, Yansong Tang, and Ziwei Wang

Xiaokang Liu, Zechen Bai, Hai Ci, Kevin Yuchen Ma, and Mike Zheng Shou. World-vla-loop: Closed-loop learning of video world model and vla policy.arXiv preprint arXiv:2602.06508,

work page arXiv

[7] [7]

World guidance: World modeling in condition space for action generation.arXiv preprint arXiv:2602.22010,

Yue Su, Sijin Chen, Haixin Shi, Mingyu Liu, Zhengshen Zhang, Ningyuan Huang, Weiheng Zhong, Zhengbang Zhu, Yuxiao Liu, and Xihui Liu. World guidance: World modeling in condition space for action generation.arXiv preprint arXiv:2602.22010,

work page arXiv

[8] [8]

Vla-jepa: Enhancing vision-language-action model with latent world model.arXiv preprint arXiv:2602.10098, 2026a

20 Jingwen Sun, Wenyao Zhang, Zekun Qi, Shaojie Ren, Zezhi Liu, Hanxin Zhu, Guangzhong Sun, Xin Jin, and Zhibo Chen. Vla-jepa: Enhancing vision-language-action model with latent world model.arXiv preprint arXiv:2602.10098, 2026a. Xiaoquan Sun, Zetian Xu, Chen Cao, Zonghe Liu, Yihan Sun, Jingrui Pang, Ruijian Zhang, Zhen Yang, Kang Pang, Dingxin He, et al....

work page arXiv

[9] [9]

Hierarchical diffusion policy: Manipulation trajectory generation via contact guidance.IEEE Transactions on Robotics, 41:2086–2104,

Dexin Wang, Chunsheng Liu, Faliang Chang, and Yichen Xu. Hierarchical diffusion policy: Manipulation trajectory generation via contact guidance.IEEE Transactions on Robotics, 41:2086–2104,

work page 2086

[10] [10]

Progressvla: Progress-guided diffusion policy for vision-language robotic manipulation.arXiv preprint arXiv:2603.27670,

Hongyu Yan, Qiwei Li, Jiaolong Yang, and Yadong Mu. Progressvla: Progress-guided diffusion policy for vision-language robotic manipulation.arXiv preprint arXiv:2603.27670,

work page arXiv

[11] [11]

World Action Models are Zero-shot Policies

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Ddp-wm: Disentangled dynamics prediction for efficient world models.arXiv preprint arXiv:2602.01780,

Shicheng Yin, Kaixuan Yin, Weixing Chen, Yang Liu, Guanbin Li, and Liang Lin. Ddp-wm: Disentangled dynamics prediction for efficient world models.arXiv preprint arXiv:2602.01780,

work page arXiv

[13] [13]

Fast-WAM: Do World Action Models Need Test-time Future Imagination?

Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666,

work page internal anchor Pith review Pith/arXiv arXiv