DVG-WM: Disentangled Video Generation Enables Efficient Embodied World Model for Robotic Manipulation
Pith reviewed 2026-07-03 21:52 UTC · model grok-4.3
The pith
Disentangling dynamics learning from visual synthesis produces faster and higher-quality video world models for robotic manipulation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DVG-WM decomposes world modeling into dynamics learning, which generates a sequence of intermediate visual states to preview physical interactions, and visual synthesis, which refines those states into high-fidelity videos using flow matching to map dynamics directly to video latents together with a latent degradation mechanism that regenerates contact-rich details; experiments confirm this yields improved video quality and up to 3.97 times faster inference on LIBERO and real-world platforms.
What carries the argument
The cascading mechanism that uses flow matching and latent degradation to map dynamics-generated intermediate states to refined video latents while preserving contact details.
If this is right
- Faster inference supports more iterations of model-based planning in robotic manipulation tasks.
- Language-conditioned intermediate states enable task-specific previewing of physical outcomes before full synthesis.
- The refinement step retains contact-rich details that would otherwise be lost in coarse predictions.
- The same decomposition applies across both simulated benchmarks like LIBERO and physical robot platforms.
Where Pith is reading between the lines
- Dynamics and synthesis modules could be updated independently, allowing targeted improvements in one without retraining the other.
- The intermediate-state preview might serve as a lightweight world model for quick feasibility checks before committing to full video generation.
- Speed gains could compound in closed-loop control by enabling higher-frequency replanning without sacrificing prediction detail.
Load-bearing premise
The intermediate visual states produced by the dynamics component can be refined without introducing artifacts or losing physically accurate temporal consistency in contact-rich scenes.
What would settle it
A test where the refined high-fidelity videos show measurable loss of contact accuracy or temporal consistency compared with entangled baselines, or where the reported inference speedup vanishes when detail levels are held constant.
read the original abstract
Video-based embodied world models provide an appealing substrate for robotic manipulation by predicting future states, yet current approaches remain limited by a fundamental entanglement: accurately modeling dynamics typically requires low-level temporal reasoning, while producing high-resolution frames demands expansive visual synthesis according to high-level semantics. This entanglement results in slow inference speed for iterative planning or too coarse predictions to retain contact-rich details. To solve this dilemma, we present Disentangled Video Generation World Model (DVG-WM), an efficient framework that explicitly decomposes world modeling into dynamics learning and visual synthesis. Conditioned on an initial observation and a language instruction, our model first generates a plausible sequence of intermediate visual states to preview the physical interaction and refines them to obtain high-fidelity videos. Furthermore, an efficient cascading mechanism is proposed, where DVG-WM uses flow matching to directly map the dynamics to video latents, and introduces a latent degradation mechanism to regenerate contact-rich details. Experiments on LIBERO and real-world platforms demonstrate improved video quality with up to 3.97 times acceleration, validating that disentangled video generation can be an efficient embodied world model for robotic manipulation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes DVG-WM, a framework that explicitly decomposes embodied world modeling into dynamics learning (generating a sequence of intermediate visual states from an initial observation and language instruction to preview physical interactions) and visual synthesis (refining those states into high-fidelity videos). The refinement uses flow matching to map dynamics outputs directly to video latents followed by a latent degradation mechanism to regenerate contact-rich details. Experiments on the LIBERO benchmark and real-world robotic platforms are claimed to demonstrate improved video quality alongside up to 3.97 times acceleration relative to prior approaches.
Significance. If the disentanglement and refinement steps can be shown to preserve physically accurate contact dynamics without introducing artifacts or breaking temporal consistency, the approach would provide a concrete route to faster iterative planning in robotic manipulation by separating low-level trajectory preview from high-resolution visual synthesis.
major comments (2)
- [Abstract] Abstract: the headline claim that latent degradation 'regenerates contact-rich details' while preserving physical accuracy rests on an unelaborated mechanism; no loss terms, constraints, or derivation are supplied to show that the step restores rather than hallucinates details, directly undermining the physical-fidelity guarantee required for the central disentanglement thesis.
- [Experiments] Experiments section (results on LIBERO and real-world): the reported 3.97 imes acceleration and video-quality gains are presented without baselines, exact metrics (e.g., FVD, PSNR, or contact-specific measures), statistical significance, or ablations isolating the dynamics-to-latent mapping and degradation components; these omissions make it impossible to verify that the speed/quality improvements are attributable to the proposed decomposition rather than implementation details.
minor comments (2)
- The cascading mechanism is described at a high level; adding explicit equations for the flow-matching map and the degradation operator would clarify how the intermediate states are transformed.
- The abstract states 'up to 3.97 times acceleration' without naming the reference method or the precise inference configuration (e.g., number of planning steps, hardware).
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments point by point below and will revise the manuscript to improve elaboration and experimental reporting.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline claim that latent degradation 'regenerates contact-rich details' while preserving physical accuracy rests on an unelaborated mechanism; no loss terms, constraints, or derivation are supplied to show that the step restores rather than hallucinates details, directly undermining the physical-fidelity guarantee required for the central disentanglement thesis.
Authors: We agree that the abstract and current manuscript do not supply explicit loss terms, constraints, or a derivation for the latent degradation step to demonstrate restoration of details rather than hallucination. The paper introduces the mechanism via flow matching but lacks this supporting analysis. We will revise by adding a detailed derivation, associated loss functions, and constraints in the methods section to strengthen the physical-fidelity claim. revision: yes
-
Referee: [Experiments] Experiments section (results on LIBERO and real-world): the reported 3.97 times acceleration and video-quality gains are presented without baselines, exact metrics (e.g., FVD, PSNR, or contact-specific measures), statistical significance, or ablations isolating the dynamics-to-latent mapping and degradation components; these omissions make it impossible to verify that the speed/quality improvements are attributable to the proposed decomposition rather than implementation details.
Authors: The experiments report results on LIBERO and real-world platforms with the claimed acceleration and quality gains, but we acknowledge the presentation omits explicit baselines, specific metrics such as FVD and PSNR, statistical significance, contact-specific measures, and component ablations. We will revise the experiments section to include these elements, ensuring the improvements can be attributed to the disentanglement. revision: yes
Circularity Check
No circularity: architectural decomposition with no self-referential reductions
full rationale
The paper describes a proposed framework that decomposes world modeling into separate dynamics learning and visual synthesis stages, using flow matching to map dynamics outputs to video latents followed by a latent degradation step. No equations, fitted parameters, or derivations appear in the provided text that reduce by construction to the inputs (e.g., no self-definitional ratios, no predictions that are statistically forced by prior fits, and no load-bearing self-citations or uniqueness theorems invoked). The central claims rest on empirical results from LIBERO and real-world experiments rather than any closed mathematical loop. This is a standard case of an independent architectural proposal.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Accurate dynamics modeling and high-resolution visual synthesis are fundamentally entangled in current video world models.
Reference graph
Works this paper leans on
-
[1]
Cosmos World Foundation Model Platform for Physical AI
Agarwal, N., Ali, A., Bala, M., Balaji, Y., Barker, E., Cai, T., Chattopadhyay, P., Chen, Y., Cui, Y., Ding, Y., et al.: Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Motus: A Unified Latent Action World Model
Bi, H., Tan, H., Xie, S., Wang, Z., Huang, S., Liu, H., Zhao, R., Feng, Y., Xiang, C., Rong, Y., et al.: Motus: A unified latent action world model. arXiv preprint arXiv:2512.13030 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., et al.: Video generation models as world simulators. OpenAI Blog1(8), 1 (2024)
work page 2024
-
[5]
RynnVLA-002: A Unified Vision-Language-Action and World Model
Cen, J., Huang, S., Yuan, Y., Li, K., Yuan, H., Yu, C., Jiang, Y., Guo, J., Li, X., Luo, H., et al.: Rynnvla-002: A unified vision-language-action and world model. arXiv preprint arXiv:2511.17502 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
WorldVLA: Towards Autoregressive Action World Model
Cen, J., Yu, C., Yuan, H., Jiang, Y., Huang, S., Guo, J., Li, X., Song, Y., Luo, H., Wang, F., et al.: Worldvla: Towards autoregressive action world model. arXiv preprint arXiv:2506.21539 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Advances in Neural Information Processing Systems37, 24081–24125 (2024)
Chen, B., Martí Monsó, D., Du, Y., Simchowitz, M., Tedrake, R., Sitzmann, V.: Diffusion forcing: Next-token prediction meets full-sequence diffusion. Advances in Neural Information Processing Systems37, 24081–24125 (2024)
work page 2024
-
[8]
Large Video Planner Enables Generalizable Robot Control
Chen, B., Zhang, T., Geng, H., Song, K., Zhang, C., Li, P., Freeman, W.T., Malik, J., Abbeel, P., Tedrake, R., et al.: Large video planner enables generalizable robot control. arXiv preprint arXiv:2512.15840 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
arXiv preprint arXiv:2602.03793 (2026)
Chen, Y., Li, P., Yang, J., He, K., Wu, X., Xu, Y., Wang, K., Liu, J., Liu, N., Huang, Y., et al.: Bridgev2w: Bridging video generation models to embodied world models via embodiment masks. arXiv preprint arXiv:2602.03793 (2026)
-
[10]
The International Journal of Robotics Research44(10-11), 1684–1704 (2025)
Chi, C., Xu, Z., Feng, S., Cousineau, E., Du, Y., Burchfiel, B., Tedrake, R., Song, S.: Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research44(10-11), 1684–1704 (2025)
work page 2025
-
[11]
Chi, X., Jia, P., Fan, C.K., Ju, X., Mi, W., Zhang, K., Qin, Z., Tian, W., Ge, K., Li, H., et al.: Wow: Towards a world omniscient world model through embodied interaction. arXiv preprint arXiv:2509.22642 (2025)
-
[12]
Deng, H., Wu, Z., Liu, H., Guo, W., Xue, Y., Shan, Z., Zhang, C., Jia, B., Ling, Y., Lu, G., et al.: A survey on reinforcement learning of vision-language-action models for robotic manipulation. Authorea Preprints (2025)
work page 2025
-
[13]
Ctrl-World: A Controllable Generative World Model for Robot Manipulation
Guo, Y., Shi, L.X., Chen, J., Finn, C.: Ctrl-world: A controllable generative world model for robot manipulation. arXiv preprint arXiv:2510.10125 (2025) 14
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Venhancer: Generative space-time enhancement for video generation,
He, J., Xue, T., Liu, D., Lin, X., Gao, P., Lin, D., Qiao, Y., Ouyang, W., Liu, Z.: Venhancer: Generative space-time enhancement for video generation. arXiv preprint arXiv:2407.07667 (2024)
-
[15]
Imagen Video: High Definition Video Generation with Diffusion Models
Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D.P., Poole, B., Norouzi, M., Fleet, D.J., et al.: Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[16]
Advances in neural information processing systems33, 6840–6851 (2020)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)
work page 2020
-
[17]
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. Iclr1(2), 3 (2022)
work page 2022
-
[18]
Huang, S., Chen, L., Zhou, P., Chen, S., Jiang, Z., Hu, Y., Liao, Y., Gao, P., Li, H., Yao, M., et al.: Enerverse: Envisioning embodied future space for robotics manipulation. arXiv preprint arXiv:2501.01895 (2025)
-
[19]
Huang, W., Chao, Y.W., Mousavian, A., Liu, M.Y., Fox, D., Mo, K., Fei-Fei, L.: Point- world: Scaling 3d world models for in-the-wild robotic manipulation. arXiv preprint arXiv:2601.03782 (2026)
-
[20]
NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks
Hung, C.Y., Sun, Q., Hong, P., Zadeh, A., Li, C., Tan, U., Majumder, N., Poria, S., et al.: Nora: A small open-sourced generalist vision language action model for embodied tasks. arXiv preprint arXiv:2504.19854 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
arXiv preprint arXiv:2509.19080 (2025)
Jiang, Z., Liu, K., Qin, Y., Tian, S., Zheng, Y., Zhou, M., Yu, C., Li, H., Zhao, D.: World4rl: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation. arXiv preprint arXiv:2509.19080 (2025)
-
[22]
Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning
Kim, M.J., Gao, Y., Lin, T.Y., Lin, Y.C., Ge, Y., Lam, G., Liang, P., Song, S., Liu, M.Y., Finn, C., et al.: Cosmos policy: Fine-tuning video models for visuomotor control and planning. arXiv preprint arXiv:2601.16163 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[23]
Causal World Modeling for Robot Control
Li, L., Zhang, Q., Luo, Y., Yang, S., Wang, R., Han, F., Yu, M., Gao, Z., Xue, N., Zhu, X., et al.: Causal world modeling for robot control. arXiv preprint arXiv:2601.21998 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[24]
KeyWorld: Key frame rea- soning enables effective and efficient world models,
Li, S., Hao, Q., Shang, Y., Li, Y.: Keyworld: Key frame reasoning enables effective and efficient world models. arXiv preprint arXiv:2509.21027 (2025)
-
[25]
Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation
Liao, Y., Zhou, P., Huang, S., Yang, D., Chen, S., Jiang, Y., Hu, Y., Cai, J., Liu, S., Luo, J., et al.: Genie envisioner: A unified world foundation platform for robotic manipulation. arXiv preprint arXiv:2508.05635 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Flow Matching for Generative Modeling
Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[27]
Advances in Neural Information Processing Systems36, 44776–44791 (2023)
Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., Stone, P.: Libero: Benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems36, 44776–44791 (2023)
work page 2023
-
[28]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022) 15
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[29]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Lu, G., Jia, B., Li, P., Chen, Y., Wang, Z., Tang, Y., Huang, S.: Gwm: Towards scal- able gaussian world models for robotic manipulation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9263–9274 (2025)
work page 2025
- [30]
-
[31]
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks
Ren, T., Liu, S., Zeng, A., Lin, J., Li, K., Cao, H., Chen, J., Huang, X., Chen, Y., Yan, F., et al.: Grounded sam: Assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
arXiv preprint arXiv:2402.09470 (2024)
Ruhe, D., Heek, J., Salimans, T., Hoogeboom, E.: Rolling diffusion models. arXiv preprint arXiv:2402.09470 (2024)
-
[33]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Shan, Z., Zhang, Y., Yang, Q., Yang, H., Xu, Y., Hwang, J.N., Xu, X., Liu, S.: Contrastive pre-training with multi-view fusion for no-reference point cloud quality assessment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 25942–25951 (2024)
work page 2024
-
[34]
IEEE Robotics and Automation Letters (2026)
Shan, Z., Zhou, Y., Wu, G., Ji, Z., Wu, Z., Wang, Z.: Dockanywhere: Data-efficient visuomotor policy learning for mobile manipulation via novel demonstration generation. IEEE Robotics and Automation Letters (2026)
work page 2026
-
[35]
arXiv preprint arXiv:2509.21790 (2025)
Shang, Y., Jin, L., Ma, Y., Zhang, X., Gao, C., Wu, W., Li, Y.: Longscape: Advancing long- horizon embodied world models with context-aware moe. arXiv preprint arXiv:2509.21790 (2025)
-
[36]
Shen, Y., Wei, F., Du, Z., Liang, Y., Lu, Y., Yang, J., Zheng, N., Guo, B.: Videovla: Video generators can be generalizable robot manipulators. arXiv preprint arXiv:2512.06963 (2025)
-
[37]
Neurocomputing568, 127063 (2024)
Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., Liu, Y.: Roformer: Enhanced transformer with rotary position embedding. Neurocomputing568, 127063 (2024)
work page 2024
-
[38]
Evaluating gemini robotics policies in a veo world simulator, 2025
Team, G.R., Choromanski, K., Devin, C., Du, Y., Dwibedi, D., Gao, R., Jindal, A., Kipf, T., Kirmani, S., Leal, I., et al.: Evaluating gemini robotics policies in a veo world simulator. arXiv preprint arXiv:2512.10675 (2025)
-
[39]
Wan: Open and Advanced Large-Scale Video Generative Models
Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
International Journal of Computer Vision133(5), 3059–3078 (2025)
Wang, Y., Chen, X., Ma, X., Zhou, S., Huang, Z., Wang, Y., Yang, C., He, Y., Yu, J., Yang, P., et al.: Lavie: High-quality video generation with cascaded latent diffusion models. International Journal of Computer Vision133(5), 3059–3078 (2025)
work page 2025
-
[41]
IEEE transactions on image processing13(4), 600–612 (2004)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing13(4), 600–612 (2004)
work page 2004
-
[42]
Advances in Neural Information Processing Systems 37, 68082–68119 (2024) 16
Wu, J., Yin, S., Feng, N., He, X., Li, D., Hao, J., Long, M.: ivideogpt: Interactive videogpts are scalable world models. Advances in Neural Information Processing Systems 37, 68082–68119 (2024) 16
work page 2024
-
[43]
World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training
Xiao, J., Yang, Y., Chang, X., Chen, R., Xiong, F., Xu, M., Zheng, W.S., Zhang, Q.: World-env: Leveraging world model as a virtual environment for vla post-training. arXiv preprint arXiv:2509.24948 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Xie, R., Liu, Y., Zhou, P., Zhao, C., Zhou, J., Zhang, K., Zhang, Z., Yang, J., Yang, Z., Tai, Y.: Star: Spatial-temporal augmentation with text-to-video models for real-world video super-resolution. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17108–17118 (2025)
work page 2025
-
[45]
RISE: Self-Improving Robot Policy with Compositional World Model
Yang, J., Lin, K., Li, J., Zhang, W., Lin, T., Wu, L., Su, Z., Zhao, H., Zhang, Y.Q., Chen, L., et al.: Rise: Self-improving robot policy with compositional world model. arXiv preprint arXiv:2602.11075 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[46]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[47]
World Action Models are Zero-shot Policies
Ye, S., Ge, Y., Zheng, K., Gao, S., Yu, S., Kurian, G., Indupuru, S., Tan, Y.L., Zhu, C., Xiang, J., et al.: World action models are zero-shot policies. arXiv preprint arXiv:2602.15922 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[48]
3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
Ze, Y., Zhang, G., Zhang, K., Hu, C., Wang, M., Xu, H.: 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations. arXiv preprint arXiv:2403.03954 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
RoDyn: Taming Interactive Robot-Dynamic 2.5D World Model for Robotic Manipulation
Zhang, C., Wu, Z., Lu, G., Tang, Y., Wang, Z.: imowm: Taming interactive multi-modal world model for robotic manipulation. arXiv preprint arXiv:2510.09036 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
A step toward world models: A survey on robotic manipulation,
Zhang, P.F., Cheng, Y., Sun, X., Wang, S., Li, F., Zhu, L., Shen, H.T.: A step toward world models: A survey on robotic manipulation. arXiv preprint arXiv:2511.02097 (2025)
-
[51]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018)
work page 2018
-
[52]
arXiv preprint arXiv:2502.05179 (2025)
Zhang, S., Li, W., Chen, S., Ge, C., Sun, P., Zhang, Y., Jiang, Y., Yuan, Z., Peng, B., Luo, P.: Flashvideo: Flowing fidelity to detail for efficient high-resolution video generation. arXiv preprint arXiv:2502.05179 (2025)
-
[53]
arXiv preprint arXiv:2506.09990 (2025)
Zhang, W., Hu, T., Zhang, H., Qiao, Y., Qin, Y., Li, Y., Liu, J., Kong, T., Liu, L., Ma, X.: Chain-of-action: Trajectory autoregressive modeling for robotic manipulation. arXiv preprint arXiv:2506.09990 (2025)
-
[54]
IEEE Transactions on Circuits and Systems for Video Technology (2024)
Zhang, Y., Yang, Q., Shan, Z., Xu, Y.: Asynchronous feedback network for perceptual point cloud quality assessment. IEEE Transactions on Circuits and Systems for Video Technology (2024)
work page 2024
-
[55]
Act2goal: From world model to general goal-conditioned policy, 2025.https://arxiv.org/abs/2512.23541
Zhou, P., Chen, L., Chen, S., Chen, D., Zhao, W., Jin, R., Ren, G., Luo, J.: Act2goal: From world model to general goal-conditioned policy. arXiv preprint arXiv:2512.23541 (2025)
-
[56]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Zhou, S., Yang, P., Wang, J., Luo, Y., Loy, C.C.: Upscale-a-video: Temporal-consistent diffusion model for real-world video super-resolution. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2535–2545 (2024) 17
work page 2024
-
[57]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Zhu, F., Wu, H., Guo, S., Liu, Y., Cheang, C., Kong, T.: Irasim: A fine-grained world model for robot manipulation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9834–9844 (2025)
work page 2025
-
[58]
Zhu, F., Yan, Z., Hong, Z., Shou, Q., Ma, X., Guo, S.: Wmpo: World model-based policy optimization for vision-language-action models. arXiv preprint arXiv:2511.09515 (2025) 18 Appendix A Details of Latent Degradation The refinement stage aims to reconstruct a high-resolution latent videozhr∈Rc×t×hhr×whr from the low-resolution dynamics produced by the pre...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.