pith. sign in

arxiv: 2606.12072 · v1 · pith:ALOLLCXUnew · submitted 2026-06-10 · 💻 cs.CV

World Model Self-Distillation: Training World Models to Solve General Tasks

Pith reviewed 2026-06-27 10:14 UTC · model grok-4.3

classification 💻 cs.CV
keywords self-distillationworld modelsvideo diffusionreinforcement learningvision-language modelstask solvingrobotics transfergenerative models
0
0 comments X

The pith

Self-distillation from a VLM lets a video world model solve tasks from an image and short prompt alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to unlock task-solving in pretrained video generators without paired execution videos or outsourcing all reasoning to language models. A vision-language model first creates a task and detailed solution from an unlabeled image; this conditions a Demonstrator video diffusion model. The Demonstrator's outputs are then distilled into an Executor that receives only the image and a brief task description. Reinforcement learning using the VLM to judge success further improves the Executor. A sympathetic reader would care because the method scales task learning from unlabeled scenes and makes world models directly usable for planning.

Core claim

The central claim is that combining self-distillation with reinforcement learning elicits task-solving ability in pretrained video diffusion models: the Demonstrator generates videos from VLM-provided detailed solutions, its behavior is transferred to an Executor conditioned only on the scene image and short task prompt, and RL from VLM feedback on whether sampled videos satisfy the task produces an Executor that surpasses the Demonstrator on the WorldTasks-Benchmark while transferring competitively to the DreamGen robotics benchmark.

What carries the argument

The self-distillation pipeline that transfers execution knowledge from the caption-conditioned Demonstrator to the image-and-short-prompt Executor, combined with reinforcement learning that exploits the VLM's greater reliability at judging success than at generating solutions.

If this is right

  • Task-solving training becomes possible from unlabeled scene images without collecting paired task-execution videos.
  • The Executor can perform planning and decision-making directly from visual input and a short prompt without detailed textual descriptions.
  • Reinforcement learning improves performance by leveraging the asymmetry between the VLM's judging and generating abilities.
  • The resulting model transfers to robotic control tasks without additional robotics-specific supervision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could reduce dependence on separate language models for high-level reasoning in embodied agents.
  • Similar distillation might unlock decision-making capabilities in other classes of generative models.
  • Iterative self-improvement loops could be run online in deployed systems using only scene images as input.
  • The method points toward world models that learn general skills from passive visual data at scale.

Load-bearing premise

The vision-language model can generate accurate step-by-step solutions from images and give reliable feedback for reinforcement learning that improves the Executor without introducing systematic errors or bias in task judgment.

What would settle it

An experiment in which videos rated successful by the VLM actually fail to complete the stated task when measured by human raters or physical robot execution, or where RL training guided by VLM feedback lowers performance on a held-out set of tasks.

Figures

Figures reproduced from arXiv: 2606.12072 by Aram Davtyan, Pablo Acuaviva Huertos, Paolo Favaro, Sebastian Stapf.

Figure 1
Figure 1. Figure 1: Overview of WMSD. The method addresses general tasks via a two-stage pipeline. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: WorldTasks examples. Each panel shows an initial frame together with the addressed-agent task prompt and the original generated solution description. Examples cover human, first-person, and robot agents across interaction, manipulation, and navigation tasks. feedback not as a standalone ground-truth reward, but as a weak verification signal to be combined with distributional regularization from the teacher… view at source ↗
Figure 3
Figure 3. Figure 3: WorldTasks prompt composition for the training split and WorldTasks-Bench. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Two ablations on WorldTasks-Bench. Left: Ablation on self-distillation methods, showing average WorldTasks score and PickScore. Right: Ablation of average WorldTasks score vs. βd. 0 20 40 60 80 100 Training Step 0.4 0.5 0.6 Task Score 0 20 40 60 80 100 Training Step 0.4 0.5 0.6 0.7 Agent Score 0 20 40 60 80 100 Training Step 0.65 0.70 0.75 0.80 0.85 Realism Score Alternating + RL Off-Policy + Dem RL On-Pol… view at source ↗
Figure 5
Figure 5. Figure 5: Ablation across training settings on WorldTasks-Bench. We report the three evaluation dimensions. 4.2 On-policy vs. Off-policy Self-Distillation We begin by comparing the three self-distillation variants introduced in Sec. 3: off-policy self￾distillation, on-policy self-distillation using only the anchor loss between student and teacher, and the full on-policy self-distillation objective in Eq. (12). In [… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparisons between WMSD and the base model across LTX-2 and HunyuanVideo-1.5. Each subfigure shows six uniformly sampled frames from the generated videos. character prompts (36.0% to 76.0%). Vehicle prompts remain more challenging, reaching 50.0% Agent Score, but this slice contains only 12 examples and therefore should be interpreted as a small-support diagnostic rather than a primary trend. … view at source ↗
Figure 7
Figure 7. Figure 7: Performance breakdown on WorldTasks-Bench. Left: Task Score by task type. Right: Agent Score by addressed-agent type. We show all categories with more than 5% benchmark support. Values are VLM-judged success rates in percent, with subgroup sizes shown in parentheses [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Two examples: the first row uses the consistency reward, while the second row does not. [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Example video generated with WMSD and LTX-2 on the DreamGen benchmark. Task: Use the right hand to pick up the pink bottle and pour water on the flower. 4.8 Discussion & Limitations Generalizability Training with WMSD leads to substantial improvements on WorldTasks-Bench as well as on robotics-related tasks (Sec. 4.7), achieving performance competitive with supervised fine-tuning. Furthermore, recent advan… view at source ↗
Figure 10
Figure 10. Figure 10: Prompt used for task reward during training. [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Prompt used for the consistency reward during training. [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Prompt used for VLM-based semantic filtering of dataset images. [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Two representative samples from WorldTasks. Each sample includes an initial frame, task prompts, and corresponding descriptive solutions. This formulation reinforces trajectories that outperform their peers on the same task while suppressing weaker ones. Unlike standard distillation, it enables improvements beyond the teacher whenever the reward function favors better solutions. FlowGRPO. Flow matching mo… view at source ↗
Figure 14
Figure 14. Figure 14: Two representative samples from WorldTasks. Each sample includes an initial frame, task prompts, and corresponding descriptive solutions. Flow-GRPO [40] extends this framework by casting denoising as a multi-step MDP. Here the subscript t − 1 denotes the next state in the discrete reverse sampler, not the continuous flow-time convention above. The state, action, and policy are defined as st = (c, t, xt), … view at source ↗
Figure 15
Figure 15. Figure 15: Prompt used to evaluate whether a generated video successfully completes the instructed [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Prompt used to verify that the correct agent performs the instructed action. [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Prompt used to evaluate physical realism and temporal consistency of generated videos. [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗
read the original abstract

Pretrained video generators are promising visual world models that exhibit emergent task-solving abilities; however, their reliance on detailed textual descriptions limits their direct use for planning and decision-making. Existing approaches either outsource this reasoning to language or vision-language models, or rely on supervised fine-tuning with paired task-execution videos, which are costly to collect and difficult to scale. We propose a scalable framework that elicits task-solving ability in such models by combining self-distillation with reinforcement learning. Given an unlabeled scene image, a vision-language model generates a candidate task and a detailed step-by-step solution. The solution conditions a pretrained video diffusion model, the Demonstrator; we distill its behavior into an Executor conditioned only on the image and a short task prompt. This transfers execution knowledge from caption-guided generation to instruction-conditioned task solving without curated task-video supervision. We further improve the Executor with reinforcement learning from VLM feedback, exploiting the asymmetry between judging whether a sampled video satisfies a task and generating the solution. Experiments on our proposed WorldTasks-Benchmark and the DreamGen robotics benchmark show that the Executor surpasses the Demonstrator under our VLM-based evaluation protocol and transfers competitively to robotic tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes World Model Self-Distillation: a VLM generates candidate tasks and step-by-step solutions from unlabeled scene images; these condition a pretrained video diffusion model (Demonstrator). Behavior is distilled into an Executor conditioned only on the image plus a short task prompt. The Executor is then refined via reinforcement learning that uses VLM feedback on whether sampled videos satisfy the task. Experiments on the introduced WorldTasks-Benchmark and the DreamGen robotics benchmark report that the Executor surpasses the Demonstrator under the VLM evaluation protocol and transfers competitively to robotic tasks.

Significance. If the reported gains can be shown to reflect genuine task execution rather than VLM preference alignment, the framework would supply a scalable route to task-solving capabilities in video world models without requiring curated task-video pairs, with direct relevance to generalist planning and robotics applications.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Evaluation): The central claim that the Executor surpasses the Demonstrator rests exclusively on a VLM-based scoring protocol in which the same VLM supplies task generation, solution generation, RL reward, and final success judgment. No independent human-annotated validation set, physical-robot ground-truth metric, or cross-model evaluator is reported. This directly affects the abstract claim and the DreamGen transfer results; without such controls it is impossible to rule out that improvements arise from stylistic or caption-matching artifacts rather than execution capability.
  2. [§3.3] §3.3 (RL from VLM Feedback): The asserted asymmetry between VLM generation and judgment is not quantified. No ablation isolates whether the VLM reward improves performance beyond the distillation baseline, nor is any analysis provided of reward noise, bias, or failure modes on the WorldTasks-Benchmark tasks.
  3. [Table 2 and §5.1] Table 2 and §5.1: Reported success rates for Executor vs. Demonstrator lack per-run standard deviations, confidence intervals, or statistical significance tests. The absence of these statistics makes it impossible to assess whether the claimed superiority is reliable or could be explained by variance.
minor comments (2)
  1. [Figure 1] Figure 1 caption and pipeline diagram: the distinction between the Demonstrator (caption-conditioned) and Executor (image+short-prompt) conditioning is visually clear but the exact conditioning tokens passed at inference time are not labeled.
  2. [Related Work] Related Work section: discussion of prior VLM-as-judge and self-distillation literature is present but omits several recent works on bias and calibration of VLM reward models.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below, agreeing where revisions are needed to strengthen statistical reporting and ablations while defending the design choices around the VLM evaluation protocol. We outline specific changes for the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract and §4] The central claim that the Executor surpasses the Demonstrator rests exclusively on a VLM-based scoring protocol in which the same VLM supplies task generation, solution generation, RL reward, and final success judgment. No independent human-annotated validation set, physical-robot ground-truth metric, or cross-model evaluator is reported.

    Authors: We acknowledge the evaluation relies on a single consistent VLM protocol, which the abstract already qualifies as 'under our VLM-based evaluation protocol.' This enables scalable assessment without curated labels. We agree this cannot fully rule out VLM-specific artifacts. In revision we will expand §4 to discuss this limitation explicitly, add caveats on potential stylistic biases, and clarify DreamGen results use the same protocol but demonstrate competitive transfer. We do not have a human-annotated set available. revision: partial

  2. Referee: [§3.3] The asserted asymmetry between VLM generation and judgment is not quantified. No ablation isolates whether the VLM reward improves performance beyond the distillation baseline, nor is any analysis provided of reward noise, bias, or failure modes on the WorldTasks-Benchmark tasks.

    Authors: The asymmetry (judging success vs. generating detailed solutions) is foundational, but we agree it requires quantification. We will add an ablation in the revision comparing distillation-only vs. full RL Executor performance, plus analysis of reward consistency (e.g., inter-run agreement and example failure modes) on WorldTasks-Benchmark. These will be inserted in §3.3 and experiments. revision: yes

  3. Referee: [Table 2 and §5.1] Reported success rates for Executor vs. Demonstrator lack per-run standard deviations, confidence intervals, or statistical significance tests.

    Authors: We agree that variance measures are essential. The original runs support recomputation; we will add per-run standard deviations, 95% confidence intervals, and significance tests (e.g., paired t-tests) to Table 2 and §5.1 in the revision. revision: yes

standing simulated objections not resolved
  • Independent human-annotated validation set or cross-model evaluator, as none was collected and creating one would require substantial new resources beyond the current scope.

Circularity Check

0 steps flagged

No circularity; empirical claims rest on external VLM asymmetry and benchmarks

full rationale

The paper's method generates tasks/solutions via VLM, conditions Demonstrator, distills to Executor, applies RL from VLM feedback, and evaluates via VLM protocol while claiming asymmetry between judgment and generation. No derivation step, equation, or prediction reduces by construction to its own inputs. No self-citation is load-bearing for uniqueness or ansatz. Results are presented as empirical comparisons on the authors' WorldTasks-Benchmark and external DreamGen benchmark. This is self-contained against external benchmarks with no exhibited reduction of the form Eq. X = Eq. Y or fitted parameter renamed as prediction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient detail to identify specific free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5745 in / 1179 out tokens · 36430 ms · 2026-06-27T10:14:09.556545+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

82 extracted references · 41 canonical work pages · 29 internal anchors

  1. [1]

    From generation to generalization: Emergent few-shot learning in video diffusion models, 2025

    Pablo Acuaviva, Aram Davtyan, Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Alexandre Alahi, and Paolo Favaro. From generation to generalization: Emergent few-shot learning in video diffusion models, 2025. URLhttps://arxiv.org/abs/2506.07280

  2. [2]

    Rethinking visual intelligence: Insights from video pretraining, 2025

    Pablo Acuaviva, Aram Davtyan, Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Alexandre Alahi, and Paolo Favaro. Rethinking visual intelligence: Insights from video pretraining, 2025. URLhttps://arxiv.org/abs/2510.24448

  3. [3]

    On-policy distillation of language models: Learning from self-generated mistakes

    R Agarwal, N Vieillard, Y Zhou, and P Stanczyk. On-policy distillation of language models: Learning from self-generated mistakes. InInternational Conference on Learning Representa- tions, 2024

  4. [4]

    Concrete Problems in AI Safety

    D Amodei, C Olah, J Steinhardt, and P Christiano. Concrete problems in ai safety.arXiv preprint arXiv:1606.06565, 2016. URLhttps://arxiv.org/abs/1606.06565

  5. [5]

    Videophy: Evaluating physical commonsense for video generation

    Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai-Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=9D2QvO1uWj

  6. [6]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    J Bjorck, F Castañeda, N Cherniadev, X Da, and R Ding. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025. URL https: //arxiv.org/abs/2503.14734

  7. [7]

    Training diffusion models with reinforce- ment learning

    K Black, M Janner, Y Du, I Kostrikov, and S Levine. Training diffusion models with reinforce- ment learning. InInternational Conference on Learning Representations, 2024

  8. [8]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    A Blattmann, T Dockhorn, S Kulal, and D Mendelevitch. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. URL https://arxiv.org/abs/2311.15127

  9. [9]

    RT-1: Robotics Transformer for Real-World Control at Scale

    A Brohan, N Brown, J Carbajal, Y Chebotar, and J Dabis. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022. URL https://arxiv. org/abs/2212.06817. 13

  10. [10]

    Genie: Generative interactive environments

    J Bruce, MD Dennis, A Edwards, J Parker-Holder, and Y Shi. Genie: Generative interactive environments. InInternational Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=bJbSbJskOS

  11. [11]

    Diffusion policy: Visuomotor policy learning via action diffusion, 2023

    Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion, 2023

  12. [12]

    Deep reinforcement learning from human preferences

    PF Christiano, J Leike, T Brown, M Martic, and S Legg. Deep reinforcement learning from human preferences. InAdvances in Neural Information Processing Systems, 2017

  13. [13]

    Video language planning

    Y Du, S Yang, P Florence, F Xia, A Wahid, and P Sermanet. Video language planning. In International Conference on Learning Representations, 2024

  14. [14]

    Learning universal policies via text-guided video generation

    Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schu- urmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. InNeurIPS, 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/ 1d5b9233ad716a43be5c0d3023cb82d0-Abstract-Conference.html

  15. [15]

    Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models

    Y Fan, O Watkins, Y Du, H Liu, M Ryu, and C Boutilier. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models. InAdvances in Neural Information Processing Systems, 2023

  16. [16]

    Born again neural networks

    T Furlanello, Z Lipton, M Tschannen, and L Itti. Born again neural networks. InProceedings of Machine Learning Research, 2018

  17. [17]

    Are video models ready as zero-shot reasoners? an empirical study with the mme-cof benchmark.ArXiv, abs/2510.26802, 2025

    Ziyu Guo, Xinyan Chen, Renrui Zhang, Ruichuan An, Yu Qi, Dongzhi Jiang, Xiangtai Li, Manyuan Zhang, Hongsheng Li, and Pheng-Ann Heng. Are video models ready as zero-shot reasoners? an empirical study with the mme-cof benchmark.ArXiv, abs/2510.26802, 2025

  18. [18]

    World Models

    D Ha and J Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018. URL https://arxiv.org/abs/1803.10122

  19. [19]

    Ltx-2: Efficient joint audio-visual foundation model, 2026

    Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, Eitan Richardson, Guy Shiran, Itay Chachy, Jonathan Chetboun, Michael Finkelson, Michael Kupchick, Nir Zabari, Nitzan Guetta, Noa Kotler, Ofir Bibi, Ori Gordon, Poriya Panet, Roi Benita, Shahar Armon, V...

  20. [20]

    Dream to Control: Learning Behaviors by Latent Imagination

    D Hafner, T Lillicrap, J Ba, and M Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019. URL https://arxiv.org/abs/1912. 01603

  21. [21]

    Learning latent dynamics for planning from pixels

    D Hafner, T Lillicrap, I Fischer, R Villegas, D Ha, and H Lee. Learning latent dynamics for planning from pixels. InProceedings of Machine Learning Research, 2019

  22. [22]

    Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Pedro M. B. Rezende, Yasaman Haghighi, David Brüggemann, Isinsu Katircioglu, Lin Zhang, Xiaoran Chen, Suman Saha, Marco Cannici, Elie Aljalbout, Botao Ye, Xi Wang, Aram Davtyan, Mathieu Salzmann, Davide Scaramuzza, Marc Pollefeys, Paolo Favaro, and Alexandre Alahi. Gem: A generalizable ego-vision multimodal wo...

  23. [23]

    Pre-trained video generative models as world simulators.CoRR, abs/2502.07825, February 2025

    Haoran He, Yang Zhang, Liang Lin, Zhongwen Xu, and Ling Pan. Pre-trained video generative models as world simulators.CoRR, abs/2502.07825, February 2025. URL https://doi.org/ 10.48550/arXiv.2502.07825

  24. [24]

    TempFlow-GRPO: When Timing Matters for GRPO in Flow Models

    X He, S Fu, Y Zhao, W Li, J Yang, D Yin, and F Rao. Tempflow-grpo: When timing matters for grpo in flow models.arXiv preprint arXiv:2508.04324, 2025. URL https: //arxiv.org/abs/2508.04324. 14

  25. [25]

    Distilling the Knowledge in a Neural Network

    G Hinton, O Vinyals, and J Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015. URLhttps://arxiv.org/abs/1503.02531

  26. [26]

    Denoising diffusion probabilistic models

    J Ho, A Jain, and P Abbeel. Denoising diffusion probabilistic models. InAdvances in Neural Information Processing Systems, 2020

  27. [27]

    Imagen Video: High Definition Video Generation with Diffusion Models

    J Ho, W Chan, C Saharia, J Whang, R Gao, and A Gritsenko. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022. URL https: //arxiv.org/abs/2210.02303

  28. [28]

    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    W Hong, M Ding, W Zheng, X Liu, and J Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022. URL https://arxiv.org/abs/2205.15868

  29. [29]

    Relic: Interactive video world model with long-horizon memory, 2025

    Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Yang Zhou, Sai Bi, Yannick Hold-Geoffroy, Mike Roberts, Matthew Fisher, Eli Shechtman, Kalyan Sunkavalli, Feng Liu, Zhengqi Li, and Hao Tan. Relic: Interactive video world model with long-horizon memory, 2025. URL https://arxiv.org/abs/2512.04040

  30. [30]

    Vbench: Comprehensive benchmark suite for video generative models

    Z Huang, Y He, J Yu, F Zhang, C Si, Y Jiang, and Y Zhang. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  31. [31]

    Reinforcement Learning via Self-Distillation

    J Hübotter, F Lübeck, L Behric, and A Baumann. Reinforcement learning via self-distillation. arXiv preprint arXiv:2601.20802, 2026. URLhttps://arxiv.org/abs/2601.20802

  32. [32]

    DreamGen: Unlocking Generalization in Robot Learning through Video World Models

    Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, Loic Magne, Ajay Mandlekar, Avnish Narayan, You Liang Tan, Guanzhi Wang, Jing Wang, Qi Wang, Yinzhen Xu, Xiaohui Zeng, Kaiyuan Zheng, Ruijie Zheng, Ming-Yu Liu, Luke Zettlemoyer, Dieter Fox, Jan Kautz, Scott Reed, Yuke Zh...

  33. [33]

    Distribution matching distillation meets reinforcement learning, 2026

    D Jiang, D Liu, Z Wang, Q Wu, L Li, H Li, X Jin, and D Liu. Distribution matching distillation meets reinforcement learning.arXiv preprint arXiv:2511.13649, 2025. URL https://arxiv. org/abs/2511.13649

  34. [34]

    Miradata: A large-scale video dataset with long durations and structured captions

    X Ju, Y Gao, Z Zhang, Z Yuan, X Wang, and A Zeng. Miradata: A large-scale video dataset with long durations and structured captions. InAdvances in Neural Information Processing Systems, 2024

  35. [35]

    Pick-a-pic: An open dataset of user preferences for text-to-image generation

    Y Kirstain, A Polyak, U Singer, S Matiana, and J Penna. Pick-a-pic: An open dataset of user preferences for text-to-image generation. InAdvances in Neural Information Processing Systems, 2023

  36. [36]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    W Kong, Q Tian, Z Zhang, R Min, Z Dai, J Zhou, and J Xiong. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. URL https://arxiv.org/abs/2412.03603

  37. [37]

    MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

    J Li, Y Cui, T Huang, Y Ma, C Fan, Y Cheng, and M Yang. Mixgrpo: Unlocking flow- based grpo efficiency with mixed ode-sde.arXiv preprint arXiv:2507.21802, 2025. URL https://arxiv.org/abs/2507.21802

  38. [38]

    Flow Matching for Generative Modeling

    Y Lipman, RTQ Chen, H Ben-Hamu, M Nickel, and M Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022. URL https://arxiv.org/abs/2210. 02747

  39. [39]

    Flow Matching Guide and Code

    Y Lipman, M Havasi, P Holderrieth, N Shaul, and M Le. Flow matching guide and code.arXiv preprint arXiv:2412.06264, 2024. URLhttps://arxiv.org/abs/2412.06264

  40. [40]

    Flow-GRPO: Training flow matching models via online RL

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di ZHANG, and Wanli Ouyang. Flow-GRPO: Training flow matching models via online RL. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=oCBKGw5HNf. 15

  41. [41]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    X Liu, C Gong, and Q Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. URL https://arxiv.org/abs/2209. 03003

  42. [42]

    Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

    S Luo, Y Tan, L Huang, J Li, and H Zhao. Latent consistency models: Synthesizing high- resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023. URL https://arxiv.org/abs/2310.04378

  43. [43]

    Cosmos World Foundation Model Platform for Physical AI

    NVIDIA, :, Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, Daniel Dworakowski, Jiaojiao Fan, Michele Fenzi, Francesco Ferroni, Sanja Fidler, Dieter Fox, Songwei Ge, Yunhao Ge, Jinwei Gu, Siddharth Gururani, Ethan He, Jiahui Huang, Jacob Huffman, Pooya Jannaty, Ji...

  44. [44]

    Open x-embodiment: Robotic learn- ing datasets and rt-x models: Open x-embodiment collaboration 0

    A O’Neill, A Rehman, A Maddukuri, and A Gupta. Open x-embodiment: Robotic learn- ing datasets and rt-x models: Open x-embodiment collaboration 0. InIEEE International Conference on Robotics and Automation, 2024

  45. [45]

    Training language models to follow instructions with human feedback

    L Ouyang, J Wu, X Jiang, D Almeida, and C Wainwright. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems, 2022

  46. [46]

    Vlp: Vision language planning for autonomous driving

    C Pan, B Yaman, T Nesti, A Mallik, and AG Allievi. Vlp: Vision language planning for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  47. [47]

    arXiv preprint arXiv:2310.12921 (2023)

    Juan Rocamonde, Victoriano Montesinos, Elvis Nava, Ethan Perez, and David Lindner. Vision-language models are zero-shot reward models for reinforcement learning.CoRR, abs/2310.12921, 2023. URLhttps://doi.org/10.48550/arXiv.2310.12921

  48. [48]

    A reduction of imitation learning and structured prediction to no-regret online learning

    S Ross, G Gordon, and D Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of Machine Learning Research, 2011

  49. [49]

    Policy Distillation

    AA Rusu, SG Colmenarejo, C Gulcehre, and G Desjardins. Policy distillation.arXiv preprint arXiv:1511.06295, 2015. URLhttps://arxiv.org/abs/1511.06295

  50. [50]

    Progressive Distillation for Fast Sampling of Diffusion Models

    T Salimans and J Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512, 2022. URLhttps://arxiv.org/abs/2202.00512

  51. [51]

    Mastering atari, go, chess and shogi by planning with a learned model.Nature, 2020

    J Schrittwieser, I Antonoglou, T Hubert, and K Simonyan. Mastering atari, go, chess and shogi by planning with a learned model.Nature, 2020

  52. [52]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Z Shao, P Wang, Q Zhu, R Xu, J Song, X Bi, and H Zhang. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. URLhttps://arxiv.org/abs/2402.03300

  53. [53]

    Self-Distillation Enables Continual Learning

    I Shenfeld, M Damani, J Hübotter, and P Agrawal. Self-distillation enables continual learning. arXiv preprint arXiv:2601.19897, 2026. URLhttps://arxiv.org/abs/2601.19897

  54. [54]

    Mind the gap: Examining the self- improvement capabilities of large language models

    Y Song, H Zhang, C Eisenach, S Kakade, and D Foster. Mind the gap: Examining the self- improvement capabilities of large language models. InInternational Conference on Learning Representations, 2025

  55. [55]

    Consistency models

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. InICML, pages 32211–32252, 2023. 16

  56. [56]

    Composition of Memory Experts for Diffusion World Models

    S Stapf, PA Huertos, A Davtyan, and P Favaro. Composition of memory experts for diffusion world models.arXiv preprint arXiv:2605.18813, 2026. URL https://arxiv.org/abs/ 2605.18813

  57. [57]

    Learning to summarize with human feedback

    N Stiennon, L Ouyang, J Wu, D Ziegler, and R Lowe. Learning to summarize with human feedback. InAdvances in Neural Information Processing Systems, 2020

  58. [58]

    Richard S. Sutton. First results with dyna, an integrated architecture for learning, planning and reacting. InNeural Networks for Control. MIT Press, 1991. URL https://doi.org/10. 7551/mitpress/4939.003.0012

  59. [59]

    Qwen3.5: Accelerating productivity with native multimodal agents, February

    Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February

  60. [60]

    Technical report

    URLhttps://qwen.ai/blog?id=qwen3.5. Technical report

  61. [61]

    HunyuanVideo 1.5 Technical Report

    Tencent Hunyuan Foundation Model Team. Hunyuanvideo 1.5 technical report, 2025. URL https://arxiv.org/abs/2511.18870

  62. [62]

    Diffusion model alignment using direct preference optimization

    B Wallace, M Dang, R Rafailov, L Zhou, and A Lou. Diffusion model alignment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  63. [63]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

  64. [64]

    A very big video reasoning suite, 2026

    Maijunxian Wang, Ruisi Wang, Juyi Lin, Ran Ji, Thaddäus Wiedemer, Qingying Gao, Dezhi Luo, Yaoyao Qian, Lianyu Huang, Zelong Hong, Jiahui Ge, Qianli Ma, Hang He, Yifan Zhou, Lingzi Guo, Lantao Mei, Jiachen Li, Hanwen Xing, Tianqi Zhao, Fengyuan Yu, Weihang Xiao, Yizheng Jiao, Jianheng Hou, Danyang Zhang, Pengcheng Xu, Boyang Zhong, Zehong Zhao, Gaoyun Fan...

  65. [65]

    Rl-vlm-f: Reinforcement learning from vision language foundation model feedback.arXiv preprint arXiv:2402.03681, 2024

    Y Wang, Z Sun, J Zhang, Z Xian, E Biyik, and D Held. Rl-vlm-f: Reinforcement learning from vision language foundation model feedback.arXiv preprint arXiv:2402.03681, 2024. URL https://arxiv.org/abs/2402.03681

  66. [66]

    Video models are zero-shot learners and reasoners,

    Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners,

  67. [67]

    URLhttps://arxiv.org/abs/2509.20328

  68. [68]

    Williams

    RJ Williams. Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine Learning, 1992. doi: 10.1007/BF00992696. URL https://link. springer.com/article/10.1007/bf00992696

  69. [69]

    Advantage weighted matching: Aligning rl with pretraining in diffusion models.arXiv preprint arXiv:2509.25050, 2025

    S Xue, C Ge, S Zhang, Y Li, and ZM Ma. Advantage weighted matching: Aligning rl with pretraining in diffusion models.arXiv preprint arXiv:2509.25050, 2025. URL https: //arxiv.org/abs/2509.25050

  70. [70]

    DanceGRPO: Unleashing GRPO on Visual Generation

    Z Xue, J Wu, Y Gao, F Kong, L Zhu, M Chen, and Z Liu. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818, 2025. URL https://arxiv.org/abs/ 2505.07818. 17

  71. [71]

    One-step diffusion with distribution matching distillation

    T Yin, M Gharbi, R Zhang, E Shechtman, and F Durand. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  72. [72]

    From slow bidirectional to fast autoregressive video diffusion models

    Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  73. [73]

    DiffusionNFT: Online Diffusion Reinforcement with Forward Process

    K Zheng, H Chen, H Ye, H Wang, Q Zhang, and K Jiang. Diffusionnft: Online diffusion reinforcement with forward process.arXiv preprint arXiv:2509.16117, 2025. URL https: //arxiv.org/abs/2509.16117

  74. [74]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control

    B Zitkovich, T Yu, S Xu, P Xu, T Xiao, F Xia, and J Wu. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InProceedings of Machine Learning Research, 2023. A Technical appendices and supplementary material A.1 Further Implementation Details In Tab. 3, we report the hyperparameters used for self-distilling LTX-2 and HunyuanVide...

  75. [75]

    Task 1:[Man in blue shirt]: Step onto the yellow lane marking and stop exactly at the white arrow’s tip. Description 1:The man in the blue shirt begins walking forward along the center of the road, his feet deliberately stepping onto the double yellow lane marking, and continues moving straight ahead until he reaches the tip of the white directional arrow...

  76. [76]

    Task 2:[Person in blue shirt]: Move for- ward to the nearest building. Description 2:The person in the blue shirt begins walking forward along the center of the road, maintaining a steady pace toward the building on the left side of the street, their body oriented directly ahead as they cross the yellow double lines; after a few steps, they continue movin...

  77. [77]

    Task 1:[Character with horned helmet]: Use the bow to aim at the tree trunk di- rectly ahead. Description 1:The character with the horned helmet slowly turns their upper body toward the tree trunk directly ahead, simultaneously drawing the bowstring back with their right hand while keeping their left hand steady on the bow’s grip, their gaze fixed on the ...

  78. [78]

    Task 2:[Character with horned helmet]: Move to the largest boulder and stop be- side its left edge. Description 2:The character with the horned helmet begins walking forward along the stone path, their body oriented toward the largest boulder visible to the left, and after a few steps, they decelerate, shifting their weight slightly as they turn their hea...

  79. [79]

    Task 1:[Driver in racing suit]: Press the red button on the steering wheel’s right side. Description 1:The driver’s right hand, clad in a black racing glove, moves slightly forward and inward, pressing the red button located on the right side of the steering wheel, while the left hand re- mains steady on the left side of the wheel, and the vehicle continu...

  80. [80]

    Task 2:[First-person view]: Align the car’s front bumper with the white track curb ahead. Description 2:The driver’s hands grip the steering wheel firmly, thumbs press- ing the paddle shifters while the left hand subtly adjusts its position to maintain con- trol; simultaneously, the right hand makes a slight inward rotation of the wheel to ini- tiate a ge...

Showing first 80 references.