pith. sign in

arxiv: 2605.22818 · v1 · pith:NZ3ALFUGnew · submitted 2026-05-21 · 💻 cs.CV

MotiMotion: Motion-Controlled Video Generation with Visual Reasoning

Pith reviewed 2026-05-22 05:47 UTC · model grok-4.3

classification 💻 cs.CV
keywords motion-controlled video generationvisual reasoningsecondary motionscausal consistencyimage-to-videoMotiBenchguidance modulationobject interactions
0
0 comments X

The pith

MotiMotion inserts a vision-language reasoning step before generation to refine trajectories and add causally grounded secondary motions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Motion-controlled video models often produce stiff or incomplete results because they follow sparse user paths without filling in the natural consequences that should follow. The paper claims that a separate reasoning stage can correct the main path and invent supporting motions that respect scene logic and cause-effect rules. This leads to videos where objects move and interact more believably. The approach also varies how tightly the generator follows the plan according to the reasoner's confidence level. A dedicated benchmark of scenes built around triggered events lets both automated evaluators and human viewers measure the gain over earlier rigid methods.

Core claim

MotiMotion reformulates motion control as a reasoning-then-generation problem. A training-free vision-language reasoner refines the image-space coordinates of primary trajectories and hallucinates plausible secondary motions that maintain causal consistency with the scene. A confidence-aware control scheme then modulates guidance strength so the model follows high-confidence plans closely while letting its internal priors correct artifacts in lower-confidence regions.

What carries the argument

A training-free vision-language reasoner that refines primary trajectories and adds secondary motions, paired with a confidence-aware control scheme that adjusts guidance strength according to plan reliability.

If this is right

  • Primary trajectories become more accurate after coordinate refinement by the reasoner.
  • Secondary motions supply missing causal consequences that make object interactions look natural.
  • Confidence modulation lets the generator override low-certainty plans with its own learned priors.
  • Interaction-centric scenes in MotiBench expose the gap between rigid and reasoned motion control.
  • Both automated VLM checks and human studies show clear preference for the resulting videos.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reasoning-before-generation pattern could be tested on other sparse-control tasks such as text-to-video or pose-conditioned animation.
  • Longer video sequences might benefit if the reasoner is applied at multiple time points to prevent error buildup.
  • Allowing users to edit the reasoner's proposed motions could turn the system into an interactive planning tool.

Load-bearing premise

The vision-language reasoner can reliably produce refinements and secondary motions that improve the final video without creating new inconsistencies the generator cannot fix.

What would settle it

If MotiMotion videos on MotiBench receive lower VLM-based plausibility scores and lower human preference ratings than videos from standard rigid trajectory methods, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.22818 by Hanwen Jiang, Jing Shi, Lee Hsin-Ying, Ming-Hsuan Yang, Yiqun Mei, Zhixin Shu.

Figure 1
Figure 1. Figure 1: MotiMotion. We devise a motion-controlled video generation model that enables intelligent and natural interaction. Given sparse, raw trajectories (visualized in green lines) and prompts from users, MotiMotion reasons about the intention of inputs and predicts subsequent events that align with world knowledge, common sense, and physics principles. We demonstrate MotiMotion’s capability to process diverse sc… view at source ↗
Figure 2
Figure 2. Figure 2: MotiMotion Pipeline. The core of MotiMotion is a reasoning-then-generation framework that transforms raw, sparse inputs into rich, detailed control that captures realistic world dynamics. (a) Given an input image, trajectory visualization, and textual prompt, a Visual Language Model (VLM) reveals user intention and models motion dynamics aligned with context by refining prompts and proposing trajectories f… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative Comparison. We compare MotiMotion against MagicMotion (Li et al., 2025a), Wan-Move (Chu et al., 2025), our approach without either reasoning, and without motion reasoning. MotiMotion demonstrates understanding of physics and tool mechanisms that align with the context [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Iterative Correction. We demonstrate the improvement made by the iterative reasoning-generation loop. Iterative correc￾tions are applied sequentially: (a) → (b) → (c) → (d). (a) The video generator fails to model the clock dynamics with the user trajectory (red). (b) The VLM first adds a new trajectory (blue) to control human hand movement, but (c) removes it later and draws a new one that brings the hour … view at source ↗
Figure 6
Figure 6. Figure 6: More Examples from MotiBench. B. Additional Experimental Details B.1. MotiBench In [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Trajectories Refined and Proposed by the VLM. Green: user input; red: refined user input; blue: model prediction [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Examples from MotionEdit in which the VLM proposes all trajectories, modulated with confidence-aware control [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Two-Stage and One-Stage Reasoning Welcome Selection [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: User Study Interface. To account for the varying degrees of perceptibility in physical improvements, we adopt a weighted scoring mechanism rather than a simple binary win rate. This ensures that strong improvements (e.g., fixing a major collision failure) contribute more to the final metric than slight improvements (e.g., minor texture sharpening). Let D be the dataset of pairwise comparisons. For each pa… view at source ↗
Figure 11
Figure 11. Figure 11: Additional Qualitative Comparison. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
read the original abstract

Current motion-controlled image-to-video generation models rigidly follow user-provided trajectories that are often sparse, imprecise, and causally incomplete. Such reliance often yields unnatural or implausible outcomes, especially by missing secondary causal consequences. To address this, we introduce MotiMotion, a novel framework that reformulates motion control as a reasoning-then-generation problem. To encourage causally grounded and commonsense-consistent interactions, we leverage a training-free vision-language reasoner to refine image-space coordinates of primary trajectories and to hallucinate plausible secondary motions. To further improve motion naturalness, we propose a confidence-aware control scheme that modulates guidance strength, enabling the model to closely follow high-confidence plans while correcting artifacts under low-confidence inputs with its internal generative priors. To support systematic evaluation, we curate a new image-to-video benchmark, MotiBench, consisting of interaction-centric scenes where new events are triggered by motion. Both VLM-based evaluation and a human study on MotiBench demonstrate that MotiMotion produces videos with more plausible object behaviors and interaction, and is preferred over existing approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MotiMotion, a framework that reformulates motion-controlled image-to-video generation as a reasoning-then-generation problem. It employs a training-free vision-language reasoner to refine image-space coordinates of primary trajectories and hallucinate plausible secondary motions, along with a confidence-aware control scheme to modulate guidance strength. A new benchmark, MotiBench, is curated for interaction-centric scenes, and evaluations via VLM and human studies claim that MotiMotion produces videos with more plausible object behaviors and interactions compared to existing approaches.

Significance. If the results hold, this approach could significantly improve the naturalness of generated videos by incorporating commonsense reasoning to complete incomplete user trajectories, addressing a key limitation in current motion-controlled generation models. The curation of MotiBench provides a valuable resource for evaluating causal interactions in video generation. The training-free nature of the reasoner is a strength, potentially making the method accessible without additional training data or compute.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Evaluation): The central claims rest on preference in human and VLM studies on MotiBench, yet no quantitative metrics, ablation results, or error analysis (e.g., plan accuracy vs. human annotations, secondary-motion consistency rates, or artifact counts) are supplied. This absence directly undermines verification of the claim that the VLM reasoning step produces causally grounded refinements free of new artifacts the generator cannot correct.
  2. [§3] §3 (Method, VLM reasoner description): The assertion that the training-free VLM reliably refines coordinates and hallucinates secondary motions without introducing inconsistencies is presented without any supporting check such as comparison to physics simulation or human causal annotations; if this step fails, the reported gains on MotiBench could be attributable to the downstream generator or confidence scheme rather than reasoning.
minor comments (2)
  1. [Figure 2 and §3.2] Figure 2 and §3.2: The diagram of the confidence-aware modulation could include explicit equations for how VLM output confidence is mapped to guidance strength to improve reproducibility.
  2. [§4.1] §4.1 (MotiBench): Additional details on scene selection criteria and annotation protocol for triggered events would help readers assess benchmark difficulty and coverage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of MotiMotion's potential impact. We address each major comment below with clarifications and commitments to revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Evaluation): The central claims rest on preference in human and VLM studies on MotiBench, yet no quantitative metrics, ablation results, or error analysis (e.g., plan accuracy vs. human annotations, secondary-motion consistency rates, or artifact counts) are supplied. This absence directly undermines verification of the claim that the VLM reasoning step produces causally grounded refinements free of new artifacts the generator cannot correct.

    Authors: We appreciate the referee's emphasis on rigorous verification. Our primary evaluations use human and VLM preference studies because these directly assess perceptual plausibility and causal consistency in generated videos, where standard automatic metrics often correlate poorly with human judgments on interaction naturalness. We agree that supplementary quantitative analyses would strengthen the manuscript. In the revised version, we will add an error analysis subsection reporting plan accuracy against human annotations, secondary-motion consistency rates, and artifact counts to better isolate the contribution of the VLM reasoning step. revision: yes

  2. Referee: [§3] §3 (Method, VLM reasoner description): The assertion that the training-free VLM reliably refines coordinates and hallucinates secondary motions without introducing inconsistencies is presented without any supporting check such as comparison to physics simulation or human causal annotations; if this step fails, the reported gains on MotiBench could be attributable to the downstream generator or confidence scheme rather than reasoning.

    Authors: We acknowledge that independent validation of the VLM reasoning step would help rule out alternative explanations for the gains. The observed improvements on MotiBench, particularly for interaction-centric scenes, indicate that the refinements contribute meaningfully beyond the base generator. To directly address this, we will incorporate a supporting analysis in the revised manuscript comparing VLM-generated plans to human causal annotations, thereby providing evidence that the reasoning step produces consistent outputs. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; framework components remain independent

full rationale

The paper describes a two-stage process where a training-free VLM reasoner refines primary trajectories and hallucinates secondary motions as an upstream step, followed by a separate confidence-aware modulation applied to an image-to-video generator. Evaluation via VLM-based metrics and human study on MotiBench is presented as downstream validation rather than a definitional loop. No equations, fitted parameters, or self-citation chains are quoted that would reduce the output predictions or refinements to the inputs by construction. The claimed improvements rest on the separation between reasoning and generation, which the text treats as distinct modules without reducing one to the other.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested assumption that an off-the-shelf VLM can perform reliable causal reasoning over sparse trajectories without domain-specific fine-tuning or verification.

axioms (1)
  • domain assumption A training-free vision-language model can accurately refine primary trajectories and hallucinate plausible secondary motions that are causally consistent with the scene.
    Invoked when the method description states the VLM is used to refine coordinates and hallucinate motions.

pith-pipeline@v0.9.0 · 5729 in / 1324 out tokens · 32428 ms · 2026-05-22T05:47:49.689776+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    leverage a training-free vision-language reasoner to refine image-space coordinates of primary trajectories and to hallucinate plausible secondary motions... confidence-aware control scheme that modulates guidance strength

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

98 extracted references · 98 canonical work pages · 11 internal anchors

  1. [1]

    NeurIPS , year=

    Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance , author=. NeurIPS , year=

  2. [2]

    2024 , booktitle =

    Shi, Xiaoyu and Huang, Zhaoyang and Wang, Fu-Yun and Bian, Weikang and Li, Dasong and Zhang, Yi and Zhang, Manyuan and Cheung, Ka Chun and See, Simon and Qin, Hongwei and Dai, Jifeng and Li, Hongsheng , title =. 2024 , booktitle =

  3. [3]

    CVPR , year =

    Geng, Daniel and Herrmann, Charles and Hur, Junhwa and Cole, Forrester and Zhang, Serena and Pfaff, Tobias and Lopez-Guevara, Tatiana and Aytar, Yusuf and Rubinstein, Michael and Sun, Chen and Wang, Oliver and Owens, Andrew and Sun, Deqing , title =. CVPR , year =

  4. [4]

    CVPR , year =

    Zhang, Zhenghao and Liao, Junchao and Li, Menghao and Dai, ZuoZhuo and Qiu, Bingxue and Zhu, Siyu and Qin, Long and Wang, Weizhi , title =. CVPR , year =

  5. [5]

    CVPR , year=

    Animateanything: Consistent and controllable animation for video generation , author=. CVPR , year=

  6. [6]

    CVPR , year =

    Burgert, Ryan and Xu, Yuancheng and Xian, Wenqi and Pilarski, Oliver and Clausen, Pascal and He, Mingming and Ma, Li and Deng, Yitong and Li, Lingxiao and Mousavi, Mohsen and Ryoo, Michael and Debevec, Paul and Yu, Ning , title =. CVPR , year =

  7. [7]

    AAAI , year=

    Trackgo: A flexible and efficient method for controllable video generation , author=. AAAI , year=

  8. [8]

    AAAI , year=

    Image conductor: Precision control for interactive video synthesis , author=. AAAI , year=

  9. [9]

    arXiv preprint arXiv:2512.02015 , year=

    Generative Video Motion Editing with 3D Point Tracks , author=. arXiv preprint arXiv:2512.02015 , year=

  10. [10]

    CVPR , year=

    Levitor: 3d trajectory oriented image-to-video synthesis , author=. CVPR , year=

  11. [11]

    SIGGRAPH , year=

    Diffusion as shader: 3d-aware video diffusion for versatile video generation control , author=. SIGGRAPH , year=

  12. [12]

    arXiv preprint arXiv:2502.07531 , year=

    Vidcraft3: Camera, object, and lighting control for image-to-video generation , author=. arXiv preprint arXiv:2502.07531 , year=

  13. [13]

    Motionstream: Real-time video gen- eration with interactive motion controls.arXiv preprint arXiv:2511.01266,

    Motionstream: Real-time video generation with interactive motion controls , author=. arXiv preprint arXiv:2511.01266 , year=

  14. [14]

    Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo

  15. [15]

    VideoDirector

    Han Lin and Abhay Zala and Jaemin Cho and Mohit Bansal , booktitle=. VideoDirector

  16. [16]

    arXiv preprint arXiv:2410.10076 , year=

    Videoagent: Self-improving video generation , author=. arXiv preprint arXiv:2410.10076 , year=

  17. [17]

    ICCV , year=

    Click to move: Controlling video generation with sparse motion , author=. ICCV , year=

  18. [18]

    CVPR , year=

    Controllable video generation with sparse trajectories , author=. CVPR , year=

  19. [19]

    NeurIPS , year=

    Denoising diffusion probabilistic models , author=. NeurIPS , year=

  20. [20]

    ICML , year=

    Deep unsupervised learning using nonequilibrium thermodynamics , author=. ICML , year=

  21. [21]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan: Open and advanced large-scale video generative models , author=. arXiv preprint arXiv:2503.20314 , year=

  22. [22]

    2024 , url =

    OpenAI , title =. 2024 , url =

  23. [23]

    2025 , url =

    Google DeepMind , title =. 2025 , url =

  24. [24]

    ICLR , year=

    Flow Matching for Generative Modeling , author=. ICLR , year=

  25. [25]

    Auto-Encoding Variational Bayes

    Auto-encoding variational bayes , author=. arXiv preprint arXiv:1312.6114 , year=

  26. [26]

    JMLR , year=

    Exploring the limits of transfer learning with a unified text-to-text transformer , author=. JMLR , year=

  27. [27]

    ICCV , year=

    Scalable diffusion models with transformers , author=. ICCV , year=

  28. [28]

    2025 , url =

    aigc-apps , title =. 2025 , url =

  29. [29]

    2024 , booktitle =

    Wang, Zhouxia and Yuan, Ziyang and Wang, Xintao and Li, Yaowei and Chen, Tianshui and Xia, Menghan and Luo, Ping and Shan, Ying , title =. 2024 , booktitle =

  30. [30]

    DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

    Dragnuwa: Fine-grained control in video generation by integrating text, image, and trajectory , author=. arXiv preprint arXiv:2308.08089 , year=

  31. [31]

    Lindell , booktitle=

    Koichi Namekata and Sherwin Bahmani and Ziyi Wu and Yash Kant and Igor Gilitschenski and David B. Lindell , booktitle=

  32. [32]

    ICCV , year =

    Feng, Wanquan and Qi, Tianhao and Liu, Jiawei and Sun, Mingzhen and Tu, Pengqi and Ma, Tianxiang and Dai, Fei and Zhao, Songtao and Zhou, Siyu and He, Qian , title =. ICCV , year =

  33. [33]

    ICLR , year=

    Trajectory attention for fine-grained video motion control , author=. ICLR , year=

  34. [34]

    Motion- Conditioned Diffusion Model for Controllable Video Synthesis,

    Motion-conditioned diffusion model for controllable video synthesis , author=. arXiv preprint arXiv:2304.14404 , year=

  35. [35]

    ECCV , year=

    Mofa-video: Controllable image animation via generative motion field adaptions in frozen image-to-video diffusion model , author=. ECCV , year=

  36. [36]

    NeurIPS , year=

    Videocomposer: Compositional video synthesis with motion controllability , author=. NeurIPS , year=

  37. [37]

    CVPR , year =

    Zhang, Zhongwei and Long, Fuchen and Qiu, Zhaofan and Pan, Yingwei and Liu, Wu and Yao, Ting and Mei, Tao , title =. CVPR , year =

  38. [38]

    SIGGRAPH , year=

    Motioncanvas: Cinematic shot design with controllable image-to-video generation , author=. SIGGRAPH , year=

  39. [39]

    ICML , year=

    Boximator: Generating Rich and Controllable Motions for Video Synthesis , author=. ICML , year=

  40. [40]

    NeurIPS , year=

    Motionbooth: Motion-aware customized text-to-video generation , author=. NeurIPS , year=

  41. [41]

    ECCV , year=

    Draganything: Motion control for anything using entity representation , author=. ECCV , year=

  42. [42]

    CVPR , year=

    Animate anyone: Consistent and controllable image-to-video synthesis for character animation , author=. CVPR , year=

  43. [43]

    CVPR , year=

    Magicanimate: Temporally consistent human image animation using diffusion model , author=. CVPR , year=

  44. [44]

    ICLR , year=

    CameraCtrl: Enabling Camera Control for Video Diffusion Models , author=. ICLR , year=

  45. [45]

    ICLR , year=

    Controlling Space and Time with Diffusion Models , author=. ICLR , year=

  46. [46]

    arXiv preprint arXiv:2511.20640 , year=

    MotionV2V: Editing Motion in a Video , author=. arXiv preprint arXiv:2511.20640 , year=

  47. [47]

    SIGGRAPH Asia , year=

    Trailblazer: Trajectory control for diffusion-based video generation , author=. SIGGRAPH Asia , year=

  48. [48]

    CVPR , year=

    Peekaboo: Interactive video generation via masked-diffusion , author=. CVPR , year=

  49. [49]

    NeurIPS , year=

    Attention is all you need , author=. NeurIPS , year=

  50. [50]

    Emerging Properties in Unified Multimodal Pretraining

    Emerging Properties in Unified Multimodal Pretraining , author =. arXiv preprint arXiv:2505.14683 , year =

  51. [51]

    2025 , url =

    OpenAI , title =. 2025 , url =

  52. [52]

    Qwen-Image Technical Report

    Qwen-image technical report , author=. arXiv preprint arXiv:2508.02324 , year=

  53. [53]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Janus-pro: Unified multimodal understanding and generation with data and model scaling , author=. arXiv preprint arXiv:2501.17811 , year=

  54. [54]

    ICCV , year =

    Tong, Shengbang and Fan, David and Li, Jiachen and Xiong, Yunyang and Chen, Xinlei and Sinha, Koustuv and Rabbat, Michael and LeCun, Yann and Xie, Saining and Liu, Zhuang , title =. ICCV , year =

  55. [55]

    OmniGen2: Towards Instruction-Aligned Multimodal Generation

    OmniGen2: Exploration to Advanced Multimodal Generation , author=. arXiv preprint arXiv:2506.18871 , year=

  56. [56]

    NeurIPS , year=

    Layoutgpt: Compositional visual planning and generation with large language models , author=. NeurIPS , year=

  57. [57]

    ICML , year=

    Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms , author=. ICML , year=

  58. [58]

    NeurIPS , year=

    Genartist: Multimodal llm as an agent for unified image generation and editing , author=. NeurIPS , year=

  59. [59]

    GoT: Unleashing Reasoning Capability of

    Rongyao Fang and Chengqi Duan and Kun Wang and Linjiang Huang and Hao Li and Hao Tian and Shilin Yan and Weihao Yu and Xingyu Zeng and Jifeng Dai and Xihui Liu and Hongsheng Li , booktitle=. GoT: Unleashing Reasoning Capability of

  60. [60]

    CVPR , year =

    Liu, Zichen and Yu, Yue and Ouyang, Hao and Wang, Qiuyu and Cheng, Ka Leong and Wang, Wen and Liu, Zhiheng and Chen, Qifeng and Shen, Yujun , title =. CVPR , year =

  61. [61]

    ICCV , year =

    Chen, Chieh-Yun and Shi, Min and Zhang, Gong and Shi, Humphrey , title =. ICCV , year =

  62. [62]

    ECCV , year=

    The fabrication of reality and fantasy: Scene generation with llm-assisted prompt interpretation , author=. ECCV , year=

  63. [63]

    arXiv preprint:2509.04545 , year=

    PromptEnhancer: A Simple Approach to Enhance Text-to-Image Models via Chain-of-Thought Prompt Rewriting , author=. arXiv preprint:2509.04545 , year=

  64. [64]

    ICCV , year =

    Guo, Jiayi and Yan, Chuanhao and Xu, Xingqian and Wang, Yulin and Wang, Kai and Huang, Gao and Shi, Humphrey , title =. ICCV , year =

  65. [65]

    CVPR , year=

    Self-correcting llm-controlled diffusion models , author=. CVPR , year=

  66. [66]

    Huang, Ziqi and Yu, Ning and Chen, Gordon and Qiu, Haonan and Debevec, Paul and Liu, Ziwei , journal=

  67. [67]

    Self-Correcting Text-to-Video Generation with Misalignment Detection and Localized Refinement

    VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement , author=. arXiv preprint arXiv:2411.15115 , year=

  68. [68]

    arXiv preprint arXiv:2511.17986 , year=

    Plan-X: Instruct Video Generation via Semantic Planning , author=. arXiv preprint arXiv:2511.17986 , year=

  69. [69]

    NeurIPS , year=

    Videotetris: Towards compositional text-to-video generation , author=. NeurIPS , year=

  70. [70]

    CVPR , year=

    Phyt2v: Llm-guided iterative self-refinement for physics-grounded text-to-video generation , author=. CVPR , year=

  71. [71]

    ICLR , year=

    Show-o: One Single Transformer to Unify Multimodal Understanding and Generation , author=. ICLR , year=

  72. [72]

    Univideo: Unified video understanding, generation, and editing.arXiv preprint arXiv:2510.08377, 2026

    Univideo: Unified understanding, generation, and editing for videos , author=. arXiv preprint arXiv:2510.08377 , year=

  73. [73]

    NeurIPS , year=

    Vitron: A unified pixel-level vision llm for understanding, generating, segmenting, editing , author=. NeurIPS , year=

  74. [74]

    Yecheng Wu and Zhuoyang Zhang and Junyu Chen and Haotian Tang and Dacheng Li and Yunhao Fang and Ligeng Zhu and Enze Xie and Hongxu Yin and Li Yi and Song Han and Yao Lu , booktitle=

  75. [75]

    ICLR , year=

    OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation , author=. ICLR , year=

  76. [76]

    ICCV , year =

    WonderPlay: Dynamic 3D Scene Generation from a Single Image and Actions , author =. ICCV , year =

  77. [77]

    NeurIPS , year=

    Force Prompting: Video Generation Models Can Learn And Generalize Physics-based Control Signals , author=. NeurIPS , year=

  78. [78]

    CVPR , year=

    WonderWorld: Interactive 3D Scene Generation from a Single Image , author=. CVPR , year=

  79. [79]

    arXiv preprint arXiv:2601.05848 , year=

    Goal Force: Teaching Video Models To Accomplish Physics-Conditioned Goals , author=. arXiv preprint arXiv:2601.05848 , year=

  80. [80]

    ECCV , year=

    Physdreamer: Physics-based interaction with 3d objects via video generation , author=. ECCV , year=

Showing first 80 references.