MotiMotion: Motion-Controlled Video Generation with Visual Reasoning

Hanwen Jiang; Jing Shi; Lee Hsin-Ying; Ming-Hsuan Yang; Yiqun Mei; Zhixin Shu

arxiv: 2605.22818 · v1 · pith:NZ3ALFUGnew · submitted 2026-05-21 · 💻 cs.CV

MotiMotion: Motion-Controlled Video Generation with Visual Reasoning

Lee Hsin-Ying , Hanwen Jiang , Yiqun Mei , Jing Shi , Ming-Hsuan Yang , Zhixin Shu This is my paper

Pith reviewed 2026-05-22 05:47 UTC · model grok-4.3

classification 💻 cs.CV

keywords motion-controlled video generationvisual reasoningsecondary motionscausal consistencyimage-to-videoMotiBenchguidance modulationobject interactions

0 comments

The pith

MotiMotion inserts a vision-language reasoning step before generation to refine trajectories and add causally grounded secondary motions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Motion-controlled video models often produce stiff or incomplete results because they follow sparse user paths without filling in the natural consequences that should follow. The paper claims that a separate reasoning stage can correct the main path and invent supporting motions that respect scene logic and cause-effect rules. This leads to videos where objects move and interact more believably. The approach also varies how tightly the generator follows the plan according to the reasoner's confidence level. A dedicated benchmark of scenes built around triggered events lets both automated evaluators and human viewers measure the gain over earlier rigid methods.

Core claim

MotiMotion reformulates motion control as a reasoning-then-generation problem. A training-free vision-language reasoner refines the image-space coordinates of primary trajectories and hallucinates plausible secondary motions that maintain causal consistency with the scene. A confidence-aware control scheme then modulates guidance strength so the model follows high-confidence plans closely while letting its internal priors correct artifacts in lower-confidence regions.

What carries the argument

A training-free vision-language reasoner that refines primary trajectories and adds secondary motions, paired with a confidence-aware control scheme that adjusts guidance strength according to plan reliability.

If this is right

Primary trajectories become more accurate after coordinate refinement by the reasoner.
Secondary motions supply missing causal consequences that make object interactions look natural.
Confidence modulation lets the generator override low-certainty plans with its own learned priors.
Interaction-centric scenes in MotiBench expose the gap between rigid and reasoned motion control.
Both automated VLM checks and human studies show clear preference for the resulting videos.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reasoning-before-generation pattern could be tested on other sparse-control tasks such as text-to-video or pose-conditioned animation.
Longer video sequences might benefit if the reasoner is applied at multiple time points to prevent error buildup.
Allowing users to edit the reasoner's proposed motions could turn the system into an interactive planning tool.

Load-bearing premise

The vision-language reasoner can reliably produce refinements and secondary motions that improve the final video without creating new inconsistencies the generator cannot fix.

What would settle it

If MotiMotion videos on MotiBench receive lower VLM-based plausibility scores and lower human preference ratings than videos from standard rigid trajectory methods, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.22818 by Hanwen Jiang, Jing Shi, Lee Hsin-Ying, Ming-Hsuan Yang, Yiqun Mei, Zhixin Shu.

**Figure 1.** Figure 1: MotiMotion. We devise a motion-controlled video generation model that enables intelligent and natural interaction. Given sparse, raw trajectories (visualized in green lines) and prompts from users, MotiMotion reasons about the intention of inputs and predicts subsequent events that align with world knowledge, common sense, and physics principles. We demonstrate MotiMotion’s capability to process diverse sc… view at source ↗

**Figure 2.** Figure 2: MotiMotion Pipeline. The core of MotiMotion is a reasoning-then-generation framework that transforms raw, sparse inputs into rich, detailed control that captures realistic world dynamics. (a) Given an input image, trajectory visualization, and textual prompt, a Visual Language Model (VLM) reveals user intention and models motion dynamics aligned with context by refining prompts and proposing trajectories f… view at source ↗

**Figure 3.** Figure 3: Qualitative Comparison. We compare MotiMotion against MagicMotion (Li et al., 2025a), Wan-Move (Chu et al., 2025), our approach without either reasoning, and without motion reasoning. MotiMotion demonstrates understanding of physics and tool mechanisms that align with the context [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 5.** Figure 5: Iterative Correction. We demonstrate the improvement made by the iterative reasoning-generation loop. Iterative corrections are applied sequentially: (a) → (b) → (c) → (d). (a) The video generator fails to model the clock dynamics with the user trajectory (red). (b) The VLM first adds a new trajectory (blue) to control human hand movement, but (c) removes it later and draws a new one that brings the hour … view at source ↗

**Figure 6.** Figure 6: More Examples from MotiBench. B. Additional Experimental Details B.1. MotiBench In [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Trajectories Refined and Proposed by the VLM. Green: user input; red: refined user input; blue: model prediction [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Examples from MotionEdit in which the VLM proposes all trajectories, modulated with confidence-aware control [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Two-Stage and One-Stage Reasoning Welcome Selection [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: User Study Interface. To account for the varying degrees of perceptibility in physical improvements, we adopt a weighted scoring mechanism rather than a simple binary win rate. This ensures that strong improvements (e.g., fixing a major collision failure) contribute more to the final metric than slight improvements (e.g., minor texture sharpening). Let D be the dataset of pairwise comparisons. For each pa… view at source ↗

**Figure 11.** Figure 11: Additional Qualitative Comparison. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

read the original abstract

Current motion-controlled image-to-video generation models rigidly follow user-provided trajectories that are often sparse, imprecise, and causally incomplete. Such reliance often yields unnatural or implausible outcomes, especially by missing secondary causal consequences. To address this, we introduce MotiMotion, a novel framework that reformulates motion control as a reasoning-then-generation problem. To encourage causally grounded and commonsense-consistent interactions, we leverage a training-free vision-language reasoner to refine image-space coordinates of primary trajectories and to hallucinate plausible secondary motions. To further improve motion naturalness, we propose a confidence-aware control scheme that modulates guidance strength, enabling the model to closely follow high-confidence plans while correcting artifacts under low-confidence inputs with its internal generative priors. To support systematic evaluation, we curate a new image-to-video benchmark, MotiBench, consisting of interaction-centric scenes where new events are triggered by motion. Both VLM-based evaluation and a human study on MotiBench demonstrate that MotiMotion produces videos with more plausible object behaviors and interaction, and is preferred over existing approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MotiMotion adds a training-free VLM step to refine trajectories and invent secondary motions in image-to-video generation, plus a new interaction benchmark, but the supporting results stay at the level of preference studies without numbers or ablations.

read the letter

The main thing here is that the authors treat motion control as a reasoning-then-generation task. A VLM cleans up the user's primary trajectories in image space and adds plausible secondary motions that follow from the scene, then a confidence scheme dials the guidance strength so the generator can override weak plans with its own priors. They also release MotiBench, a set of scenes built around motion that triggers new events.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MotiMotion, a framework that reformulates motion-controlled image-to-video generation as a reasoning-then-generation problem. It employs a training-free vision-language reasoner to refine image-space coordinates of primary trajectories and hallucinate plausible secondary motions, along with a confidence-aware control scheme to modulate guidance strength. A new benchmark, MotiBench, is curated for interaction-centric scenes, and evaluations via VLM and human studies claim that MotiMotion produces videos with more plausible object behaviors and interactions compared to existing approaches.

Significance. If the results hold, this approach could significantly improve the naturalness of generated videos by incorporating commonsense reasoning to complete incomplete user trajectories, addressing a key limitation in current motion-controlled generation models. The curation of MotiBench provides a valuable resource for evaluating causal interactions in video generation. The training-free nature of the reasoner is a strength, potentially making the method accessible without additional training data or compute.

major comments (2)

[Abstract and §4] Abstract and §4 (Evaluation): The central claims rest on preference in human and VLM studies on MotiBench, yet no quantitative metrics, ablation results, or error analysis (e.g., plan accuracy vs. human annotations, secondary-motion consistency rates, or artifact counts) are supplied. This absence directly undermines verification of the claim that the VLM reasoning step produces causally grounded refinements free of new artifacts the generator cannot correct.
[§3] §3 (Method, VLM reasoner description): The assertion that the training-free VLM reliably refines coordinates and hallucinates secondary motions without introducing inconsistencies is presented without any supporting check such as comparison to physics simulation or human causal annotations; if this step fails, the reported gains on MotiBench could be attributable to the downstream generator or confidence scheme rather than reasoning.

minor comments (2)

[Figure 2 and §3.2] Figure 2 and §3.2: The diagram of the confidence-aware modulation could include explicit equations for how VLM output confidence is mapped to guidance strength to improve reproducibility.
[§4.1] §4.1 (MotiBench): Additional details on scene selection criteria and annotation protocol for triggered events would help readers assess benchmark difficulty and coverage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of MotiMotion's potential impact. We address each major comment below with clarifications and commitments to revisions where appropriate.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Evaluation): The central claims rest on preference in human and VLM studies on MotiBench, yet no quantitative metrics, ablation results, or error analysis (e.g., plan accuracy vs. human annotations, secondary-motion consistency rates, or artifact counts) are supplied. This absence directly undermines verification of the claim that the VLM reasoning step produces causally grounded refinements free of new artifacts the generator cannot correct.

Authors: We appreciate the referee's emphasis on rigorous verification. Our primary evaluations use human and VLM preference studies because these directly assess perceptual plausibility and causal consistency in generated videos, where standard automatic metrics often correlate poorly with human judgments on interaction naturalness. We agree that supplementary quantitative analyses would strengthen the manuscript. In the revised version, we will add an error analysis subsection reporting plan accuracy against human annotations, secondary-motion consistency rates, and artifact counts to better isolate the contribution of the VLM reasoning step. revision: yes
Referee: [§3] §3 (Method, VLM reasoner description): The assertion that the training-free VLM reliably refines coordinates and hallucinates secondary motions without introducing inconsistencies is presented without any supporting check such as comparison to physics simulation or human causal annotations; if this step fails, the reported gains on MotiBench could be attributable to the downstream generator or confidence scheme rather than reasoning.

Authors: We acknowledge that independent validation of the VLM reasoning step would help rule out alternative explanations for the gains. The observed improvements on MotiBench, particularly for interaction-centric scenes, indicate that the refinements contribute meaningfully beyond the base generator. To directly address this, we will incorporate a supporting analysis in the revised manuscript comparing VLM-generated plans to human causal annotations, thereby providing evidence that the reasoning step produces consistent outputs. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; framework components remain independent

full rationale

The paper describes a two-stage process where a training-free VLM reasoner refines primary trajectories and hallucinates secondary motions as an upstream step, followed by a separate confidence-aware modulation applied to an image-to-video generator. Evaluation via VLM-based metrics and human study on MotiBench is presented as downstream validation rather than a definitional loop. No equations, fitted parameters, or self-citation chains are quoted that would reduce the output predictions or refinements to the inputs by construction. The claimed improvements rest on the separation between reasoning and generation, which the text treats as distinct modules without reducing one to the other.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested assumption that an off-the-shelf VLM can perform reliable causal reasoning over sparse trajectories without domain-specific fine-tuning or verification.

axioms (1)

domain assumption A training-free vision-language model can accurately refine primary trajectories and hallucinate plausible secondary motions that are causally consistent with the scene.
Invoked when the method description states the VLM is used to refine coordinates and hallucinate motions.

pith-pipeline@v0.9.0 · 5729 in / 1324 out tokens · 32428 ms · 2026-05-22T05:47:49.689776+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

leverage a training-free vision-language reasoner to refine image-space coordinates of primary trajectories and to hallucinate plausible secondary motions... confidence-aware control scheme that modulates guidance strength

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

98 extracted references · 98 canonical work pages · 11 internal anchors

[1]

NeurIPS , year=

Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance , author=. NeurIPS , year=

work page
[2]

2024 , booktitle =

Shi, Xiaoyu and Huang, Zhaoyang and Wang, Fu-Yun and Bian, Weikang and Li, Dasong and Zhang, Yi and Zhang, Manyuan and Cheung, Ka Chun and See, Simon and Qin, Hongwei and Dai, Jifeng and Li, Hongsheng , title =. 2024 , booktitle =

work page 2024
[3]

CVPR , year =

Geng, Daniel and Herrmann, Charles and Hur, Junhwa and Cole, Forrester and Zhang, Serena and Pfaff, Tobias and Lopez-Guevara, Tatiana and Aytar, Yusuf and Rubinstein, Michael and Sun, Chen and Wang, Oliver and Owens, Andrew and Sun, Deqing , title =. CVPR , year =

work page
[4]

CVPR , year =

Zhang, Zhenghao and Liao, Junchao and Li, Menghao and Dai, ZuoZhuo and Qiu, Bingxue and Zhu, Siyu and Qin, Long and Wang, Weizhi , title =. CVPR , year =

work page
[5]

CVPR , year=

Animateanything: Consistent and controllable animation for video generation , author=. CVPR , year=

work page
[6]

CVPR , year =

Burgert, Ryan and Xu, Yuancheng and Xian, Wenqi and Pilarski, Oliver and Clausen, Pascal and He, Mingming and Ma, Li and Deng, Yitong and Li, Lingxiao and Mousavi, Mohsen and Ryoo, Michael and Debevec, Paul and Yu, Ning , title =. CVPR , year =

work page
[7]

AAAI , year=

Trackgo: A flexible and efficient method for controllable video generation , author=. AAAI , year=

work page
[8]

AAAI , year=

Image conductor: Precision control for interactive video synthesis , author=. AAAI , year=

work page
[9]

arXiv preprint arXiv:2512.02015 , year=

Generative Video Motion Editing with 3D Point Tracks , author=. arXiv preprint arXiv:2512.02015 , year=

work page arXiv
[10]

CVPR , year=

Levitor: 3d trajectory oriented image-to-video synthesis , author=. CVPR , year=

work page
[11]

SIGGRAPH , year=

Diffusion as shader: 3d-aware video diffusion for versatile video generation control , author=. SIGGRAPH , year=

work page
[12]

arXiv preprint arXiv:2502.07531 , year=

Vidcraft3: Camera, object, and lighting control for image-to-video generation , author=. arXiv preprint arXiv:2502.07531 , year=

work page arXiv
[13]

Motionstream: Real-time video gen- eration with interactive motion controls.arXiv preprint arXiv:2511.01266,

Motionstream: Real-time video generation with interactive motion controls , author=. arXiv preprint arXiv:2511.01266 , year=

work page arXiv
[14]

Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo

work page
[15]

VideoDirector

Han Lin and Abhay Zala and Jaemin Cho and Mohit Bansal , booktitle=. VideoDirector

work page
[16]

arXiv preprint arXiv:2410.10076 , year=

Videoagent: Self-improving video generation , author=. arXiv preprint arXiv:2410.10076 , year=

work page arXiv
[17]

ICCV , year=

Click to move: Controlling video generation with sparse motion , author=. ICCV , year=

work page
[18]

CVPR , year=

Controllable video generation with sparse trajectories , author=. CVPR , year=

work page
[19]

NeurIPS , year=

Denoising diffusion probabilistic models , author=. NeurIPS , year=

work page
[20]

ICML , year=

Deep unsupervised learning using nonequilibrium thermodynamics , author=. ICML , year=

work page
[21]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan: Open and advanced large-scale video generative models , author=. arXiv preprint arXiv:2503.20314 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

2024 , url =

OpenAI , title =. 2024 , url =

work page 2024
[23]

2025 , url =

Google DeepMind , title =. 2025 , url =

work page 2025
[24]

ICLR , year=

Flow Matching for Generative Modeling , author=. ICLR , year=

work page
[25]

Auto-Encoding Variational Bayes

Auto-encoding variational bayes , author=. arXiv preprint arXiv:1312.6114 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[26]

JMLR , year=

Exploring the limits of transfer learning with a unified text-to-text transformer , author=. JMLR , year=

work page
[27]

ICCV , year=

Scalable diffusion models with transformers , author=. ICCV , year=

work page
[28]

2025 , url =

aigc-apps , title =. 2025 , url =

work page 2025
[29]

2024 , booktitle =

Wang, Zhouxia and Yuan, Ziyang and Wang, Xintao and Li, Yaowei and Chen, Tianshui and Xia, Menghan and Luo, Ping and Shan, Ying , title =. 2024 , booktitle =

work page 2024
[30]

DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

Dragnuwa: Fine-grained control in video generation by integrating text, image, and trajectory , author=. arXiv preprint arXiv:2308.08089 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Lindell , booktitle=

Koichi Namekata and Sherwin Bahmani and Ziyi Wu and Yash Kant and Igor Gilitschenski and David B. Lindell , booktitle=

work page
[32]

ICCV , year =

Feng, Wanquan and Qi, Tianhao and Liu, Jiawei and Sun, Mingzhen and Tu, Pengqi and Ma, Tianxiang and Dai, Fei and Zhao, Songtao and Zhou, Siyu and He, Qian , title =. ICCV , year =

work page
[33]

ICLR , year=

Trajectory attention for fine-grained video motion control , author=. ICLR , year=

work page
[34]

Motion- Conditioned Diffusion Model for Controllable Video Synthesis,

Motion-conditioned diffusion model for controllable video synthesis , author=. arXiv preprint arXiv:2304.14404 , year=

work page arXiv
[35]

ECCV , year=

Mofa-video: Controllable image animation via generative motion field adaptions in frozen image-to-video diffusion model , author=. ECCV , year=

work page
[36]

NeurIPS , year=

Videocomposer: Compositional video synthesis with motion controllability , author=. NeurIPS , year=

work page
[37]

CVPR , year =

Zhang, Zhongwei and Long, Fuchen and Qiu, Zhaofan and Pan, Yingwei and Liu, Wu and Yao, Ting and Mei, Tao , title =. CVPR , year =

work page
[38]

SIGGRAPH , year=

Motioncanvas: Cinematic shot design with controllable image-to-video generation , author=. SIGGRAPH , year=

work page
[39]

ICML , year=

Boximator: Generating Rich and Controllable Motions for Video Synthesis , author=. ICML , year=

work page
[40]

NeurIPS , year=

Motionbooth: Motion-aware customized text-to-video generation , author=. NeurIPS , year=

work page
[41]

ECCV , year=

Draganything: Motion control for anything using entity representation , author=. ECCV , year=

work page
[42]

CVPR , year=

Animate anyone: Consistent and controllable image-to-video synthesis for character animation , author=. CVPR , year=

work page
[43]

CVPR , year=

Magicanimate: Temporally consistent human image animation using diffusion model , author=. CVPR , year=

work page
[44]

ICLR , year=

CameraCtrl: Enabling Camera Control for Video Diffusion Models , author=. ICLR , year=

work page
[45]

ICLR , year=

Controlling Space and Time with Diffusion Models , author=. ICLR , year=

work page
[46]

arXiv preprint arXiv:2511.20640 , year=

MotionV2V: Editing Motion in a Video , author=. arXiv preprint arXiv:2511.20640 , year=

work page arXiv
[47]

SIGGRAPH Asia , year=

Trailblazer: Trajectory control for diffusion-based video generation , author=. SIGGRAPH Asia , year=

work page
[48]

CVPR , year=

Peekaboo: Interactive video generation via masked-diffusion , author=. CVPR , year=

work page
[49]

NeurIPS , year=

Attention is all you need , author=. NeurIPS , year=

work page
[50]

Emerging Properties in Unified Multimodal Pretraining

Emerging Properties in Unified Multimodal Pretraining , author =. arXiv preprint arXiv:2505.14683 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[51]

2025 , url =

OpenAI , title =. 2025 , url =

work page 2025
[52]

Qwen-Image Technical Report

Qwen-image technical report , author=. arXiv preprint arXiv:2508.02324 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[53]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Janus-pro: Unified multimodal understanding and generation with data and model scaling , author=. arXiv preprint arXiv:2501.17811 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[54]

ICCV , year =

Tong, Shengbang and Fan, David and Li, Jiachen and Xiong, Yunyang and Chen, Xinlei and Sinha, Koustuv and Rabbat, Michael and LeCun, Yann and Xie, Saining and Liu, Zhuang , title =. ICCV , year =

work page
[55]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

OmniGen2: Exploration to Advanced Multimodal Generation , author=. arXiv preprint arXiv:2506.18871 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[56]

NeurIPS , year=

Layoutgpt: Compositional visual planning and generation with large language models , author=. NeurIPS , year=

work page
[57]

ICML , year=

Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms , author=. ICML , year=

work page
[58]

NeurIPS , year=

Genartist: Multimodal llm as an agent for unified image generation and editing , author=. NeurIPS , year=

work page
[59]

GoT: Unleashing Reasoning Capability of

Rongyao Fang and Chengqi Duan and Kun Wang and Linjiang Huang and Hao Li and Hao Tian and Shilin Yan and Weihao Yu and Xingyu Zeng and Jifeng Dai and Xihui Liu and Hongsheng Li , booktitle=. GoT: Unleashing Reasoning Capability of

work page
[60]

CVPR , year =

Liu, Zichen and Yu, Yue and Ouyang, Hao and Wang, Qiuyu and Cheng, Ka Leong and Wang, Wen and Liu, Zhiheng and Chen, Qifeng and Shen, Yujun , title =. CVPR , year =

work page
[61]

ICCV , year =

Chen, Chieh-Yun and Shi, Min and Zhang, Gong and Shi, Humphrey , title =. ICCV , year =

work page
[62]

ECCV , year=

The fabrication of reality and fantasy: Scene generation with llm-assisted prompt interpretation , author=. ECCV , year=

work page
[63]

arXiv preprint:2509.04545 , year=

PromptEnhancer: A Simple Approach to Enhance Text-to-Image Models via Chain-of-Thought Prompt Rewriting , author=. arXiv preprint:2509.04545 , year=

work page arXiv
[64]

ICCV , year =

Guo, Jiayi and Yan, Chuanhao and Xu, Xingqian and Wang, Yulin and Wang, Kai and Huang, Gao and Shi, Humphrey , title =. ICCV , year =

work page
[65]

CVPR , year=

Self-correcting llm-controlled diffusion models , author=. CVPR , year=

work page
[66]

Huang, Ziqi and Yu, Ning and Chen, Gordon and Qiu, Haonan and Debevec, Paul and Liu, Ziwei , journal=

work page
[67]

Self-Correcting Text-to-Video Generation with Misalignment Detection and Localized Refinement

VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement , author=. arXiv preprint arXiv:2411.15115 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[68]

arXiv preprint arXiv:2511.17986 , year=

Plan-X: Instruct Video Generation via Semantic Planning , author=. arXiv preprint arXiv:2511.17986 , year=

work page arXiv
[69]

NeurIPS , year=

Videotetris: Towards compositional text-to-video generation , author=. NeurIPS , year=

work page
[70]

CVPR , year=

Phyt2v: Llm-guided iterative self-refinement for physics-grounded text-to-video generation , author=. CVPR , year=

work page
[71]

ICLR , year=

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation , author=. ICLR , year=

work page
[72]

Univideo: Unified video understanding, generation, and editing.arXiv preprint arXiv:2510.08377, 2026

Univideo: Unified understanding, generation, and editing for videos , author=. arXiv preprint arXiv:2510.08377 , year=

work page arXiv
[73]

NeurIPS , year=

Vitron: A unified pixel-level vision llm for understanding, generating, segmenting, editing , author=. NeurIPS , year=

work page
[74]

Yecheng Wu and Zhuoyang Zhang and Junyu Chen and Haotian Tang and Dacheng Li and Yunhao Fang and Ligeng Zhu and Enze Xie and Hongxu Yin and Li Yi and Song Han and Yao Lu , booktitle=

work page
[75]

ICLR , year=

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation , author=. ICLR , year=

work page
[76]

ICCV , year =

WonderPlay: Dynamic 3D Scene Generation from a Single Image and Actions , author =. ICCV , year =

work page
[77]

NeurIPS , year=

Force Prompting: Video Generation Models Can Learn And Generalize Physics-based Control Signals , author=. NeurIPS , year=

work page
[78]

CVPR , year=

WonderWorld: Interactive 3D Scene Generation from a Single Image , author=. CVPR , year=

work page
[79]

arXiv preprint arXiv:2601.05848 , year=

Goal Force: Teaching Video Models To Accomplish Physics-Conditioned Goals , author=. arXiv preprint arXiv:2601.05848 , year=

work page arXiv
[80]

ECCV , year=

Physdreamer: Physics-based interaction with 3d objects via video generation , author=. ECCV , year=

work page

Showing first 80 references.

[1] [1]

NeurIPS , year=

Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance , author=. NeurIPS , year=

work page

[2] [2]

2024 , booktitle =

Shi, Xiaoyu and Huang, Zhaoyang and Wang, Fu-Yun and Bian, Weikang and Li, Dasong and Zhang, Yi and Zhang, Manyuan and Cheung, Ka Chun and See, Simon and Qin, Hongwei and Dai, Jifeng and Li, Hongsheng , title =. 2024 , booktitle =

work page 2024

[3] [3]

CVPR , year =

Geng, Daniel and Herrmann, Charles and Hur, Junhwa and Cole, Forrester and Zhang, Serena and Pfaff, Tobias and Lopez-Guevara, Tatiana and Aytar, Yusuf and Rubinstein, Michael and Sun, Chen and Wang, Oliver and Owens, Andrew and Sun, Deqing , title =. CVPR , year =

work page

[4] [4]

CVPR , year =

Zhang, Zhenghao and Liao, Junchao and Li, Menghao and Dai, ZuoZhuo and Qiu, Bingxue and Zhu, Siyu and Qin, Long and Wang, Weizhi , title =. CVPR , year =

work page

[5] [5]

CVPR , year=

Animateanything: Consistent and controllable animation for video generation , author=. CVPR , year=

work page

[6] [6]

CVPR , year =

Burgert, Ryan and Xu, Yuancheng and Xian, Wenqi and Pilarski, Oliver and Clausen, Pascal and He, Mingming and Ma, Li and Deng, Yitong and Li, Lingxiao and Mousavi, Mohsen and Ryoo, Michael and Debevec, Paul and Yu, Ning , title =. CVPR , year =

work page

[7] [7]

AAAI , year=

Trackgo: A flexible and efficient method for controllable video generation , author=. AAAI , year=

work page

[8] [8]

AAAI , year=

Image conductor: Precision control for interactive video synthesis , author=. AAAI , year=

work page

[9] [9]

arXiv preprint arXiv:2512.02015 , year=

Generative Video Motion Editing with 3D Point Tracks , author=. arXiv preprint arXiv:2512.02015 , year=

work page arXiv

[10] [10]

CVPR , year=

Levitor: 3d trajectory oriented image-to-video synthesis , author=. CVPR , year=

work page

[11] [11]

SIGGRAPH , year=

Diffusion as shader: 3d-aware video diffusion for versatile video generation control , author=. SIGGRAPH , year=

work page

[12] [12]

arXiv preprint arXiv:2502.07531 , year=

Vidcraft3: Camera, object, and lighting control for image-to-video generation , author=. arXiv preprint arXiv:2502.07531 , year=

work page arXiv

[13] [13]

Motionstream: Real-time video gen- eration with interactive motion controls.arXiv preprint arXiv:2511.01266,

Motionstream: Real-time video generation with interactive motion controls , author=. arXiv preprint arXiv:2511.01266 , year=

work page arXiv

[14] [14]

Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo

work page

[15] [15]

VideoDirector

Han Lin and Abhay Zala and Jaemin Cho and Mohit Bansal , booktitle=. VideoDirector

work page

[16] [16]

arXiv preprint arXiv:2410.10076 , year=

Videoagent: Self-improving video generation , author=. arXiv preprint arXiv:2410.10076 , year=

work page arXiv

[17] [17]

ICCV , year=

Click to move: Controlling video generation with sparse motion , author=. ICCV , year=

work page

[18] [18]

CVPR , year=

Controllable video generation with sparse trajectories , author=. CVPR , year=

work page

[19] [19]

NeurIPS , year=

Denoising diffusion probabilistic models , author=. NeurIPS , year=

work page

[20] [20]

ICML , year=

Deep unsupervised learning using nonequilibrium thermodynamics , author=. ICML , year=

work page

[21] [21]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan: Open and advanced large-scale video generative models , author=. arXiv preprint arXiv:2503.20314 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

2024 , url =

OpenAI , title =. 2024 , url =

work page 2024

[23] [23]

2025 , url =

Google DeepMind , title =. 2025 , url =

work page 2025

[24] [24]

ICLR , year=

Flow Matching for Generative Modeling , author=. ICLR , year=

work page

[25] [25]

Auto-Encoding Variational Bayes

Auto-encoding variational bayes , author=. arXiv preprint arXiv:1312.6114 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

JMLR , year=

Exploring the limits of transfer learning with a unified text-to-text transformer , author=. JMLR , year=

work page

[27] [27]

ICCV , year=

Scalable diffusion models with transformers , author=. ICCV , year=

work page

[28] [28]

2025 , url =

aigc-apps , title =. 2025 , url =

work page 2025

[29] [29]

2024 , booktitle =

Wang, Zhouxia and Yuan, Ziyang and Wang, Xintao and Li, Yaowei and Chen, Tianshui and Xia, Menghan and Luo, Ping and Shan, Ying , title =. 2024 , booktitle =

work page 2024

[30] [30]

DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

Dragnuwa: Fine-grained control in video generation by integrating text, image, and trajectory , author=. arXiv preprint arXiv:2308.08089 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

Lindell , booktitle=

Koichi Namekata and Sherwin Bahmani and Ziyi Wu and Yash Kant and Igor Gilitschenski and David B. Lindell , booktitle=

work page

[32] [32]

ICCV , year =

Feng, Wanquan and Qi, Tianhao and Liu, Jiawei and Sun, Mingzhen and Tu, Pengqi and Ma, Tianxiang and Dai, Fei and Zhao, Songtao and Zhou, Siyu and He, Qian , title =. ICCV , year =

work page

[33] [33]

ICLR , year=

Trajectory attention for fine-grained video motion control , author=. ICLR , year=

work page

[34] [34]

Motion- Conditioned Diffusion Model for Controllable Video Synthesis,

Motion-conditioned diffusion model for controllable video synthesis , author=. arXiv preprint arXiv:2304.14404 , year=

work page arXiv

[35] [35]

ECCV , year=

Mofa-video: Controllable image animation via generative motion field adaptions in frozen image-to-video diffusion model , author=. ECCV , year=

work page

[36] [36]

NeurIPS , year=

Videocomposer: Compositional video synthesis with motion controllability , author=. NeurIPS , year=

work page

[37] [37]

CVPR , year =

Zhang, Zhongwei and Long, Fuchen and Qiu, Zhaofan and Pan, Yingwei and Liu, Wu and Yao, Ting and Mei, Tao , title =. CVPR , year =

work page

[38] [38]

SIGGRAPH , year=

Motioncanvas: Cinematic shot design with controllable image-to-video generation , author=. SIGGRAPH , year=

work page

[39] [39]

ICML , year=

Boximator: Generating Rich and Controllable Motions for Video Synthesis , author=. ICML , year=

work page

[40] [40]

NeurIPS , year=

Motionbooth: Motion-aware customized text-to-video generation , author=. NeurIPS , year=

work page

[41] [41]

ECCV , year=

Draganything: Motion control for anything using entity representation , author=. ECCV , year=

work page

[42] [42]

CVPR , year=

Animate anyone: Consistent and controllable image-to-video synthesis for character animation , author=. CVPR , year=

work page

[43] [43]

CVPR , year=

Magicanimate: Temporally consistent human image animation using diffusion model , author=. CVPR , year=

work page

[44] [44]

ICLR , year=

CameraCtrl: Enabling Camera Control for Video Diffusion Models , author=. ICLR , year=

work page

[45] [45]

ICLR , year=

Controlling Space and Time with Diffusion Models , author=. ICLR , year=

work page

[46] [46]

arXiv preprint arXiv:2511.20640 , year=

MotionV2V: Editing Motion in a Video , author=. arXiv preprint arXiv:2511.20640 , year=

work page arXiv

[47] [47]

SIGGRAPH Asia , year=

Trailblazer: Trajectory control for diffusion-based video generation , author=. SIGGRAPH Asia , year=

work page

[48] [48]

CVPR , year=

Peekaboo: Interactive video generation via masked-diffusion , author=. CVPR , year=

work page

[49] [49]

NeurIPS , year=

Attention is all you need , author=. NeurIPS , year=

work page

[50] [50]

Emerging Properties in Unified Multimodal Pretraining

Emerging Properties in Unified Multimodal Pretraining , author =. arXiv preprint arXiv:2505.14683 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[51] [51]

2025 , url =

OpenAI , title =. 2025 , url =

work page 2025

[52] [52]

Qwen-Image Technical Report

Qwen-image technical report , author=. arXiv preprint arXiv:2508.02324 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[53] [53]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Janus-pro: Unified multimodal understanding and generation with data and model scaling , author=. arXiv preprint arXiv:2501.17811 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[54] [54]

ICCV , year =

Tong, Shengbang and Fan, David and Li, Jiachen and Xiong, Yunyang and Chen, Xinlei and Sinha, Koustuv and Rabbat, Michael and LeCun, Yann and Xie, Saining and Liu, Zhuang , title =. ICCV , year =

work page

[55] [55]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

OmniGen2: Exploration to Advanced Multimodal Generation , author=. arXiv preprint arXiv:2506.18871 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[56] [56]

NeurIPS , year=

Layoutgpt: Compositional visual planning and generation with large language models , author=. NeurIPS , year=

work page

[57] [57]

ICML , year=

Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms , author=. ICML , year=

work page

[58] [58]

NeurIPS , year=

Genartist: Multimodal llm as an agent for unified image generation and editing , author=. NeurIPS , year=

work page

[59] [59]

GoT: Unleashing Reasoning Capability of

Rongyao Fang and Chengqi Duan and Kun Wang and Linjiang Huang and Hao Li and Hao Tian and Shilin Yan and Weihao Yu and Xingyu Zeng and Jifeng Dai and Xihui Liu and Hongsheng Li , booktitle=. GoT: Unleashing Reasoning Capability of

work page

[60] [60]

CVPR , year =

Liu, Zichen and Yu, Yue and Ouyang, Hao and Wang, Qiuyu and Cheng, Ka Leong and Wang, Wen and Liu, Zhiheng and Chen, Qifeng and Shen, Yujun , title =. CVPR , year =

work page

[61] [61]

ICCV , year =

Chen, Chieh-Yun and Shi, Min and Zhang, Gong and Shi, Humphrey , title =. ICCV , year =

work page

[62] [62]

ECCV , year=

The fabrication of reality and fantasy: Scene generation with llm-assisted prompt interpretation , author=. ECCV , year=

work page

[63] [63]

arXiv preprint:2509.04545 , year=

PromptEnhancer: A Simple Approach to Enhance Text-to-Image Models via Chain-of-Thought Prompt Rewriting , author=. arXiv preprint:2509.04545 , year=

work page arXiv

[64] [64]

ICCV , year =

Guo, Jiayi and Yan, Chuanhao and Xu, Xingqian and Wang, Yulin and Wang, Kai and Huang, Gao and Shi, Humphrey , title =. ICCV , year =

work page

[65] [65]

CVPR , year=

Self-correcting llm-controlled diffusion models , author=. CVPR , year=

work page

[66] [66]

Huang, Ziqi and Yu, Ning and Chen, Gordon and Qiu, Haonan and Debevec, Paul and Liu, Ziwei , journal=

work page

[67] [67]

Self-Correcting Text-to-Video Generation with Misalignment Detection and Localized Refinement

VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement , author=. arXiv preprint arXiv:2411.15115 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[68] [68]

arXiv preprint arXiv:2511.17986 , year=

Plan-X: Instruct Video Generation via Semantic Planning , author=. arXiv preprint arXiv:2511.17986 , year=

work page arXiv

[69] [69]

NeurIPS , year=

Videotetris: Towards compositional text-to-video generation , author=. NeurIPS , year=

work page

[70] [70]

CVPR , year=

Phyt2v: Llm-guided iterative self-refinement for physics-grounded text-to-video generation , author=. CVPR , year=

work page

[71] [71]

ICLR , year=

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation , author=. ICLR , year=

work page

[72] [72]

Univideo: Unified video understanding, generation, and editing.arXiv preprint arXiv:2510.08377, 2026

Univideo: Unified understanding, generation, and editing for videos , author=. arXiv preprint arXiv:2510.08377 , year=

work page arXiv

[73] [73]

NeurIPS , year=

Vitron: A unified pixel-level vision llm for understanding, generating, segmenting, editing , author=. NeurIPS , year=

work page

[74] [74]

Yecheng Wu and Zhuoyang Zhang and Junyu Chen and Haotian Tang and Dacheng Li and Yunhao Fang and Ligeng Zhu and Enze Xie and Hongxu Yin and Li Yi and Song Han and Yao Lu , booktitle=

work page

[75] [75]

ICLR , year=

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation , author=. ICLR , year=

work page

[76] [76]

ICCV , year =

WonderPlay: Dynamic 3D Scene Generation from a Single Image and Actions , author =. ICCV , year =

work page

[77] [77]

NeurIPS , year=

Force Prompting: Video Generation Models Can Learn And Generalize Physics-based Control Signals , author=. NeurIPS , year=

work page

[78] [78]

CVPR , year=

WonderWorld: Interactive 3D Scene Generation from a Single Image , author=. CVPR , year=

work page

[79] [79]

arXiv preprint arXiv:2601.05848 , year=

Goal Force: Teaching Video Models To Accomplish Physics-Conditioned Goals , author=. arXiv preprint arXiv:2601.05848 , year=

work page arXiv

[80] [80]

ECCV , year=

Physdreamer: Physics-based interaction with 3d objects via video generation , author=. ECCV , year=

work page