arxiv: 2604.17749 · v1 · submitted 2026-04-20 · 💻 cs.CV

Recognition: unknown

Ego-InBetween: Generating Object State Transitions in Ego-Centric Videos

Mengmeng Ge , Takashi Isobe , Xu Jia , Yanan Sun , Zetong Yang , Weinong Wang , Dong Zhou , Dong Li

show 2 more authors

Huchuan Lu Emad Barsoum

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:22 UTC · model grok-4.3

classification 💻 cs.CV

keywords egocentric videosobject state transitionsvideo frame generationvisual state transitionhuman-object interactionrobot-object interactiongenerative AIaction modeling

0 comments

The pith

The EgoIn framework generates sequences of intermediate frames depicting object state transitions in egocentric videos from given initial and target states and an action instruction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines the Egocentric Instructed Visual State Transition task as generating intermediate frames that show how objects transform between two states under a brief instruction from an egocentric viewpoint. It proposes the EgoIn framework to solve the challenges of reasoning about transformation steps and maintaining object consistency in generated videos. The framework uses a fine-tuned TransitionVLM to infer multi-step processes without hallucinations and employs a Transition Conditioning module along with Object-aware Auxiliary Supervision to create coherent frame sequences. If the approach works, it would improve AI's ability to model physical actions and transformations as humans experience them in first-person views, aiding fields like robotics and interactive AI systems.

Core claim

We propose the EgoIn framework. It first infers the multi-step transition process between two given states using TransitionVLM, fine-tuned on our curated dataset to better adapt to this task and reduce hallucinated information. It then generates a sequence of frames based on transition conditions produced by the proposed Transition Conditioning module. Additionally, we introduce Object-aware Auxiliary Supervision to preserve consistent object appearance throughout the transition.

What carries the argument

EgoIn framework consisting of TransitionVLM for inferring multi-step transitions, Transition Conditioning module for generating frames, and Object-aware Auxiliary Supervision for preserving object appearance.

If this is right

Superior performance in generating semantically meaningful and visually coherent transformation sequences on human-object interaction datasets.
Superior performance on robot-object interaction datasets.
Enables accurate multi-step transition inference by fine-tuning on curated data.
Preserves object appearance across generated frames using auxiliary supervision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method could be tested on datasets involving more complex actions or multiple objects to see if the inference scales.
The approach might inform better video prediction models for autonomous systems by incorporating egocentric reasoning.
Applications in augmented reality could use such transitions to simulate object changes in real-time user views.
Without the specific fine-tuning, similar tasks might suffer from more inconsistencies in object identity.

Load-bearing premise

Fine-tuning TransitionVLM on the curated dataset reduces hallucination for accurate multi-step inference and the Object-aware Auxiliary Supervision preserves object appearance without artifacts.

What would settle it

Evaluating the generated videos on the datasets and finding that they do not outperform baselines in semantic meaning or visual coherence, or show object appearance changes or instruction mismatches, would falsify the claim.

Figures

Figures reproduced from arXiv: 2604.17749 by Dong Li, Dong Zhou, Emad Barsoum, Huchuan Lu, Mengmeng Ge, Takashi Isobe, Weinong Wang, Xu Jia, Yanan Sun, Zetong Yang.

**Figure 1.** Figure 1: Examples of diverse generated object state transition sequences under different textual and visual conditions: (a) different action [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Illustration of the proposed EgoIn framework. EgoIn works in two stages: (1) transition process modeling using the tuned [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of the tuning process for the proposed TransitionVLM: (a) shows the data curation process used to obtain state-aware [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Illustration of the Transition Conditioning (TC) mod [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison on Epic100 and Bridge. Intermediate frames ( [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 7.** Figure 7: Ablation on the effectiveness of TransitionVLM in video [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Understanding physical transformation processes is crucial for both human cognition and artificial intelligence systems, particularly from an egocentric perspective, which serves as a key bridge between humans and machines in action modeling. We define this modeling process as Egocentric Instructed Visual State Transition (EIVST), which involves generating intermediate frames that depict object transformations between initial and target states under a brief action instruction. EIVST poses two challenges for current generative models: (1) understanding the visual scenes of the initial and target states and reasoning about transformation steps from an egocentric view, and (2) generating a consistent intermediate transition that follows the given instruction while preserving object appearance across the two visual states. To address these challenges, we propose the EgoIn framework. It first infers the multi-step transition process between two given states using TransitionVLM, fine-tuned on our curated dataset to better adapt to this task and reduce hallucinated information. It then generates a sequence of frames based on transition conditions produced by the proposed Transition Conditioning module. Additionally, we introduce Object-aware Auxiliary Supervision to preserve consistent object appearance throughout the transition. Extensive experiments on human-object and robot-object interaction datasets demonstrate EgoIn's superior performance in generating semantically meaningful and visually coherent transformation sequences.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper defines a narrow new task for egocentric object state transitions and builds a workable VLM-plus-conditioning pipeline that beats baselines on their datasets, though the key claims rest on unablated components.

read the letter

The main thing to know is that this work carves out Egocentric Instructed Visual State Transition as a focused problem: given start and end frames plus a short instruction, generate the intermediate egocentric frames showing the object change. EgoIn splits it into a fine-tuned TransitionVLM for step reasoning and a conditioning module with object-aware auxiliary supervision for frame generation, then tests on human-object and robot-object datasets where it reports better semantic and visual results than prior methods.

Referee Report

3 major / 2 minor

Summary. The paper defines the new Egocentric Instructed Visual State Transition (EIVST) task of generating intermediate ego-centric video frames that depict object state changes between given initial and target states under a brief action instruction. It introduces the EgoIn framework, which first uses a TransitionVLM fine-tuned on a curated dataset to infer multi-step transition processes (addressing hallucination), then applies a Transition Conditioning module to generate frames while using Object-aware Auxiliary Supervision to maintain object appearance consistency. Extensive experiments on human-object and robot-object interaction datasets are claimed to demonstrate superior performance over baselines in producing semantically meaningful and visually coherent sequences.

Significance. If the empirical claims hold with stronger validation, the work would introduce a practically relevant task and modular approach for modeling physical transformations in egocentric video, potentially benefiting robotics, action understanding, and generative modeling of state changes. The combination of VLM-based reasoning with conditioning and auxiliary supervision targets concrete challenges in consistency and multi-step inference.

major comments (3)

[Abstract and §4] Abstract and §4 (Experiments): The central claim that fine-tuning TransitionVLM on the curated dataset reduces hallucination and enables accurate multi-step transition inference lacks any direct quantitative metric (e.g., step-wise factual consistency, hallucination rate, or base-VLM comparison) on a held-out EIVST benchmark. End-to-end FID/LPIPS and user studies are reported instead, but these do not isolate the fine-tuning contribution and therefore do not fully support the framework's reliance on this component.
[§3.3] §3.3 (Object-aware Auxiliary Supervision): No ablation isolates the effect of the proposed Object-aware Auxiliary Supervision on artifact introduction versus a baseline conditioning module. Without this, it remains unclear whether the supervision reliably preserves appearance across states or introduces new inconsistencies, undermining the claim of visually coherent sequences.
[§4] §4 (Experimental Setup): The reported superior performance lacks error bars, full baseline implementation details, and comprehensive ablations (including on the Transition Conditioning module). This makes it difficult to verify that the gains are attributable to the proposed components rather than dataset curation or hyperparameter choices.

minor comments (2)

[Introduction] Clarify the exact definition and scope of the invented EIVST task early in the introduction to avoid ambiguity with related video generation or state-transition benchmarks.
[Throughout] Ensure all acronyms (EIVST, EgoIn, TransitionVLM) are expanded on first use and used consistently in figure captions and tables.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We agree that stronger isolation of component contributions and more rigorous experimental reporting would improve the manuscript. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): The central claim that fine-tuning TransitionVLM on the curated dataset reduces hallucination and enables accurate multi-step transition inference lacks any direct quantitative metric (e.g., step-wise factual consistency, hallucination rate, or base-VLM comparison) on a held-out EIVST benchmark. End-to-end FID/LPIPS and user studies are reported instead, but these do not isolate the fine-tuning contribution and therefore do not fully support the framework's reliance on this component.

Authors: We acknowledge that direct quantitative metrics isolating the TransitionVLM fine-tuning effect (such as step-wise factual consistency or hallucination rate versus the base VLM) would provide clearer support for this component. The current manuscript prioritizes end-to-end generation metrics and user studies to demonstrate practical utility of the full EgoIn pipeline. To address the gap, we will add a dedicated evaluation subsection with held-out benchmark comparisons, including hallucination rate and factual consistency metrics, in the revised manuscript. revision: yes
Referee: [§3.3] §3.3 (Object-aware Auxiliary Supervision): No ablation isolates the effect of the proposed Object-aware Auxiliary Supervision on artifact introduction versus a baseline conditioning module. Without this, it remains unclear whether the supervision reliably preserves appearance across states or introduces new inconsistencies, undermining the claim of visually coherent sequences.

Authors: We agree that an explicit ablation isolating the Object-aware Auxiliary Supervision is needed to confirm it improves consistency without introducing new artifacts. While the manuscript describes the module's design and its integration with the conditioning process, a standalone ablation table was not included. We will add this ablation in the revision, reporting quantitative comparisons (e.g., consistency metrics and visual artifact analysis) between the full model and a baseline without the auxiliary supervision. revision: yes
Referee: [§4] §4 (Experimental Setup): The reported superior performance lacks error bars, full baseline implementation details, and comprehensive ablations (including on the Transition Conditioning module). This makes it difficult to verify that the gains are attributable to the proposed components rather than dataset curation or hyperparameter choices.

Authors: We will update the experimental section to include error bars on all reported metrics. We will also expand the implementation details for all baselines to support reproducibility. In addition, we will extend the ablation studies to explicitly cover the Transition Conditioning module and other key components, allowing clearer attribution of performance improvements to the proposed elements rather than external factors. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical fine-tuning and modules are independent of inputs

full rationale

The paper defines EIVST as a task, then describes an empirical pipeline: fine-tune TransitionVLM on a curated dataset to reduce hallucination, apply a Transition Conditioning module, and add Object-aware Auxiliary Supervision for appearance consistency. Performance is assessed via experiments on human-object and robot-object datasets using metrics and user studies. No equations or derivations are presented that reduce a claimed prediction to a fitted input by construction; no self-definitional loops, no renaming of known results as new unifications, and no load-bearing self-citations that substitute for independent justification. The central claims rest on the outcomes of fine-tuning and ablation-style experiments rather than tautological equivalence to the inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The central claim rests on the assumption that a curated dataset and fine-tuning can produce reliable transition inferences, plus standard generative model training assumptions. No explicit free parameters are named in the abstract, but model hyperparameters and dataset selection choices function as such.

free parameters (1)

TransitionVLM fine-tuning hyperparameters
Chosen to adapt the VLM to the EIVST task and reduce hallucinations; central to the first stage of the framework.

axioms (2)

domain assumption Vision-language models can be fine-tuned to infer multi-step physical transitions from egocentric image pairs without introducing hallucinations
Invoked in the design of the first stage of EgoIn.
domain assumption Object appearance can be preserved across generated frames via auxiliary supervision during training
Basis for the Object-aware Auxiliary Supervision component.

invented entities (1)

EIVST task definition no independent evidence
purpose: Formalizes the problem of generating instructed intermediate frames for object state transitions in egocentric video
Newly introduced to frame the research problem

pith-pipeline@v0.9.0 · 5545 in / 1386 out tokens · 51934 ms · 2026-05-10T05:22:23.149284+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 26 canonical work pages · 7 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.CoRR, abs/2303.08774, 2023. 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wen- bin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.CoRR, abs/2502.13923, 2025. 3, 4, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Mobius: Text to seamless loop- ing video generation via latent shift

Xiuli Bi, Jianfei Yuan, Bo Liu, Yong Zhang, Xiaodong Cun, Chi-Man Pun, and Bin Xiao. Mobius: Text to seamless loop- ing video generation via latent shift. InSIGGRAPH, 2025. 3

2025
[4]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.CoRR, abs/2311.15127, 2023. 2

work page internal anchor Pith review arXiv 2023
[5]

Sharegpt4video: Improving video understand- ing and generation with better captions

Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, et al. Sharegpt4video: Improving video understanding and generation with better captions.CoRR, abs/2406.04325,

work page arXiv
[6]

Seine: Short-to-long video diffusion model for generative transition and prediction

Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Seine: Short-to-long video diffusion model for generative transition and prediction. InICLR,
[7]

Scaling egocentric vision: The epic-kitchens dataset

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. In ECCV, 2018. 6

2018
[8]

Ldmvfi: Video frame interpolation with latent diffusion models

Duolikun Danier, Fan Zhang, and David Bull. Ldmvfi: Video frame interpolation with latent diffusion models. InAAAI,
[9]

Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets

Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Daniilidis, Chelsea Finn, and Sergey Levine. Bridge data: Boosting general- ization of robotic skills with cross-domain datasets.CoRR, abs/2109.13396, 2021. 6

work page internal anchor Pith review arXiv 2021
[10]

Explo- rative inbetweening of time and space

Haiwen Feng, Zheng Ding, Zhihao Xia, Simon Niklaus, Vic- toria Abrevaya, Michael J Black, and Xuaner Zhang. Explo- rative inbetweening of time and space. InECCV, 2025. 3, 7

2025
[11]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In CVPR, 2022. 6

2022
[12]

Seer: Language instructed video prediction with latent diffusion models

Xianfan Gu, Chuan Wen, Weirui Ye, Jiaming Song, and Yang Gao. Seer: Language instructed video prediction with latent diffusion models. InICLR, 2024. 1, 3

2024
[13]

Controllable human-centric keyframe interpolation with generative prior.CoRR, abs/2506.03119,

Zujin Guo, Size Wu, Zhongang Cai, Wei Li, and Chen Change Loy. Controllable human-centric keyframe interpolation with generative prior.CoRR, abs/2506.03119,

work page arXiv
[14]

Latent Video Diffusion Models for High-Fidelity Long Video Generation

Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation.CoRR, abs/2211.13221, 2022. 2

work page internal anchor Pith review arXiv 2022
[15]

Free-bloom: Zero-shot text-to-video generator with llm director and ldm animator

Hanzhuo Huang, Yufan Feng, Cheng Shi, Lan Xu, Jingyi Yu, and Sibei Yang. Free-bloom: Zero-shot text-to-video generator with llm director and ldm animator. InNIPS, 2024. 3

2024
[16]

Vbench++: Comprehensive and versatile benchmark suite for video generative models

Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying-Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Vbench++: Comprehensive and versatile benchmark suite for video generative models.CoRR, abs/2411.13503, 2024. 7

work page arXiv 2024
[17]

Amd-hummingbird: Towards an efficient text-to-video model.CoRR, abs/2503.18559, 2025

Takashi Isobe, He Cui, Dong Zhou, Mengmeng Ge, Dong Li, and Emad Barsoum. Amd-hummingbird: Towards an efficient text-to-video model.CoRR, abs/2503.18559, 2025. 2

work page arXiv 2025
[18]

Video interpolation with diffusion models

Siddhant Jain, Daniel Watson, Eric Tabellion, Ben Poole, Janne Kontkanen, et al. Video interpolation with diffusion models. InCVPR, 2024. 1

2024
[19]

Customizing text-to-image generation with inverted interaction

Xu Jia, Takashi Isobe, Xiaomin Li, Qinghe Wang, Jing Mu, Dong Zhou, Huchuan Lu, Lu Tian, Ashish Sirasao, Emad Barsoum, et al. Customizing text-to-image generation with inverted interaction. InACM Multimedia 2024, 2024. 6

2024
[20]

Sage: Structure-aware generative video transitions between diverse clips.CoRR, abs/2510.24667, 2025

Mia Kan, Yilin Liu, and Niloy Mitra. Sage: Structure-aware generative video transitions between diverse clips.CoRR, abs/2510.24667, 2025. 3

work page arXiv 2025
[21]

Lego: L earning ego cen- tric action frame generation via visual instruction tuning

Bolin Lai, Xiaoliang Dai, Lawrence Chen, Guan Pang, James M Rehg, and Miao Liu. Lego: L earning ego cen- tric action frame generation via visual instruction tuning. In ECCV, 2024. 3

2024
[22]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InICML,
[23]

Seek-and-Solve: Benchmarking MLLMs for Visual Clue-Driven Reasoning in Daily Scenarios

Xiaomin Li, Tala Wang, Zichen Zhong, Ying Zhang, Zirui Zheng, Takashi Isobe, Dezhuang Li, Huchuan Lu, You He, and Xu Jia. Seek-and-solve: Benchmarking mllms for visual clue-driven reasoning in daily scenarios.CoRR, abs/2604.14041, 2026. 3

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

Vstar: Generative temporal nursing for longer dynamic video synthesis.CoRR, abs/2403.13501,

Yumeng Li, William Beluch, Margret Keuper, Dan Zhang, and Anna Khoreva. Vstar: Generative temporal nursing for longer dynamic video synthesis.CoRR, abs/2403.13501,

work page arXiv
[25]

LLM -grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models

Long Lian, Boyi Li, Adam Yala, and Trevor Darrell. Llm- grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models. CoRR, abs/2305.13655, 2023. 3

work page arXiv 2023
[26]

Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning.arXiv preprint arXiv:2309.15091,

Han Lin, Abhay Zala, Jaemin Cho, and Mohit Bansal. Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning.CoRR, abs/2309.15091, 2023. 3

work page arXiv 2023
[27]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNIPS, 2023. 3

2023
[28]

Muttaqien, Koshi Makihara, Hanbit Oh, Keisuke Shirai, Floris Erich, Ryo Hanai, and Yukiyasu Do- mae

Tomohiro Motoda, Masaki Murooka, Ryoichi Nakajo, Muhammad A. Muttaqien, Koshi Makihara, Hanbit Oh, Keisuke Shirai, Floris Erich, Ryo Hanai, and Yukiyasu Do- mae. Aist-bimanual manipulation, 2025. 6

2025
[29]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Seg- ment anything in images and videos.CoRR, abs/2408.00714,

work page internal anchor Pith review arXiv
[30]

Film: Frame interpo- lation for large motion

Fitsum Reda, Janne Kontkanen, Eric Tabellion, Deqing Sun, Caroline Pantofaru, and Brian Curless. Film: Frame interpo- lation for large motion. InECCV, 2022. 1, 7

2022
[31]

Dreammover: Leveraging the prior of diffusion models for image interpolation with large motion.CoRR, abs/2409.09605, 2024

Liao Shen, Tianqi Liu, Huiqiang Sun, Xinyi Ye, Baopu Li, Jianming Zhang, and Zhiguo Cao. Dreammover: Leveraging the prior of diffusion models for image interpolation with large motion.CoRR, abs/2409.09605, 2024. 1

work page arXiv 2024
[32]

Introducing agibot world colosseo: A large- scale manipulation platform for scalable and intelligent embodied systems.https://opendrivelab.com/ AgiBot-World/, 2025

Modi Shi, Yuxiang Lu, Huijie Wang, Chengen Xie, and Qingwen Bu. Introducing agibot world colosseo: A large- scale manipulation platform for scalable and intelligent embodied systems.https://opendrivelab.com/ AgiBot-World/, 2025. Blog post. 6

2025
[33]

Genhowto: Learning to generate actions and state transformations from instructional videos

Tom ´aˇs Sou ˇcek, Dima Damen, Michael Wray, Ivan Laptev, and Josef Sivic. Genhowto: Learning to generate actions and state transformations from instructional videos. InCVPR,
[34]

Showhowto: Generating scene-conditioned step-by-step visual instructions

Tom ´aˇs Sou ˇcek, Prajwal Gatti, Michael Wray, Ivan Laptev, Dima Damen, and Josef Sivic. Showhowto: Generating scene-conditioned step-by-step visual instructions. InCVPR,
[35]

Fvd: A new metric for video generation

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Rapha¨el Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. InICLRW, 2019. 7

2019
[36]

Mcvd-masked conditional video diffusion for prediction, generation, and interpolation.NIPS, 35, 2022

Vikram V oleti, Alexia Jolicoeur-Martineau, and Chris Pal. Mcvd-masked conditional video diffusion for prediction, generation, and interpolation.NIPS, 35, 2022. 3

2022
[37]

Wang and Polina Golland

Clinton J. Wang and Polina Golland. Interpolating between images with diffusion models.CoRR, abs/2307.12560, 2023. 3

work page arXiv 2023
[38]

Generative inbetweening: Adapting image-to-video models for keyframe interpolation.CoRR, abs/2408.15239, 2024

Xiaojuan Wang, Boyang Zhou, Brian Curless, Ira Kemelmacher-Shlizerman, Aleksander Holynski, and Steven M Seitz. Generative inbetweening: Adapting image-to-video models for keyframe interpolation.CoRR, abs/2408.15239, 2024. 3, 7

work page arXiv 2024
[39]

Lavie: High-quality video gener- ation with cascaded latent diffusion models

Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video gen- eration with cascaded latent diffusion models.CoRR, abs/2309.15103, 2023. 2

work page arXiv 2023
[40]

Robomind: Benchmark on multi- embodiment intelligence normative data for robot manipula- tion

Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, Xi- aozhu Ju, Zhuqin Yang, Meng Li, Yinuo Zhao, Zhiyuan Xu, Guang Yang, et al. Robomind: Benchmark on multi- embodiment intelligence normative data for robot manipula- tion. InRSS, 2025. 6

2025
[41]

Self-correcting llm-controlled diffusion models.CoRR, abs/2311.16090, 2023

Tsung-Han Wu, Long Lian, Joseph E Gonzalez, Boyi Li, and Trevor Darrell. Self-correcting llm-controlled diffusion models.CoRR, abs/2311.16090, 2023. 3

work page arXiv 2023
[42]

Xing, and Zhiting Hu

Jiannan Xiang, Guangyi Liu, Yi Gu, Qiyue Gao, Yuting Ning, Yuheng Zha, Zeyu Feng, Tianhua Tao, Shibo Hao, Yemin Shi, et al. Pandora: Towards general world model with natural language actions and video states.CoRR, abs/2406.09455, 2024. 3

work page arXiv 2024
[43]

Dynamicrafter: Animating open-domain images with video diffusion priors

Jinbo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Wangbo Yu, Hanyuan Liu, Gongye Liu, Xintao Wang, Ying Shan, and Tien-Tsin Wong. Dynamicrafter: Animating open-domain images with video diffusion priors. InECCV,
[44]

Aid: Adapting image2video diffusion models for instruction-guided video prediction.CoRR, abs/2406.06465,

Zhen Xing, Qi Dai, Zejia Weng, Zuxuan Wu, and Yu-Gang Jiang. Aid: Adapting image2video diffusion models for instruction-guided video prediction.CoRR, abs/2406.06465,

work page arXiv
[45]

Vibidsam- pler: Enhancing video interpolation using bidirectional dif- fusion sampler

Serin Yang, Taesung Kwon, and Jong Chul Ye. Vibidsam- pler: Enhancing video interpolation using bidirectional dif- fusion sampler. InICLR, 2025. 3

2025
[46]

Identity- preserving text-to-video generation by frequency decompo- sition

Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyang Ge, Yu- jun Shi, Liuhan Chen, Jiebo Luo, and Li Yuan. Identity- preserving text-to-video generation by frequency decompo- sition. InCVPR, 2025. 6

2025
[47]

Make pixels dance: High- dynamic video generation

Yan Zeng, Guoqiang Wei, Jiani Zheng, Jiaxin Zou, Yang Wei, Yuchen Zhang, and Hang Li. Make pixels dance: High- dynamic video generation. InCVPR, 2024. 1, 3

2024
[48]

Motion-aware generative frame inter- polation.CoRR, abs/2501.03699, 2025

Guozhen Zhang, Yuhan Zhu, Yutao Cui, Xiaotong Zhao, Kai Ma, and Limin Wang. Motion-aware generative frame inter- polation.CoRR, abs/2501.03699, 2025. 3

work page arXiv 2025
[49]

Diffmorpher: Unleashing the capability of diffu- sion models for image morphing

Kaiwen Zhang, Yifan Zhou, Xudong Xu, Bo Dai, and Xin- gang Pan. Diffmorpher: Unleashing the capability of diffu- sion models for image morphing. InCVPR, 2024. 1, 3

2024
[50]

Tvg: A training-free transi- tion video generation method with diffusion models.CoRR, abs/2408.13413, 2024

Rui Zhang, Yaosen Chen, Yuegen Liu, Wei Wang, Xum- ing Wen, and Hongxia Wang. Tvg: A training-free transi- tion video generation method with diffusion models.CoRR, abs/2408.13413, 2024. 1, 3

work page arXiv 2024
[51]

arXiv preprint arXiv:2311.04145 (2023)

Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, and Jin- gren Zhou. I2vgen-xl: High-quality image-to-video synthe- sis via cascaded diffusion models.CoRR, abs/2311.04145,

work page arXiv
[52]

Egvd: Event-guided video diffusion model for physically realistic large-motion frame interpolation.CoRR, abs/2503.20268, 2025

Ziran Zhang, Xiaohui Li, Yihao Liu, Yujin Wang, Yueting Chen, Tianfan Xue, and Shi Guo. Egvd: Event-guided video diffusion model for physically realistic large-motion frame interpolation.CoRR, abs/2503.20268, 2025. 3

work page arXiv 2025
[53]

Generative inbetweening through frame- wise conditions-driven video generation

Tianyi Zhu, Dongwei Ren, Qilong Wang, Xiaohe Wu, and Wangmeng Zuo. Generative inbetweening through frame- wise conditions-driven video generation. InCVPR, 2025. 3

2025