Recognition: unknown
Ego-InBetween: Generating Object State Transitions in Ego-Centric Videos
Pith reviewed 2026-05-10 05:22 UTC · model grok-4.3
The pith
The EgoIn framework generates sequences of intermediate frames depicting object state transitions in egocentric videos from given initial and target states and an action instruction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose the EgoIn framework. It first infers the multi-step transition process between two given states using TransitionVLM, fine-tuned on our curated dataset to better adapt to this task and reduce hallucinated information. It then generates a sequence of frames based on transition conditions produced by the proposed Transition Conditioning module. Additionally, we introduce Object-aware Auxiliary Supervision to preserve consistent object appearance throughout the transition.
What carries the argument
EgoIn framework consisting of TransitionVLM for inferring multi-step transitions, Transition Conditioning module for generating frames, and Object-aware Auxiliary Supervision for preserving object appearance.
If this is right
- Superior performance in generating semantically meaningful and visually coherent transformation sequences on human-object interaction datasets.
- Superior performance on robot-object interaction datasets.
- Enables accurate multi-step transition inference by fine-tuning on curated data.
- Preserves object appearance across generated frames using auxiliary supervision.
Where Pith is reading between the lines
- This method could be tested on datasets involving more complex actions or multiple objects to see if the inference scales.
- The approach might inform better video prediction models for autonomous systems by incorporating egocentric reasoning.
- Applications in augmented reality could use such transitions to simulate object changes in real-time user views.
- Without the specific fine-tuning, similar tasks might suffer from more inconsistencies in object identity.
Load-bearing premise
Fine-tuning TransitionVLM on the curated dataset reduces hallucination for accurate multi-step inference and the Object-aware Auxiliary Supervision preserves object appearance without artifacts.
What would settle it
Evaluating the generated videos on the datasets and finding that they do not outperform baselines in semantic meaning or visual coherence, or show object appearance changes or instruction mismatches, would falsify the claim.
Figures
read the original abstract
Understanding physical transformation processes is crucial for both human cognition and artificial intelligence systems, particularly from an egocentric perspective, which serves as a key bridge between humans and machines in action modeling. We define this modeling process as Egocentric Instructed Visual State Transition (EIVST), which involves generating intermediate frames that depict object transformations between initial and target states under a brief action instruction. EIVST poses two challenges for current generative models: (1) understanding the visual scenes of the initial and target states and reasoning about transformation steps from an egocentric view, and (2) generating a consistent intermediate transition that follows the given instruction while preserving object appearance across the two visual states. To address these challenges, we propose the EgoIn framework. It first infers the multi-step transition process between two given states using TransitionVLM, fine-tuned on our curated dataset to better adapt to this task and reduce hallucinated information. It then generates a sequence of frames based on transition conditions produced by the proposed Transition Conditioning module. Additionally, we introduce Object-aware Auxiliary Supervision to preserve consistent object appearance throughout the transition. Extensive experiments on human-object and robot-object interaction datasets demonstrate EgoIn's superior performance in generating semantically meaningful and visually coherent transformation sequences.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper defines the new Egocentric Instructed Visual State Transition (EIVST) task of generating intermediate ego-centric video frames that depict object state changes between given initial and target states under a brief action instruction. It introduces the EgoIn framework, which first uses a TransitionVLM fine-tuned on a curated dataset to infer multi-step transition processes (addressing hallucination), then applies a Transition Conditioning module to generate frames while using Object-aware Auxiliary Supervision to maintain object appearance consistency. Extensive experiments on human-object and robot-object interaction datasets are claimed to demonstrate superior performance over baselines in producing semantically meaningful and visually coherent sequences.
Significance. If the empirical claims hold with stronger validation, the work would introduce a practically relevant task and modular approach for modeling physical transformations in egocentric video, potentially benefiting robotics, action understanding, and generative modeling of state changes. The combination of VLM-based reasoning with conditioning and auxiliary supervision targets concrete challenges in consistency and multi-step inference.
major comments (3)
- [Abstract and §4] Abstract and §4 (Experiments): The central claim that fine-tuning TransitionVLM on the curated dataset reduces hallucination and enables accurate multi-step transition inference lacks any direct quantitative metric (e.g., step-wise factual consistency, hallucination rate, or base-VLM comparison) on a held-out EIVST benchmark. End-to-end FID/LPIPS and user studies are reported instead, but these do not isolate the fine-tuning contribution and therefore do not fully support the framework's reliance on this component.
- [§3.3] §3.3 (Object-aware Auxiliary Supervision): No ablation isolates the effect of the proposed Object-aware Auxiliary Supervision on artifact introduction versus a baseline conditioning module. Without this, it remains unclear whether the supervision reliably preserves appearance across states or introduces new inconsistencies, undermining the claim of visually coherent sequences.
- [§4] §4 (Experimental Setup): The reported superior performance lacks error bars, full baseline implementation details, and comprehensive ablations (including on the Transition Conditioning module). This makes it difficult to verify that the gains are attributable to the proposed components rather than dataset curation or hyperparameter choices.
minor comments (2)
- [Introduction] Clarify the exact definition and scope of the invented EIVST task early in the introduction to avoid ambiguity with related video generation or state-transition benchmarks.
- [Throughout] Ensure all acronyms (EIVST, EgoIn, TransitionVLM) are expanded on first use and used consistently in figure captions and tables.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We agree that stronger isolation of component contributions and more rigorous experimental reporting would improve the manuscript. We address each major comment below and commit to revisions that directly respond to the concerns raised.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): The central claim that fine-tuning TransitionVLM on the curated dataset reduces hallucination and enables accurate multi-step transition inference lacks any direct quantitative metric (e.g., step-wise factual consistency, hallucination rate, or base-VLM comparison) on a held-out EIVST benchmark. End-to-end FID/LPIPS and user studies are reported instead, but these do not isolate the fine-tuning contribution and therefore do not fully support the framework's reliance on this component.
Authors: We acknowledge that direct quantitative metrics isolating the TransitionVLM fine-tuning effect (such as step-wise factual consistency or hallucination rate versus the base VLM) would provide clearer support for this component. The current manuscript prioritizes end-to-end generation metrics and user studies to demonstrate practical utility of the full EgoIn pipeline. To address the gap, we will add a dedicated evaluation subsection with held-out benchmark comparisons, including hallucination rate and factual consistency metrics, in the revised manuscript. revision: yes
-
Referee: [§3.3] §3.3 (Object-aware Auxiliary Supervision): No ablation isolates the effect of the proposed Object-aware Auxiliary Supervision on artifact introduction versus a baseline conditioning module. Without this, it remains unclear whether the supervision reliably preserves appearance across states or introduces new inconsistencies, undermining the claim of visually coherent sequences.
Authors: We agree that an explicit ablation isolating the Object-aware Auxiliary Supervision is needed to confirm it improves consistency without introducing new artifacts. While the manuscript describes the module's design and its integration with the conditioning process, a standalone ablation table was not included. We will add this ablation in the revision, reporting quantitative comparisons (e.g., consistency metrics and visual artifact analysis) between the full model and a baseline without the auxiliary supervision. revision: yes
-
Referee: [§4] §4 (Experimental Setup): The reported superior performance lacks error bars, full baseline implementation details, and comprehensive ablations (including on the Transition Conditioning module). This makes it difficult to verify that the gains are attributable to the proposed components rather than dataset curation or hyperparameter choices.
Authors: We will update the experimental section to include error bars on all reported metrics. We will also expand the implementation details for all baselines to support reproducibility. In addition, we will extend the ablation studies to explicitly cover the Transition Conditioning module and other key components, allowing clearer attribution of performance improvements to the proposed elements rather than external factors. revision: yes
Circularity Check
No significant circularity; empirical fine-tuning and modules are independent of inputs
full rationale
The paper defines EIVST as a task, then describes an empirical pipeline: fine-tune TransitionVLM on a curated dataset to reduce hallucination, apply a Transition Conditioning module, and add Object-aware Auxiliary Supervision for appearance consistency. Performance is assessed via experiments on human-object and robot-object datasets using metrics and user studies. No equations or derivations are presented that reduce a claimed prediction to a fitted input by construction; no self-definitional loops, no renaming of known results as new unifications, and no load-bearing self-citations that substitute for independent justification. The central claims rest on the outcomes of fine-tuning and ablation-style experiments rather than tautological equivalence to the inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- TransitionVLM fine-tuning hyperparameters
axioms (2)
- domain assumption Vision-language models can be fine-tuned to infer multi-step physical transitions from egocentric image pairs without introducing hallucinations
- domain assumption Object appearance can be preserved across generated frames via auxiliary supervision during training
invented entities (1)
-
EIVST task definition
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.CoRR, abs/2303.08774, 2023. 3, 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wen- bin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.CoRR, abs/2502.13923, 2025. 3, 4, 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Mobius: Text to seamless loop- ing video generation via latent shift
Xiuli Bi, Jianfei Yuan, Bo Liu, Yong Zhang, Xiaodong Cun, Chi-Man Pun, and Bin Xiao. Mobius: Text to seamless loop- ing video generation via latent shift. InSIGGRAPH, 2025. 3
2025
-
[4]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.CoRR, abs/2311.15127, 2023. 2
work page internal anchor Pith review arXiv 2023
-
[5]
Sharegpt4video: Improving video understand- ing and generation with better captions
Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, et al. Sharegpt4video: Improving video understanding and generation with better captions.CoRR, abs/2406.04325,
-
[6]
Seine: Short-to-long video diffusion model for generative transition and prediction
Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Seine: Short-to-long video diffusion model for generative transition and prediction. InICLR,
-
[7]
Scaling egocentric vision: The epic-kitchens dataset
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. In ECCV, 2018. 6
2018
-
[8]
Ldmvfi: Video frame interpolation with latent diffusion models
Duolikun Danier, Fan Zhang, and David Bull. Ldmvfi: Video frame interpolation with latent diffusion models. InAAAI,
-
[9]
Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets
Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Daniilidis, Chelsea Finn, and Sergey Levine. Bridge data: Boosting general- ization of robotic skills with cross-domain datasets.CoRR, abs/2109.13396, 2021. 6
work page internal anchor Pith review arXiv 2021
-
[10]
Explo- rative inbetweening of time and space
Haiwen Feng, Zheng Ding, Zhihao Xia, Simon Niklaus, Vic- toria Abrevaya, Michael J Black, and Xuaner Zhang. Explo- rative inbetweening of time and space. InECCV, 2025. 3, 7
2025
-
[11]
Ego4d: Around the world in 3,000 hours of egocentric video
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In CVPR, 2022. 6
2022
-
[12]
Seer: Language instructed video prediction with latent diffusion models
Xianfan Gu, Chuan Wen, Weirui Ye, Jiaming Song, and Yang Gao. Seer: Language instructed video prediction with latent diffusion models. InICLR, 2024. 1, 3
2024
-
[13]
Controllable human-centric keyframe interpolation with generative prior.CoRR, abs/2506.03119,
Zujin Guo, Size Wu, Zhongang Cai, Wei Li, and Chen Change Loy. Controllable human-centric keyframe interpolation with generative prior.CoRR, abs/2506.03119,
-
[14]
Latent Video Diffusion Models for High-Fidelity Long Video Generation
Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation.CoRR, abs/2211.13221, 2022. 2
work page internal anchor Pith review arXiv 2022
-
[15]
Free-bloom: Zero-shot text-to-video generator with llm director and ldm animator
Hanzhuo Huang, Yufan Feng, Cheng Shi, Lan Xu, Jingyi Yu, and Sibei Yang. Free-bloom: Zero-shot text-to-video generator with llm director and ldm animator. InNIPS, 2024. 3
2024
-
[16]
Vbench++: Comprehensive and versatile benchmark suite for video generative models
Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying-Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Vbench++: Comprehensive and versatile benchmark suite for video generative models.CoRR, abs/2411.13503, 2024. 7
-
[17]
Amd-hummingbird: Towards an efficient text-to-video model.CoRR, abs/2503.18559, 2025
Takashi Isobe, He Cui, Dong Zhou, Mengmeng Ge, Dong Li, and Emad Barsoum. Amd-hummingbird: Towards an efficient text-to-video model.CoRR, abs/2503.18559, 2025. 2
-
[18]
Video interpolation with diffusion models
Siddhant Jain, Daniel Watson, Eric Tabellion, Ben Poole, Janne Kontkanen, et al. Video interpolation with diffusion models. InCVPR, 2024. 1
2024
-
[19]
Customizing text-to-image generation with inverted interaction
Xu Jia, Takashi Isobe, Xiaomin Li, Qinghe Wang, Jing Mu, Dong Zhou, Huchuan Lu, Lu Tian, Ashish Sirasao, Emad Barsoum, et al. Customizing text-to-image generation with inverted interaction. InACM Multimedia 2024, 2024. 6
2024
-
[20]
Sage: Structure-aware generative video transitions between diverse clips.CoRR, abs/2510.24667, 2025
Mia Kan, Yilin Liu, and Niloy Mitra. Sage: Structure-aware generative video transitions between diverse clips.CoRR, abs/2510.24667, 2025. 3
-
[21]
Lego: L earning ego cen- tric action frame generation via visual instruction tuning
Bolin Lai, Xiaoliang Dai, Lawrence Chen, Guan Pang, James M Rehg, and Miao Liu. Lego: L earning ego cen- tric action frame generation via visual instruction tuning. In ECCV, 2024. 3
2024
-
[22]
Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InICML,
-
[23]
Seek-and-Solve: Benchmarking MLLMs for Visual Clue-Driven Reasoning in Daily Scenarios
Xiaomin Li, Tala Wang, Zichen Zhong, Ying Zhang, Zirui Zheng, Takashi Isobe, Dezhuang Li, Huchuan Lu, You He, and Xu Jia. Seek-and-solve: Benchmarking mllms for visual clue-driven reasoning in daily scenarios.CoRR, abs/2604.14041, 2026. 3
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[24]
Vstar: Generative temporal nursing for longer dynamic video synthesis.CoRR, abs/2403.13501,
Yumeng Li, William Beluch, Margret Keuper, Dan Zhang, and Anna Khoreva. Vstar: Generative temporal nursing for longer dynamic video synthesis.CoRR, abs/2403.13501,
-
[25]
Long Lian, Boyi Li, Adam Yala, and Trevor Darrell. Llm- grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models. CoRR, abs/2305.13655, 2023. 3
-
[26]
Han Lin, Abhay Zala, Jaemin Cho, and Mohit Bansal. Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning.CoRR, abs/2309.15091, 2023. 3
-
[27]
Visual instruction tuning
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNIPS, 2023. 3
2023
-
[28]
Muttaqien, Koshi Makihara, Hanbit Oh, Keisuke Shirai, Floris Erich, Ryo Hanai, and Yukiyasu Do- mae
Tomohiro Motoda, Masaki Murooka, Ryoichi Nakajo, Muhammad A. Muttaqien, Koshi Makihara, Hanbit Oh, Keisuke Shirai, Floris Erich, Ryo Hanai, and Yukiyasu Do- mae. Aist-bimanual manipulation, 2025. 6
2025
-
[29]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Seg- ment anything in images and videos.CoRR, abs/2408.00714,
work page internal anchor Pith review arXiv
-
[30]
Film: Frame interpo- lation for large motion
Fitsum Reda, Janne Kontkanen, Eric Tabellion, Deqing Sun, Caroline Pantofaru, and Brian Curless. Film: Frame interpo- lation for large motion. InECCV, 2022. 1, 7
2022
-
[31]
Liao Shen, Tianqi Liu, Huiqiang Sun, Xinyi Ye, Baopu Li, Jianming Zhang, and Zhiguo Cao. Dreammover: Leveraging the prior of diffusion models for image interpolation with large motion.CoRR, abs/2409.09605, 2024. 1
-
[32]
Introducing agibot world colosseo: A large- scale manipulation platform for scalable and intelligent embodied systems.https://opendrivelab.com/ AgiBot-World/, 2025
Modi Shi, Yuxiang Lu, Huijie Wang, Chengen Xie, and Qingwen Bu. Introducing agibot world colosseo: A large- scale manipulation platform for scalable and intelligent embodied systems.https://opendrivelab.com/ AgiBot-World/, 2025. Blog post. 6
2025
-
[33]
Genhowto: Learning to generate actions and state transformations from instructional videos
Tom ´aˇs Sou ˇcek, Dima Damen, Michael Wray, Ivan Laptev, and Josef Sivic. Genhowto: Learning to generate actions and state transformations from instructional videos. InCVPR,
-
[34]
Showhowto: Generating scene-conditioned step-by-step visual instructions
Tom ´aˇs Sou ˇcek, Prajwal Gatti, Michael Wray, Ivan Laptev, Dima Damen, and Josef Sivic. Showhowto: Generating scene-conditioned step-by-step visual instructions. InCVPR,
-
[35]
Fvd: A new metric for video generation
Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Rapha¨el Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. InICLRW, 2019. 7
2019
-
[36]
Mcvd-masked conditional video diffusion for prediction, generation, and interpolation.NIPS, 35, 2022
Vikram V oleti, Alexia Jolicoeur-Martineau, and Chris Pal. Mcvd-masked conditional video diffusion for prediction, generation, and interpolation.NIPS, 35, 2022. 3
2022
-
[37]
Clinton J. Wang and Polina Golland. Interpolating between images with diffusion models.CoRR, abs/2307.12560, 2023. 3
-
[38]
Xiaojuan Wang, Boyang Zhou, Brian Curless, Ira Kemelmacher-Shlizerman, Aleksander Holynski, and Steven M Seitz. Generative inbetweening: Adapting image-to-video models for keyframe interpolation.CoRR, abs/2408.15239, 2024. 3, 7
-
[39]
Lavie: High-quality video gener- ation with cascaded latent diffusion models
Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video gen- eration with cascaded latent diffusion models.CoRR, abs/2309.15103, 2023. 2
-
[40]
Robomind: Benchmark on multi- embodiment intelligence normative data for robot manipula- tion
Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, Xi- aozhu Ju, Zhuqin Yang, Meng Li, Yinuo Zhao, Zhiyuan Xu, Guang Yang, et al. Robomind: Benchmark on multi- embodiment intelligence normative data for robot manipula- tion. InRSS, 2025. 6
2025
-
[41]
Self-correcting llm-controlled diffusion models.CoRR, abs/2311.16090, 2023
Tsung-Han Wu, Long Lian, Joseph E Gonzalez, Boyi Li, and Trevor Darrell. Self-correcting llm-controlled diffusion models.CoRR, abs/2311.16090, 2023. 3
-
[42]
Jiannan Xiang, Guangyi Liu, Yi Gu, Qiyue Gao, Yuting Ning, Yuheng Zha, Zeyu Feng, Tianhua Tao, Shibo Hao, Yemin Shi, et al. Pandora: Towards general world model with natural language actions and video states.CoRR, abs/2406.09455, 2024. 3
-
[43]
Dynamicrafter: Animating open-domain images with video diffusion priors
Jinbo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Wangbo Yu, Hanyuan Liu, Gongye Liu, Xintao Wang, Ying Shan, and Tien-Tsin Wong. Dynamicrafter: Animating open-domain images with video diffusion priors. InECCV,
-
[44]
Zhen Xing, Qi Dai, Zejia Weng, Zuxuan Wu, and Yu-Gang Jiang. Aid: Adapting image2video diffusion models for instruction-guided video prediction.CoRR, abs/2406.06465,
-
[45]
Vibidsam- pler: Enhancing video interpolation using bidirectional dif- fusion sampler
Serin Yang, Taesung Kwon, and Jong Chul Ye. Vibidsam- pler: Enhancing video interpolation using bidirectional dif- fusion sampler. InICLR, 2025. 3
2025
-
[46]
Identity- preserving text-to-video generation by frequency decompo- sition
Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyang Ge, Yu- jun Shi, Liuhan Chen, Jiebo Luo, and Li Yuan. Identity- preserving text-to-video generation by frequency decompo- sition. InCVPR, 2025. 6
2025
-
[47]
Make pixels dance: High- dynamic video generation
Yan Zeng, Guoqiang Wei, Jiani Zheng, Jiaxin Zou, Yang Wei, Yuchen Zhang, and Hang Li. Make pixels dance: High- dynamic video generation. InCVPR, 2024. 1, 3
2024
-
[48]
Motion-aware generative frame inter- polation.CoRR, abs/2501.03699, 2025
Guozhen Zhang, Yuhan Zhu, Yutao Cui, Xiaotong Zhao, Kai Ma, and Limin Wang. Motion-aware generative frame inter- polation.CoRR, abs/2501.03699, 2025. 3
-
[49]
Diffmorpher: Unleashing the capability of diffu- sion models for image morphing
Kaiwen Zhang, Yifan Zhou, Xudong Xu, Bo Dai, and Xin- gang Pan. Diffmorpher: Unleashing the capability of diffu- sion models for image morphing. InCVPR, 2024. 1, 3
2024
-
[50]
Rui Zhang, Yaosen Chen, Yuegen Liu, Wei Wang, Xum- ing Wen, and Hongxia Wang. Tvg: A training-free transi- tion video generation method with diffusion models.CoRR, abs/2408.13413, 2024. 1, 3
-
[51]
arXiv preprint arXiv:2311.04145 (2023)
Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, and Jin- gren Zhou. I2vgen-xl: High-quality image-to-video synthe- sis via cascaded diffusion models.CoRR, abs/2311.04145,
-
[52]
Ziran Zhang, Xiaohui Li, Yihao Liu, Yujin Wang, Yueting Chen, Tianfan Xue, and Shi Guo. Egvd: Event-guided video diffusion model for physically realistic large-motion frame interpolation.CoRR, abs/2503.20268, 2025. 3
-
[53]
Generative inbetweening through frame- wise conditions-driven video generation
Tianyi Zhu, Dongwei Ren, Qilong Wang, Xiaohe Wu, and Wangmeng Zuo. Generative inbetweening through frame- wise conditions-driven video generation. InCVPR, 2025. 3
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.