Recognition: no theorem link
VideoCoF: Unified Video Editing with Temporal Reasoner
Pith reviewed 2026-05-17 00:05 UTC · model grok-4.3
The pith
Forcing a video diffusion model to predict edit-region latents first produces precise mask-free video editing.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VideoCoF enforces a see-reason-then-edit procedure inside a video diffusion model by compelling it to predict reasoning tokens (edit-region latents) before generating target video tokens, thereby achieving precise instruction-to-region alignment without user-provided masks while supporting unified editing across tasks.
What carries the argument
The Chain-of-Frames procedure that inserts a reasoning step to predict edit-region latents prior to target video token generation.
If this is right
- Video editing becomes unified across tasks without separate models or user masks.
- State-of-the-art results appear on VideoCoF-Bench after training on only 50k video pairs.
- Motion alignment and length extrapolation improve through RoPE applied to the reasoning tokens.
- The same explicit reasoning step can be added to other diffusion pipelines for finer spatial control.
Where Pith is reading between the lines
- The reasoning-token stage could be reused in image or 3D editing diffusion models to reduce reliance on masks.
- Training cost drops further if the reasoning tokens are distilled from a smaller teacher model rather than learned from scratch.
- Iterative editing becomes feasible by feeding the first round's reasoning tokens back as additional context.
Load-bearing premise
Predicting reasoning tokens for edit regions before target video tokens will produce accurate instruction-to-region alignment without masks or extra supervision.
What would settle it
Running the model on videos whose instructions point to visually ambiguous regions and measuring whether edit localization matches ground-truth region accuracy would directly test the claimed alignment gain.
Figures
read the original abstract
Existing video editing methods face a critical trade-off: expert models offer precision but rely on task-specific priors like masks, hindering unification; conversely, unified temporal in-context learning models are mask-free but lack explicit spatial cues, leading to weak instruction-to-region mapping and imprecise localization. To resolve this conflict, we propose VideoCoF, a novel Chain-of-Frames approach inspired by Chain-of-Thought reasoning. VideoCoF enforces a ``see, reason, then edit" procedure by compelling the video diffusion model to first predict reasoning tokens (edit-region latents) before generating the target video tokens. This explicit reasoning step removes the need for user-provided masks while achieving precise instruction-to-region alignment and fine-grained video editing. Furthermore, we introduce a RoPE alignment strategy that leverages these reasoning tokens to ensure motion alignment and enable length extrapolation beyond the training duration. We demonstrate that with a minimal data cost of only 50k video pairs, VideoCoF achieves state-of-the-art performance on VideoCoF-Bench, validating the efficiency and effectiveness of our approach. Our code, weight, data are available at https://github.com/knightyxp/VideoCoF.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes VideoCoF, a Chain-of-Frames (CoF) method for unified video editing in diffusion models. It enforces a 'see, reason, then edit' procedure by requiring the model to first predict reasoning tokens (edit-region latents) before target video tokens, eliminating the need for user-provided masks while achieving precise instruction-to-region alignment. The approach uses only 50k video pairs for training, introduces a RoPE alignment strategy for motion consistency and length extrapolation, and reports state-of-the-art results on the introduced VideoCoF-Bench.
Significance. If the reasoning-token mechanism demonstrably enforces spatial localization from instructions alone without auxiliary supervision or masks, the work would meaningfully bridge the precision of mask-based expert models with the unification of in-context learning approaches, while highlighting data-efficient training for temporal video tasks.
major comments (2)
- [Method / Training Procedure] The central claim that predicting reasoning tokens (edit-region latents) before target tokens produces precise instruction-to-region alignment without masks or additional supervision is not supported by any described auxiliary loss, reconstruction objective, or contrastive term in the method. Without such a mechanism, the tokens risk functioning as generic conditioning rather than explicit spatial reasoning, undermining the 'see, reason, then edit' unification benefit.
- [Experiments / Ablations] No details are provided on how the reasoning tokens are trained or supervised (e.g., loss formulation, initialization, or gradient flow from the edit-region latents), nor are ablation studies shown isolating their contribution versus standard diffusion conditioning. This leaves the SOTA claim on VideoCoF-Bench unverifiable from the reported evidence.
minor comments (2)
- [Introduction / Experiments] The abstract and method description introduce VideoCoF-Bench without clarifying its construction, diversity, or comparison to existing benchmarks (e.g., how edit instructions and ground-truth regions are sourced).
- [Method] Notation for 'reasoning tokens' and 'edit-region latents' is used interchangeably without an explicit definition or diagram showing their dimensionality and integration into the diffusion U-Net.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on VideoCoF. The comments highlight the need for greater clarity on how the reasoning tokens enforce spatial alignment. We address each point below with explanations drawn directly from the method and commit to revisions that add the requested details without altering the core claims.
read point-by-point responses
-
Referee: [Method / Training Procedure] The central claim that predicting reasoning tokens (edit-region latents) before target tokens produces precise instruction-to-region alignment without masks or additional supervision is not supported by any described auxiliary loss, reconstruction objective, or contrastive term in the method. Without such a mechanism, the tokens risk functioning as generic conditioning rather than explicit spatial reasoning, undermining the 'see, reason, then edit' unification benefit.
Authors: The 'see, reason, then edit' procedure is enforced architecturally by structuring the token sequence so that edit-region latents must be predicted first, before target video tokens, within the diffusion denoising steps. This ordering, combined with the RoPE alignment strategy that conditions motion on the reasoning tokens, encourages the latents to capture instruction-specific regions. No separate auxiliary loss is used; supervision occurs via the standard diffusion denoising objective applied to the full sequence, with gradients flowing from target-token prediction back through the reasoning tokens. We acknowledge that the manuscript does not explicitly detail this gradient flow or sequence construction and will add a dedicated subsection in the revised Method section describing the token ordering, loss application, and how it produces instruction-to-region alignment without masks. revision: yes
-
Referee: [Experiments / Ablations] No details are provided on how the reasoning tokens are trained or supervised (e.g., loss formulation, initialization, or gradient flow from the edit-region latents), nor are ablation studies shown isolating their contribution versus standard diffusion conditioning. This leaves the SOTA claim on VideoCoF-Bench unverifiable from the reported evidence.
Authors: The reasoning tokens are trained jointly with the target tokens under the standard DDPM denoising loss; they are initialized from a lightweight encoder applied to the instruction and noisy latents. We will revise the Experiments section to explicitly state the loss formulation, initialization procedure, and gradient flow. We will also add ablation studies that compare the full VideoCoF model against a baseline using only standard diffusion conditioning (i.e., removing the explicit reasoning-token prediction step) while keeping all other components fixed, reporting quantitative results on VideoCoF-Bench to isolate the contribution. These additions will make the SOTA claims directly verifiable. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper presents VideoCoF as a procedural modification to video diffusion models, where reasoning tokens (edit-region latents) are predicted before target video tokens to enforce a 'see, reason, then edit' sequence. This is described as an explicit ordering constraint on standard diffusion conditioning, with RoPE alignment added for motion consistency. No equations, derivations, or fitted parameters are shown that reduce the claimed instruction-to-region alignment or SOTA results on VideoCoF-Bench to tautological inputs or self-citations. The 50k video pair training cost and mask-free unification are presented as empirical outcomes rather than mathematical necessities derived from prior self-work. The approach builds on external diffusion and Chain-of-Thought concepts without load-bearing self-citation chains or ansatzes that collapse the central claim.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Video diffusion models can be extended to first predict auxiliary reasoning tokens that encode edit regions
invented entities (1)
-
reasoning tokens (edit-region latents)
no independent evidence
Forward citations
Cited by 3 Pith papers
-
MiVE: Multiscale Vision-language features for reference-guided video Editing
MiVE repurposes VLMs as multiscale feature extractors integrated into a unified self-attention Diffusion Transformer, achieving top human preference in reference-guided video editing.
-
LIVE: Leveraging Image Manipulation Priors for Instruction-based Video Editing
LIVE achieves state-of-the-art instruction-based video editing by jointly training on image and video data with a frame-wise token noise strategy to bridge domain gaps and a new benchmark of over 60 tasks.
-
Measuring AI Reasoning: A Guide for Researchers
Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.
Reference graph
Works this paper leans on
-
[1]
Yuxuan Bian, Zhaoyang Zhang, Xuan Ju, Mingdeng Cao, Liangbin Xie, Ying Shan, and Qiang Xu. Videopainter: Any- length video inpainting and editing with plug-and-play con- text control.arXiv preprint arXiv:2503.05639, 2025. 2
-
[2]
In- structpix2pix: Learning to follow image editing instructions
Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18392–18402, 2023. 2, 6, 4
work page 2023
-
[3]
Emerg- ing properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 3
work page 2021
-
[4]
Lan Chen, Yuchao Gu, and Qi Mao. Univid: Unifying vi- sion tasks with pre-trained video generation models.arXiv preprint arXiv:2509.21760, 2025. 3
-
[5]
Consistent video- to-video transfer using synthetic dataset.arXiv preprint arXiv:2311.00213, 2023
Jiaxin Cheng, Tianjun Xiao, and Tong He. Consistent video- to-video transfer using synthetic dataset.arXiv preprint arXiv:2311.00213, 2023. 2, 6, 4
-
[6]
Flatten: Optical flow- guided attention for consistent text-to-video editing
Yuren Cong, Mengmeng Xu, Christian Simon, Shoufa Chen, Jiawei Ren, Yanping Xie, Juan-Manuel Perez-Rua, Bodo Rosenhahn, Tao Xiang, and Sen He. Flatten: Optical flow- guided attention for consistent text-to-video editing. InPro- ceedings of the International Conference on Learning Rep- resentations (ICLR), 2024. 2
work page 2024
-
[7]
Videoswap: Customized video subject swapping with interactive semantic point cor- respondence
Yuchao Gu, Yipin Zhou, Bichen Wu, Licheng Yu, Jia-Wei Liu, Rui Zhao, Jay Zhangjie Wu, David Junhao Zhang, Mike Zheng Shou, and Kevin Tang. Videoswap: Customized video subject swapping with interactive semantic point cor- respondence. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7621– 7630, 2024. 2
work page 2024
-
[8]
Diffusion with offset noise.https: //www.crosslabs.org/blog/diffusion-with- offset-noise, 2023
Nicholas Guttenberg. Diffusion with offset noise.https: //www.crosslabs.org/blog/diffusion-with- offset-noise, 2023. 3
work page 2023
-
[9]
In-context lora for diffusion transformers.arXiv preprint arxiv:2410.23775, 2024
Lianghua Huang, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Huanzhang Dou, Chen Liang, Yutong Feng, Yu Liu, and Jin- gren Zhou. In-context lora for diffusion transformers.arXiv preprint arxiv:2410.23775, 2024. 1
-
[10]
VACE: All-in-One Video Creation and Editing
Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing.arXiv preprint arXiv:2503.07598, 2025. 5, 6, 2, 4
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning
Xuan Ju, Tianyu Wang, Yuqian Zhou, He Zhang, Qing Liu, Nanxuan Zhao, Zhifei Zhang, Yijun Li, Yuanhao Cai, Shaoteng Liu, et al. Editverse: Unifying image and video editing and generation with in-context learning.arXiv preprint arXiv:2509.20360, 2025. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Xuan Ju, Weicai Ye, Quande Liu, Qiulin Wang, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, and Qiang Xu. Fulldit: Multi-task video generative foundation model with full attention.arXiv preprint arXiv:2503.19907, 2025. 2
-
[13]
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information pro- cessing systems, 35:22199–22213, 2022. 3
work page 2022
-
[14]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Viescore: Towards explainable metrics for conditional image synthesis evaluation, 2023
Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation, 2023. 5
work page 2023
-
[16]
Maksim Kuprashevich, Grigorii Alekseenko, Irina Tolstykh, Georgii Fedorov, Bulat Suleimanov, Vladimir Dokholyan, and Aleksandr Gordeev. Nohumansrequired: Autonomous high-quality image editing triplet mining.arXiv preprint arXiv:2507.14119, 2025. 3
-
[17]
Minghan Li, Chenxi Xie, Yichen Wu, Lei Zhang, and Mengyu Wang. Five: A fine-grained video editing bench- mark for evaluating emerging diffusion and rectified flow models.arXiv preprint arXiv:2503.13684, 2025. 6
-
[18]
Dif- fueraser: A diffusion model for video inpainting.arXiv preprint arXiv:2501.10018, 2025
Xiaowen Li, Haolan Xue, Peiran Ren, and Liefeng Bo. Dif- fueraser: A diffusion model for video inpainting.arXiv preprint arXiv:2501.10018, 2025. 2
-
[19]
Xinyao Liao, Xianfang Zeng, Ziye Song, Zhoujie Fu, Gang Yu, and Guosheng Lin. In-context learning with unpaired clips for instruction-based video editing.arXiv preprint arXiv:2510.14648, 2025. 2, 3, 6, 7, 1, 4
-
[20]
Decoupled Weight Decay Regularization
Ilya Loshchilov, Frank Hutter, et al. Fixing weight decay regularization in adam.arXiv preprint arXiv:1711.05101, 5 (5):5, 2017. 6
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[21]
In- structx: Towards unified visual editing with mllm guidance
Chong Mou, Qichao Sun, Yanze Wu, Pengze Zhang, Xinghui Li, Fulong Ye, Songtao Zhao, and Qian He. In- structx: Towards unified visual editing with mllm guidance. arXiv preprint arXiv:2510.08485, 2025. 2, 3
- [22]
-
[23]
Pexels: Free stock photos, royalty free stock images & videos.https://www.pexels.com/, 2025
Pexels. Pexels: Free stock photos, royalty free stock images & videos.https://www.pexels.com/, 2025. Ac- cessed: 2025-11-06. 5, 6, 2
work page 2025
-
[24]
Fatezero: Fus- ing attentions for zero-shot text-based video editing
Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen. Fatezero: Fus- ing attentions for zero-shot text-based video editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15932–15942, 2023. 2
work page 2023
-
[25]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 3
work page 2021
-
[26]
Grounded sam: Assembling open-world models for diverse visual tasks,
Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kun- chang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. Grounded sam: Assembling open-world models for diverse visual tasks,
-
[27]
Large-scale text-to-image model with inpainting is a zero-shot subject-driven image generator
Chaehun Shin, Jooyoung Choi, Heeseung Kim, and Sungroh Yoon. Large-scale text-to-image model with inpainting is a zero-shot subject-driven image generator. 2024. 1
work page 2024
-
[28]
Roformer: Enhanced transformer with 9 rotary position embedding.Neurocomputing, 568:127063,
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with 9 rotary position embedding.Neurocomputing, 568:127063,
-
[29]
Diffusion model-based video editing: A survey.arXiv preprint arXiv:2407.07111, 2024
Wenhao Sun, Rong-Cheng Tu, Jingyi Liao, and Dacheng Tao. Diffusion model-based video editing: A survey.arXiv preprint arXiv:2407.07111, 2024. 6
-
[30]
Lucy edit: Open-weight text-guided video editing
DecartAI Team. Lucy edit: Open-weight text-guided video editing. 2025. 6, 7, 4
work page 2025
-
[31]
Videoanydoor: High-fidelity video ob- ject insertion with precise motion control
Yuanpeng Tu, Hao Luo, Xi Chen, Sihui Ji, Xiang Bai, and Hengshuang Zhao. Videoanydoor: High-fidelity video ob- ject insertion with precise motion control. InProceedings of the Special Interest Group on Computer Graphics and In- teractive Techniques Conference Conference Papers, pages 1–11, 2025. 2
work page 2025
-
[32]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 2, 3, 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Jun- yang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
Univideo: Unified understanding, generation, and editing for videos
Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, and Wenhu Chen. Univideo: Unified understanding, generation, and editing for videos. arXiv preprint arXiv:2510.08377, 2025. 2
-
[35]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022. 2, 3
work page 2022
-
[36]
Video models are zero-shot learners and reasoners
Thadd ¨aus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learn- ers and reasoners.arXiv preprint arXiv:2509.20328, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jing- wen Hou Hou, Annan Wang, Wenxiu Sun Sun, Qiong Yan, and Weisi Lin. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. InInternational Conference on Computer Vision (ICCV),
-
[38]
Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation
Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7623–7633, 2023. 2
work page 2023
-
[39]
Cvpr 2023 text guided video edit- ing competition.arXiv preprint arXiv:2310.16003, 2023
Jay Zhangjie Wu, Xiuyu Li, Difei Gao, Zhen Dong, Jin- bin Bai, Aishani Singh, Xiaoyu Xiang, Youzeng Li, Zuwei Huang, Yuanxi Sun, et al. Cvpr 2023 text guided video edit- ing competition.arXiv preprint arXiv:2310.16003, 2023. 6
-
[40]
Omnigen: Unified image genera- tion
Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xin- grun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image genera- tion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13294–13304, 2025. 2
work page 2025
-
[41]
Videograin: Modulating space-time attention for multi- grained video editing
Xiangpeng Yang, Linchao Zhu, Hehe Fan, and Yi Yang. Videograin: Modulating space-time attention for multi- grained video editing. InThe Thirteenth International Con- ference on Learning Representations, 2025. 2
work page 2025
-
[42]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
Unic: Unified in-context video editing.arXiv preprint arXiv:2506.04216, 2025
Zixuan Ye, Xuanhua He, Quande Liu, Qiulin Wang, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Qifeng Chen, and Wenhan Luo. Unic: Unified in-context video editing.arXiv preprint arXiv:2506.04216, 2025. 2, 3
-
[44]
Stylemaster: Stylize your video with artistic generation and translation
Zixuan Ye, Huijuan Huang, Xintao Wang, Pengfei Wan, Di Zhang, and Wenhan Luo. Stylemaster: Stylize your video with artistic generation and translation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2630–2640, 2025. 2
work page 2025
-
[45]
Veg- gie: Instructional editing and reasoning video concepts with grounded generation
Shoubin Yu, Difan Liu, Ziqiao Ma, Yicong Hong, Yang Zhou, Hao Tan, Joyce Chai, and Mohit Bansal. Veg- gie: Instructional editing and reasoning video concepts with grounded generation. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 15147– 15158, 2025. 2, 8
work page 2025
-
[46]
Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction- guided image editing.Advances in Neural Information Pro- cessing Systems, 36:31428–31449, 2023. 3
work page 2023
-
[47]
Adding conditional control to text-to-image diffusion models
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023. 2
work page 2023
-
[49]
Zechuan Zhang, Ji Xie, Yu Lu, Zongxin Yang, and Yi Yang. In-context edit: Enabling instructional image editing with in- context generation in large scale diffusion transformer.arXiv preprint arXiv:2504.20690, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Minimax-remover: Taming bad noise helps video object removal.arXiv preprint arXiv:2505.24873, 2025
Bojia Zi, Weixuan Peng, Xianbiao Qi, Jianan Wang, Shihao Zhao, Rong Xiao, and Kam-Fai Wong. Minimax-remover: Taming bad noise helps video object removal.arXiv preprint arXiv:2505.24873, 2025. 2, 5
-
[51]
Bojia Zi, Penghui Ruan, Marco Chen, Xianbiao Qi, Shaozhe Hao, Shihao Zhao, Youze Huang, Bin Liang, Rong Xiao, and Kam-Fai Wong. Se\˜ norita-2m: A high-quality instruction- based dataset for general video editing by video specialists. arXiv preprint arXiv:2502.06734, 2025. 5, 6, 4 10 Unified Video Editing with Temporal Reasoner Supplementary Material This ...
-
[52]
Full Comparison As shown in Tab. 7, we provide a detailed breakdown of the results across four distinct tasks: Object Removal, Object Addition, Object Swap, and Local Style Transfer. Our VideoCoF consistently achieves the highest scores in instruction followingandsuccess ratioacross all tasks, demonstrating superior capability in understanding and exe- cu...
-
[53]
More Ablation Studies In this section, we validate key design choices of VideoCoF: the length of reasoning frames and the dispatch prompt. 7.1. Ablation on Reasoning Frames Tab. 4 investigates the optimal number of reasoning frames (F) for spatial guidance. Considering the VideoV AE tem- poral compression formulaL= (F−1)//4 + 1, frames 1∼4map to a single ...
-
[54]
Implementation Details 8.1. Training Dataset To equip our model with robust instruction-following ca- pabilities, we constructed a unified chain-of-frames video editing dataset comprising 50k video pairs. As detailed in Table 6, the dataset is strategically balanced across four core editing tasks: object addition, removal, swapping, and local stylization....
-
[55]
Metrics GPT Evaluation.To comprehensively assess the edit- ing performance, we employ the state-of-the-art Vision- Language Model, GPT-4o [22], serving as an automated judge. Following the protocol of InstructX [21], we sam- ple three frames from each video pair and utilize structured 2 prompts in [21] to evaluate the results across the following dimensio...
-
[56]
Discussion Scaling up Chain-of-Frames.Currently, Video- CoF achieves SOTA performance in instruction following and success rate using only 50k source-reasoning-editing pairs. This demonstrates remarkable data efficiency compared to existing large-scale baselines. For instance, EditVerse [11] utilizes 4M videos and 8M images, ICVE [19] leverages 2M pre-tra...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.