Geometry-Instructed Video Editing

Chirui Chang; Haoru Tan; Jianmin Bao; Pengfei Wan; Shizhen Zhao; Xiaojuan Qi; Xiaoyang Lyu; Xin Tao; Yi-Hua Huang; Yikang Ding

arxiv: 2606.24225 · v1 · pith:O2ISWF5Unew · submitted 2026-06-23 · 💻 cs.CV

Geometry-Instructed Video Editing

Chirui Chang , Xiaoyang Lyu , Yi-Hua Huang , Haoru Tan , Shizhen Zhao , Yikang Ding , Jianmin Bao , Xin Tao

show 2 more authors

Pengfei Wan Xiaojuan Qi

This is my paper

Pith reviewed 2026-06-26 00:41 UTC · model grok-4.3

classification 💻 cs.CV

keywords video editinggeometric editsobject manipulationdepth boxorientation boxsynthetic datagenerative modelscomputer vision

0 comments

The pith

GIVE uses depth-box and orientation-box streams to specify 3D object state changes for reliable video geometric edits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GIVE as a framework for performing object-level geometric edits such as translating, rotating, scaling, duplicating or removing objects in videos. It tackles the difficulty of specifying unambiguous 3D state changes across time and viewpoints while updating related effects like shadows consistently. Two geometry streams, depth-box for placement and orientation-box for cues, provide the specification. A graphics pipeline creates training data with before and after pairs. Results show the method works across edit types and transfers to real videos.

Core claim

GIVE represents edits through a unified object-state formulation. Two video-aligned geometry streams describe the target object before and after editing: a depth-box encoding coarse 3D placement and extent, and an orientation-box providing an appearance-agnostic orientation cue. A scalable graphics-engine pipeline executes object-level edit programs and renders controlled before/after pairs to provide paired supervision.

What carries the argument

Unified object-state formulation using depth-box and orientation-box geometry streams for specifying pre and post edit states.

Load-bearing premise

The graphics-engine pipeline generates paired videos that isolate the intended geometric edit and keep secondary effects consistent, and that this synthetic data transfers to real videos.

What would settle it

Editing a real video with a known object movement such as a ball rolling and casting a shadow, then checking if the output matches the expected new position, orientation, and shadow placement without introducing artifacts.

Figures

Figures reproduced from arXiv: 2606.24225 by Chirui Chang, Haoru Tan, Jianmin Bao, Pengfei Wan, Shizhen Zhao, Xiaojuan Qi, Xiaoyang Lyu, Xin Tao, Yi-Hua Huang, Yikang Ding.

**Figure 1.** Figure 1: GIVE performs object-level geometric video editing using unified pre and post box-based geometry instructions derived from lightweight user interaction and off-the-shelf tools. A single Video DiT editor supports multiple DCC-style operators, including rotation, scaling, removal, translation, duplication, and trajectory editing, while maintaining temporal coherence and producing consistent geometry-dependen… view at source ↗

**Figure 2.** Figure 2: Limitation of reconstruction-based methods. Shape-for-Motion fails to recover a coherent proxy with complex motion, leading to unreliable edits; GIVE remains stable without per-video reconstruction/optimization. MiniMax 2025; OpenAI 2025]. In practical DCC workflows, many revisions are object-level geometric operations: nudging a prop to refine composition, rotating an actor to face the camera, scaling an … view at source ↗

**Figure 3.** Figure 3: Architecture of GIVE. The input RGB video and four geometryinstruction streams are encoded by a shared video VAE into spatiotemporal tokens. These tokens are concatenated with the noisy edited-video latent and denoised by a video diffusion transformer with RoPE state alignment. The predicted edited latent is decoded to produce the edited video. streams, which are later tokenized as geometry-instruction (G… view at source ↗

**Figure 4.** Figure 4: Procedural Construction Pipeline. From left to right, (i) Asset Sampling: selection of a scene and a target actor; (ii) Attributes Sampling: assignment of actor attributes independent of the edit (e.g., facial expressions and body-motion clips); (iii) Camera Sampling: sampling of a camera-motion family and continuous parameters to define the camera trajectory; (iv) Operator Sampling: sampling of a DCC-styl… view at source ↗

**Figure 5.** Figure 5: Qualitative comparison on real videos (I). We show results for geometric operators including removal, rotation and translation on the real videos. Across cases, GIVE better follows the intended geometric manipulation while preserving object identity and non-edited regions, and produces more coherent geometry-dependent secondary effects (e.g., shadows and reflections). Baselines often exhibit operator mis-e… view at source ↗

**Figure 6.** Figure 6: Qualitative comparison on real videos (II). We show results for geometric operators including duplication, scaling and trajectory editing on the real videos. GIVE consistently achieves more faithful geometric edits with stronger temporal coherence, while baselines may introduce identity/background drift, operator mis-execution, or temporally inconsistent artifacts. For trajectory editing, arrows in the inp… view at source ↗

read the original abstract

Object-level geometric edits, including translating, rotating, scaling, duplicating, or removing an object, are routine operations in digital content creation (DCC) workflows, yet they remain unreliable in generative video editing. The key challenge lies in specifying the target object's 3D state change unambiguously across viewpoint and time, while consistently updating geometry-dependent secondary effects such as shadows and reflections. We introduce GIVE, a geometry-instructed video editing framework that represents edits through a unified object-state formulation. Two video-aligned geometry streams describe the target object before and after editing: a depth-box encoding coarse 3D placement and extent, and an orientation-box providing an appearance-agnostic orientation cue. Together, these streams provide a compact pre/post geometric specification for object-state transitions. To provide paired supervision for learning these edits, we build a scalable graphics-engine pipeline that executes object-level edit programs and renders controlled before/after pairs, isolating the intended geometric edit while keeping secondary effects consistent with the transformation. Experimental results demonstrate that GIVE produces faithful geometric edits with temporal coherence and consistent secondary effects across operators in a unified framework, and shows promising transfer to in-the-wild videos. Project page: https://geometry-instructed-video-editing.github.io/give/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GIVE gives a workable geometric spec for video object edits via depth-box and orientation-box but the synthetic training pipeline's transfer to real video is the part that needs evidence.

read the letter

The core contribution is a compact way to tell a video model how an object's 3D state should change: a depth-box for placement and size plus an orientation-box for rotation, applied before and after the edit. This is paired with a graphics-engine pipeline that renders controlled before/after video pairs so the model can learn to keep shadows and reflections consistent with the geometry change.

What stands out is that the same two streams handle translate, rotate, scale, duplicate, and remove in one framework. That unification is useful for DCC tools where these operations are routine. The synthetic data route is a reasonable way to get the paired supervision that real video lacks.

The main uncertainty is whether the rendered pairs actually isolate the intended edit without introducing lighting or material correlations that do not exist in real footage. The abstract says the pipeline keeps secondary effects consistent, but supplies no description of randomization or any measure of the domain gap. If those correlations are present, the learned mapping may not hold up on in-the-wild video.

No equations, ablations, or numbers appear in the supplied text, so it is impossible to judge how large the coherence gains actually are or whether the transfer claim is supported. The stress-test concern about untested domain gap therefore lands.

This is for people building controllable video editors or generative tools that need geometric operators. A reader who wants a practical handle on object state in video will find the formulation worth looking at.

The work is grounded enough to go to referees; the idea is clear and the problem matters, even if the experiments will need close checking on the synthetic-to-real step.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces GIVE, a geometry-instructed video editing framework that specifies object-level edits (translate, rotate, scale, duplicate, remove) via a unified object-state formulation. Two video-aligned geometry streams—a depth-box for coarse 3D placement/extent and an orientation-box for appearance-agnostic orientation—provide compact pre/post specifications. A graphics-engine pipeline generates paired before/after videos by executing edit programs while aiming to keep secondary effects (shadows, reflections) consistent. The central claim is that this yields faithful geometric edits with temporal coherence and consistent secondary effects in a single framework, plus promising transfer to in-the-wild videos.

Significance. If the synthetic-to-real transfer and coherence claims hold with supporting evidence, the work would address a practical gap in generative video editing by enabling explicit, controllable 3D geometric manipulations without per-operator retraining. The scalable synthetic supervision pipeline is a potential strength for avoiding the need for real paired data.

major comments (2)

[Abstract] Abstract (pipeline paragraph): the assertion that the graphics-engine pipeline 'isolates the intended geometric edit while keeping secondary effects consistent' is load-bearing for the transfer claim, yet no details are given on randomization of lighting, materials, or camera parameters, nor any measure of resulting domain gap; this leaves the weakest assumption (synthetic pairs generalize without lighting correlations or artifacts) untested.
[Experimental results] Experimental results section: the abstract states that results 'demonstrate' faithful edits, temporal coherence, and consistent secondary effects, but supplies no quantitative metrics, ablation tables, or comparisons; without these it is impossible to verify whether the unified framework actually outperforms operator-specific baselines or preserves secondary effects on real inputs.

minor comments (2)

[Method] Clarify the precise encoding and alignment procedure for the depth-box and orientation-box streams when they are introduced in the method section.
[Related work] Add a reference to prior synthetic-to-real video editing works that address domain gap (e.g., via domain randomization or style transfer) to contextualize the pipeline design.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point-by-point below.

read point-by-point responses

Referee: [Abstract] Abstract (pipeline paragraph): the assertion that the graphics-engine pipeline 'isolates the intended geometric edit while keeping secondary effects consistent' is load-bearing for the transfer claim, yet no details are given on randomization of lighting, materials, or camera parameters, nor any measure of resulting domain gap; this leaves the weakest assumption (synthetic pairs generalize without lighting correlations or artifacts) untested.

Authors: The abstract summarizes the pipeline at a high level due to length constraints. The full manuscript (Section 3) details the graphics-engine pipeline, including randomization over lighting, materials, and camera parameters to isolate geometric edits and reduce domain gap. We will revise the abstract to briefly note the domain randomization strategy and add a short domain-gap analysis or reference in the revised version. revision: yes
Referee: [Experimental results] Experimental results section: the abstract states that results 'demonstrate' faithful edits, temporal coherence, and consistent secondary effects, but supplies no quantitative metrics, ablation tables, or comparisons; without these it is impossible to verify whether the unified framework actually outperforms operator-specific baselines or preserves secondary effects on real inputs.

Authors: The experimental section presents qualitative results and visual comparisons across edit operators and in-the-wild videos. We agree that quantitative metrics and ablations would strengthen verification. We will add ablation tables, temporal coherence metrics, and a user study on secondary effects in the revision; direct baseline comparisons will also be included where feasible. revision: partial

Circularity Check

0 steps flagged

No circularity: explicit geometry streams and separate rendering pipeline are independent of the learned editing model.

full rationale

The derivation introduces depth-box and orientation-box as a compact pre/post geometric specification, then builds an independent graphics-engine pipeline to generate paired supervision. The central claim (faithful geometric edits with temporal coherence) is evaluated on outputs of this pipeline and on in-the-wild transfer; nothing in the abstract or described chain reduces the model output to a quantity defined by the same inputs or by self-citation. The pipeline is presented as an external data-generation step rather than a fitted component renamed as prediction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5774 in / 1130 out tokens · 20325 ms · 2026-06-26T00:41:09.493702+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 13 canonical work pages · 5 internal anchors

[1]

Yuxuan Bian, Zhaoyang Zhang, Xuan Ju, Mingdeng Cao, Liangbin Xie, Ying Shan, and Qiang Xu

Recammaster: Camera-controlled generative rendering from a single video.arXiv preprint arXiv:2503.11647(2025). Yuxuan Bian, Zhaoyang Zhang, Xuan Ju, Mingdeng Cao, Liangbin Xie, Ying Shan, and Qiang Xu

work page arXiv 2025
[2]

Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127(2023). Blender

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh

Blender.https://docs.blender.org(2025). Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh

2025
[4]

Video generation models as world simulators. (2024). https: //openai.com/research/video-generation-models-as-world-simulators ByteDance Seed

2024
[5]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261 (2025). Google DeepMind. 2025a. Genie 3: A new frontier for world models. https://deepmind.google/blog/genie-3-a-new-frontier-for-world-models/(2025). Google DeepMind. 2025b. Veo 3 Model Card.https...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Imagen Video: High Definition Video Generation with Diffusion Models

Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303(2022). Jonathan Ho, Ajay Jain, and Pieter Abbeel

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al

Denoising diffusion probabilistic models.Advances in neural information processing systems33 (2020), 6840–6851. Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al

2020
[8]

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu

Vbench: Comprehensive benchmark suite for video generative models.arXiv preprint arXiv:2311.17982(2023). Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. 2025a. Vace: All-in-one video creation and editing. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision. 17191–17202. Zeren Jiang, Chuanxia Zheng, Iro ...

work page arXiv 2023
[9]

Kuaishou

AnyV2V: A Tuning-Free Framework For Any Video-to-Video Editing Tasks.Transactions on Machine Learning Research(2024). Kuaishou

2024
[10]

Yao-Chih Lee, Erika Lu, Sarah Rumbley, Michal Geyer, Jia-Bin Huang, Tali Dekel, and Forrester Cole

Kling AI.https://klingai.kuaishou.com/(2025). Yao-Chih Lee, Erika Lu, Sarah Rumbley, Michal Geyer, Jia-Bin Huang, Tali Dekel, and Forrester Cole

2025
[11]

LumaLabs

Shape- for-motion: Precise and consistent video editing with 3d proxy.arXiv preprint arXiv:2506.22432(2025). LumaLabs

work page arXiv 2025
[12]

Wan-Duo Kurt Ma, John P Lewis, and W Bastiaan Kleijn

Dream Machine.https://lumalabs.ai/dream-machine(2025). Wan-Duo Kurt Ma, John P Lewis, and W Bastiaan Kleijn

2025
[13]

InSIGGRAPH Asia 2024 Conference Papers

Trailblazer: Trajectory control for diffusion-based video generation. InSIGGRAPH Asia 2024 Conference Papers. 1–11. Yue Ma, Xiaodong Cun, Yingqing He, Chenyang Qi, Xintao Wang, Ying Shan, Xiu Li, and Qifeng Chen

2024
[14]

Chenxuan Miao, Yutong Feng, Jianshu Zeng, Zixiang Gao, Hantang Liu, Yunfeng Yan, Donglian Qi, Xi Chen, Bin Wang, and Hengshuang Zhao

Magicstick: Controllable video editing via control handle transformations.arXiv preprint arXiv:2312.03047(2023). Chenxuan Miao, Yutong Feng, Jianshu Zeng, Zixiang Gao, Hantang Liu, Yunfeng Yan, Donglian Qi, Xi Chen, Bin Wang, and Hengshuang Zhao

work page arXiv 2023
[15]

Oscar Michel, Anand Bhattad, Eli VanderBilt, Ranjay Krishna, Aniruddha Kembhavi, and Tanmay Gupta

Rose: Remove objects with side effects in videos.arXiv preprint arXiv:2508.18633(2025). Oscar Michel, Anand Bhattad, Eli VanderBilt, Ranjay Krishna, Aniruddha Kembhavi, and Tanmay Gupta

work page arXiv 2025
[16]

Advances in Neural Information Processing Systems36 (2023), 3497–3516

Object 3dit: Language-guided 3d-aware image editing. Advances in Neural Information Processing Systems36 (2023), 3497–3516. MiniMax

2023
[17]

Hailuo AI.https://hailuoai.com/video(2025). OpenAI

2025
[18]

Wenqi Ouyang, Yi Dong, Lei Yang, Jianlou Si, and Xingang Pan

Sora 2.https://openai.com/index/sora-2/(2025). Wenqi Ouyang, Yi Dong, Lei Yang, Jianlou Si, and Xingang Pan

2025
[19]

InSIGGRAPH Asia 2024 Conference Papers

I2VEdit: First- Frame-Guided Video Editing via Image-to-Video Diffusion Models. InSIGGRAPH Asia 2024 Conference Papers. Karran Pandey, Paul Guerrero, Matheus Gadelha, Yannick Hold-Geoffroy, Karan Singh, and Niloy J Mitra

2024
[20]

SAM 2: Segment Anything in Images and Videos

Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714(2024). Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Jiaming Song, Chenlin Meng, and Stefano Ermon

Gen-4.https://runwayml.com/(2025). Jiaming Song, Chenlin Meng, and Stefano Ermon

2025
[22]

Kling-Omni Technical Report

Kling-Omni Technical Report.arXiv preprint arXiv:2512.16776(2025). Wan Video

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Ati: Any trajectory instruction for controllable video generation.arXiv preprint arXiv:2505.22944, 2025

Wan: Open and advanced large-scale video generative models. https://github.com/Wan-Video/Wan2.1. Angtian Wang, Haibin Huang, Jacob Zhiyuan Fang, Yiding Yang, and Chongyang Ma. 2025a. Ati: Any trajectory instruction for controllable video generation.arXiv preprint arXiv:2505.22944(2025). Hanlin Wang, Hao Ouyang, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Qifeng...

work page arXiv 2025
[24]

Weijia Wu, Zhuang Li, Yuchao Gu, Rui Zhao, Yefei He, David Junhao Zhang, Mike Zheng Shou, Yan Li, Tingting Gao, and Di Zhang

Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing13, 4 (2004), 600–612. Weijia Wu, Zhuang Li, Yuchao Gu, Rui Zhao, Yefei He, David Junhao Zhang, Mike Zheng Shou, Yan Li, Tingting Gao, and Di Zhang

2004
[25]

Zixuan Ye, Xuanhua He, Quande Liu, Qiulin Wang, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Qifeng Chen, and Wenhan Luo

Mtv-inpaint: Multi-task long video inpainting.arXiv preprint arXiv:2503.11412(2025). Zixuan Ye, Xuanhua He, Quande Liu, Qiulin Wang, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Qifeng Chen, and Wenhan Luo

work page arXiv 2025
[26]

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala

UNIC: Unified In-Context Video Editing.arXiv preprint arXiv:2506.04216(2025). Lvmin Zhang, Anyi Rao, and Maneesh Agrawala

work page arXiv 2025
[27]

InProceedings of the IEEE/CVF international conference on computer vision

Propainter: Improving propagation and transformer for video inpainting. InProceedings of the IEEE/CVF international conference on computer vision. 10477–10486. ACM Trans. Graph., Vol. 1, No. 1, Article . Publication date: June 2026

2026

[1] [1]

Yuxuan Bian, Zhaoyang Zhang, Xuan Ju, Mingdeng Cao, Liangbin Xie, Ying Shan, and Qiang Xu

Recammaster: Camera-controlled generative rendering from a single video.arXiv preprint arXiv:2503.11647(2025). Yuxuan Bian, Zhaoyang Zhang, Xuan Ju, Mingdeng Cao, Liangbin Xie, Ying Shan, and Qiang Xu

work page arXiv 2025

[2] [2]

Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127(2023). Blender

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh

Blender.https://docs.blender.org(2025). Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh

2025

[4] [4]

Video generation models as world simulators. (2024). https: //openai.com/research/video-generation-models-as-world-simulators ByteDance Seed

2024

[5] [5]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261 (2025). Google DeepMind. 2025a. Genie 3: A new frontier for world models. https://deepmind.google/blog/genie-3-a-new-frontier-for-world-models/(2025). Google DeepMind. 2025b. Veo 3 Model Card.https...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Imagen Video: High Definition Video Generation with Diffusion Models

Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303(2022). Jonathan Ho, Ajay Jain, and Pieter Abbeel

work page internal anchor Pith review Pith/arXiv arXiv 2022

[7] [7]

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al

Denoising diffusion probabilistic models.Advances in neural information processing systems33 (2020), 6840–6851. Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al

2020

[8] [8]

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu

Vbench: Comprehensive benchmark suite for video generative models.arXiv preprint arXiv:2311.17982(2023). Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. 2025a. Vace: All-in-one video creation and editing. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision. 17191–17202. Zeren Jiang, Chuanxia Zheng, Iro ...

work page arXiv 2023

[9] [9]

Kuaishou

AnyV2V: A Tuning-Free Framework For Any Video-to-Video Editing Tasks.Transactions on Machine Learning Research(2024). Kuaishou

2024

[10] [10]

Yao-Chih Lee, Erika Lu, Sarah Rumbley, Michal Geyer, Jia-Bin Huang, Tali Dekel, and Forrester Cole

Kling AI.https://klingai.kuaishou.com/(2025). Yao-Chih Lee, Erika Lu, Sarah Rumbley, Michal Geyer, Jia-Bin Huang, Tali Dekel, and Forrester Cole

2025

[11] [11]

LumaLabs

Shape- for-motion: Precise and consistent video editing with 3d proxy.arXiv preprint arXiv:2506.22432(2025). LumaLabs

work page arXiv 2025

[12] [12]

Wan-Duo Kurt Ma, John P Lewis, and W Bastiaan Kleijn

Dream Machine.https://lumalabs.ai/dream-machine(2025). Wan-Duo Kurt Ma, John P Lewis, and W Bastiaan Kleijn

2025

[13] [13]

InSIGGRAPH Asia 2024 Conference Papers

Trailblazer: Trajectory control for diffusion-based video generation. InSIGGRAPH Asia 2024 Conference Papers. 1–11. Yue Ma, Xiaodong Cun, Yingqing He, Chenyang Qi, Xintao Wang, Ying Shan, Xiu Li, and Qifeng Chen

2024

[14] [14]

Chenxuan Miao, Yutong Feng, Jianshu Zeng, Zixiang Gao, Hantang Liu, Yunfeng Yan, Donglian Qi, Xi Chen, Bin Wang, and Hengshuang Zhao

Magicstick: Controllable video editing via control handle transformations.arXiv preprint arXiv:2312.03047(2023). Chenxuan Miao, Yutong Feng, Jianshu Zeng, Zixiang Gao, Hantang Liu, Yunfeng Yan, Donglian Qi, Xi Chen, Bin Wang, and Hengshuang Zhao

work page arXiv 2023

[15] [15]

Oscar Michel, Anand Bhattad, Eli VanderBilt, Ranjay Krishna, Aniruddha Kembhavi, and Tanmay Gupta

Rose: Remove objects with side effects in videos.arXiv preprint arXiv:2508.18633(2025). Oscar Michel, Anand Bhattad, Eli VanderBilt, Ranjay Krishna, Aniruddha Kembhavi, and Tanmay Gupta

work page arXiv 2025

[16] [16]

Advances in Neural Information Processing Systems36 (2023), 3497–3516

Object 3dit: Language-guided 3d-aware image editing. Advances in Neural Information Processing Systems36 (2023), 3497–3516. MiniMax

2023

[17] [17]

Hailuo AI.https://hailuoai.com/video(2025). OpenAI

2025

[18] [18]

Wenqi Ouyang, Yi Dong, Lei Yang, Jianlou Si, and Xingang Pan

Sora 2.https://openai.com/index/sora-2/(2025). Wenqi Ouyang, Yi Dong, Lei Yang, Jianlou Si, and Xingang Pan

2025

[19] [19]

InSIGGRAPH Asia 2024 Conference Papers

I2VEdit: First- Frame-Guided Video Editing via Image-to-Video Diffusion Models. InSIGGRAPH Asia 2024 Conference Papers. Karran Pandey, Paul Guerrero, Matheus Gadelha, Yannick Hold-Geoffroy, Karan Singh, and Niloy J Mitra

2024

[20] [20]

SAM 2: Segment Anything in Images and Videos

Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714(2024). Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Jiaming Song, Chenlin Meng, and Stefano Ermon

Gen-4.https://runwayml.com/(2025). Jiaming Song, Chenlin Meng, and Stefano Ermon

2025

[22] [22]

Kling-Omni Technical Report

Kling-Omni Technical Report.arXiv preprint arXiv:2512.16776(2025). Wan Video

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Ati: Any trajectory instruction for controllable video generation.arXiv preprint arXiv:2505.22944, 2025

Wan: Open and advanced large-scale video generative models. https://github.com/Wan-Video/Wan2.1. Angtian Wang, Haibin Huang, Jacob Zhiyuan Fang, Yiding Yang, and Chongyang Ma. 2025a. Ati: Any trajectory instruction for controllable video generation.arXiv preprint arXiv:2505.22944(2025). Hanlin Wang, Hao Ouyang, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Qifeng...

work page arXiv 2025

[24] [24]

Weijia Wu, Zhuang Li, Yuchao Gu, Rui Zhao, Yefei He, David Junhao Zhang, Mike Zheng Shou, Yan Li, Tingting Gao, and Di Zhang

Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing13, 4 (2004), 600–612. Weijia Wu, Zhuang Li, Yuchao Gu, Rui Zhao, Yefei He, David Junhao Zhang, Mike Zheng Shou, Yan Li, Tingting Gao, and Di Zhang

2004

[25] [25]

Zixuan Ye, Xuanhua He, Quande Liu, Qiulin Wang, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Qifeng Chen, and Wenhan Luo

Mtv-inpaint: Multi-task long video inpainting.arXiv preprint arXiv:2503.11412(2025). Zixuan Ye, Xuanhua He, Quande Liu, Qiulin Wang, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Qifeng Chen, and Wenhan Luo

work page arXiv 2025

[26] [26]

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala

UNIC: Unified In-Context Video Editing.arXiv preprint arXiv:2506.04216(2025). Lvmin Zhang, Anyi Rao, and Maneesh Agrawala

work page arXiv 2025

[27] [27]

InProceedings of the IEEE/CVF international conference on computer vision

Propainter: Improving propagation and transformer for video inpainting. InProceedings of the IEEE/CVF international conference on computer vision. 10477–10486. ACM Trans. Graph., Vol. 1, No. 1, Article . Publication date: June 2026

2026