pith. machine review for the scientific record. sign in

arxiv: 2604.17021 · v1 · submitted 2026-04-18 · 💻 cs.CV

Recognition: unknown

LIVE: Leveraging Image Manipulation Priors for Instruction-based Video Editing

Authors on Pith no claims yet

Pith reviewed 2026-05-10 07:17 UTC · model grok-4.3

classification 💻 cs.CV
keywords video editinginstruction-based editingimage priorsjoint trainingframe-wise noisegenerative modelsevaluation benchmarktemporal transformation
0
0 comments X

The pith

LIVE jointly trains video editors on image and video data using frame-wise token noise to bridge domain gaps and reach state-of-the-art results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LIVE, a joint training framework that combines abundant high-quality image editing data with scarcer video datasets to strengthen instruction-based video editing. Video data creation is expensive, so the approach borrows priors from image editing to scale up training while maintaining task diversity. A frame-wise token noise strategy treats latents from selected frames as reasoning tokens inside pretrained video models, allowing plausible motion to emerge from static image knowledge. Cleaning existing datasets and applying two-stage training further anneals the model toward video-specific edits. Experiments on a new benchmark of over 60 tasks show clear gains over prior methods.

Core claim

LIVE is a joint training framework that leverages large-scale, high-quality image editing data alongside video datasets to bolster editing capabilities. To mitigate the domain discrepancy between static images and dynamic videos, it introduces a frame-wise token noise strategy which treats the latents of specific frames as reasoning tokens, leveraging large pretrained video generative models to create plausible temporal transformations. Through cleaning public datasets and constructing an automated data pipeline, a two-stage training strategy anneals video editing capabilities. A comprehensive evaluation benchmark encompassing over 60 challenging tasks is curated, and extensive comparative,

What carries the argument

The frame-wise token noise strategy, which selects latents of specific frames as reasoning tokens inside pretrained video generative models to produce temporal transformations from image priors.

If this is right

  • Video editing can scale without proportional increases in costly video-specific annotations by drawing on image datasets.
  • Models gain the ability to perform editing tasks common in images but rare in existing video collections.
  • Two-stage training allows gradual transfer of capabilities from image to video domains.
  • Ablation results confirm that removing the noise strategy or image data degrades performance on the benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same principle of mixing high-quality static priors with dynamic data could extend to other modalities such as audio or 3D scene editing.
  • The new 60-task benchmark could serve as a shared testbed that pushes future video editing methods toward greater task coverage.
  • Further scaling the image data component might yield additional gains on edits involving complex or rare motions.

Load-bearing premise

The frame-wise token noise strategy can effectively mitigate the domain discrepancy between static images and dynamic videos by enabling plausible temporal transformations.

What would settle it

Train an otherwise identical video editing model on video data alone without the image data or frame-wise noise component, then compare its performance directly against LIVE on the same 60-task benchmark; equal or better results would refute the benefit of the image priors.

Figures

Figures reproduced from arXiv: 2604.17021 by Jufeng Yang, Juncheng Zhou, Meng Wang, Pengfei Wan, Weicheng Wang, Wenyu Qin, Yongjie Zhu, Zhicheng Zhang, Zhongqi Zhang.

Figure 1
Figure 1. Figure 1: Performance of LIVE in various video editing tasks. In addition to common video editing capabilities, our method enables creative video editing tasks. Further￾more, we support video input at different resolution ratios. Abstract. Video editing aims to modify input videos according to user intent. Recently, end-to-end training methods have garnered widespread ∗: Equal Contribution. †: Project Leader. ‡: Cor… view at source ↗
Figure 2
Figure 2. Figure 2: (a) Comparison of task coverage and training data scale among representative instruction-based image and video editing methods. Our method achieves more editing tasks with less video training data. (b) with advanced editing tools, the construction of image editing datasets achieves higher feasibility and quality compared to video counterparts. pared to the image editing landscape. This data gap makes video… view at source ↗
Figure 3
Figure 3. Figure 3: Training data statistics and distribution. Left: Distribution of task categories. Right: Statistics of collected datasets. The sampling number for each dataset is dis￾played. editing datasets, enhancing editing quality. UnicEdit-10M [54] employs FLUX.1- Kontext [19] and Qwen-Image-Edit [45] to create 2M data with 22 tasks, and train an MLLM for filtering. Pico-banana-400K [30] utilizes Nano Banana to synth… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the multi-reference action editing data pipeline. Instructions are extracted by an LLM. The action in Reference Video1 is edited to match that of Reference Video2. 1) Single-reference Editing Task. This category focuses on instruction￾based editing with a single reference. For the image domain, we compile 1.85M image-to-image samples from UniWorld-V1 [22], AnyEdit [14], Nano-consistent￾150k [53… view at source ↗
Figure 5
Figure 5. Figure 5: Overview of the LIVE pipeline. For image data, the input and target images are replicated N times to construct pseudo-video sequences, whereas for video data, the latents are extracted directly. The input latents and noise latents are then concatenated along the temporal dimension. Preliminaries. To ensure training stability, we adopt Flow Matching as our foundational theoretical framework. This approach l… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison with open-source instruction-based video editing mod￾els on LIVE-Bench. LIVE demonstrates superior capabilities in instruction following, background preservation, and temporal consistency. LIVE maintains the best temporal consistency even though it is trained on a large corpus of static image data. This is attributed to our frame-wise token noise strategy, which effectively construct… view at source ↗
Figure 7
Figure 7. Figure 7: Per-task performance on LIVE-Bench with Task SR. Tasks are arranged based on the average performance of all comparison methods and baseline trained with only video data. We visualize the minimum, average, and maximum scores for each task, compared with our method. We achieve the highest performance on 52 of 64 tasks. Specifically, we visualize the minimum, average, and maximum scores alongside the results … view at source ↗
Figure 8
Figure 8. Figure 8: The dynamic training process of LIVE. The number of tasks with an average Editing Quality score exceeding 6 in different training steps is counted. In the first stage, 2M image data and 380K video data are trained. In the second stage, the proportion of images and videos is reduced, i.e., 260K video and 116K image. Moreover, more challenging tasks are included [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The ablation study on the video-image mixing ratio. We conduct experiments under the settings of 1:1, 1:2, 1:3, and 1:4. steps in [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
read the original abstract

Video editing aims to modify input videos according to user intent. Recently, end-to-end training methods have garnered widespread attention, constructing paired video editing data through video generation or editing models. However, compared to image editing, the high annotation costs of video data severely constrain the scale, quality, and task diversity of video editing datasets when relying on video generative models or manual annotation. To bridge this gap, we propose LIVE, a joint training framework that leverages large-scale, high-quality image editing data alongside video datasets to bolster editing capabilities. To mitigate the domain discrepancy between static images and dynamic videos, we introduce a frame-wise token noise strategy, which treats the latents of specific frames as reasoning tokens, leveraging large pretrained video generative models to create plausible temporal transformations. Moreover, through cleaning public datasets and constructing an automated data pipeline, we adopt a two-stage training strategy to anneal video editing capabilities. Furthermore, we curate a comprehensive evaluation benchmark encompassing over 60 challenging tasks that are prevalent in image editing but scarce in existing video datasets. Extensive comparative and ablation experiments demonstrate that our method achieves state-of-the-art performance. The source code will be publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces LIVE, a joint training framework for instruction-based video editing that leverages large-scale image editing data alongside video datasets. It proposes a frame-wise token noise strategy to mitigate domain discrepancies by treating latents of specific frames as reasoning tokens within pretrained video generative models, uses a two-stage training pipeline after dataset cleaning and curation, and presents a new benchmark covering over 60 challenging tasks. The authors report state-of-the-art performance via comparative experiments and ablations, with plans to release source code.

Significance. If the central claims hold, the work could meaningfully advance video editing by reducing reliance on scarce, high-cost video annotations through effective transfer from image priors, while the curated benchmark and public code would provide valuable resources for the community. The two-stage annealing strategy and emphasis on task diversity represent practical contributions to scaling editing capabilities.

major comments (3)
  1. [§3.2] §3.2: The frame-wise token noise strategy is presented as the key mechanism for bridging static image and dynamic video domains by converting selected frame latents into reasoning tokens; however, the ablations do not isolate its contribution (e.g., via controlled comparisons of noise scheduling versus joint training alone), leaving the load-bearing assumption that it induces plausible temporal transformations unverified by quantitative evidence.
  2. [Table 3] Table 3 and §5.2: The SOTA performance claims rest on comparative results against prior video editing methods, but the reported metrics lack error bars, multiple random seeds, or statistical tests, making it difficult to determine whether observed gains are robust or attributable to the proposed components rather than implementation details.
  3. [§4.3] §4.3: The automated data pipeline and cleaning process for public datasets are described at a high level, but without explicit criteria for task selection, exclusion rules, or quality metrics, it is unclear how the 60-task benchmark ensures coverage of image-editing tasks that are scarce in video data or avoids introducing biases.
minor comments (2)
  1. [Abstract] The abstract and §2 could benefit from a brief equation or pseudocode snippet formalizing the frame-wise token noise application to improve clarity for readers unfamiliar with the latent space operations.
  2. [Figure 2] Figure 2 (method overview) would be strengthened by annotating the exact frames selected for noise injection and the resulting temporal flow.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and constructive suggestions. We address each of the major comments below and have made revisions to the manuscript to incorporate the feedback where appropriate.

read point-by-point responses
  1. Referee: [§3.2] §3.2: The frame-wise token noise strategy is presented as the key mechanism for bridging static image and dynamic video domains by converting selected frame latents into reasoning tokens; however, the ablations do not isolate its contribution (e.g., via controlled comparisons of noise scheduling versus joint training alone), leaving the load-bearing assumption that it induces plausible temporal transformations unverified by quantitative evidence.

    Authors: We appreciate this observation. The ablations in the original manuscript focused on the overall joint training framework and its benefits over video-only training. To better isolate the frame-wise token noise strategy, we have added new controlled experiments in the revised version of §3.2. Specifically, we compare the full LIVE model against a baseline that performs joint training without the frame-wise token noise (using standard noise scheduling instead). The results, reported in a new table, show that the token noise strategy contributes to improved temporal coherence, providing quantitative support for its role in creating plausible temporal transformations. These additional ablations are also detailed in the supplementary material. revision: yes

  2. Referee: [Table 3] Table 3 and §5.2: The SOTA performance claims rest on comparative results against prior video editing methods, but the reported metrics lack error bars, multiple random seeds, or statistical tests, making it difficult to determine whether observed gains are robust or attributable to the proposed components rather than implementation details.

    Authors: We agree that reporting variability would enhance the reliability of the SOTA claims. Due to the high computational demands of training on large-scale datasets, we initially reported results from a single run with a fixed seed. In the revised manuscript, we have included error bars based on three independent runs for the main metrics in Table 3. Additionally, we have added a discussion in §5.2 on the observed variance and performed paired t-tests where applicable to assess statistical significance of the improvements over baselines. We note that the gains remain consistent across runs. revision: yes

  3. Referee: [§4.3] §4.3: The automated data pipeline and cleaning process for public datasets are described at a high level, but without explicit criteria for task selection, exclusion rules, or quality metrics, it is unclear how the 60-task benchmark ensures coverage of image-editing tasks that are scarce in video data or avoids introducing biases.

    Authors: We thank the referee for highlighting the need for greater transparency in the data curation process. In the revised §4.3, we have provided detailed descriptions of the automated pipeline, including: explicit task selection criteria (focusing on 60 tasks such as style transfer, object manipulation, and attribute editing that are common in image editing but rare in video datasets), exclusion rules (e.g., discarding samples with low motion coherence or poor text-video alignment), and quality metrics (including FID scores for image fidelity and temporal consistency measures). We have also included an analysis showing the distribution of tasks to confirm broad coverage and minimal bias introduction. A new supplementary section elaborates on the pipeline implementation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical comparisons and new benchmark

full rationale

The paper presents an empirical method (joint image-video training with a frame-wise token noise strategy and two-stage pipeline) whose central claims are justified by comparative experiments, ablations, and a curated 60-task benchmark rather than any derivation chain. No equations, fitted parameters renamed as predictions, or self-citations invoked as uniqueness theorems appear in the abstract or described approach. The frame-wise strategy is introduced as a design choice to address domain gap and is evaluated externally via performance metrics, keeping the contribution self-contained without reduction to inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that image editing priors transfer to video via noise injection and that pretrained video models can generate plausible motion from noisy frame tokens. No explicit free parameters or invented entities are detailed in the abstract.

free parameters (1)
  • frame-wise token noise levels
    Specific noise parameters applied to frame latents to create temporal transformations; values chosen to balance image fidelity and video plausibility.
axioms (1)
  • domain assumption Large pretrained video generative models can create plausible temporal transformations from noisy image latents.
    Invoked to bridge the static image to dynamic video domain gap in the frame-wise token noise strategy.

pith-pipeline@v0.9.0 · 5528 in / 1359 out tokens · 36629 ms · 2026-05-10T07:17:26.681282+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 24 canonical work pages · 7 internal anchors

  1. [1]

    In: ICCV (2025) 7

    Bai,J., Xia, M., Fu, X.,Wang, X.,Mu, L., Cao, J.,Liu, Z., Hu, H.,Bai, X., Wan, P., et al.: Recammaster: Camera-controlled generative rendering from a single video. In: ICCV (2025) 7

  2. [2]

    CoRR , volume =

    Bai, Q., Wang, Q., Ouyang, H., Yu, Y., Wang, H., Wang, W., Cheng, K.L., Ma, S., Zeng, Y., Liu, Z., et al.: Scaling instruction-based video editing with a high-quality synthetic dataset. arXiv preprint arXiv:2510.15742 (2025) 2, 4, 5, 10, 11

  3. [3]

    In: CVPR (2023) 2 16 Wang et al

    Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: CVPR (2023) 2 16 Wang et al

  4. [4]

    In: ICLR (2026) 6

    Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala,K.V.,Khedr,H.,Huang,A.,etal.:Sam3:Segmentanythingwithconcepts. In: ICLR (2026) 6

  5. [5]

    In: ICCV (2023) 2

    Ceylan, D., Huang, C.H.P., Mitra, N.J.: Pix2video: Video editing using image diffusion. In: ICCV (2023) 2

  6. [6]

    Bytemorph: Benchmarking instruction- guided image editing with non-rigid motions.ArXiv, abs/2506.03107, 2025

    Chang, D., Cao, M., Shi, Y., Liu, B., Cai, S., Zhou, S., Huang, W., Wetzstein, G., Soleymani, M., Wang, P.: Bytemorph: Benchmarking instruction-guided image editing with non-rigid motions. arXiv preprint arXiv:2506.03107 (2025) 5, 6

  7. [7]

    In: ICLR (2026) 2, 10

    Chen, Y., Zhang, J., Hu, T., Zeng, Y., Xue, Z., He, Q., Wang, C., Liu, Y., Hu, X., Yan, S.: Ivebench: Modern benchmark suite for instruction-guided video editing assessment. In: ICLR (2026) 2, 10

  8. [8]

    In: ICLR (2024) 2, 4, 5

    Cheng, J., Xiao, T., He, T.: Consistent video-to-video transfer using synthetic dataset. In: ICLR (2024) 2, 4, 5

  9. [9]

    In: ICLR (2024) 4

    Cong, Y., Xu, M., Simon, C., Chen, S., Ren, J., Xie, Y., Perez-Rua, J.M., Rosen- hahn, B., Xiang, T., He, S.: Flatten: optical flow-guided attention for consistent text-to-video editing. In: ICLR (2024) 4

  10. [10]

    In: ICLR (2024) 2, 4

    Geyer, M., Bar-Tal, O., Bagon, S., Dekel, T.: Tokenflow: Consistent diffusion fea- tures for consistent video editing. In: ICLR (2024) 2, 4

  11. [11]

    CoRR , volume =

    He, H., Wang, J., Zhang, J., Xue, Z., Bu, X., Yang, Q., Wen, S., Xie, L.: Openve- 3m: A large-scale high-quality dataset for instruction-guided video editing. arXiv preprint arXiv:2512.07826 (2025) 2, 4, 5, 6, 10

  12. [12]

    In: ICLR (2023) 2

    Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. In: ICLR (2023) 2

  13. [13]

    In: ICLR (2023) 4

    Hong, W., Ding, M., Zheng, W., Liu, X., Tang, J.: Cogvideo: Large-scale pretrain- ing for text-to-video generation via transformers. In: ICLR (2023) 4

  14. [14]

    In: ICML (2025) 5, 6

    Jiang, H., Fang, J., Zhang, N., Ma, G., Wan, M., Wang, X., He, X., Chua, T.s.: Anyedit: Edit any knowledge encoded in language models. In: ICML (2025) 5, 6

  15. [15]

    In: ICCV (2025) 3, 7

    Jiang, Z., Han, Z., Mao, C., Zhang, J., Pan, Y., Liu, Y.: Vace: All-in-one video creation and editing. In: ICCV (2025) 3, 7

  16. [16]

    In: ICLR (2026) 4, 5, 10, 11

    Ju, X., Wang, T., Zhou, Y., Zhang, H., Liu, Q., Zhao, N., Zhang, Z., Li, Y., Cai, Y., Liu, S., et al.: Editverse: Unifying image and video editing and generation with in-context learning. In: ICLR (2026) 4, 5, 10, 11

  17. [17]

    Kara, O., Kurtkaya, B., Yesiltepe,H., Rehg, J.M., Yanardag, P.:Rave: Randomized noiseshufflingforfastandconsistentvideoeditingwithdiffusionmodels.In:CVPR (2024) 4

  18. [18]

    Transactions on Machine Learning Research (2024) 2

    Ku, M., Wei, C., Ren, W., Yang, H., Chen, W.: Anyv2v: A tuning-free framework for any video-to-video editing tasks. Transactions on Machine Learning Research (2024) 2

  19. [19]

    FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

    Labs, B.F., Batifol, S., Blattmann, A., Boesel, F., Consul, S., Diagne, C., Dock- horn, T., English, J., English, Z., Esser, P., et al.: Flux. 1 kontext: Flow match- ing for in-context image generation and editing in latent space. arXiv preprint arXiv:2506.15742 (2025) 5

  20. [20]

    arXiv preprint arXiv:2506.05046 (2025) 2

    Li, G., Yang, Y., Song, C., Zhang, C.: Flowdirector: Training-free flow steering for precise text-to-video editing. arXiv preprint arXiv:2506.05046 (2025) 2

  21. [21]

    arXiv preprint arXiv:2509.14638 (2025) 5, 6

    Li, M., Liu, L., Wang, H., Chen, H., Gu, X., Liu, S., Gong, D., Zhao, J., Lan, Z., Li, J.: Multiedit: Advancing instruction-based image editing on diverse and challenging tasks. arXiv preprint arXiv:2509.14638 (2025) 5, 6

  22. [22]

    UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

    Lin, B., Li, Z., Cheng, X., Niu, Y., Ye, Y., He, X., Yuan, S., Yu, W., Wang, S., Ge, Y., et al.: Uniworld-v1: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147 (2025) 5, 6 LIVE 17

  23. [23]

    arXiv preprint arXiv:2602.09609 (2026) 4

    Liu, J., Ma, Y., Cao, X., Li, T., Shang, G., Huang, H., Zhang, C., Li, X., Liu, C., Liu, J., et al.: Tele-omni: a unified multimodal framework for video generation and editing. arXiv preprint arXiv:2602.09609 (2026) 4

  24. [24]

    Step1X-Edit: A Practical Framework for General Image Editing

    Liu, S., Han, Y., Xing, P., Yin, F., Wang, R., Cheng, W., Liao, J., Wang, Y., Fu, H., Han, C., et al.: Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761 (2025) 2

  25. [25]

    Glyphdraw2: Automatic generation of complex glyph posters with diffusion models and large language models

    Ma, J., Zhu, X., Pan, Z., Peng, Q., Guo, X., Chen, C., Lu, H.: X2edit: Revisiting arbitrary-instruction image editing through self-constructed data and task-aware representation learning. arXiv preprint arXiv:2508.07607 (2025) 5, 6

  26. [26]

    In: NeurIPS (2025) 6

    Miao, C., Feng, Y., Zeng, J., Gao, Z., Liu, H., Yan, Y., Qi, D., Chen, X., Wang, B., Zhao, H.: Rose: Remove objects with side effects in videos. In: NeurIPS (2025) 6

  27. [27]

    PicoTrex: Picotrex/awesome-nano-banana-images (2025) 6

  28. [28]

    In: ICLR (2024) 4

    Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. In: ICLR (2024) 4

  29. [29]

    In: ICCV (2023) 2, 4

    Qi, C., Cun, X., Zhang, Y., Lei, C., Wang, X., Shan, Y., Chen, Q.: Fatezero: Fusing attentions for zero-shot text-based video editing. In: ICCV (2023) 2, 4

  30. [30]

    Pico-banana-400k: A large-scale dataset for text-guided image editing.arXiv preprint arXiv:2510.19808, 2025

    Qian, Y., Bocek-Rivele, E., Song, L., Tong, J., Yang, Y., Lu, J., Hu, W., Gan, Z.: Pico-banana-400k: A large-scale dataset for text-guided image editing. arXiv preprint arXiv:2510.19808 (2025) 5, 6

  31. [31]

    Journal of Machine Learning Research21(140), 1–67 (2020) 7

    Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research21(140), 1–67 (2020) 7

  32. [32]

    In: CVPR (2022) 4

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022) 4

  33. [33]

    In: CVPR (2024) 2

    Sheynin, S., Polyak, A., Singer, U., Kirstain, Y., Zohar, A., Ashual, O., Parikh, D., Taigman, Y.: Emu edit: Precise image editing via recognition and generation tasks. In: CVPR (2024) 2

  34. [34]

    In: ECCV (2024) 10

    Singer, U., Zohar, A., Kirstain, Y., Sheynin, S., Polyak, A., Parikh, D., Taigman, Y.: Video editing via factorized diffusion distillation. In: ECCV (2024) 10

  35. [35]

    In: ICLR (2021) 2

    Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: ICLR (2021) 2

  36. [36]

    Diffusion model-based video editing: A survey.arXiv preprint arXiv:2407.07111, 2024

    Sun, W., Tu, R.C., Liao, J., Tao, D.: Diffusion model-based video editing: A survey. arXiv preprint arXiv:2407.07111 (2024) 10

  37. [37]

    In: CVPR (2025) 5, 6

    Sushko, P., Bharadwaj, A., Lim, Z.Y., Ilin, V., Caffee, B., Chen, D., Salehi, M., Hsieh, C.Y., Krishna, R.: Realedit: Reddit edits as a large-scale empirical dataset for image transformations. In: CVPR (2025) 5, 6

  38. [38]

    Team, D.: Lucy edit: Open-weight text-guided video editing (2025) 10, 11

  39. [39]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025) 4, 7, 9

  40. [40]

    ModelScope Text-to-Video Technical Report

    Wang, J., Yuan, H., Chen, D., Zhang, Y., Wang, X., Zhang, S.: Modelscope text- to-video technical report. arXiv preprint arXiv:2308.06571 (2023) 2

  41. [41]

    Transactions on Machine Learning Research (2024) 2

    Wang, W., Jiang, Y., Xie, K., Liu, Z., Chen, H., Cao, Y., Wang, X., Shen, C.: Zero-shot video editing using off-the-shelf image diffusion models. Transactions on Machine Learning Research (2024) 2

  42. [42]

    Gpt-image-edit-1.5 m: A million-scale, gpt-generated image dataset

    Wang, Y., Yang, S., Zhao, B., Zhang, L., Liu, Q., Zhou, Y., Xie, C.: Gpt- image-edit-1.5 m: A million-scale, gpt-generated image dataset. arXiv preprint arXiv:2507.21033 (2025) 4, 5, 6 18 Wang et al

  43. [43]

    In: ICLR (2026) 4

    Wei,C.,Liu,Q.,Ye,Z.,Wang,Q.,Wang,X.,Wan,P.,Gai,K.,Chen,W.:Univideo: Unified understanding, generation, and editing for videos. In: ICLR (2026) 4

  44. [44]

    Skywork unipic 3.0: Unified multi-image composition via sequence modeling.arXiv preprint arXiv:2601.15664, 2026

    Wei, H., Liu, H., Wang, Z., Peng, Y., Xu, B., Wu, S., Zhang, X., He, X., Liu, Z., Wang, P., et al.: Skywork unipic 3.0: Unified multi-image composition via sequence modeling. arXiv preprint arXiv:2601.15664 (2026) 5, 6

  45. [45]

    Qwen-Image Technical Report

    Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.m., Bai, S., Xu, X., Chen, Y., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025) 5, 7

  46. [46]

    ICLR (2026) 9

    Wu, J.Z., Ren, X., Shen, T., Cao, T., He, K., Lu, Y., Gao, R., Xie, E., Lan, S., Alvarez, J.M., et al.: Chronoedit: Towards temporal reasoning for image editing and world simulation. ICLR (2026) 9

  47. [47]

    In: ICCV (2025) 2, 4, 5

    Wu, Y., Chen, L., Li, R., Wang, S., Xie, C., Zhang, L.: Insvie-1m: Effective instruction-based video editing with elaborate dataset construction. In: ICCV (2025) 2, 4, 5

  48. [48]

    arXiv preprint arXiv:2508.06080 (2025)

    Xia, B., Liu, J., Zhang, Y., Peng, B., Chu, R., Wang, Y., Wu, X., Yu, B., Jia, J.: Dreamve: Unified instruction-based image and video editing. arXiv preprint arXiv:2508.06080 (2025) 4

  49. [49]

    IEEE Transactions on Visualization and Computer Graphics 31(2), 1526–1541 (2024) 2

    Xing, J., Xia, M., Liu, Y., Zhang, Y., Zhang, Y., He, Y., Liu, H., Chen, H., Cun, X., Wang, X., et al.: Make-your-video: Customized video generation using textual and structural guidance. IEEE Transactions on Visualization and Computer Graphics 31(2), 1526–1541 (2024) 2

  50. [50]

    IEEE Transactions on Visualization and Computer Graphics (2026) 2

    Xu, Z., Huang, Z., Cao, J., Zhang, Y., Cun, X., Shuai, Q., Wang, Y., Bao, L., Tang, F.: Anchorcrafter: Animate cyber-anchors selling your products via human-object interacting video generation. IEEE Transactions on Visualization and Computer Graphics (2026) 2

  51. [51]

    VideoCoF: Unified Video Editing with Temporal Reasoner

    Yang, X., Xie, J., Yang, Y., Huang, Y., Xu, M., Wu, Q.: Unified video editing with temporal reasoner. arXiv preprint arXiv:2512.07469 (2025) 5, 6, 9

  52. [52]

    In: ICLR (2025) 2

    Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. In: ICLR (2025) 2

  53. [53]

    Echo-4o: Harnessing the power of gpt-4o synthetic images for improved image generation.arXiv preprint arXiv:2508.09987, 2025

    Ye, J., Jiang, D., Wang, Z., Zhu, L., Hu, Z., Huang, Z., He, J., Yan, Z., Yu, J., Li, H., et al.: Echo-4o: Harnessing the power of gpt-4o synthetic images for improved image generation. arXiv preprint arXiv:2508.09987 (2025) 5, 6

  54. [54]

    arXiv preprint arXiv:2512.02790 (2025) 3, 5, 6

    Ye, K., Huang, Z., Fu, C., Liu, Q., Cai, J., Lv, Z., Li, C., Lyu, J., Zhao, Z., Zhang, S.: Unicedit-10m: A dataset and benchmark breaking the scale-quality barrier via unified verification for reasoning-enriched edits. arXiv preprint arXiv:2512.02790 (2025) 3, 5, 6

  55. [55]

    Opens2v-nexus: A detailed benchmark and million-scale dataset for subject-to-video generation

    Yuan, S., He, X., Deng, Y., Ye, Y., Huang, J., Lin, B., Luo, J., Yuan, L.: Opens2v- nexus: A detailed benchmark and million-scale dataset for subject-to-video gener- ation. arXiv preprint arXiv:2505.20292 (2025) 5, 6

  56. [56]

    arXiv preprint arXiv:2503.17641 (2025) 4

    Zhang, C., Feng, C., Yan, F., Zhang, Q., Zhang, M., Zhong, Y., Zhang, J., Ma, L.: Instructvedit: A holistic approach for instructional video editing. arXiv preprint arXiv:2503.17641 (2025) 4

  57. [57]

    CoRR , volume =

    Zhang, Z., Long, F., Li, W., Qiu, Z., Liu, W., Yao, T., Mei, T.: Region- Constraint In-Context Generation for Instructional Video Editing. arXiv preprint arXiv:2512.17650 (2025) 5, 6, 10, 11

  58. [58]

    In: NeurIPS (2025) 5, 6

    Zi, B., Ruan, P., Chen, M., Qi, X., Hao, S., Zhao, S., Huang, Y., Liang, B., Xiao, R., Wong, K.F.: Señorita-2m: A high-quality instruction-based dataset for general video editing by video specialists. In: NeurIPS (2025) 5, 6