Goku: A Million-Scale Universal Dataset and Benchmark for Instruction-Based Video Editing

Cong Wang; Fengbin Guan; Qinglin Lu; Sen Liang; Teng Hu; Xin Li; Youliang Zhang; Yuan Zhou; Zhengguang Zhou; Zhentao Yu

arxiv: 2606.30599 · v1 · pith:3ZCJVPWMnew · submitted 2026-06-29 · 💻 cs.CV

Goku: A Million-Scale Universal Dataset and Benchmark for Instruction-Based Video Editing

Sen Liang , Cong Wang , Zhentao Yu , Fengbin Guan , Zhengguang Zhou , Teng Hu , Youliang Zhang , Yuan Zhou

show 3 more authors

Xin Li Qinglin Lu Zhibo Chen

This is my paper

Pith reviewed 2026-06-30 05:58 UTC · model grok-4.3

classification 💻 cs.CV

keywords video editinginstruction-based editingdatasetbenchmarkstructural manipulationdata synthesisdual-branch architecture

0 comments

The pith

A 2-million-pair dataset extends video editing to multi-task and structural manipulations like subject movement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Goku as the first large-scale collection of instruction-aligned video editing pairs that moves beyond single-task appearance changes to include complex multi-task and structural edits. It introduces an efficient synthesis pipeline that breaks down difficult edits into controllable sub-problems and applies progressive filtering to maintain quality and alignment. Goku-Edit is built on this data using an MLLM text encoder and a decoupled dual-branch architecture that separates structural mask control from appearance rendering. Goku-Bench supplies 1,000 human-verified test cases plus seven editing-specific metrics, and evaluation shows the model achieves up to 8 percent better instruction following than prior open-source approaches.

Core claim

Goku supplies two million high-quality instruction-aligned video editing pairs that cover multi-task and structural manipulations, created through a decomposition pipeline with progressive filtering; Goku-Edit uses this data with an MLLM encoder and decoupled dual-branch design to achieve stronger instruction following on the accompanying Goku-Bench benchmark.

What carries the argument

The efficient data synthesis pipeline that decomposes complex edits into controllable sub-problems combined with progressive filtering, paired with Goku-Edit's decoupled dual-branch architecture using an MLLM as text encoder.

If this is right

Models can perform precise structural edits such as controlled subject movement in response to natural language instructions.
Instruction following accuracy in video editing rises by measurable margins across diverse tasks when trained on the new scale of data.
Standardized evaluation becomes possible through Goku-Bench's human-verified cases and dedicated editing metrics.
Data creation for complex editing tasks scales to millions of examples without requiring full manual annotation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dual-branch separation of structure and appearance may apply to other generative video tasks that need independent control dimensions.
Datasets built this way could support downstream creative tools that let users specify both motion and appearance changes in one instruction.
If the decomposition approach generalizes, similar pipelines might accelerate data collection for related domains such as 3D scene editing.

Load-bearing premise

The synthesis pipeline that decomposes edits into sub-problems and applies progressive filtering produces reliable, high-quality instruction-aligned pairs without systematic artifacts or biases.

What would settle it

A random sample of Goku pairs examined by independent human reviewers showing frequent misalignment between instructions and edited video content would falsify the reliability of the generated dataset.

Figures

Figures reproduced from arXiv: 2606.30599 by Cong Wang, Fengbin Guan, Qinglin Lu, Sen Liang, Teng Hu, Xin Li, Youliang Zhang, Yuan Zhou, Zhengguang Zhou, Zhentao Yu, Zhibo Chen.

**Figure 1.** Figure 1: Goku covers 10 core video editing task classes across basic and complex edits. The word cloud illustrates the instruction vocabulary distribution, while the two charts show the distributions of instruction length and frame count. they remain largely confined to single-task and appearance-level modifications, such as object removal and single-attribute alteration. One of the primary factors contributing to … view at source ↗

**Figure 2.** Figure 2: The illustration of our automated video editing pipeline. (a) Video PreProcessing. (b) Data Generation for Different Tasks. (c) Progressive Filtering System. MLLM-Powered Instruction Generation. We leverage the multimodal understanding capabilities of Gemini2.5-Pro to generate natural and diverse editing instructions for each task category. For Add, Remove, Swap, and Subject Movement tasks, the model fi… view at source ↗

**Figure 3.** Figure 3: Overview of Goku-Edit, featuring a dual-branch architecture with RoPEaligned spatial cross-attention and inference-time SpatialCFG. To validate our progressive filtering system, we include a human evaluation (100 samples per task, 3 annotators) and precision/recall analysis in our supplementary material. An alternative filtering pipeline based on open-source models (Qwen3VL-30B [3]) is also provided for … view at source ↗

**Figure 4.** Figure 4: Statistical distributions of Goku-Bench. complete selection pipeline and criteria are detailed in the supplementary material. The final test set covers multi-person scenarios, full and half body human subjects, animals (dogs, cats, sharks, birds, etc.), common objects (clothing, vehicles, buildings, etc.), and natural landscapes (mountains, rivers, deserts, etc.). Furthermore, we specifically include cha… view at source ↗

**Figure 5.** Figure 5: Ablation study on the spatial downsampling factor n [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison with state-of-the-art methods [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

read the original abstract

Existing instruction-based video editing datasets commonly focus on single-task appearance editing, failing to meet the complex creative demands of real-world scenarios. To bridge this gap, we present Goku, a large-scale dataset featuring 2 million high-quality, instruction-aligned video editing pairs, which is the first to extend task boundaries from basic appearance editing to multi-task and structural manipulations(e.g., precise control of subject movement). To tackle the data synthesis challenges inherent in these complex tasks, we design an efficient data synthesis pipeline that decomposes complex edits into controllable sub-problems and introduce a progressive filtering system for data reliability throughout the whole process. Furthermore, we explore the optimal network structures on Goku, and propose Goku-Edit. To deeply comprehend complex editing instructions, Goku-Edit leverages an MLLM as its text encoder and adopts a decoupled dual-branch design: a dedicated mask branch handles structural control, freeing the main branch for appearance rendering. A comprehensive video editing benchmark, Goku-Bench, is also proposed with 1,000 human-verified test cases and 7 novel editing-specific metrics. Evaluated on Goku-Bench, Goku-Edit obtains up to +8% improvement on other open-source models in terms of instruction following.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Goku adds a large dataset and benchmark for structural video editing but provides little evidence that the generated pairs are clean enough to support the claimed gains.

read the letter

The main takeaway is that this paper ships a 2 million pair dataset aimed at instruction-based video editing that includes structural changes such as subject movement, along with a dual-branch model and a 1000-case benchmark. That scale and task expansion is the concrete new element compared with earlier appearance-only collections.

The synthesis approach of decomposing edits into sub-problems and applying progressive filtering is a practical engineering step for generating complex pairs at volume. The model choice to route structural control through a dedicated mask branch while using an MLLM for text also follows logically from the problem. The benchmark metrics focused on editing-specific behavior are a reasonable addition.

The soft spot is the missing quantitative check on data quality. The abstract notes human verification but gives no agreement rates, rejection statistics, or breakdown of failure modes on the structural cases that matter most. Without those numbers it is hard to tell whether the reported +8% instruction-following lift comes from genuinely better pairs or from residual artifacts that the pipeline did not catch. The stress-test concern about possible biases in multi-task structural pairs therefore lands on the evidence that is shown.

This work is aimed at groups training generative video models who need larger and more varied editing data. A reader in that area could extract the dataset and benchmark for their own experiments even if they treat the model results as preliminary.

It should go to peer review. The scale and direction are substantial enough to warrant referee time, provided the authors supply the missing validation numbers on the training pairs.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Goku, a dataset of 2 million instruction-aligned video editing pairs that extends beyond single-task appearance editing to multi-task and structural manipulations such as subject movement control. It describes an efficient synthesis pipeline that decomposes complex edits into sub-problems with progressive filtering, proposes the Goku-Edit architecture using an MLLM text encoder and a decoupled dual-branch design (mask branch for structural control), and presents Goku-Bench containing 1,000 human-verified test cases together with 7 novel editing-specific metrics. The central empirical claim is that models trained on Goku achieve up to +8% improvement in instruction following over other open-source baselines when evaluated on Goku-Bench.

Significance. If the synthesis pipeline demonstrably yields instruction-aligned pairs without systematic artifacts on structural and multi-task cases, the work would supply a substantial public resource that expands the scope of instruction-based video editing research. The benchmark and dual-branch architecture could also become reference points for future model development in this area.

major comments (3)

[Abstract / data synthesis pipeline] Abstract and data synthesis pipeline section: the assertion that the decomposition-plus-progressive-filtering pipeline produces reliable, high-quality pairs for structural manipulations is load-bearing for the entire contribution, yet the manuscript provides no quantitative validation (human agreement rates, artifact statistics, or failure-mode breakdown on subject-movement cases).
[Abstract / experimental results] Abstract and results section: the reported '+8% improvement' on instruction following is central to the performance claim, but the manuscript does not specify how the figure was computed, which exact metric(s) it aggregates, or whether baseline models were trained or evaluated under identical conditions and data scales.
[Goku-Bench section] Goku-Bench description: the claim of 1,000 human-verified test cases is used to establish benchmark reliability, yet the manuscript supplies no details on verification protocol, inter-annotator agreement, or criteria applied to structural-edit cases.

minor comments (2)

[Abstract / Introduction] The abstract and introduction use the phrase 'first to extend task boundaries' without citing prior multi-task video editing datasets; a brief related-work comparison would clarify novelty.
[Goku-Edit architecture] Notation for the decoupled dual-branch architecture (mask branch vs. main branch) is introduced without an accompanying diagram or equation defining the information flow between branches.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will incorporate the requested clarifications and additional analyses in the revised manuscript.

read point-by-point responses

Referee: [Abstract / data synthesis pipeline] Abstract and data synthesis pipeline section: the assertion that the decomposition-plus-progressive-filtering pipeline produces reliable, high-quality pairs for structural manipulations is load-bearing for the entire contribution, yet the manuscript provides no quantitative validation (human agreement rates, artifact statistics, or failure-mode breakdown on subject-movement cases).

Authors: We agree that explicit quantitative validation for structural manipulations is essential. The current manuscript describes the pipeline but does not report the requested statistics. In revision we will add human agreement rates (from our internal annotation rounds), artifact statistics, and a failure-mode breakdown focused on subject-movement cases, drawn from the progressive filtering logs and spot-checks performed during dataset construction. revision: yes
Referee: [Abstract / experimental results] Abstract and results section: the reported '+8% improvement' on instruction following is central to the performance claim, but the manuscript does not specify how the figure was computed, which exact metric(s) it aggregates, or whether baseline models were trained or evaluated under identical conditions and data scales.

Authors: The '+8%' figure is the maximum observed gain in instruction-following scores (averaged over the seven editing-specific metrics) when comparing Goku-Edit to the strongest open-source baseline on Goku-Bench. We will revise the abstract and results section to state the exact aggregation method, list the contributing metrics, and clarify that all models were evaluated under identical inference settings on the same 1,000-case test set; training-scale differences will also be explicitly noted. revision: yes
Referee: [Goku-Bench section] Goku-Bench description: the claim of 1,000 human-verified test cases is used to establish benchmark reliability, yet the manuscript supplies no details on verification protocol, inter-annotator agreement, or criteria applied to structural-edit cases.

Authors: We will expand the Goku-Bench section with a dedicated subsection describing the verification protocol, the inter-annotator agreement (Cohen's kappa or equivalent), and the specific criteria used to accept or reject structural-edit cases. These details were collected during benchmark curation but were omitted from the initial submission for brevity. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset construction and benchmark evaluation

full rationale

The paper describes an empirical pipeline for synthesizing 2M video editing pairs via decomposition and progressive filtering, followed by training Goku-Edit (MLLM text encoder + dual-branch architecture) and evaluating on the human-verified Goku-Bench with 7 metrics. No equations, fitted parameters renamed as predictions, self-definitional steps, or load-bearing self-citations appear in the provided text. Performance gains (+8% instruction following) are reported as direct empirical comparisons against external open-source models on an independently human-verified test set, with no reduction to inputs by construction. This is a standard dataset+model paper whose central claims rest on external validation rather than internal redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that complex edits can be reliably decomposed and filtered into high-quality training pairs; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Complex video edits can be decomposed into controllable sub-problems that a progressive filtering system can reliably validate.
Invoked to justify the data synthesis pipeline for multi-task and structural manipulations.

pith-pipeline@v0.9.1-grok · 5776 in / 1275 out tokens · 31187 ms · 2026-06-30T05:58:20.269465+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 16 canonical work pages · 9 internal anchors

[1]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Bai,J., Xia, M., Fu, X.,Wang, X.,Mu, L., Cao, J.,Liu, Z., Hu, H.,Bai, X., Wan, P., et al.: Recammaster: Camera-controlled generative rendering from a single video. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14834–14844 (2025)

2025
[2]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Bai, Q., Wang, Q., Ouyang, H., Yu, Y., Wang, H., Wang, W., Cheng, K.L., Ma, S., Zeng, Y., Liu, Z., et al.: Scaling instruction-based video editing with a high-quality synthetic dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 37971–37981 (2026)

2026
[3]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

In: SIGGRAPH Asia 2024 Conference Papers

Bar-Tal, O., Chefer, H., Tov, O., Herrmann, C., Paiss, R., Zada, S., Ephrat, A., Hur, J., Liu, G., Raj, A., et al.: Lumiere: A space-time diffusion model for video generation. In: SIGGRAPH Asia 2024 Conference Papers. pp. 1–11 (2024)

2024
[5]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S.W., Fidler, S., Kreis, K.: Align your latents: High-resolution video synthesis with latent diffusion mod- els. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 22563–22575 (2023)

2023
[7]

Brooks, T., Hellsten, J., Aittala, M., Wang, T.C., Aila, T., Lehtinen, J., Liu, M.Y., Efros,A.,Karras,T.:Generatinglongvideosofdynamicscenes.AdvancesinNeural Information Processing Systems35, 31769–31781 (2022)

2022
[8]

In: Proceedings of the IEEE/CVF international conference on computer vision

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021)

2021
[9]

In: International Conference on Learning Representations

Cheng, J., Xiao, T., He, T.: Consistent video-to-video transfer using synthetic dataset. In: International Conference on Learning Representations. vol. 2024, pp. 16867–16879 (2024)

2024
[10]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Guo, Y., Yang, C., Rao, A., Liang, Z., Wang, Y., Qiao, Y., Agrawala, M., Lin, D., Dai, B.: Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

In: European Conference on Computer Vision

Gupta, A., Yu, L., Sohn, K., Gu, X., Hahn, M., Li, F.F., Essa, I., Jiang, L., Lezama, J.: Photorealistic video generation with diffusion models. In: European Conference on Computer Vision. pp. 393–411. Springer (2024)

2024
[12]

arXiv preprint arXiv:2512.07826 (2025)

He, H., Wang, J., Zhang, J., Xue, Z., Bu, X., Yang, Q., Wen, S., Xie, L.: Openve- 3m: A large-scale high-quality dataset for instruction-guided video editing. arXiv preprint arXiv:2512.07826 (2025)

work page arXiv 2025
[13]

Advances in neural information processing systems35, 8633– 8646 (2022)

Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models. Advances in neural information processing systems35, 8633– 8646 (2022)

2022
[14]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.: Vbench: Comprehensive benchmark suite for video gener- ative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21807–21818 (2024) Goku 17

2024
[15]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Jiang, Z., Han, Z., Mao, C., Zhang, J., Pan, Y., Liu, Y.: Vace: All-in-one video creation and editing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17191–17202 (2025)

2025
[16]

EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning

Ju, X., Wang, T., Zhou, Y., Zhang, H., Liu, Q., Zhao, N., Zhang, Z., Li, Y., Cai, Y., Liu, S., et al.: Editverse: Unifying image and video editing and generation with in-context learning. arXiv preprint arXiv:2509.20360 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

AnyV2V: A tuning-free framework for any video-to-video editing tasks.arXiv preprint arXiv:2403.14468, 2024

Ku, M., Wei, C., Ren, W., Yang, H., Chen, W.: Anyv2v: A tuning-free framework for any video-to-video editing tasks. arXiv preprint arXiv:2403.14468 (2024)

work page arXiv 2024
[18]

In: Proceedings of the AAAI conference on artificial intelligence

Li, Y., Min, M., Shen, D., Carlson, D., Carin, L.: Video generation from text. In: Proceedings of the AAAI conference on artificial intelligence. vol. 32 (2018)

2018
[19]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Liang, S., Guan, F., Zhang, Y., Li, X., Chen, Z.: Cot-edit: Let cot guide instruction video editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 37960–37970 (2026)

2026
[20]

SpongeBob: Sync-Aware Harmonious Audio-Visual Generative Editing

Liang, S., Wang, C., Guan, F., Yu, Z., Lu, Y., Wang, Y., Zhou, Y., Li, X., Chen, Z.: Spongebob: Sync-aware harmonious audio-visual generative editing. arXiv preprint arXiv:2605.25193 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[21]

arXiv preprint arXiv:2506.01801 (2025)

Liang, S., Yu, Z., Zhou, Z., Hu, T., Wang, H., Chen, Y., Lin, Q., Zhou, Y., Li, X., Lu, Q., et al.: Omniv2v: Versatile video generation and editing via dynamic content manipulation. arXiv preprint arXiv:2506.01801 (2025)

work page arXiv 2025
[22]

arXiv preprint arXiv:2501.08316 (2025)

Lin, S., Xia, X., Ren, Y., Yang, C., Xiao, X., Jiang, L.: Diffusion adversarial post- training for one-step video generation. arXiv preprint arXiv:2501.08316 (2025)

work page arXiv 2025
[23]

IEEE Transactions on Circuits and Systems for Video Technology (2025)

Liu,C.,Li,R.,Zhang,K.,Lan,Y.,Liu,D.:Stablev2v:Stabilizingshapeconsistency in video-to-video editing. IEEE Transactions on Circuits and Systems for Video Technology (2025)

2025
[24]

Follow-your-creation: Empowering 4d creation through video inpainting.arXiv preprint arXiv:2506.04590, 2025

Ma, Y., Feng, K., Zhang, X., Liu, H., Zhang, D.J., Xing, J., Zhang, Y., Yang, A., Wang, Z., Chen, Q.: Follow-your-creation: Empowering 4d creation through video inpainting. arXiv preprint arXiv:2506.04590 (2025)

work page arXiv 2025
[25]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Ma, Y., He, Y., Cun, X., Wang, X., Chen, S., Li, X., Chen, Q.: Follow your pose: Pose-guided text-to-video generation using pose-free videos. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 4117–4125 (2024)

2024
[26]

In: Proceedings of the Computer Vision and Pattern Recognition Con- ference

Qu, L., Zhang, H., Liu, Y., Wang, X., Jiang, Y., Gao, Y., Ye, H., Du, D.K., Yuan, Z., Wu, X.: Tokenflow: Unified image tokenizer for multimodal understanding and generation. In: Proceedings of the Computer Vision and Pattern Recognition Con- ference. pp. 2545–2555 (2025)

2025
[27]

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Ren, T., Liu, S., Zeng, A., Lin, J., Li, K., Cao, H., Chen, J., Huang, X., Chen, Y., Yan, F., et al.: Grounded sam: Assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., et al.: Make-a-video: Text-to-video generation without text- video data. arXiv preprint arXiv:2209.14792 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[29]

arXiv preprint arXiv:2507.06119 (2025)

Tan, Z., Yang, H., Qin, L., Gong, J., Yang, M., Li, H.: Omni-video: Democratiz- ing unified video understanding and generation. arXiv preprint arXiv:2507.06119 (2025)

work page arXiv 2025
[30]

Team, D.: Lucy edit: Open-weight text-guided video editing (2025)

2025
[31]

ModelScope Text-to-Video Technical Report

Wang, J., Yuan, H., Chen, D., Zhang, Y., Wang, X., Zhang, S.: Modelscope text- to-video technical report. arXiv preprint arXiv:2308.06571 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

In: Proceedings of the Com- puter Vision and Pattern Recognition Conference

Wang, Q., Shi, Y., Ou, J., Chen, R., Lin, K., Wang, J., Jiang, B., Yang, H., Zheng, M., Tao, X., et al.: Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content. In: Proceedings of the Com- puter Vision and Pattern Recognition Conference. pp. 8428–8437 (2025) 18 S. Liang et al

2025
[33]

In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

Wang, Y., Bilinski, P., Bremond, F., Dantcheva, A.: Imaginator: Conditional spatio-temporal gan for video generation. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 1160–1169 (2020)

2020
[34]

In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision

Wu, Y., Chen, L., Li, R., Wang, S., Xie, C., Zhang, L.: Insvie-1m: Effective instruction-based video editing with elaborate dataset construction. In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision. pp. 16692– 16701 (2025)

2025
[35]

MagicVideo: Efficient Video Generation With Latent Diffusion Models

Zhou, D., Wang, W., Yan, H., Lv, W., Zhu, Y., Feng, J.: Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[36]

arXiv preprint arXiv:2410.15458 (2024)

Zhou, Y., Wang, Q., Cai, Y., Yang, H.: Allegro: Open the black box of commercial- level video generation model. arXiv preprint arXiv:2410.15458 (2024)

work page arXiv 2024
[37]

Advances in Neural Infor- mation Processing Systems38, 75518–75547 (2026)

Zi, B., Peng, W., Qi, X., Wang, J., Zhao, S., Xiao, R., Wong, K.F.: Minimax- remover: Taming bad noise helps video object removal. Advances in Neural Infor- mation Processing Systems38, 75518–75547 (2026)

2026
[38]

Advances in Neural Information Processing Systems38(2026)

Zi, B., Ruan, P., Chen, M., Qi, X., Hao, S., Zhao, S., Huang, Y., Liang, B., Xiao, R., Wong, K.F.: Señorita-2m: A high-quality instruction-based dataset for gen- eral video editing by video specialists. Advances in Neural Information Processing Systems38(2026)

2026

[1] [1]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Bai,J., Xia, M., Fu, X.,Wang, X.,Mu, L., Cao, J.,Liu, Z., Hu, H.,Bai, X., Wan, P., et al.: Recammaster: Camera-controlled generative rendering from a single video. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14834–14844 (2025)

2025

[2] [2]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Bai, Q., Wang, Q., Ouyang, H., Yu, Y., Wang, H., Wang, W., Cheng, K.L., Ma, S., Zeng, Y., Liu, Z., et al.: Scaling instruction-based video editing with a high-quality synthetic dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 37971–37981 (2026)

2026

[3] [3]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

In: SIGGRAPH Asia 2024 Conference Papers

Bar-Tal, O., Chefer, H., Tov, O., Herrmann, C., Paiss, R., Zada, S., Ephrat, A., Hur, J., Liu, G., Raj, A., et al.: Lumiere: A space-time diffusion model for video generation. In: SIGGRAPH Asia 2024 Conference Papers. pp. 1–11 (2024)

2024

[5] [5]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S.W., Fidler, S., Kreis, K.: Align your latents: High-resolution video synthesis with latent diffusion mod- els. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 22563–22575 (2023)

2023

[7] [7]

Brooks, T., Hellsten, J., Aittala, M., Wang, T.C., Aila, T., Lehtinen, J., Liu, M.Y., Efros,A.,Karras,T.:Generatinglongvideosofdynamicscenes.AdvancesinNeural Information Processing Systems35, 31769–31781 (2022)

2022

[8] [8]

In: Proceedings of the IEEE/CVF international conference on computer vision

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021)

2021

[9] [9]

In: International Conference on Learning Representations

Cheng, J., Xiao, T., He, T.: Consistent video-to-video transfer using synthetic dataset. In: International Conference on Learning Representations. vol. 2024, pp. 16867–16879 (2024)

2024

[10] [10]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Guo, Y., Yang, C., Rao, A., Liang, Z., Wang, Y., Qiao, Y., Agrawala, M., Lin, D., Dai, B.: Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

In: European Conference on Computer Vision

Gupta, A., Yu, L., Sohn, K., Gu, X., Hahn, M., Li, F.F., Essa, I., Jiang, L., Lezama, J.: Photorealistic video generation with diffusion models. In: European Conference on Computer Vision. pp. 393–411. Springer (2024)

2024

[12] [12]

arXiv preprint arXiv:2512.07826 (2025)

He, H., Wang, J., Zhang, J., Xue, Z., Bu, X., Yang, Q., Wen, S., Xie, L.: Openve- 3m: A large-scale high-quality dataset for instruction-guided video editing. arXiv preprint arXiv:2512.07826 (2025)

work page arXiv 2025

[13] [13]

Advances in neural information processing systems35, 8633– 8646 (2022)

Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models. Advances in neural information processing systems35, 8633– 8646 (2022)

2022

[14] [14]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.: Vbench: Comprehensive benchmark suite for video gener- ative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21807–21818 (2024) Goku 17

2024

[15] [15]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Jiang, Z., Han, Z., Mao, C., Zhang, J., Pan, Y., Liu, Y.: Vace: All-in-one video creation and editing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17191–17202 (2025)

2025

[16] [16]

EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning

Ju, X., Wang, T., Zhou, Y., Zhang, H., Liu, Q., Zhao, N., Zhang, Z., Li, Y., Cai, Y., Liu, S., et al.: Editverse: Unifying image and video editing and generation with in-context learning. arXiv preprint arXiv:2509.20360 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

AnyV2V: A tuning-free framework for any video-to-video editing tasks.arXiv preprint arXiv:2403.14468, 2024

Ku, M., Wei, C., Ren, W., Yang, H., Chen, W.: Anyv2v: A tuning-free framework for any video-to-video editing tasks. arXiv preprint arXiv:2403.14468 (2024)

work page arXiv 2024

[18] [18]

In: Proceedings of the AAAI conference on artificial intelligence

Li, Y., Min, M., Shen, D., Carlson, D., Carin, L.: Video generation from text. In: Proceedings of the AAAI conference on artificial intelligence. vol. 32 (2018)

2018

[19] [19]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Liang, S., Guan, F., Zhang, Y., Li, X., Chen, Z.: Cot-edit: Let cot guide instruction video editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 37960–37970 (2026)

2026

[20] [20]

SpongeBob: Sync-Aware Harmonious Audio-Visual Generative Editing

Liang, S., Wang, C., Guan, F., Yu, Z., Lu, Y., Wang, Y., Zhou, Y., Li, X., Chen, Z.: Spongebob: Sync-aware harmonious audio-visual generative editing. arXiv preprint arXiv:2605.25193 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[21] [21]

arXiv preprint arXiv:2506.01801 (2025)

Liang, S., Yu, Z., Zhou, Z., Hu, T., Wang, H., Chen, Y., Lin, Q., Zhou, Y., Li, X., Lu, Q., et al.: Omniv2v: Versatile video generation and editing via dynamic content manipulation. arXiv preprint arXiv:2506.01801 (2025)

work page arXiv 2025

[22] [22]

arXiv preprint arXiv:2501.08316 (2025)

Lin, S., Xia, X., Ren, Y., Yang, C., Xiao, X., Jiang, L.: Diffusion adversarial post- training for one-step video generation. arXiv preprint arXiv:2501.08316 (2025)

work page arXiv 2025

[23] [23]

IEEE Transactions on Circuits and Systems for Video Technology (2025)

Liu,C.,Li,R.,Zhang,K.,Lan,Y.,Liu,D.:Stablev2v:Stabilizingshapeconsistency in video-to-video editing. IEEE Transactions on Circuits and Systems for Video Technology (2025)

2025

[24] [24]

Follow-your-creation: Empowering 4d creation through video inpainting.arXiv preprint arXiv:2506.04590, 2025

Ma, Y., Feng, K., Zhang, X., Liu, H., Zhang, D.J., Xing, J., Zhang, Y., Yang, A., Wang, Z., Chen, Q.: Follow-your-creation: Empowering 4d creation through video inpainting. arXiv preprint arXiv:2506.04590 (2025)

work page arXiv 2025

[25] [25]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Ma, Y., He, Y., Cun, X., Wang, X., Chen, S., Li, X., Chen, Q.: Follow your pose: Pose-guided text-to-video generation using pose-free videos. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 4117–4125 (2024)

2024

[26] [26]

In: Proceedings of the Computer Vision and Pattern Recognition Con- ference

Qu, L., Zhang, H., Liu, Y., Wang, X., Jiang, Y., Gao, Y., Ye, H., Du, D.K., Yuan, Z., Wu, X.: Tokenflow: Unified image tokenizer for multimodal understanding and generation. In: Proceedings of the Computer Vision and Pattern Recognition Con- ference. pp. 2545–2555 (2025)

2025

[27] [27]

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Ren, T., Liu, S., Zeng, A., Lin, J., Li, K., Cao, H., Chen, J., Huang, X., Chen, Y., Yan, F., et al.: Grounded sam: Assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., et al.: Make-a-video: Text-to-video generation without text- video data. arXiv preprint arXiv:2209.14792 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[29] [29]

arXiv preprint arXiv:2507.06119 (2025)

Tan, Z., Yang, H., Qin, L., Gong, J., Yang, M., Li, H.: Omni-video: Democratiz- ing unified video understanding and generation. arXiv preprint arXiv:2507.06119 (2025)

work page arXiv 2025

[30] [30]

Team, D.: Lucy edit: Open-weight text-guided video editing (2025)

2025

[31] [31]

ModelScope Text-to-Video Technical Report

Wang, J., Yuan, H., Chen, D., Zhang, Y., Wang, X., Zhang, S.: Modelscope text- to-video technical report. arXiv preprint arXiv:2308.06571 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

In: Proceedings of the Com- puter Vision and Pattern Recognition Conference

Wang, Q., Shi, Y., Ou, J., Chen, R., Lin, K., Wang, J., Jiang, B., Yang, H., Zheng, M., Tao, X., et al.: Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content. In: Proceedings of the Com- puter Vision and Pattern Recognition Conference. pp. 8428–8437 (2025) 18 S. Liang et al

2025

[33] [33]

In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

Wang, Y., Bilinski, P., Bremond, F., Dantcheva, A.: Imaginator: Conditional spatio-temporal gan for video generation. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 1160–1169 (2020)

2020

[34] [34]

In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision

Wu, Y., Chen, L., Li, R., Wang, S., Xie, C., Zhang, L.: Insvie-1m: Effective instruction-based video editing with elaborate dataset construction. In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision. pp. 16692– 16701 (2025)

2025

[35] [35]

MagicVideo: Efficient Video Generation With Latent Diffusion Models

Zhou, D., Wang, W., Yan, H., Lv, W., Zhu, Y., Feng, J.: Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[36] [36]

arXiv preprint arXiv:2410.15458 (2024)

Zhou, Y., Wang, Q., Cai, Y., Yang, H.: Allegro: Open the black box of commercial- level video generation model. arXiv preprint arXiv:2410.15458 (2024)

work page arXiv 2024

[37] [37]

Advances in Neural Infor- mation Processing Systems38, 75518–75547 (2026)

Zi, B., Peng, W., Qi, X., Wang, J., Zhao, S., Xiao, R., Wong, K.F.: Minimax- remover: Taming bad noise helps video object removal. Advances in Neural Infor- mation Processing Systems38, 75518–75547 (2026)

2026

[38] [38]

Advances in Neural Information Processing Systems38(2026)

Zi, B., Ruan, P., Chen, M., Qi, X., Hao, S., Zhao, S., Huang, Y., Liang, B., Xiao, R., Wong, K.F.: Señorita-2m: A high-quality instruction-based dataset for gen- eral video editing by video specialists. Advances in Neural Information Processing Systems38(2026)

2026