Goku: A Million-Scale Universal Dataset and Benchmark for Instruction-Based Video Editing

Cong Wang; Fengbin Guan; Qinglin Lu; Sen Liang; Teng Hu; Xin Li; Youliang Zhang; Yuan Zhou; Zhengguang Zhou; Zhentao Yu

arxiv: 2606.30599 · v2 · pith:3ZCJVPWMnew · submitted 2026-06-29 · 💻 cs.CV

Goku: A Million-Scale Universal Dataset and Benchmark for Instruction-Based Video Editing

Sen Liang , Cong Wang , Zhentao Yu , Fengbin Guan , Zhengguang Zhou , Teng Hu , Youliang Zhang , Yuan Zhou

show 3 more authors

Xin Li Qinglin Lu Zhibo Chen

This is my paper

Pith reviewed 2026-07-01 06:33 UTC · model grok-4.3

classification 💻 cs.CV

keywords video editinginstruction-based editingdatasetbenchmarkstructural manipulationMLLMdual-branch architecturedata synthesis pipeline

0 comments

The pith

A 2-million-pair dataset extends instruction-based video editing to structural manipulations like subject movement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing datasets limit video editing to single-task appearance changes and fall short of real-world creative needs. Goku supplies 2 million instruction-aligned pairs by decomposing complex edits into sub-problems and applying progressive filtering for quality. Goku-Edit processes instructions with an MLLM text encoder and a decoupled dual-branch design that isolates structural control in a mask branch. A new benchmark, Goku-Bench, supplies 1000 human-verified cases and seven editing-specific metrics. On this benchmark Goku-Edit improves instruction following by up to 8 percent over other open-source models.

Core claim

Goku supplies the first million-scale dataset of instruction-aligned video editing pairs that reaches beyond appearance editing into multi-task and structural manipulations, produced by a synthesis pipeline that decomposes edits into controllable sub-problems together with progressive filtering; the accompanying Goku-Edit model, built with an MLLM text encoder and a dedicated mask branch for structural control, reaches up to 8 percent higher instruction-following scores on the 1000-case Goku-Bench benchmark.

What carries the argument

Data synthesis pipeline that decomposes complex edits into controllable sub-problems plus progressive filtering system, paired with Goku-Edit's decoupled dual-branch architecture that routes structural control to a mask branch while the main branch handles appearance.

If this is right

Models trained on Goku can execute precise subject-movement edits in addition to appearance changes.
Goku-Bench supplies a standardized testbed using seven metrics that directly measure instruction alignment and structural fidelity.
The dual-branch separation allows the main network to focus on rendering while the mask branch enforces spatial constraints.
The decomposition approach scales data creation for other multi-step generative video tasks.
Human verification of the 1000 test cases provides a reproducible reference for future instruction-based editing research.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The decomposition-plus-filtering method could be reused to generate training data for other forms of controllable video generation such as camera-path editing.
A dataset at this scale may support training of general-purpose video editors that accept free-form natural-language instructions without task-specific fine-tuning.
The emphasis on structural control suggests future benchmarks could add metrics for temporal consistency across longer clips.

Load-bearing premise

The data synthesis pipeline that decomposes complex edits into controllable sub-problems and the progressive filtering system produce high-quality, instruction-aligned pairs that support reliable model training and evaluation.

What would settle it

Manual review of a random sample of the 2 million pairs revealing frequent mismatches between the written instruction and the actual edit performed, or side-by-side evaluation on the 1000 test cases showing Goku-Edit no better than prior open-source models on the seven new metrics.

Figures

Figures reproduced from arXiv: 2606.30599 by Cong Wang, Fengbin Guan, Qinglin Lu, Sen Liang, Teng Hu, Xin Li, Youliang Zhang, Yuan Zhou, Zhengguang Zhou, Zhentao Yu, Zhibo Chen.

**Figure 1.** Figure 1: Goku covers 10 core video editing task classes across basic and complex edits. The word cloud illustrates the instruction vocabulary distribution, while the two charts show the distributions of instruction length and frame count. they remain largely confined to single-task and appearance-level modifications, such as object removal and single-attribute alteration. One of the primary factors contributing to … view at source ↗

**Figure 2.** Figure 2: The illustration of our automated video editing pipeline. (a) Video PreProcessing. (b) Data Generation for Different Tasks. (c) Progressive Filtering System. MLLM-Powered Instruction Generation. We leverage the multimodal understanding capabilities of Gemini2.5-Pro to generate natural and diverse editing instructions for each task category. For Add, Remove, Swap, and Subject Movement tasks, the model fi… view at source ↗

**Figure 3.** Figure 3: Overview of Goku-Edit, featuring a dual-branch architecture with RoPEaligned spatial cross-attention and inference-time SpatialCFG. To validate our progressive filtering system, we include a human evaluation (100 samples per task, 3 annotators) and precision/recall analysis in our supplementary material. An alternative filtering pipeline based on open-source models (Qwen3VL-30B [3]) is also provided for … view at source ↗

**Figure 4.** Figure 4: Statistical distributions of Goku-Bench. complete selection pipeline and criteria are detailed in the supplementary material. The final test set covers multi-person scenarios, full and half body human subjects, animals (dogs, cats, sharks, birds, etc.), common objects (clothing, vehicles, buildings, etc.), and natural landscapes (mountains, rivers, deserts, etc.). Furthermore, we specifically include cha… view at source ↗

**Figure 5.** Figure 5: Ablation study on the spatial downsampling factor n [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison with state-of-the-art methods [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

read the original abstract

Existing instruction-based video editing datasets commonly focus on single-task appearance editing, failing to meet the complex creative demands of real-world scenarios. To bridge this gap, we present Goku, a large-scale dataset featuring 2 million high-quality, instruction-aligned video editing pairs, which is the first to extend task boundaries from basic appearance editing to multi-task and structural manipulations(e.g., precise control of subject movement). To tackle the data synthesis challenges inherent in these complex tasks, we design an efficient data synthesis pipeline that decomposes complex edits into controllable sub-problems and introduce a progressive filtering system for data reliability throughout the whole process. Furthermore, we explore the optimal network structures on Goku, and propose Goku-Edit. To deeply comprehend complex editing instructions, Goku-Edit leverages an MLLM as its text encoder and adopts a decoupled dual-branch design: a dedicated mask branch handles structural control, freeing the main branch for appearance rendering. A comprehensive video editing benchmark, Goku-Bench, is also proposed with 1,000 human-verified test cases and 7 novel editing-specific metrics. Evaluated on Goku-Bench, Goku-Edit obtains up to +8% improvement on other open-source models in terms of instruction following.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Goku scales instruction video editing data to 2M pairs with structural tasks and a dual-branch model, but the gains stay modest and the quality claims need checking.

read the letter

The main point is a 2 million pair dataset that moves instruction-based video editing past single-task appearance tweaks into multi-task structural changes like precise subject movement control. They also release Goku-Edit, which uses an MLLM text encoder plus a decoupled dual-branch setup where one branch handles masks for structure and the other does appearance. On their new benchmark they report up to 8% better instruction following than other open models.

They handle the hard part of data creation by breaking complex edits into smaller controllable pieces and running progressive filters, then human-verifying 1,000 test cases and defining seven editing-specific metrics. That pipeline approach and the benchmark are the practical contributions here.

The lift is small enough that it could come from better data, the architecture split, or both, and the abstract gives no variance numbers or detailed ablations to separate those. The synthesis method sounds workable but still risks subtle misalignments between instructions and outputs that filtering might miss, especially on structural edits. A 1,000-case benchmark is useful for a start but limited for broad claims.

This is for people training or evaluating generative video models who need more varied editing data. Dataset users and MLLM-for-editing folks will get the most out of it. The scale and new benchmark are substantial enough that it deserves a serious referee, even if the review will focus on data validation and metric robustness.

I would send it to review.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces the Goku dataset of 2 million instruction-aligned video editing pairs that extends beyond single-task appearance editing to multi-task and structural manipulations. It describes an efficient synthesis pipeline that decomposes complex edits into sub-problems with progressive filtering, proposes the Goku-Edit model that uses an MLLM text encoder and a decoupled dual-branch architecture (mask branch for structural control, main branch for appearance), and presents the Goku-Bench benchmark consisting of 1,000 human-verified cases together with 7 novel editing-specific metrics. On this benchmark Goku-Edit is reported to achieve up to +8% improvement in instruction following relative to other open-source models.

Significance. If the data quality and benchmark reliability claims hold, the work supplies a large-scale, diverse resource that directly addresses the current limitation of existing datasets to simple appearance edits. The dual-branch design that isolates structural control is a concrete architectural response to a recognized challenge in video editing. The human verification step on the 1,000-case benchmark and the introduction of task-specific metrics constitute clear strengths that would support reproducible progress in the area.

minor comments (2)

[Abstract] Abstract: the seven novel metrics are referenced but not named or briefly defined; adding their names and one-sentence characterizations would improve immediate comprehension of the evaluation protocol.
[Abstract] Abstract: the interaction between the dedicated mask branch and the main appearance branch is described at a high level; a short statement on how the branches are fused or conditioned would clarify the decoupled design without requiring the reader to reach the methods section.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive evaluation of the Goku dataset, Goku-Edit model, and Goku-Bench benchmark, as well as the recommendation for minor revision. No major comments were listed in the report, so we have no specific points to address point-by-point. We will make minor revisions to improve clarity and presentation as appropriate.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central claims consist of an empirical dataset construction pipeline, a new benchmark (Goku-Bench) with human-verified cases, and reported performance gains (+8% instruction following) for Goku-Edit on that benchmark. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided abstract or description. The derivation chain is self-contained as standard ML dataset+model+eval work without reduction of results to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view yields no explicit free parameters, axioms, or invented entities; the synthesis pipeline and filtering are referenced at a conceptual level without implementation specifics or external benchmarks.

pith-pipeline@v0.9.1-grok · 5776 in / 1133 out tokens · 48749 ms · 2026-07-01T06:33:00.587723+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 15 canonical work pages · 9 internal anchors

[1]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Bai,J., Xia, M., Fu, X.,Wang, X.,Mu, L., Cao, J.,Liu, Z., Hu, H.,Bai, X., Wan, P., et al.: Recammaster: Camera-controlled generative rendering from a single video. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14834–14844 (2025)

2025
[2]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Bai, Q., Wang, Q., Ouyang, H., Yu, Y., Wang, H., Wang, W., Cheng, K.L., Ma, S., Zeng, Y., Liu, Z., et al.: Scaling instruction-based video editing with a high-quality synthetic dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 37971–37981 (2026)

2026
[3]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

In: SIGGRAPH Asia 2024 Conference Papers

Bar-Tal, O., Chefer, H., Tov, O., Herrmann, C., Paiss, R., Zada, S., Ephrat, A., Hur, J., Liu, G., Raj, A., et al.: Lumiere: A space-time diffusion model for video generation. In: SIGGRAPH Asia 2024 Conference Papers. pp. 1–11 (2024)

2024
[5]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S.W., Fidler, S., Kreis, K.: Align your latents: High-resolution video synthesis with latent diffusion mod- els. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 22563–22575 (2023)

2023
[7]

Brooks, T., Hellsten, J., Aittala, M., Wang, T.C., Aila, T., Lehtinen, J., Liu, M.Y., Efros,A.,Karras,T.:Generatinglongvideosofdynamicscenes.AdvancesinNeural Information Processing Systems35, 31769–31781 (2022)

2022
[8]

In: Proceedings of the IEEE/CVF international conference on computer vision

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021)

2021
[9]

In: International Conference on Learning Representations

Cheng, J., Xiao, T., He, T.: Consistent video-to-video transfer using synthetic dataset. In: International Conference on Learning Representations. vol. 2024, pp. 16867–16879 (2024)

2024
[10]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Guo, Y., Yang, C., Rao, A., Liang, Z., Wang, Y., Qiao, Y., Agrawala, M., Lin, D., Dai, B.: Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

In: European Conference on Computer Vision

Gupta, A., Yu, L., Sohn, K., Gu, X., Hahn, M., Li, F.F., Essa, I., Jiang, L., Lezama, J.: Photorealistic video generation with diffusion models. In: European Conference on Computer Vision. pp. 393–411. Springer (2024)

2024
[12]

arXiv preprint arXiv:2512.07826 (2025)

He, H., Wang, J., Zhang, J., Xue, Z., Bu, X., Yang, Q., Wen, S., Xie, L.: Openve- 3m: A large-scale high-quality dataset for instruction-guided video editing. arXiv preprint arXiv:2512.07826 (2025)

work page arXiv 2025
[13]

Advances in neural information processing systems35, 8633– 8646 (2022)

Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models. Advances in neural information processing systems35, 8633– 8646 (2022)

2022
[14]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.: Vbench: Comprehensive benchmark suite for video gener- ative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21807–21818 (2024) Goku 17

2024
[15]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Jiang, Z., Han, Z., Mao, C., Zhang, J., Pan, Y., Liu, Y.: Vace: All-in-one video creation and editing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17191–17202 (2025)

2025
[16]

EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning

Ju, X., Wang, T., Zhou, Y., Zhang, H., Liu, Q., Zhao, N., Zhang, Z., Li, Y., Cai, Y., Liu, S., et al.: Editverse: Unifying image and video editing and generation with in-context learning. arXiv preprint arXiv:2509.20360 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Anyv2v: A tuning-free framework for any video-to- video editing tasks.arXiv preprint arXiv:2403.14468, 2024

Ku, M., Wei, C., Ren, W., Yang, H., Chen, W.: Anyv2v: A tuning-free framework for any video-to-video editing tasks. arXiv preprint arXiv:2403.14468 (2024)

work page arXiv 2024
[18]

In: Proceedings of the AAAI conference on artificial intelligence

Li, Y., Min, M., Shen, D., Carlson, D., Carin, L.: Video generation from text. In: Proceedings of the AAAI conference on artificial intelligence. vol. 32 (2018)

2018
[19]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Liang, S., Guan, F., Zhang, Y., Li, X., Chen, Z.: Cot-edit: Let cot guide instruction video editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 37960–37970 (2026)

2026
[20]

SpongeBob: Sync-Aware Harmonious Audio-Visual Generative Editing

Liang, S., Wang, C., Guan, F., Yu, Z., Lu, Y., Wang, Y., Zhou, Y., Li, X., Chen, Z.: Spongebob: Sync-aware harmonious audio-visual generative editing. arXiv preprint arXiv:2605.25193 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[21]

arXiv preprint arXiv:2506.01801 (2025)

Liang, S., Yu, Z., Zhou, Z., Hu, T., Wang, H., Chen, Y., Lin, Q., Zhou, Y., Li, X., Lu, Q., et al.: Omniv2v: Versatile video generation and editing via dynamic content manipulation. arXiv preprint arXiv:2506.01801 (2025)

work page arXiv 2025
[22]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Liang, S., Zhu, K., Zhai, W., Liu, Z., Cao, Y.: Hypercorrelation evolution for video class-incremental learning. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 3315–3323 (2024)

2024
[23]

Diffusion adversarial post-training for one-step video generation.arXiv preprint arXiv:2501.08316,

Lin, S., Xia, X., Ren, Y., Yang, C., Xiao, X., Jiang, L.: Diffusion adversarial post- training for one-step video generation. arXiv preprint arXiv:2501.08316 (2025)

work page arXiv 2025
[24]

IEEE Transactions on Circuits and Systems for Video Technology (2025)

Liu,C.,Li,R.,Zhang,K.,Lan,Y.,Liu,D.:Stablev2v:Stabilizingshapeconsistency in video-to-video editing. IEEE Transactions on Circuits and Systems for Video Technology (2025)

2025
[25]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Ma, Y., He, Y., Cun, X., Wang, X., Chen, S., Li, X., Chen, Q.: Follow your pose: Pose-guided text-to-video generation using pose-free videos. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 4117–4125 (2024)

2024
[26]

In: Proceedings of the Computer Vision and Pattern Recognition Con- ference

Qu, L., Zhang, H., Liu, Y., Wang, X., Jiang, Y., Gao, Y., Ye, H., Du, D.K., Yuan, Z., Wu, X.: Tokenflow: Unified image tokenizer for multimodal understanding and generation. In: Proceedings of the Computer Vision and Pattern Recognition Con- ference. pp. 2545–2555 (2025)

2025
[27]

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Ren, T., Liu, S., Zeng, A., Lin, J., Li, K., Cao, H., Chen, J., Huang, X., Chen, Y., Yan, F., et al.: Grounded sam: Assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., et al.: Make-a-video: Text-to-video generation without text- video data. arXiv preprint arXiv:2209.14792 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[29]

Omni-video: Democratizing unified video understanding and generation.arXiv preprint arXiv:2507.06119, 2025

Tan, Z., Yang, H., Qin, L., Gong, J., Yang, M., Li, H.: Omni-video: Democratiz- ing unified video understanding and generation. arXiv preprint arXiv:2507.06119 (2025)

work page arXiv 2025
[30]

ModelScope Text-to-Video Technical Report

Wang, J., Yuan, H., Chen, D., Zhang, Y., Wang, X., Zhang, S.: Modelscope text- to-video technical report. arXiv preprint arXiv:2308.06571 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

In: Proceedings of the Com- puter Vision and Pattern Recognition Conference

Wang, Q., Shi, Y., Ou, J., Chen, R., Lin, K., Wang, J., Jiang, B., Yang, H., Zheng, M., Tao, X., et al.: Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content. In: Proceedings of the Com- puter Vision and Pattern Recognition Conference. pp. 8428–8437 (2025) 18 S. Liang et al

2025
[32]

In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

Wang, Y., Bilinski, P., Bremond, F., Dantcheva, A.: Imaginator: Conditional spatio-temporal gan for video generation. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 1160–1169 (2020)

2020
[33]

In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision

Wu, Y., Chen, L., Li, R., Wang, S., Xie, C., Zhang, L.: Insvie-1m: Effective instruction-based video editing with elaborate dataset construction. In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision. pp. 16692– 16701 (2025)

2025
[34]

MagicVideo: Efficient Video Generation With Latent Diffusion Models

Zhou, D., Wang, W., Yan, H., Lv, W., Zhu, Y., Feng, J.: Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[35]

arXiv preprint arXiv:2410.15458 (2024)

Zhou, Y., Wang, Q., Cai, Y., Yang, H.: Allegro: Open the black box of commercial- level video generation model. arXiv preprint arXiv:2410.15458 (2024)

work page arXiv 2024
[36]

Advances in Neural Infor- mation Processing Systems38, 75518–75547 (2026)

Zi, B., Peng, W., Qi, X., Wang, J., Zhao, S., Xiao, R., Wong, K.F.: Minimax- remover: Taming bad noise helps video object removal. Advances in Neural Infor- mation Processing Systems38, 75518–75547 (2026)

2026
[37]

Advances in Neural Information Processing Systems38(2026)

Zi, B., Ruan, P., Chen, M., Qi, X., Hao, S., Zhao, S., Huang, Y., Liang, B., Xiao, R., Wong, K.F.: Señorita-2m: A high-quality instruction-based dataset for gen- eral video editing by video specialists. Advances in Neural Information Processing Systems38(2026)

2026

[1] [1]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Bai,J., Xia, M., Fu, X.,Wang, X.,Mu, L., Cao, J.,Liu, Z., Hu, H.,Bai, X., Wan, P., et al.: Recammaster: Camera-controlled generative rendering from a single video. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14834–14844 (2025)

2025

[2] [2]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Bai, Q., Wang, Q., Ouyang, H., Yu, Y., Wang, H., Wang, W., Cheng, K.L., Ma, S., Zeng, Y., Liu, Z., et al.: Scaling instruction-based video editing with a high-quality synthetic dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 37971–37981 (2026)

2026

[3] [3]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

In: SIGGRAPH Asia 2024 Conference Papers

Bar-Tal, O., Chefer, H., Tov, O., Herrmann, C., Paiss, R., Zada, S., Ephrat, A., Hur, J., Liu, G., Raj, A., et al.: Lumiere: A space-time diffusion model for video generation. In: SIGGRAPH Asia 2024 Conference Papers. pp. 1–11 (2024)

2024

[5] [5]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S.W., Fidler, S., Kreis, K.: Align your latents: High-resolution video synthesis with latent diffusion mod- els. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 22563–22575 (2023)

2023

[7] [7]

Brooks, T., Hellsten, J., Aittala, M., Wang, T.C., Aila, T., Lehtinen, J., Liu, M.Y., Efros,A.,Karras,T.:Generatinglongvideosofdynamicscenes.AdvancesinNeural Information Processing Systems35, 31769–31781 (2022)

2022

[8] [8]

In: Proceedings of the IEEE/CVF international conference on computer vision

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021)

2021

[9] [9]

In: International Conference on Learning Representations

Cheng, J., Xiao, T., He, T.: Consistent video-to-video transfer using synthetic dataset. In: International Conference on Learning Representations. vol. 2024, pp. 16867–16879 (2024)

2024

[10] [10]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Guo, Y., Yang, C., Rao, A., Liang, Z., Wang, Y., Qiao, Y., Agrawala, M., Lin, D., Dai, B.: Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

In: European Conference on Computer Vision

Gupta, A., Yu, L., Sohn, K., Gu, X., Hahn, M., Li, F.F., Essa, I., Jiang, L., Lezama, J.: Photorealistic video generation with diffusion models. In: European Conference on Computer Vision. pp. 393–411. Springer (2024)

2024

[12] [12]

arXiv preprint arXiv:2512.07826 (2025)

He, H., Wang, J., Zhang, J., Xue, Z., Bu, X., Yang, Q., Wen, S., Xie, L.: Openve- 3m: A large-scale high-quality dataset for instruction-guided video editing. arXiv preprint arXiv:2512.07826 (2025)

work page arXiv 2025

[13] [13]

Advances in neural information processing systems35, 8633– 8646 (2022)

Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models. Advances in neural information processing systems35, 8633– 8646 (2022)

2022

[14] [14]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.: Vbench: Comprehensive benchmark suite for video gener- ative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21807–21818 (2024) Goku 17

2024

[15] [15]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Jiang, Z., Han, Z., Mao, C., Zhang, J., Pan, Y., Liu, Y.: Vace: All-in-one video creation and editing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17191–17202 (2025)

2025

[16] [16]

EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning

Ju, X., Wang, T., Zhou, Y., Zhang, H., Liu, Q., Zhao, N., Zhang, Z., Li, Y., Cai, Y., Liu, S., et al.: Editverse: Unifying image and video editing and generation with in-context learning. arXiv preprint arXiv:2509.20360 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Anyv2v: A tuning-free framework for any video-to- video editing tasks.arXiv preprint arXiv:2403.14468, 2024

Ku, M., Wei, C., Ren, W., Yang, H., Chen, W.: Anyv2v: A tuning-free framework for any video-to-video editing tasks. arXiv preprint arXiv:2403.14468 (2024)

work page arXiv 2024

[18] [18]

In: Proceedings of the AAAI conference on artificial intelligence

Li, Y., Min, M., Shen, D., Carlson, D., Carin, L.: Video generation from text. In: Proceedings of the AAAI conference on artificial intelligence. vol. 32 (2018)

2018

[19] [19]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Liang, S., Guan, F., Zhang, Y., Li, X., Chen, Z.: Cot-edit: Let cot guide instruction video editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 37960–37970 (2026)

2026

[20] [20]

SpongeBob: Sync-Aware Harmonious Audio-Visual Generative Editing

Liang, S., Wang, C., Guan, F., Yu, Z., Lu, Y., Wang, Y., Zhou, Y., Li, X., Chen, Z.: Spongebob: Sync-aware harmonious audio-visual generative editing. arXiv preprint arXiv:2605.25193 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[21] [21]

arXiv preprint arXiv:2506.01801 (2025)

Liang, S., Yu, Z., Zhou, Z., Hu, T., Wang, H., Chen, Y., Lin, Q., Zhou, Y., Li, X., Lu, Q., et al.: Omniv2v: Versatile video generation and editing via dynamic content manipulation. arXiv preprint arXiv:2506.01801 (2025)

work page arXiv 2025

[22] [22]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Liang, S., Zhu, K., Zhai, W., Liu, Z., Cao, Y.: Hypercorrelation evolution for video class-incremental learning. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 3315–3323 (2024)

2024

[23] [23]

Diffusion adversarial post-training for one-step video generation.arXiv preprint arXiv:2501.08316,

Lin, S., Xia, X., Ren, Y., Yang, C., Xiao, X., Jiang, L.: Diffusion adversarial post- training for one-step video generation. arXiv preprint arXiv:2501.08316 (2025)

work page arXiv 2025

[24] [24]

IEEE Transactions on Circuits and Systems for Video Technology (2025)

Liu,C.,Li,R.,Zhang,K.,Lan,Y.,Liu,D.:Stablev2v:Stabilizingshapeconsistency in video-to-video editing. IEEE Transactions on Circuits and Systems for Video Technology (2025)

2025

[25] [25]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Ma, Y., He, Y., Cun, X., Wang, X., Chen, S., Li, X., Chen, Q.: Follow your pose: Pose-guided text-to-video generation using pose-free videos. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 4117–4125 (2024)

2024

[26] [26]

In: Proceedings of the Computer Vision and Pattern Recognition Con- ference

Qu, L., Zhang, H., Liu, Y., Wang, X., Jiang, Y., Gao, Y., Ye, H., Du, D.K., Yuan, Z., Wu, X.: Tokenflow: Unified image tokenizer for multimodal understanding and generation. In: Proceedings of the Computer Vision and Pattern Recognition Con- ference. pp. 2545–2555 (2025)

2025

[27] [27]

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Ren, T., Liu, S., Zeng, A., Lin, J., Li, K., Cao, H., Chen, J., Huang, X., Chen, Y., Yan, F., et al.: Grounded sam: Assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., et al.: Make-a-video: Text-to-video generation without text- video data. arXiv preprint arXiv:2209.14792 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[29] [29]

Omni-video: Democratizing unified video understanding and generation.arXiv preprint arXiv:2507.06119, 2025

Tan, Z., Yang, H., Qin, L., Gong, J., Yang, M., Li, H.: Omni-video: Democratiz- ing unified video understanding and generation. arXiv preprint arXiv:2507.06119 (2025)

work page arXiv 2025

[30] [30]

ModelScope Text-to-Video Technical Report

Wang, J., Yuan, H., Chen, D., Zhang, Y., Wang, X., Zhang, S.: Modelscope text- to-video technical report. arXiv preprint arXiv:2308.06571 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

In: Proceedings of the Com- puter Vision and Pattern Recognition Conference

Wang, Q., Shi, Y., Ou, J., Chen, R., Lin, K., Wang, J., Jiang, B., Yang, H., Zheng, M., Tao, X., et al.: Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content. In: Proceedings of the Com- puter Vision and Pattern Recognition Conference. pp. 8428–8437 (2025) 18 S. Liang et al

2025

[32] [32]

In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

Wang, Y., Bilinski, P., Bremond, F., Dantcheva, A.: Imaginator: Conditional spatio-temporal gan for video generation. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 1160–1169 (2020)

2020

[33] [33]

In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision

Wu, Y., Chen, L., Li, R., Wang, S., Xie, C., Zhang, L.: Insvie-1m: Effective instruction-based video editing with elaborate dataset construction. In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision. pp. 16692– 16701 (2025)

2025

[34] [34]

MagicVideo: Efficient Video Generation With Latent Diffusion Models

Zhou, D., Wang, W., Yan, H., Lv, W., Zhu, Y., Feng, J.: Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[35] [35]

arXiv preprint arXiv:2410.15458 (2024)

Zhou, Y., Wang, Q., Cai, Y., Yang, H.: Allegro: Open the black box of commercial- level video generation model. arXiv preprint arXiv:2410.15458 (2024)

work page arXiv 2024

[36] [36]

Advances in Neural Infor- mation Processing Systems38, 75518–75547 (2026)

Zi, B., Peng, W., Qi, X., Wang, J., Zhao, S., Xiao, R., Wong, K.F.: Minimax- remover: Taming bad noise helps video object removal. Advances in Neural Infor- mation Processing Systems38, 75518–75547 (2026)

2026

[37] [37]

Advances in Neural Information Processing Systems38(2026)

Zi, B., Ruan, P., Chen, M., Qi, X., Hao, S., Zhao, S., Huang, Y., Liang, B., Xiao, R., Wong, K.F.: Señorita-2m: A high-quality instruction-based dataset for gen- eral video editing by video specialists. Advances in Neural Information Processing Systems38(2026)

2026