Goku: A Million-Scale Universal Dataset and Benchmark for Instruction-Based Video Editing
Pith reviewed 2026-07-01 06:33 UTC · model grok-4.3
The pith
A 2-million-pair dataset extends instruction-based video editing to structural manipulations like subject movement.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Goku supplies the first million-scale dataset of instruction-aligned video editing pairs that reaches beyond appearance editing into multi-task and structural manipulations, produced by a synthesis pipeline that decomposes edits into controllable sub-problems together with progressive filtering; the accompanying Goku-Edit model, built with an MLLM text encoder and a dedicated mask branch for structural control, reaches up to 8 percent higher instruction-following scores on the 1000-case Goku-Bench benchmark.
What carries the argument
Data synthesis pipeline that decomposes complex edits into controllable sub-problems plus progressive filtering system, paired with Goku-Edit's decoupled dual-branch architecture that routes structural control to a mask branch while the main branch handles appearance.
If this is right
- Models trained on Goku can execute precise subject-movement edits in addition to appearance changes.
- Goku-Bench supplies a standardized testbed using seven metrics that directly measure instruction alignment and structural fidelity.
- The dual-branch separation allows the main network to focus on rendering while the mask branch enforces spatial constraints.
- The decomposition approach scales data creation for other multi-step generative video tasks.
- Human verification of the 1000 test cases provides a reproducible reference for future instruction-based editing research.
Where Pith is reading between the lines
- The decomposition-plus-filtering method could be reused to generate training data for other forms of controllable video generation such as camera-path editing.
- A dataset at this scale may support training of general-purpose video editors that accept free-form natural-language instructions without task-specific fine-tuning.
- The emphasis on structural control suggests future benchmarks could add metrics for temporal consistency across longer clips.
Load-bearing premise
The data synthesis pipeline that decomposes complex edits into controllable sub-problems and the progressive filtering system produce high-quality, instruction-aligned pairs that support reliable model training and evaluation.
What would settle it
Manual review of a random sample of the 2 million pairs revealing frequent mismatches between the written instruction and the actual edit performed, or side-by-side evaluation on the 1000 test cases showing Goku-Edit no better than prior open-source models on the seven new metrics.
Figures
read the original abstract
Existing instruction-based video editing datasets commonly focus on single-task appearance editing, failing to meet the complex creative demands of real-world scenarios. To bridge this gap, we present Goku, a large-scale dataset featuring 2 million high-quality, instruction-aligned video editing pairs, which is the first to extend task boundaries from basic appearance editing to multi-task and structural manipulations(e.g., precise control of subject movement). To tackle the data synthesis challenges inherent in these complex tasks, we design an efficient data synthesis pipeline that decomposes complex edits into controllable sub-problems and introduce a progressive filtering system for data reliability throughout the whole process. Furthermore, we explore the optimal network structures on Goku, and propose Goku-Edit. To deeply comprehend complex editing instructions, Goku-Edit leverages an MLLM as its text encoder and adopts a decoupled dual-branch design: a dedicated mask branch handles structural control, freeing the main branch for appearance rendering. A comprehensive video editing benchmark, Goku-Bench, is also proposed with 1,000 human-verified test cases and 7 novel editing-specific metrics. Evaluated on Goku-Bench, Goku-Edit obtains up to +8% improvement on other open-source models in terms of instruction following.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Goku dataset of 2 million instruction-aligned video editing pairs that extends beyond single-task appearance editing to multi-task and structural manipulations. It describes an efficient synthesis pipeline that decomposes complex edits into sub-problems with progressive filtering, proposes the Goku-Edit model that uses an MLLM text encoder and a decoupled dual-branch architecture (mask branch for structural control, main branch for appearance), and presents the Goku-Bench benchmark consisting of 1,000 human-verified cases together with 7 novel editing-specific metrics. On this benchmark Goku-Edit is reported to achieve up to +8% improvement in instruction following relative to other open-source models.
Significance. If the data quality and benchmark reliability claims hold, the work supplies a large-scale, diverse resource that directly addresses the current limitation of existing datasets to simple appearance edits. The dual-branch design that isolates structural control is a concrete architectural response to a recognized challenge in video editing. The human verification step on the 1,000-case benchmark and the introduction of task-specific metrics constitute clear strengths that would support reproducible progress in the area.
minor comments (2)
- [Abstract] Abstract: the seven novel metrics are referenced but not named or briefly defined; adding their names and one-sentence characterizations would improve immediate comprehension of the evaluation protocol.
- [Abstract] Abstract: the interaction between the dedicated mask branch and the main appearance branch is described at a high level; a short statement on how the branches are fused or conditioned would clarify the decoupled design without requiring the reader to reach the methods section.
Simulated Author's Rebuttal
We thank the referee for their positive evaluation of the Goku dataset, Goku-Edit model, and Goku-Bench benchmark, as well as the recommendation for minor revision. No major comments were listed in the report, so we have no specific points to address point-by-point. We will make minor revisions to improve clarity and presentation as appropriate.
Circularity Check
No significant circularity detected
full rationale
The paper's central claims consist of an empirical dataset construction pipeline, a new benchmark (Goku-Bench) with human-verified cases, and reported performance gains (+8% instruction following) for Goku-Edit on that benchmark. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided abstract or description. The derivation chain is self-contained as standard ML dataset+model+eval work without reduction of results to inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Bai,J., Xia, M., Fu, X.,Wang, X.,Mu, L., Cao, J.,Liu, Z., Hu, H.,Bai, X., Wan, P., et al.: Recammaster: Camera-controlled generative rendering from a single video. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14834–14844 (2025)
2025
-
[2]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Bai, Q., Wang, Q., Ouyang, H., Yu, Y., Wang, H., Wang, W., Cheng, K.L., Ma, S., Zeng, Y., Liu, Z., et al.: Scaling instruction-based video editing with a high-quality synthetic dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 37971–37981 (2026)
2026
-
[3]
Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
In: SIGGRAPH Asia 2024 Conference Papers
Bar-Tal, O., Chefer, H., Tov, O., Herrmann, C., Paiss, R., Zada, S., Ephrat, A., Hur, J., Liu, G., Raj, A., et al.: Lumiere: A space-time diffusion model for video generation. In: SIGGRAPH Asia 2024 Conference Papers. pp. 1–11 (2024)
2024
-
[5]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S.W., Fidler, S., Kreis, K.: Align your latents: High-resolution video synthesis with latent diffusion mod- els. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 22563–22575 (2023)
2023
-
[7]
Brooks, T., Hellsten, J., Aittala, M., Wang, T.C., Aila, T., Lehtinen, J., Liu, M.Y., Efros,A.,Karras,T.:Generatinglongvideosofdynamicscenes.AdvancesinNeural Information Processing Systems35, 31769–31781 (2022)
2022
-
[8]
In: Proceedings of the IEEE/CVF international conference on computer vision
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021)
2021
-
[9]
In: International Conference on Learning Representations
Cheng, J., Xiao, T., He, T.: Consistent video-to-video transfer using synthetic dataset. In: International Conference on Learning Representations. vol. 2024, pp. 16867–16879 (2024)
2024
-
[10]
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Guo, Y., Yang, C., Rao, A., Liang, Z., Wang, Y., Qiao, Y., Agrawala, M., Lin, D., Dai, B.: Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
In: European Conference on Computer Vision
Gupta, A., Yu, L., Sohn, K., Gu, X., Hahn, M., Li, F.F., Essa, I., Jiang, L., Lezama, J.: Photorealistic video generation with diffusion models. In: European Conference on Computer Vision. pp. 393–411. Springer (2024)
2024
-
[12]
arXiv preprint arXiv:2512.07826 (2025)
He, H., Wang, J., Zhang, J., Xue, Z., Bu, X., Yang, Q., Wen, S., Xie, L.: Openve- 3m: A large-scale high-quality dataset for instruction-guided video editing. arXiv preprint arXiv:2512.07826 (2025)
-
[13]
Advances in neural information processing systems35, 8633– 8646 (2022)
Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models. Advances in neural information processing systems35, 8633– 8646 (2022)
2022
-
[14]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.: Vbench: Comprehensive benchmark suite for video gener- ative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21807–21818 (2024) Goku 17
2024
-
[15]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Jiang, Z., Han, Z., Mao, C., Zhang, J., Pan, Y., Liu, Y.: Vace: All-in-one video creation and editing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17191–17202 (2025)
2025
-
[16]
EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning
Ju, X., Wang, T., Zhou, Y., Zhang, H., Liu, Q., Zhao, N., Zhang, Z., Li, Y., Cai, Y., Liu, S., et al.: Editverse: Unifying image and video editing and generation with in-context learning. arXiv preprint arXiv:2509.20360 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Ku, M., Wei, C., Ren, W., Yang, H., Chen, W.: Anyv2v: A tuning-free framework for any video-to-video editing tasks. arXiv preprint arXiv:2403.14468 (2024)
-
[18]
In: Proceedings of the AAAI conference on artificial intelligence
Li, Y., Min, M., Shen, D., Carlson, D., Carin, L.: Video generation from text. In: Proceedings of the AAAI conference on artificial intelligence. vol. 32 (2018)
2018
-
[19]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Liang, S., Guan, F., Zhang, Y., Li, X., Chen, Z.: Cot-edit: Let cot guide instruction video editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 37960–37970 (2026)
2026
-
[20]
SpongeBob: Sync-Aware Harmonious Audio-Visual Generative Editing
Liang, S., Wang, C., Guan, F., Yu, Z., Lu, Y., Wang, Y., Zhou, Y., Li, X., Chen, Z.: Spongebob: Sync-aware harmonious audio-visual generative editing. arXiv preprint arXiv:2605.25193 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[21]
arXiv preprint arXiv:2506.01801 (2025)
Liang, S., Yu, Z., Zhou, Z., Hu, T., Wang, H., Chen, Y., Lin, Q., Zhou, Y., Li, X., Lu, Q., et al.: Omniv2v: Versatile video generation and editing via dynamic content manipulation. arXiv preprint arXiv:2506.01801 (2025)
-
[22]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Liang, S., Zhu, K., Zhai, W., Liu, Z., Cao, Y.: Hypercorrelation evolution for video class-incremental learning. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 3315–3323 (2024)
2024
-
[23]
Diffusion adversarial post-training for one-step video generation.arXiv preprint arXiv:2501.08316,
Lin, S., Xia, X., Ren, Y., Yang, C., Xiao, X., Jiang, L.: Diffusion adversarial post- training for one-step video generation. arXiv preprint arXiv:2501.08316 (2025)
-
[24]
IEEE Transactions on Circuits and Systems for Video Technology (2025)
Liu,C.,Li,R.,Zhang,K.,Lan,Y.,Liu,D.:Stablev2v:Stabilizingshapeconsistency in video-to-video editing. IEEE Transactions on Circuits and Systems for Video Technology (2025)
2025
-
[25]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Ma, Y., He, Y., Cun, X., Wang, X., Chen, S., Li, X., Chen, Q.: Follow your pose: Pose-guided text-to-video generation using pose-free videos. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 4117–4125 (2024)
2024
-
[26]
In: Proceedings of the Computer Vision and Pattern Recognition Con- ference
Qu, L., Zhang, H., Liu, Y., Wang, X., Jiang, Y., Gao, Y., Ye, H., Du, D.K., Yuan, Z., Wu, X.: Tokenflow: Unified image tokenizer for multimodal understanding and generation. In: Proceedings of the Computer Vision and Pattern Recognition Con- ference. pp. 2545–2555 (2025)
2025
-
[27]
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks
Ren, T., Liu, S., Zeng, A., Lin, J., Li, K., Cao, H., Chen, J., Huang, X., Chen, Y., Yan, F., et al.: Grounded sam: Assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Make-A-Video: Text-to-Video Generation without Text-Video Data
Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., et al.: Make-a-video: Text-to-video generation without text- video data. arXiv preprint arXiv:2209.14792 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[29]
Tan, Z., Yang, H., Qin, L., Gong, J., Yang, M., Li, H.: Omni-video: Democratiz- ing unified video understanding and generation. arXiv preprint arXiv:2507.06119 (2025)
-
[30]
ModelScope Text-to-Video Technical Report
Wang, J., Yuan, H., Chen, D., Zhang, Y., Wang, X., Zhang, S.: Modelscope text- to-video technical report. arXiv preprint arXiv:2308.06571 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
In: Proceedings of the Com- puter Vision and Pattern Recognition Conference
Wang, Q., Shi, Y., Ou, J., Chen, R., Lin, K., Wang, J., Jiang, B., Yang, H., Zheng, M., Tao, X., et al.: Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content. In: Proceedings of the Com- puter Vision and Pattern Recognition Conference. pp. 8428–8437 (2025) 18 S. Liang et al
2025
-
[32]
In: Proceedings of the IEEE/CVF winter conference on applications of computer vision
Wang, Y., Bilinski, P., Bremond, F., Dantcheva, A.: Imaginator: Conditional spatio-temporal gan for video generation. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 1160–1169 (2020)
2020
-
[33]
In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision
Wu, Y., Chen, L., Li, R., Wang, S., Xie, C., Zhang, L.: Insvie-1m: Effective instruction-based video editing with elaborate dataset construction. In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision. pp. 16692– 16701 (2025)
2025
-
[34]
MagicVideo: Efficient Video Generation With Latent Diffusion Models
Zhou, D., Wang, W., Yan, H., Lv, W., Zhu, Y., Feng, J.: Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[35]
arXiv preprint arXiv:2410.15458 (2024)
Zhou, Y., Wang, Q., Cai, Y., Yang, H.: Allegro: Open the black box of commercial- level video generation model. arXiv preprint arXiv:2410.15458 (2024)
-
[36]
Advances in Neural Infor- mation Processing Systems38, 75518–75547 (2026)
Zi, B., Peng, W., Qi, X., Wang, J., Zhao, S., Xiao, R., Wong, K.F.: Minimax- remover: Taming bad noise helps video object removal. Advances in Neural Infor- mation Processing Systems38, 75518–75547 (2026)
2026
-
[37]
Advances in Neural Information Processing Systems38(2026)
Zi, B., Ruan, P., Chen, M., Qi, X., Hao, S., Zhao, S., Huang, Y., Liang, B., Xiao, R., Wong, K.F.: Señorita-2m: A high-quality instruction-based dataset for gen- eral video editing by video specialists. Advances in Neural Information Processing Systems38(2026)
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.