pith. sign in

arxiv: 2606.30599 · v2 · pith:3ZCJVPWMnew · submitted 2026-06-29 · 💻 cs.CV

Goku: A Million-Scale Universal Dataset and Benchmark for Instruction-Based Video Editing

Pith reviewed 2026-07-01 06:33 UTC · model grok-4.3

classification 💻 cs.CV
keywords video editinginstruction-based editingdatasetbenchmarkstructural manipulationMLLMdual-branch architecturedata synthesis pipeline
0
0 comments X

The pith

A 2-million-pair dataset extends instruction-based video editing to structural manipulations like subject movement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing datasets limit video editing to single-task appearance changes and fall short of real-world creative needs. Goku supplies 2 million instruction-aligned pairs by decomposing complex edits into sub-problems and applying progressive filtering for quality. Goku-Edit processes instructions with an MLLM text encoder and a decoupled dual-branch design that isolates structural control in a mask branch. A new benchmark, Goku-Bench, supplies 1000 human-verified cases and seven editing-specific metrics. On this benchmark Goku-Edit improves instruction following by up to 8 percent over other open-source models.

Core claim

Goku supplies the first million-scale dataset of instruction-aligned video editing pairs that reaches beyond appearance editing into multi-task and structural manipulations, produced by a synthesis pipeline that decomposes edits into controllable sub-problems together with progressive filtering; the accompanying Goku-Edit model, built with an MLLM text encoder and a dedicated mask branch for structural control, reaches up to 8 percent higher instruction-following scores on the 1000-case Goku-Bench benchmark.

What carries the argument

Data synthesis pipeline that decomposes complex edits into controllable sub-problems plus progressive filtering system, paired with Goku-Edit's decoupled dual-branch architecture that routes structural control to a mask branch while the main branch handles appearance.

If this is right

  • Models trained on Goku can execute precise subject-movement edits in addition to appearance changes.
  • Goku-Bench supplies a standardized testbed using seven metrics that directly measure instruction alignment and structural fidelity.
  • The dual-branch separation allows the main network to focus on rendering while the mask branch enforces spatial constraints.
  • The decomposition approach scales data creation for other multi-step generative video tasks.
  • Human verification of the 1000 test cases provides a reproducible reference for future instruction-based editing research.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The decomposition-plus-filtering method could be reused to generate training data for other forms of controllable video generation such as camera-path editing.
  • A dataset at this scale may support training of general-purpose video editors that accept free-form natural-language instructions without task-specific fine-tuning.
  • The emphasis on structural control suggests future benchmarks could add metrics for temporal consistency across longer clips.

Load-bearing premise

The data synthesis pipeline that decomposes complex edits into controllable sub-problems and the progressive filtering system produce high-quality, instruction-aligned pairs that support reliable model training and evaluation.

What would settle it

Manual review of a random sample of the 2 million pairs revealing frequent mismatches between the written instruction and the actual edit performed, or side-by-side evaluation on the 1000 test cases showing Goku-Edit no better than prior open-source models on the seven new metrics.

Figures

Figures reproduced from arXiv: 2606.30599 by Cong Wang, Fengbin Guan, Qinglin Lu, Sen Liang, Teng Hu, Xin Li, Youliang Zhang, Yuan Zhou, Zhengguang Zhou, Zhentao Yu, Zhibo Chen.

Figure 1
Figure 1. Figure 1: Goku covers 10 core video editing task classes across basic and complex edits. The word cloud illustrates the instruction vocabulary distribution, while the two charts show the distributions of instruction length and frame count. they remain largely confined to single-task and appearance-level modifications, such as object removal and single-attribute alteration. One of the primary factors contributing to … view at source ↗
Figure 2
Figure 2. Figure 2: The illustration of our automated video editing pipeline. (a) Video Pre￾Processing. (b) Data Generation for Different Tasks. (c) Progressive Filtering System. MLLM-Powered Instruction Generation. We leverage the multimodal un￾derstanding capabilities of Gemini2.5-Pro to generate natural and diverse editing instructions for each task category. For Add, Remove, Swap, and Subject Move￾ment tasks, the model fi… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of Goku-Edit, featuring a dual-branch architecture with RoPE￾aligned spatial cross-attention and inference-time SpatialCFG. To validate our progressive filtering system, we include a human evaluation (100 samples per task, 3 annotators) and precision/recall analysis in our supple￾mentary material. An alternative filtering pipeline based on open-source models (Qwen3VL-30B [3]) is also provided for … view at source ↗
Figure 4
Figure 4. Figure 4: Statistical distributions of Goku-Bench. complete selection pipeline and criteria are detailed in the supplementary mate￾rial. The final test set covers multi-person scenarios, full and half body human subjects, animals (dogs, cats, sharks, birds, etc.), common objects (clothing, ve￾hicles, buildings, etc.), and natural landscapes (mountains, rivers, deserts, etc.). Furthermore, we specifically include cha… view at source ↗
Figure 5
Figure 5. Figure 5: Ablation study on the spatial downsampling factor n [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison with state-of-the-art methods [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
read the original abstract

Existing instruction-based video editing datasets commonly focus on single-task appearance editing, failing to meet the complex creative demands of real-world scenarios. To bridge this gap, we present Goku, a large-scale dataset featuring 2 million high-quality, instruction-aligned video editing pairs, which is the first to extend task boundaries from basic appearance editing to multi-task and structural manipulations(e.g., precise control of subject movement). To tackle the data synthesis challenges inherent in these complex tasks, we design an efficient data synthesis pipeline that decomposes complex edits into controllable sub-problems and introduce a progressive filtering system for data reliability throughout the whole process. Furthermore, we explore the optimal network structures on Goku, and propose Goku-Edit. To deeply comprehend complex editing instructions, Goku-Edit leverages an MLLM as its text encoder and adopts a decoupled dual-branch design: a dedicated mask branch handles structural control, freeing the main branch for appearance rendering. A comprehensive video editing benchmark, Goku-Bench, is also proposed with 1,000 human-verified test cases and 7 novel editing-specific metrics. Evaluated on Goku-Bench, Goku-Edit obtains up to +8% improvement on other open-source models in terms of instruction following.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces the Goku dataset of 2 million instruction-aligned video editing pairs that extends beyond single-task appearance editing to multi-task and structural manipulations. It describes an efficient synthesis pipeline that decomposes complex edits into sub-problems with progressive filtering, proposes the Goku-Edit model that uses an MLLM text encoder and a decoupled dual-branch architecture (mask branch for structural control, main branch for appearance), and presents the Goku-Bench benchmark consisting of 1,000 human-verified cases together with 7 novel editing-specific metrics. On this benchmark Goku-Edit is reported to achieve up to +8% improvement in instruction following relative to other open-source models.

Significance. If the data quality and benchmark reliability claims hold, the work supplies a large-scale, diverse resource that directly addresses the current limitation of existing datasets to simple appearance edits. The dual-branch design that isolates structural control is a concrete architectural response to a recognized challenge in video editing. The human verification step on the 1,000-case benchmark and the introduction of task-specific metrics constitute clear strengths that would support reproducible progress in the area.

minor comments (2)
  1. [Abstract] Abstract: the seven novel metrics are referenced but not named or briefly defined; adding their names and one-sentence characterizations would improve immediate comprehension of the evaluation protocol.
  2. [Abstract] Abstract: the interaction between the dedicated mask branch and the main appearance branch is described at a high level; a short statement on how the branches are fused or conditioned would clarify the decoupled design without requiring the reader to reach the methods section.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive evaluation of the Goku dataset, Goku-Edit model, and Goku-Bench benchmark, as well as the recommendation for minor revision. No major comments were listed in the report, so we have no specific points to address point-by-point. We will make minor revisions to improve clarity and presentation as appropriate.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central claims consist of an empirical dataset construction pipeline, a new benchmark (Goku-Bench) with human-verified cases, and reported performance gains (+8% instruction following) for Goku-Edit on that benchmark. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided abstract or description. The derivation chain is self-contained as standard ML dataset+model+eval work without reduction of results to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view yields no explicit free parameters, axioms, or invented entities; the synthesis pipeline and filtering are referenced at a conceptual level without implementation specifics or external benchmarks.

pith-pipeline@v0.9.1-grok · 5776 in / 1133 out tokens · 48749 ms · 2026-07-01T06:33:00.587723+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 15 canonical work pages · 9 internal anchors

  1. [1]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Bai,J., Xia, M., Fu, X.,Wang, X.,Mu, L., Cao, J.,Liu, Z., Hu, H.,Bai, X., Wan, P., et al.: Recammaster: Camera-controlled generative rendering from a single video. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14834–14844 (2025)

  2. [2]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Bai, Q., Wang, Q., Ouyang, H., Yu, Y., Wang, H., Wang, W., Cheng, K.L., Ma, S., Zeng, Y., Liu, Z., et al.: Scaling instruction-based video editing with a high-quality synthetic dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 37971–37981 (2026)

  3. [3]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

  4. [4]

    In: SIGGRAPH Asia 2024 Conference Papers

    Bar-Tal, O., Chefer, H., Tov, O., Herrmann, C., Paiss, R., Zada, S., Ephrat, A., Hur, J., Liu, G., Raj, A., et al.: Lumiere: A space-time diffusion model for video generation. In: SIGGRAPH Asia 2024 Conference Papers. pp. 1–11 (2024)

  5. [5]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

  6. [6]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S.W., Fidler, S., Kreis, K.: Align your latents: High-resolution video synthesis with latent diffusion mod- els. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 22563–22575 (2023)

  7. [7]

    Brooks, T., Hellsten, J., Aittala, M., Wang, T.C., Aila, T., Lehtinen, J., Liu, M.Y., Efros,A.,Karras,T.:Generatinglongvideosofdynamicscenes.AdvancesinNeural Information Processing Systems35, 31769–31781 (2022)

  8. [8]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021)

  9. [9]

    In: International Conference on Learning Representations

    Cheng, J., Xiao, T., He, T.: Consistent video-to-video transfer using synthetic dataset. In: International Conference on Learning Representations. vol. 2024, pp. 16867–16879 (2024)

  10. [10]

    AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    Guo, Y., Yang, C., Rao, A., Liang, Z., Wang, Y., Qiao, Y., Agrawala, M., Lin, D., Dai, B.: Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725 (2023)

  11. [11]

    In: European Conference on Computer Vision

    Gupta, A., Yu, L., Sohn, K., Gu, X., Hahn, M., Li, F.F., Essa, I., Jiang, L., Lezama, J.: Photorealistic video generation with diffusion models. In: European Conference on Computer Vision. pp. 393–411. Springer (2024)

  12. [12]

    arXiv preprint arXiv:2512.07826 (2025)

    He, H., Wang, J., Zhang, J., Xue, Z., Bu, X., Yang, Q., Wen, S., Xie, L.: Openve- 3m: A large-scale high-quality dataset for instruction-guided video editing. arXiv preprint arXiv:2512.07826 (2025)

  13. [13]

    Advances in neural information processing systems35, 8633– 8646 (2022)

    Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models. Advances in neural information processing systems35, 8633– 8646 (2022)

  14. [14]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.: Vbench: Comprehensive benchmark suite for video gener- ative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21807–21818 (2024) Goku 17

  15. [15]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Jiang, Z., Han, Z., Mao, C., Zhang, J., Pan, Y., Liu, Y.: Vace: All-in-one video creation and editing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17191–17202 (2025)

  16. [16]

    EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning

    Ju, X., Wang, T., Zhou, Y., Zhang, H., Liu, Q., Zhao, N., Zhang, Z., Li, Y., Cai, Y., Liu, S., et al.: Editverse: Unifying image and video editing and generation with in-context learning. arXiv preprint arXiv:2509.20360 (2025)

  17. [17]

    Anyv2v: A tuning-free framework for any video-to- video editing tasks.arXiv preprint arXiv:2403.14468, 2024

    Ku, M., Wei, C., Ren, W., Yang, H., Chen, W.: Anyv2v: A tuning-free framework for any video-to-video editing tasks. arXiv preprint arXiv:2403.14468 (2024)

  18. [18]

    In: Proceedings of the AAAI conference on artificial intelligence

    Li, Y., Min, M., Shen, D., Carlson, D., Carin, L.: Video generation from text. In: Proceedings of the AAAI conference on artificial intelligence. vol. 32 (2018)

  19. [19]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Liang, S., Guan, F., Zhang, Y., Li, X., Chen, Z.: Cot-edit: Let cot guide instruction video editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 37960–37970 (2026)

  20. [20]

    SpongeBob: Sync-Aware Harmonious Audio-Visual Generative Editing

    Liang, S., Wang, C., Guan, F., Yu, Z., Lu, Y., Wang, Y., Zhou, Y., Li, X., Chen, Z.: Spongebob: Sync-aware harmonious audio-visual generative editing. arXiv preprint arXiv:2605.25193 (2026)

  21. [21]

    arXiv preprint arXiv:2506.01801 (2025)

    Liang, S., Yu, Z., Zhou, Z., Hu, T., Wang, H., Chen, Y., Lin, Q., Zhou, Y., Li, X., Lu, Q., et al.: Omniv2v: Versatile video generation and editing via dynamic content manipulation. arXiv preprint arXiv:2506.01801 (2025)

  22. [22]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Liang, S., Zhu, K., Zhai, W., Liu, Z., Cao, Y.: Hypercorrelation evolution for video class-incremental learning. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 3315–3323 (2024)

  23. [23]

    Diffusion adversarial post-training for one-step video generation.arXiv preprint arXiv:2501.08316,

    Lin, S., Xia, X., Ren, Y., Yang, C., Xiao, X., Jiang, L.: Diffusion adversarial post- training for one-step video generation. arXiv preprint arXiv:2501.08316 (2025)

  24. [24]

    IEEE Transactions on Circuits and Systems for Video Technology (2025)

    Liu,C.,Li,R.,Zhang,K.,Lan,Y.,Liu,D.:Stablev2v:Stabilizingshapeconsistency in video-to-video editing. IEEE Transactions on Circuits and Systems for Video Technology (2025)

  25. [25]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Ma, Y., He, Y., Cun, X., Wang, X., Chen, S., Li, X., Chen, Q.: Follow your pose: Pose-guided text-to-video generation using pose-free videos. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 4117–4125 (2024)

  26. [26]

    In: Proceedings of the Computer Vision and Pattern Recognition Con- ference

    Qu, L., Zhang, H., Liu, Y., Wang, X., Jiang, Y., Gao, Y., Ye, H., Du, D.K., Yuan, Z., Wu, X.: Tokenflow: Unified image tokenizer for multimodal understanding and generation. In: Proceedings of the Computer Vision and Pattern Recognition Con- ference. pp. 2545–2555 (2025)

  27. [27]

    Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

    Ren, T., Liu, S., Zeng, A., Lin, J., Li, K., Cao, H., Chen, J., Huang, X., Chen, Y., Yan, F., et al.: Grounded sam: Assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159 (2024)

  28. [28]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., et al.: Make-a-video: Text-to-video generation without text- video data. arXiv preprint arXiv:2209.14792 (2022)

  29. [29]

    Omni-video: Democratizing unified video understanding and generation.arXiv preprint arXiv:2507.06119, 2025

    Tan, Z., Yang, H., Qin, L., Gong, J., Yang, M., Li, H.: Omni-video: Democratiz- ing unified video understanding and generation. arXiv preprint arXiv:2507.06119 (2025)

  30. [30]

    ModelScope Text-to-Video Technical Report

    Wang, J., Yuan, H., Chen, D., Zhang, Y., Wang, X., Zhang, S.: Modelscope text- to-video technical report. arXiv preprint arXiv:2308.06571 (2023)

  31. [31]

    In: Proceedings of the Com- puter Vision and Pattern Recognition Conference

    Wang, Q., Shi, Y., Ou, J., Chen, R., Lin, K., Wang, J., Jiang, B., Yang, H., Zheng, M., Tao, X., et al.: Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content. In: Proceedings of the Com- puter Vision and Pattern Recognition Conference. pp. 8428–8437 (2025) 18 S. Liang et al

  32. [32]

    In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

    Wang, Y., Bilinski, P., Bremond, F., Dantcheva, A.: Imaginator: Conditional spatio-temporal gan for video generation. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 1160–1169 (2020)

  33. [33]

    In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision

    Wu, Y., Chen, L., Li, R., Wang, S., Xie, C., Zhang, L.: Insvie-1m: Effective instruction-based video editing with elaborate dataset construction. In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision. pp. 16692– 16701 (2025)

  34. [34]

    MagicVideo: Efficient Video Generation With Latent Diffusion Models

    Zhou, D., Wang, W., Yan, H., Lv, W., Zhu, Y., Feng, J.: Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018 (2022)

  35. [35]

    arXiv preprint arXiv:2410.15458 (2024)

    Zhou, Y., Wang, Q., Cai, Y., Yang, H.: Allegro: Open the black box of commercial- level video generation model. arXiv preprint arXiv:2410.15458 (2024)

  36. [36]

    Advances in Neural Infor- mation Processing Systems38, 75518–75547 (2026)

    Zi, B., Peng, W., Qi, X., Wang, J., Zhao, S., Xiao, R., Wong, K.F.: Minimax- remover: Taming bad noise helps video object removal. Advances in Neural Infor- mation Processing Systems38, 75518–75547 (2026)

  37. [37]

    Advances in Neural Information Processing Systems38(2026)

    Zi, B., Ruan, P., Chen, M., Qi, X., Hao, S., Zhao, S., Huang, Y., Liang, B., Xiao, R., Wong, K.F.: Señorita-2m: A high-quality instruction-based dataset for gen- eral video editing by video specialists. Advances in Neural Information Processing Systems38(2026)