pith. sign in

arxiv: 2603.05539 · v2 · submitted 2026-03-04 · 💻 cs.LG · cs.AI· cs.IR· cs.MM

VDCook:DIY video data cook your MLLMs

Pith reviewed 2026-05-15 16:49 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.IRcs.MM
keywords VDCookvideo data constructionMLLMsself-evolving datasetsautomated data ingestiondata provenancemultimodal training dataMCP
0
0 comments X

The pith

VDCook automatically builds and continuously updates specialized video datasets for multimodal models from natural language queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

VDCook is a configurable platform that accepts natural language requests along with parameters for scale, retrieval-synthesis balance, and quality thresholds. It then runs concurrent real-video retrieval and controlled synthesis to produce in-domain data packages that include full provenance, multi-dimensional metadata such as scene segmentation and motion scores, and reproducible notebooks. The system relies on MCP-based automated ingestion to convert static one-time datasets into self-updating open ecosystems that support ongoing domain expansion without repeated manual rebuilding. This infrastructure-level approach targets the high cost and effort of creating vertical-domain video training data for MLLMs.

Core claim

VDCook establishes a self-evolving video data operating system in which users issue natural language queries and adjustable parameters, after which the platform performs query optimization, concurrently executes real video retrieval and controlled synthesis modules, and outputs complete in-domain data packages equipped with provenance, metadata annotations, and reproducible notebooks, thereby transforming static datasets into dynamically evolving ecosystems via MCP-driven automated ingestion.

What carries the argument

MCP-based automated data ingestion mechanism that orchestrates concurrent real-video retrieval and controlled synthesis modules.

If this is right

  • Datasets receive continuous updates and domain expansion through automated ingestion rather than periodic manual reconstruction.
  • Multi-dimensional metadata annotations support flexible later-stage data cooking and indexing.
  • Each generated package includes reproducible notebooks that enable exact recreation of the data construction process.
  • The platform supports community contributions under a governance model for shared ecosystem growth.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Integration of VDCook outputs into existing MLLM fine-tuning pipelines could shorten the time from domain identification to usable training data.
  • Similar automated ingestion patterns might extend to other modalities such as audio or sensor streams where provenance tracking is required.
  • Long-term maintenance costs would shift from data collection labor to oversight of the ingestion parameters and quality thresholds.
  • Open-ecosystem growth depends on adoption incentives that encourage users to contribute new ingestion rules or verified data packages.

Load-bearing premise

The automated retrieval and synthesis modules can reliably generate high-quality, in-domain video data without substantial manual curation or quality loss.

What would settle it

A controlled experiment measuring downstream MLLM accuracy on domain-specific video tasks when trained on VDCook-generated packages versus matched manually curated datasets of identical scale.

Figures

Figures reproduced from arXiv: 2603.05539 by Chengwei Wu.

Figure 1
Figure 1. Figure 1: High-level architecture of VDCook. The system comprises automated ingestion [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Analysis of our corpus: (a) Resolution distribution showing high-fidelity content; (b) [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Representative samples of road waterlogging events. Such scenarios are rare in generic [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Examples of dump trucks in construction environments. These scenes involve heavy [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Clips of road snow accumulation under various lighting and weather conditions. Such [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Examples of fallen trees or urban greenery after storms or accidents. These events [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Representative pulmonary CT angiography video sequences. Medical imaging scenar [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Examples of embodied sequential manipulation tasks. The clips highlight multi-step [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Multimodal digital human samples with speech, facial expression, and subtitle align [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Chinese ink-wash style video samples. The dataset supports stylized generative [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Examples of physics-driven interactions and temporally consistent object dynamics. [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Left: Wan-1.3B base model output. Right: Fine-tuned model on our ink-wash [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Left: Wan-1.3B base model output. Right: Fine-tuned model. Prompt: “Delicate [PITH_FULL_IMAGE:figures/full_fig_p013_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Left: Wan-1.3B base model output. Right: Fine-tuned model. Prompt: “Elegant koi [PITH_FULL_IMAGE:figures/full_fig_p014_14.png] view at source ↗
read the original abstract

We introduce VDCook: a self-evolving video data operating system, a configurable video data construction platform for researchers and vertical domain teams. Users initiate data requests via natural language queries and adjustable parameters (scale, retrieval-synthesis ratio, quality threshold). The system automatically performs query optimization, concurrently running real video retrieval and controlled synthesis modules. It ultimately generates in-domain data packages with complete provenance and metadata, along with reproducible Notebooks. Unlike traditional static, one-time-built datasets, VDCook enables continuous updates and domain expansion through its automated data ingestion mechanism based on MCP (Model Context Protocol)\cite{mcp2024anthropic}, transforming datasets into dynamically evolving open ecosystems. The system also provides multi-dimensional metadata annotation (scene segmentation, motion scoring, OCR ratio, automatic captioning, etc.), laying the foundation for flexible subsequent data `cooking' and indexing\cite{vlogger}. This platform aims to significantly lower the barrier to constructing specialized video training datasets through infrastructure-level solutions, while supporting community contributions and a governance-enabled data expansion paradigm. \textbf{Project demo:} https://screenapp.io/app/v/WP0SvffgsH

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces VDCook, a self-evolving video data operating system and configurable platform for constructing video datasets for MLLMs. Users initiate requests via natural language queries with adjustable parameters for scale, retrieval-synthesis ratio, and quality threshold. The system performs automated query optimization, concurrent real video retrieval and controlled synthesis, generating in-domain data packages with complete provenance, multi-dimensional metadata (such as scene segmentation, motion scoring, OCR ratio, and automatic captioning), and reproducible notebooks. It claims to enable continuous updates and domain expansion through MCP-based automated data ingestion, transforming static datasets into dynamically evolving open ecosystems while supporting community contributions and governance.

Significance. If the system's reliability is demonstrated, VDCook could substantially lower the barrier for researchers and vertical domain teams to build and maintain specialized video training datasets for multimodal models. The emphasis on provenance, metadata, and reproducibility could enhance data quality and facilitate ongoing community-driven expansion, representing a potentially useful infrastructure contribution in the field of data-centric AI.

major comments (2)
  1. [Abstract] Abstract: The assertion that VDCook 'enables continuous updates and domain expansion' and transforms datasets into 'dynamically evolving open ecosystems' is not supported by any quantitative results, ablation studies, quality metrics, or comparisons to existing data construction pipelines.
  2. [Abstract] Abstract: The description of the concurrent retrieval and controlled synthesis modules combined with MCP-based ingestion does not include any evidence or evaluation showing that they reliably produce high-quality, in-domain video data without substantial manual curation or quality degradation.
minor comments (2)
  1. The citations for MCP and vlogger are referenced but no reference list or full bibliographic details are provided in the manuscript.
  2. The project demo link is mentioned, but the paper does not elaborate on the specific demonstrations or results shown in the demo.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the current manuscript would benefit from additional empirical support and will revise to address the points raised.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion that VDCook 'enables continuous updates and domain expansion' and transforms datasets into 'dynamically evolving open ecosystems' is not supported by any quantitative results, ablation studies, quality metrics, or comparisons to existing data construction pipelines.

    Authors: We acknowledge that the manuscript currently supports these claims primarily through the system design, the MCP-based ingestion mechanism, and the live demo rather than through quantitative benchmarks. In the revision we will add a new evaluation section reporting preliminary metrics from the deployed system, including data freshness over multiple ingestion cycles, measured domain expansion (e.g., new scene categories added), and direct comparisons against static dataset construction baselines. Ablation results on the contribution of the automated ingestion component will also be included. revision: yes

  2. Referee: [Abstract] Abstract: The description of the concurrent retrieval and controlled synthesis modules combined with MCP-based ingestion does not include any evidence or evaluation showing that they reliably produce high-quality, in-domain video data without substantial manual curation or quality degradation.

    Authors: The referee correctly notes the absence of explicit quality evaluations. We will revise the manuscript to report concrete metrics collected from the VDCook demo, such as in-domain relevance scores, motion and scene quality distributions, and the fraction of outputs requiring manual review. We will also document the effect of the configurable quality threshold on curation effort and any observed cases of quality degradation, thereby providing the requested evidence while transparently discussing remaining limitations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; architectural description relies on external MCP citation without self-referential reductions or fitted predictions

full rationale

The paper presents VDCook as a configurable platform for video data construction using natural-language queries, concurrent retrieval/synthesis modules, and MCP-based ingestion for continuous updates. No equations, parameters, or predictions appear that reduce by construction to inputs. Citations to MCP (Anthropic) and Vlogger are external and not self-citations by the single author. The central claim of transforming static datasets into evolving ecosystems is an architectural assertion, not a mathematical derivation that loops back on itself. The system is self-contained against external benchmarks with no load-bearing self-citation chains or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The description rests on domain assumptions about the reliability of automated retrieval-synthesis pipelines and MCP for continuous data evolution, with no free parameters or invented entities quantified.

axioms (2)
  • domain assumption Automated retrieval combined with controlled synthesis produces usable in-domain video data at scale.
    Invoked when claiming generation of high-quality data packages without manual intervention.
  • domain assumption MCP enables reliable automated data ingestion for self-evolving datasets.
    Central to the continuous update and open ecosystem claim.

pith-pipeline@v0.9.0 · 5497 in / 1225 out tokens · 39773 ms · 2026-05-15T16:49:09.802295+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 3 internal anchors

  1. [1]

    Model context protocol (mcp).https://modelcontextprotocol.io, 2024

    Anthropic. Model context protocol (mcp).https://modelcontextprotocol.io, 2024. Accessed: 2024-12-20

  2. [2]

    Frozen in time: A joint video and image encoder for end-to-end retrieval

    Max Bain, Arsha Nagrani, G¨ ul Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. InIEEE International Conference on Computer Vision (ICCV), 2021

  3. [3]

    Lumiere: A space-time diffusion model for video generation

    Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffusion model for video generation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024

  4. [4]

    Activ- itynet: A large-scale video benchmark for human activity understanding

    Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activ- itynet: A large-scale video benchmark for human activity understanding. InProceedings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015

  5. [5]

    Pyscenedetect.Last accessed, 2020

    Brandon Castellano. Pyscenedetect.Last accessed, 2020

  6. [6]

    Coherent online video style transfer

    Dongdong Chen, Jing Liao, Lu Yuan, Nenghai Yu, and Gang Hua. Coherent online video style transfer. InProceedings of the IEEE international conference on computer vision, pages 1105–1114, 2017

  7. [7]

    Panda-70m: Captioning 70m videos with multiple cross-modality teachers

    Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13320– 13331, 2024

  8. [8]

    Two-frame motion estimation based on polynomial expansion

    Gunnar Farneb¨ ack. Two-frame motion estimation based on polynomial expansion. In Scandinavian Conference on Image Analysis, pages 363–370. Springer, 2003

  9. [9]

    Vbench: Comprehensive benchmark suite for video generative models

    Zanyi Huang et al. Vbench: Comprehensive benchmark suite for video generative models. InCVPR, 2024

  10. [10]

    Miradata: A large-scale video dataset with long durations and structured captions.Advances in Neural Information Processing Systems, 37:48955–48970, 2024

    Xuan Ju, Yiming Gao, Zhaoyang Zhang, Ziyang Yuan, Xintao Wang, Ailing Zeng, Yu Xiong, Qiang Xu, and Ying Shan. Miradata: A large-scale video dataset with long durations and structured captions.Advances in Neural Information Processing Systems, 37:48955–48970, 2024

  11. [11]

    The Kinetics Human Action Video Dataset

    Will Kay et al. The kinetics human action video dataset.arXiv preprint arXiv:1705.06950, 2017

  12. [12]

    Mvbench: A comprehensive multi-modal video understanding bench- mark

    Kunchang Li et al. Mvbench: A comprehensive multi-modal video understanding bench- mark. InCVPR, 2024

  13. [13]

    An iterative image registration technique with an application to stereo vision

    Bruce D Lucas and Takeo Kanade. An iterative image registration technique with an application to stereo vision. InProceedings of the 7th International Joint Conference on Artificial Intelligence (IJCAI), pages 674–679, 1981. 15

  14. [14]

    Video-chatgpt: Towards detailed video understanding via large language models

    Muhammad Maaz et al. Video-chatgpt: Towards detailed video understanding via large language models. InACL, 2024

  15. [15]

    Howto100m: Learning a text-video embedding by watching hundred million narrated video clips

    Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. InProceedings of the IEEE/CVF international conference on computer vision, pages 2630–2640, 2019

  16. [16]

    A survey of ocr evaluation tools and metrics

    Clemens Neudecker, Konstantin Baierer, Mike Gerber, Christian Clausner, Apostolos An- tonacopoulos, and Stefan Pletschacher. A survey of ocr evaluation tools and metrics. In Proceedings of the 6th International Workshop on Historical Document Imaging and Pro- cessing, pages 13–18, 2021

  17. [17]

    Flowmo: Variance-based flow guidance for coherent motion in video generation.arXiv preprint arXiv:2506.01144, 2025

    Ariel Shaulov, Itay Hazan, Lior Wolf, and Hila Chefer. Flowmo: Variance-based flow guidance for coherent motion in video generation.arXiv preprint arXiv:2506.01144, 2025

  18. [18]

    VidGen-1M: A large-scale dataset for text-to-video generation

    Zhiyu Tan, Xiaomeng Yang, Luozheng Qin, and Hao Li. Vidgen-1m: A large-scale dataset for text-to-video generation.arXiv preprint arXiv:2408.02629, 2024

  19. [19]

    Emo: Emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions

    Linrui Tian, Qi Wang, Bang Zhang, and Liefeng Bo. Emo: Emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions. In European Conference on Computer Vision, pages 244–260. Springer, 2024

  20. [20]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  21. [21]

    V-express: Conditional dropout for progressive training of portrait video generation,

    Cong Wang, Kuan Tian, Jun Zhang, Yonghang Guan, Feng Luo, Fei Shen, Zhiwei Jiang, Qing Gu, Xiao Han, and Wei Yang. V-express: Conditional dropout for progressive training of portrait video generation.arXiv preprint arXiv:2406.02511, 2024

  22. [22]

    Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content

    Qiuheng Wang, Yukai Shi, Jiarong Ou, Rui Chen, Ke Lin, Jiahao Wang, Boyuan Jiang, Haotian Yang, Mingwu Zheng, Xin Tao, et al. Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8428–8437, 2025

  23. [23]

    Styleheat: One-shot high-resolution editable talking face generation via pre-trained stylegan

    Fei Yin, Yong Zhang, Xiaodong Cun, Mingdeng Cao, Yanbo Fan, Xuan Wang, Qingyan Bai, Baoyuan Wu, Jue Wang, and Yujiu Yang. Styleheat: One-shot high-resolution editable talking face generation via pre-trained stylegan. InEuropean conference on computer vision, pages 85–101. Springer, 2022

  24. [24]

    Celebv-text: A large-scale facial text-video dataset

    Jianhui Yu, Hao Zhu, Liming Jiang, Chen Change Loy, Weidong Cai, and Wayne Wu. Celebv-text: A large-scale facial text-video dataset. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14805–14814, 2023

  25. [25]

    Inkthetics: a comprehen- sive computational model for aesthetic evaluation of chinese ink paintings.IEEE Access, 8:225857–225871, 2020

    Jiajing Zhang, Yongwei Miao, Junsong Zhang, and Jinhui Yu. Inkthetics: a comprehen- sive computational model for aesthetic evaluation of chinese ink paintings.IEEE Access, 8:225857–225871, 2020

  26. [26]

    Open-Sora: Democratizing Efficient Video Production for All

    Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video produc- tion for all.arXiv preprint arXiv:2412.20404, 2024

  27. [27]

    Negative shanshui: Real-time interactive ink painting synthesis.arXiv preprint arXiv:2508.16612, 2025

    Aven-Le Zhou. Negative shanshui: Real-time interactive ink painting synthesis.arXiv preprint arXiv:2508.16612, 2025. 16

  28. [28]

    Vlogger: Make your dream a vlog

    Shaobin Zhuang, Kunchang Li, Xinyuan Chen, Yaohui Wang, Ziwei Liu, Yu Qiao, and Yali Wang. Vlogger: Make your dream a vlog. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8806–8817, 2024. 17