pith. machine review for the scientific record. sign in

arxiv: 2604.15127 · v2 · submitted 2026-04-16 · 💻 cs.MM

Recognition: unknown

MCSC-Bench: Multimodal Context-to-Script Creation for Realistic Video Production

Authors on Pith no claims yet

Pith reviewed 2026-05-10 09:07 UTC · model grok-4.3

classification 💻 cs.MM
keywords multimodalvideo productionscript generationbenchmark datasetmultimodal LLMsnarrative planningmaterial selectionvideo generation
0
0 comments X

The pith

A new benchmark of 11K+ videos trains models to create full production scripts from noisy materials and instructions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines the Multimodal Context-to-Script Creation task as the complete workflow of selecting relevant shots from redundant multimodal inputs, planning additional shots to fill narrative gaps, and organizing everything into executable scripts with voiceovers. Existing benchmarks only test isolated pieces of this process, leaving the integrated reasoning unmeasured. MCSC-Bench supplies over 11,000 annotated samples that include both in-domain and out-of-domain test sets to evaluate material selection, narrative planning, and conditioned script generation together. Experiments show current multimodal large language models perform poorly on long-context structure-aware reasoning, yet models trained on the new dataset reach state-of-the-art results, with an 8B model surpassing Gemini-2.5-Pro and maintaining performance on unseen scenarios. Scripts produced by these models also improve downstream video generation quality.

Core claim

MCSC-Bench is the first large-scale dataset for the full video production reasoning process; each of its 11K+ samples pairs redundant multimodal materials and user instructions with a coherent script that mixes material-based shots, newly planned shots carrying explicit shooting instructions, and shot-aligned voiceovers. The benchmark measures performance across material selection, narrative planning, and conditioned script generation, and includes separate in-domain and out-of-domain splits. Training on the dataset yields models that achieve state-of-the-art results, including an 8B-parameter model that outperforms Gemini-2.5-Pro, while also generalizing to out-of-domain cases and producing

What carries the argument

MCSC-Bench dataset that pairs noisy multimodal inputs with structured scripts containing material-based shots, planned shots, and aligned voiceovers to test the integrated workflow of selection, planning, and generation.

If this is right

  • Current multimodal LLMs struggle with structure-aware reasoning when given long, noisy contexts.
  • Fine-tuned models reach state-of-the-art on material selection, narrative planning, and script generation.
  • An 8B model trained on the benchmark surpasses Gemini-2.5-Pro and generalizes to out-of-domain scenarios.
  • Scripts generated by the trained models improve quality in downstream video generation pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same annotation style could be applied to other creative pipelines such as audio editing or interactive storytelling.
  • Smaller open-weight models becoming competitive suggests accessible tools for independent video creators.
  • The benchmark exposes a gap in long-context multimodal reasoning that future model architectures must address.
  • Integration of the generated scripts into end-to-end AI video systems could reduce manual pre-production effort.

Load-bearing premise

Human-annotated scripts in the 11K videos correctly represent real-world video production reasoning workflows and the chosen metrics accurately measure structure-aware multimodal reasoning under long contexts.

What would settle it

An independent set of real video-production tasks where models trained on MCSC-Bench produce scripts that fail to yield coherent final videos or do not outperform untrained baselines when executed by human crews.

Figures

Figures reproduced from arXiv: 2604.15127 by Dingyi Yang, Huanran Hu, Liangyu Chen, Qin Jin, Qixiang Gao, Tiezheng Ge, Zihui Ren.

Figure 1
Figure 1. Figure 1: An overview of our Multimodal Context-to-Script Creation (MCSC) task. Models should comprehend the multimodal long contexts, create the plot, and output the structured script, which includes material-based shots and newly planned shots. video shots including relevant and distractor mate￾rials; (ii) text materials; (iii) user instructions; and (iv) structured output scripts containing shooting instructions … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the MCSC-Bench dataset construction. Video materials are drawn from a large video pool. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Statistics of MCSC-Bench. (a): Distribution [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Multi-dimensional evaluation on MCSC [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation on Long-Context Stress Test. Base [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Script-Driven (Ours) and Instruction-Driven [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison between our Script-Driven approach and the Instruction-Driven baseline. [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparison between scripts generated by Qwen3-VL-8B and Gemini-2.5-Pro. [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Script generated by Qwen2.5-VL-7B [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: A Skit Script generated by MCSC-8B [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Script generated by Gemini-2.5-Pro [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Script generated by Qwen3-VL-8B [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Script generated by MCSC-8B [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Script generated by Qwen2.5-VL-72B with our agent method. [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Script generated by Gemini-2.5-Pro [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Script generated by InternVL3-8B [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Prompt for script creation in the MCSC task. [PITH_FULL_IMAGE:figures/full_fig_p028_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Prompt for script evaluation in the Multi-dimensional metrics(Section [PITH_FULL_IMAGE:figures/full_fig_p029_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Prompt for script generation in the first phase of Section [PITH_FULL_IMAGE:figures/full_fig_p030_20.png] view at source ↗
read the original abstract

Real-world video creation often involves a complex reasoning workflow of selecting relevant shots from noisy materials, planning missing shots for narrative completeness, and organizing them into coherent storylines. However, existing benchmarks focus on isolated sub-tasks and lack support for evaluating this full process. To address this gap, we propose Multimodal Context-to-Script Creation (MCSC), a new task that transforms noisy multimodal inputs and user instructions into structured, executable video scripts. We further introduce MCSC-Bench, the first large-scale MCSC dataset, comprising 11K+ well-annotated videos. Each sample includes: (1) redundant multimodal materials and user instructions; (2) a coherent, production-ready script containing material-based shots, newly planned shots (with shooting instructions), and shot-aligned voiceovers. MCSC-Bench supports comprehensive evaluation across material selection, narrative planning, and conditioned script generation, and includes both in-domain and out-of-domain test sets. Experiments show that current multimodal LLMs struggle with structure-aware reasoning under long contexts, highlighting the challenges posed by our benchmark. Models trained on MCSC-Bench achieve SOTA performance, with an 8B model surpassing Gemini-2.5-Pro, and generalize to out-of-domain scenarios. Downstream video generation guided by the generated scripts further validates the practical value of MCSC. Datasets will be public soon.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Multimodal Context-to-Script Creation (MCSC) task for generating structured, production-ready video scripts from noisy multimodal materials and user instructions. It presents MCSC-Bench, a new dataset of 11K+ human-annotated videos that include redundant materials, instructions, material-based shots, newly planned shots with shooting instructions, and shot-aligned voiceovers. The work evaluates current multimodal LLMs, reports that they struggle with structure-aware reasoning under long contexts, shows that models fine-tuned on MCSC-Bench achieve SOTA results (including an 8B model surpassing Gemini-2.5-Pro), demonstrates out-of-domain generalization, and validates the scripts via downstream video generation.

Significance. If the annotations prove reliable and the empirical results are reproducible with full metrics and protocols, the benchmark would meaningfully advance evaluation of end-to-end video production reasoning beyond isolated subtasks. It supplies a large-scale training resource and highlights practical challenges in long-context multimodal planning, with potential downstream utility for realistic video pipelines.

major comments (3)
  1. [Abstract and Experiments] Abstract and Experiments: The central SOTA claim that an 8B model surpasses Gemini-2.5-Pro (plus OOD generalization) is stated without any specific metrics, baseline details, error bars, or experimental protocol. This prevents assessment of whether the reported superiority is load-bearing or artifactual.
  2. [Dataset construction] Dataset construction (likely §3): The human-annotated 'coherent, production-ready scripts' are presented as faithful targets for material selection, narrative planning, and script generation, yet no inter-annotator agreement statistics, expert validation against real production workflows, or ablation on alternative valid scripts are provided. This directly undermines the supervised training results and automatic metric rankings.
  3. [Evaluation and OOD] Evaluation metrics and OOD split: No concrete definitions or breakdowns are given for the metrics used to score material selection, narrative planning, and conditioned script generation, nor for how the out-of-domain test set differs from in-domain data. These details are required to support the generalization claim.
minor comments (2)
  1. [Abstract] The abstract states that 'Datasets will be public soon' but provides no timeline, access mechanism, or licensing details.
  2. [Task definition] Notation for the three evaluation axes (material selection, narrative planning, conditioned script generation) should be defined consistently when first introduced.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications from the manuscript and outlining revisions where appropriate to strengthen the presentation of results, dataset quality, and evaluation details.

read point-by-point responses
  1. Referee: [Abstract and Experiments] The central SOTA claim that an 8B model surpasses Gemini-2.5-Pro (plus OOD generalization) is stated without any specific metrics, baseline details, error bars, or experimental protocol. This prevents assessment of whether the reported superiority is load-bearing or artifactual.

    Authors: We appreciate the referee's concern regarding the self-contained nature of the abstract. The specific metrics (F1 for material selection, planning accuracy, and generation quality scores), full baseline comparisons including Gemini-2.5-Pro, error bars from multiple runs, and the complete experimental protocol are reported in Section 4.2, Table 2, and Section 4.1. The OOD results appear in Section 4.3 and Table 5. To address the comment directly, we will revise the abstract to include key quantitative highlights of the SOTA and generalization results while keeping it concise. revision: partial

  2. Referee: [Dataset construction] The human-annotated 'coherent, production-ready scripts' are presented as faithful targets for material selection, narrative planning, and script generation, yet no inter-annotator agreement statistics, expert validation against real production workflows, or ablation on alternative valid scripts are provided. This directly undermines the supervised training results and automatic metric rankings.

    Authors: We agree that demonstrating annotation reliability is essential for a supervised benchmark. Section 3.2 describes the annotation pipeline, which involved professional video editors following detailed guidelines for material selection, shot planning, and voiceover alignment, with multiple review rounds. We will add inter-annotator agreement statistics (e.g., Fleiss' kappa for shot-level decisions) in the revised Section 3. We will also expand the description of expert consultations with industry professionals to validate against real production workflows. Regarding alternative valid scripts, we will include a new analysis in the experiments section examining script variability and its effect on automatic metrics, as the task inherently permits some valid alternatives while prioritizing coverage and coherence. revision: yes

  3. Referee: [Evaluation and OOD] No concrete definitions or breakdowns are given for the metrics used to score material selection, narrative planning, and conditioned script generation, nor for how the out-of-domain test set differs from in-domain data. These details are required to support the generalization claim.

    Authors: The metrics and OOD construction are defined in the manuscript, but we acknowledge they could be presented more explicitly. Section 4.1 provides the definitions: material selection uses precision/recall/F1 on shot overlap; narrative planning is scored via event coverage and logical sequence metrics; conditioned script generation uses BLEU-4, ROUGE-L, METEOR, plus human evaluation on coherence and relevance. The OOD test set (Section 3.4) comprises videos from unseen genres, sources, and styles not present in the training or in-domain test data, with per-domain breakdowns in Table 5. We will revise Section 4 to include more granular metric formulas, component-wise breakdowns, and an expanded explanation of the OOD split construction to fully support the generalization claims. revision: yes

Circularity Check

0 steps flagged

No circularity: new benchmark construction and empirical model evaluation are independent of self-referential fits or definitions.

full rationale

The paper introduces a new task (MCSC) and dataset (MCSC-Bench) via human annotation of 11K+ videos, then reports empirical results of training and evaluating multimodal LLMs on held-out in-domain and out-of-domain splits using standard metrics. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The SOTA claim (8B model > Gemini-2.5-Pro) and OOD generalization rest on direct comparison against external models on the new test sets, not on any reduction to the training inputs by construction. Annotation quality concerns are validity issues, not circularity per the defined patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark introduction paper with no mathematical derivations, fitted parameters, or new theoretical entities; it relies on standard assumptions about multimodal LLM capabilities and human annotation quality for video production tasks.

pith-pipeline@v0.9.0 · 5559 in / 1110 out tokens · 38050 ms · 2026-05-10T09:07:15.494797+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 27 canonical work pages · 9 internal anchors

  1. [1]

    GPT-4 Technical Report

    Gpt-4 techni- cal report.arXiv preprint arXiv:2303.08774. Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou

  2. [2]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Qwen-vl: A versatile vision- language model for understanding, localization, text reading, and beyond.Preprint, arXiv:2308.12966. Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, and 1 others. 2025a. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631. Shuai Bai, Keqin Chen...

  3. [3]

    Boyu Chen, Zhengrong Yue, Siran Chen, Zikang Wang, Yang Liu, Peng Li, and Yali Wang

    The creativity maze: Explor- ing creativity in screenplay writing.Psychology of Aesthetics, Creativity, and the Arts, 8(4):384. Boyu Chen, Zhengrong Yue, Siran Chen, Zikang Wang, Yang Liu, Peng Li, and Yali Wang. 2025a. Lvagent: Long video understanding by multi-round dynamical collaboration of mllm agents. InProceedings of the IEEE/CVF International Conf...

  4. [4]

    arXiv preprint arXiv:2501.05884

    Text-to-edit: Controllable end-to-end video ad creation via multimodal llms. arXiv preprint arXiv:2501.05884. Peggy Chi, Zheng Sun, Katrina Panovich, and Irfan Essa

  5. [5]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261. Yanqi Dai, Huanran Hu, Lei Wang, Shengjie Jin, Xu Chen, and Zhiwu Lu

  6. [6]

    Ken Dancyger

    Mmrole: A com- prehensive framework for developing and evaluat- ing multimodal role-playing agents.arXiv preprint arXiv:2408.04203. Ken Dancyger. 2018.The technique of film and video editing: history, theory, and practice. Routledge. Sibo Dong, Ismail Shaheen, Maggie Shen, Rupayan Mallick, and Sarah Adel Bargal

  7. [7]

    Tinyfusion: Diffusion transformers learned shallow

    Tbstar-edit: From image editing pattern shifting to consistency enhance- ment.Preprint, arXiv:2510.04483. Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, and 1 oth- ers

  8. [8]

    arXiv preprint arXiv:2312.10300

    Shot2story20k: A new benchmark for comprehensive understanding of multi-shot videos. arXiv preprint arXiv:2312.10300. Liu He, Yizhi Song, Hejun Huang, Pinxin Liu, Yunlong Tang, Daniel Aliaga, and Xin Zhou. 2024a. Kubrick: Multimodal agent collaborations for synthetic video generation.arXiv preprint arXiv:2408.10453. Xuan He, Dongfu Jiang, Ge Zhang, Max Ku...

  9. [9]

    Animate-a-story: Storytelling with retrieval- augmented video generation,

    Animate-a-story: Storytelling with retrieval-augmented video generation.arXiv preprint arXiv:2307.06940. Roberto Henschel, Levon Khachatryan, Hayk Poghosyan, Daniil Hayrapetyan, Vahram Tade- vosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi

  10. [10]

    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    Cogvideo: Large-scale pre- training for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868. Ziwei Ji, Yan Xu, I Cheng, Samuel Cahyawijaya, Rita Frieske, Etsuko Ishii, Min Zeng, Andrea Madotto, Pascale Fung, and 1 others

  11. [11]

    arXiv preprint arXiv:2203.00314

    Vscript: Con- trollable script generation with visual presentation. arXiv preprint arXiv:2203.00314. Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, and 1 others

  12. [12]

    Videopoet: A large language model for zero-shot video generation.arXiv:2312.14125,

    Videopoet: A large language model for zero-shot video generation.arXiv preprint arXiv:2312.14125. Junnan Li, Yongkang Wong, Qi Zhao, and Mohan S Kankanhalli

  13. [13]

    Chin-Yew Lin

    From shots to stories: Llm-assisted video editing with unified language representations.arXiv preprint arXiv:2505.12237. Chin-Yew Lin

  14. [14]

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee

    Improving visual storytelling with multimodal large language models.arXiv preprint arXiv:2407.02586. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023a. Visual instruction tuning.Advances in neural information processing systems, 36:34892– 34916. Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023b. G-eval: Nlg e...

  15. [15]

    Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

    Video-chatgpt: Towards detailed video understanding via large vision and language models.arXiv preprint arXiv:2306.05424. Louis Mahon and Mirella Lapata

  16. [16]

    arXiv preprint arXiv:2410.19809 (2024)

    Screenwriter: Automatic screenplay generation and movie sum- marisation.arXiv preprint arXiv:2410.19809. Chenyu Mu, Xin He, Qu Yang, Wanshun Chen, Jiadi Yao, Huang Liu, Zihao Yi, Bo Zhao, Xingyu Chen, Ruotian Ma, and 1 others

  17. [17]

    The script is all you need: An agentic framework for long-horizon dialogue-to-cinematic video generation.arXiv preprint arXiv:2601.17737. OpenAI

  18. [18]

    PySceneDetect

    Integrat- ing video and text: A balanced approach to multi- modal summary generation and evaluation.arXiv preprint arXiv:2505.06594. PySceneDetect. Pyscenedetect. www.scenedetect. com. Accessed: 2026-01-22. Dongjun Qian, Kai Su, Yiming Tan, Qishuai Diao, Xian Wu, Chang Liu, Bingyue Peng, and Zehuan Yuan

  19. [19]

    arXiv preprint arXiv:2504.05673

    Vc-llm: Automated advertisement video creation from raw footage using multi-modal llms. arXiv preprint arXiv:2504.05673. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, and 1 others

  20. [20]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300. Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao

  21. [21]

    Skyscript-100m: 1,000,000,000 pairs of scripts and shooting scripts for short drama.arXiv preprint arXiv:2408.09333. Taobao. Taobao. www.taobao.com. Accessed: 2026- 01-09. TikTok. Tiktok. https://www.tiktok.com. Ac- cessed: 2026-01-09. Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, and 1 others

  22. [22]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314. Bryan Wang, Yuliang Li, Zhaoyang Lv, Haijun Xia, Yan Xu, and Raj Sodhi. 2024a. Lave: Llm-powered agent assistance and language augmentation for video editing. InProceedings of the 29th International Conference on Intelligent User Interfaces, pages 699–

  23. [23]

    Lvbench: An extreme long video understanding benchmark.CoRR, abs/2406.08035, 2024

    Write-a- video: computational video montage from themed text.ACM Trans. Graph., 38(6):177–1. Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, and 1 others. 2024b. Lvbench: An extreme long video understanding benchmark.arXiv preprint arXiv:2406.08035. Yuetian Weng, Mingfei Han, Haoyu He, Xia...

  24. [24]

    arXiv preprint arXiv:2503.07314 (2025)

    Au- tomated movie generation via multi-agent cot plan- ning.arXiv preprint arXiv:2503.07314. Dingyi Yang, Chunru Zhan, Ziheng Wang, Biao Wang, Tiezheng Ge, Bo Zheng, and Qin Jin. 2024a. Syn- chronized video storytelling: Generating video nar- rations with structured storyline.arXiv preprint arXiv:2405.14040. Dongjie Yang, Suyuan Huang, Chengqiang Lu, Xi- ...

  25. [25]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Llava-video: Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713. Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chen- hui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You

  26. [26]

    Open-Sora: Democratizing Efficient Video Production for All

    Open-sora: De- mocratizing efficient video production for all.arXiv preprint arXiv:2412.20404. Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yong- ping Xiong, Bo Zhang, and 1 others

  27. [27]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479. Junchen Zhu, Huan Yang, Huiguo He, Wenjing Wang, Zixi Tuo, Wen-Huang Cheng, Lianli Gao, Jingkuan Song, and Jianlong Fu

  28. [28]

    We apply our trained evaluator to score generated scripts across the 6 dimensions, and take the average normalized score as the reward signal

    to enhance response quality on top of MCSC-8B. We apply our trained evaluator to score generated scripts across the 6 dimensions, and take the average normalized score as the reward signal. The RL training is con- ducted with a learning rate of 1e-6 with AdamW optimizer, an effective batch size of 128, and KL regularization (β= 0.01 ). The vision encoder ...