arxiv: 2604.15127 · v2 · submitted 2026-04-16 · 💻 cs.MM

Recognition: unknown

MCSC-Bench: Multimodal Context-to-Script Creation for Realistic Video Production

Huanran Hu , Zihui Ren , Dingyi Yang , Liangyu Chen , Qixiang Gao , Tiezheng Ge , Qin Jin

Authors on Pith no claims yet

Pith reviewed 2026-05-10 09:07 UTC · model grok-4.3

classification 💻 cs.MM

keywords multimodalvideo productionscript generationbenchmark datasetmultimodal LLMsnarrative planningmaterial selectionvideo generation

0 comments

The pith

A new benchmark of 11K+ videos trains models to create full production scripts from noisy materials and instructions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines the Multimodal Context-to-Script Creation task as the complete workflow of selecting relevant shots from redundant multimodal inputs, planning additional shots to fill narrative gaps, and organizing everything into executable scripts with voiceovers. Existing benchmarks only test isolated pieces of this process, leaving the integrated reasoning unmeasured. MCSC-Bench supplies over 11,000 annotated samples that include both in-domain and out-of-domain test sets to evaluate material selection, narrative planning, and conditioned script generation together. Experiments show current multimodal large language models perform poorly on long-context structure-aware reasoning, yet models trained on the new dataset reach state-of-the-art results, with an 8B model surpassing Gemini-2.5-Pro and maintaining performance on unseen scenarios. Scripts produced by these models also improve downstream video generation quality.

Core claim

MCSC-Bench is the first large-scale dataset for the full video production reasoning process; each of its 11K+ samples pairs redundant multimodal materials and user instructions with a coherent script that mixes material-based shots, newly planned shots carrying explicit shooting instructions, and shot-aligned voiceovers. The benchmark measures performance across material selection, narrative planning, and conditioned script generation, and includes separate in-domain and out-of-domain splits. Training on the dataset yields models that achieve state-of-the-art results, including an 8B-parameter model that outperforms Gemini-2.5-Pro, while also generalizing to out-of-domain cases and producing

What carries the argument

MCSC-Bench dataset that pairs noisy multimodal inputs with structured scripts containing material-based shots, planned shots, and aligned voiceovers to test the integrated workflow of selection, planning, and generation.

If this is right

Current multimodal LLMs struggle with structure-aware reasoning when given long, noisy contexts.
Fine-tuned models reach state-of-the-art on material selection, narrative planning, and script generation.
An 8B model trained on the benchmark surpasses Gemini-2.5-Pro and generalizes to out-of-domain scenarios.
Scripts generated by the trained models improve quality in downstream video generation pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same annotation style could be applied to other creative pipelines such as audio editing or interactive storytelling.
Smaller open-weight models becoming competitive suggests accessible tools for independent video creators.
The benchmark exposes a gap in long-context multimodal reasoning that future model architectures must address.
Integration of the generated scripts into end-to-end AI video systems could reduce manual pre-production effort.

Load-bearing premise

Human-annotated scripts in the 11K videos correctly represent real-world video production reasoning workflows and the chosen metrics accurately measure structure-aware multimodal reasoning under long contexts.

What would settle it

An independent set of real video-production tasks where models trained on MCSC-Bench produce scripts that fail to yield coherent final videos or do not outperform untrained baselines when executed by human crews.

Figures

Figures reproduced from arXiv: 2604.15127 by Dingyi Yang, Huanran Hu, Liangyu Chen, Qin Jin, Qixiang Gao, Tiezheng Ge, Zihui Ren.

**Figure 1.** Figure 1: An overview of our Multimodal Context-to-Script Creation (MCSC) task. Models should comprehend the multimodal long contexts, create the plot, and output the structured script, which includes material-based shots and newly planned shots. video shots including relevant and distractor materials; (ii) text materials; (iii) user instructions; and (iv) structured output scripts containing shooting instructions … view at source ↗

**Figure 2.** Figure 2: Overview of the MCSC-Bench dataset construction. Video materials are drawn from a large video pool. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Statistics of MCSC-Bench. (a): Distribution [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Multi-dimensional evaluation on MCSC [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation on Long-Context Stress Test. Base [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 7.** Figure 7: Script-Driven (Ours) and Instruction-Driven [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative comparison between our Script-Driven approach and the Instruction-Driven baseline. [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative comparison between scripts generated by Qwen3-VL-8B and Gemini-2.5-Pro. [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Script generated by Qwen2.5-VL-7B [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: A Skit Script generated by MCSC-8B [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: Script generated by Gemini-2.5-Pro [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 13.** Figure 13: Script generated by Qwen3-VL-8B [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗

**Figure 14.** Figure 14: Script generated by MCSC-8B [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 15.** Figure 15: Script generated by Qwen2.5-VL-72B with our agent method. [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗

**Figure 16.** Figure 16: Script generated by Gemini-2.5-Pro [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗

**Figure 17.** Figure 17: Script generated by InternVL3-8B [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗

**Figure 18.** Figure 18: Prompt for script creation in the MCSC task. [PITH_FULL_IMAGE:figures/full_fig_p028_18.png] view at source ↗

**Figure 19.** Figure 19: Prompt for script evaluation in the Multi-dimensional metrics(Section [PITH_FULL_IMAGE:figures/full_fig_p029_19.png] view at source ↗

**Figure 20.** Figure 20: Prompt for script generation in the first phase of Section [PITH_FULL_IMAGE:figures/full_fig_p030_20.png] view at source ↗

read the original abstract

Real-world video creation often involves a complex reasoning workflow of selecting relevant shots from noisy materials, planning missing shots for narrative completeness, and organizing them into coherent storylines. However, existing benchmarks focus on isolated sub-tasks and lack support for evaluating this full process. To address this gap, we propose Multimodal Context-to-Script Creation (MCSC), a new task that transforms noisy multimodal inputs and user instructions into structured, executable video scripts. We further introduce MCSC-Bench, the first large-scale MCSC dataset, comprising 11K+ well-annotated videos. Each sample includes: (1) redundant multimodal materials and user instructions; (2) a coherent, production-ready script containing material-based shots, newly planned shots (with shooting instructions), and shot-aligned voiceovers. MCSC-Bench supports comprehensive evaluation across material selection, narrative planning, and conditioned script generation, and includes both in-domain and out-of-domain test sets. Experiments show that current multimodal LLMs struggle with structure-aware reasoning under long contexts, highlighting the challenges posed by our benchmark. Models trained on MCSC-Bench achieve SOTA performance, with an 8B model surpassing Gemini-2.5-Pro, and generalize to out-of-domain scenarios. Downstream video generation guided by the generated scripts further validates the practical value of MCSC. Datasets will be public soon.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New integrated benchmark for video script creation from noisy inputs, but SOTA claims and annotation quality need data release to verify.

read the letter

The main takeaway is that this paper defines a full-pipeline task called MCSC for turning redundant multimodal materials and instructions into production-ready scripts that include selected shots, newly planned ones with instructions, and aligned voiceovers. They back it with MCSC-Bench, a dataset of 11k+ videos that adds in-domain and out-of-domain splits for testing generalization across material selection, narrative planning, and script generation. That integrated setup is the clearest step beyond existing single-subtask benchmarks in multimodal video work. They also run some downstream checks where the scripts feed into video generation, which gives a bit of practical grounding. The experiments note that current models have trouble with structure-aware reasoning over long contexts, which aligns with the task difficulty they set up. What the paper does well is lay out a concrete workflow that matches real video production steps more closely than prior isolated evaluations. The dataset construction looks thoughtful on paper, with explicit support for both in- and out-of-domain testing. On the soft spots, the abstract claims an 8B model beats Gemini-2.5-Pro and generalizes well, yet supplies no numbers, baselines, protocols, or error bars to assess that. The dataset is still pending release, so the central results stay uncheckable. The human-annotated scripts also lack any reported inter-annotator agreement or validation against actual production logs, which leaves open whether the targets capture the full range of valid decisions or carry annotator-specific biases. If those hold, the training and automatic metrics could shift. This is for researchers working on multimodal models for creative pipelines and video generation. Readers who want benchmarks that test end-to-end reasoning rather than single steps would get value once the data is out. It deserves a serious referee after release, because the task definition fills a visible gap even if the current empirical section needs more detail on metrics and annotation process. I would recommend sending it for review once the dataset is public, with specific attention to the evaluation setup and how the gold scripts were validated.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Multimodal Context-to-Script Creation (MCSC) task for generating structured, production-ready video scripts from noisy multimodal materials and user instructions. It presents MCSC-Bench, a new dataset of 11K+ human-annotated videos that include redundant materials, instructions, material-based shots, newly planned shots with shooting instructions, and shot-aligned voiceovers. The work evaluates current multimodal LLMs, reports that they struggle with structure-aware reasoning under long contexts, shows that models fine-tuned on MCSC-Bench achieve SOTA results (including an 8B model surpassing Gemini-2.5-Pro), demonstrates out-of-domain generalization, and validates the scripts via downstream video generation.

Significance. If the annotations prove reliable and the empirical results are reproducible with full metrics and protocols, the benchmark would meaningfully advance evaluation of end-to-end video production reasoning beyond isolated subtasks. It supplies a large-scale training resource and highlights practical challenges in long-context multimodal planning, with potential downstream utility for realistic video pipelines.

major comments (3)

[Abstract and Experiments] Abstract and Experiments: The central SOTA claim that an 8B model surpasses Gemini-2.5-Pro (plus OOD generalization) is stated without any specific metrics, baseline details, error bars, or experimental protocol. This prevents assessment of whether the reported superiority is load-bearing or artifactual.
[Dataset construction] Dataset construction (likely §3): The human-annotated 'coherent, production-ready scripts' are presented as faithful targets for material selection, narrative planning, and script generation, yet no inter-annotator agreement statistics, expert validation against real production workflows, or ablation on alternative valid scripts are provided. This directly undermines the supervised training results and automatic metric rankings.
[Evaluation and OOD] Evaluation metrics and OOD split: No concrete definitions or breakdowns are given for the metrics used to score material selection, narrative planning, and conditioned script generation, nor for how the out-of-domain test set differs from in-domain data. These details are required to support the generalization claim.

minor comments (2)

[Abstract] The abstract states that 'Datasets will be public soon' but provides no timeline, access mechanism, or licensing details.
[Task definition] Notation for the three evaluation axes (material selection, narrative planning, conditioned script generation) should be defined consistently when first introduced.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications from the manuscript and outlining revisions where appropriate to strengthen the presentation of results, dataset quality, and evaluation details.

read point-by-point responses

Referee: [Abstract and Experiments] The central SOTA claim that an 8B model surpasses Gemini-2.5-Pro (plus OOD generalization) is stated without any specific metrics, baseline details, error bars, or experimental protocol. This prevents assessment of whether the reported superiority is load-bearing or artifactual.

Authors: We appreciate the referee's concern regarding the self-contained nature of the abstract. The specific metrics (F1 for material selection, planning accuracy, and generation quality scores), full baseline comparisons including Gemini-2.5-Pro, error bars from multiple runs, and the complete experimental protocol are reported in Section 4.2, Table 2, and Section 4.1. The OOD results appear in Section 4.3 and Table 5. To address the comment directly, we will revise the abstract to include key quantitative highlights of the SOTA and generalization results while keeping it concise. revision: partial
Referee: [Dataset construction] The human-annotated 'coherent, production-ready scripts' are presented as faithful targets for material selection, narrative planning, and script generation, yet no inter-annotator agreement statistics, expert validation against real production workflows, or ablation on alternative valid scripts are provided. This directly undermines the supervised training results and automatic metric rankings.

Authors: We agree that demonstrating annotation reliability is essential for a supervised benchmark. Section 3.2 describes the annotation pipeline, which involved professional video editors following detailed guidelines for material selection, shot planning, and voiceover alignment, with multiple review rounds. We will add inter-annotator agreement statistics (e.g., Fleiss' kappa for shot-level decisions) in the revised Section 3. We will also expand the description of expert consultations with industry professionals to validate against real production workflows. Regarding alternative valid scripts, we will include a new analysis in the experiments section examining script variability and its effect on automatic metrics, as the task inherently permits some valid alternatives while prioritizing coverage and coherence. revision: yes
Referee: [Evaluation and OOD] No concrete definitions or breakdowns are given for the metrics used to score material selection, narrative planning, and conditioned script generation, nor for how the out-of-domain test set differs from in-domain data. These details are required to support the generalization claim.

Authors: The metrics and OOD construction are defined in the manuscript, but we acknowledge they could be presented more explicitly. Section 4.1 provides the definitions: material selection uses precision/recall/F1 on shot overlap; narrative planning is scored via event coverage and logical sequence metrics; conditioned script generation uses BLEU-4, ROUGE-L, METEOR, plus human evaluation on coherence and relevance. The OOD test set (Section 3.4) comprises videos from unseen genres, sources, and styles not present in the training or in-domain test data, with per-domain breakdowns in Table 5. We will revise Section 4 to include more granular metric formulas, component-wise breakdowns, and an expanded explanation of the OOD split construction to fully support the generalization claims. revision: yes

Circularity Check

0 steps flagged

No circularity: new benchmark construction and empirical model evaluation are independent of self-referential fits or definitions.

full rationale

The paper introduces a new task (MCSC) and dataset (MCSC-Bench) via human annotation of 11K+ videos, then reports empirical results of training and evaluating multimodal LLMs on held-out in-domain and out-of-domain splits using standard metrics. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The SOTA claim (8B model > Gemini-2.5-Pro) and OOD generalization rest on direct comparison against external models on the new test sets, not on any reduction to the training inputs by construction. Annotation quality concerns are validity issues, not circularity per the defined patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark introduction paper with no mathematical derivations, fitted parameters, or new theoretical entities; it relies on standard assumptions about multimodal LLM capabilities and human annotation quality for video production tasks.

pith-pipeline@v0.9.0 · 5559 in / 1110 out tokens · 38050 ms · 2026-05-10T09:07:15.494797+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 27 canonical work pages · 9 internal anchors

[1]

GPT-4 Technical Report

Gpt-4 techni- cal report.arXiv preprint arXiv:2303.08774. Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Qwen-vl: A versatile vision- language model for understanding, localization, text reading, and beyond.Preprint, arXiv:2308.12966. Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, and 1 others. 2025a. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631. Shuai Bai, Keqin Chen...

work page internal anchor Pith review arXiv
[3]

Boyu Chen, Zhengrong Yue, Siran Chen, Zikang Wang, Yang Liu, Peng Li, and Yali Wang

The creativity maze: Explor- ing creativity in screenplay writing.Psychology of Aesthetics, Creativity, and the Arts, 8(4):384. Boyu Chen, Zhengrong Yue, Siran Chen, Zikang Wang, Yang Liu, Peng Li, and Yali Wang. 2025a. Lvagent: Long video understanding by multi-round dynamical collaboration of mllm agents. InProceedings of the IEEE/CVF International Conf...

work page arXiv
[4]

arXiv preprint arXiv:2501.05884

Text-to-edit: Controllable end-to-end video ad creation via multimodal llms. arXiv preprint arXiv:2501.05884. Peggy Chi, Zheng Sun, Katrina Panovich, and Irfan Essa

work page arXiv
[5]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261. Yanqi Dai, Huanran Hu, Lei Wang, Shengjie Jin, Xu Chen, and Zhiwu Lu

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Ken Dancyger

Mmrole: A com- prehensive framework for developing and evaluat- ing multimodal role-playing agents.arXiv preprint arXiv:2408.04203. Ken Dancyger. 2018.The technique of film and video editing: history, theory, and practice. Routledge. Sibo Dong, Ismail Shaheen, Maggie Shen, Rupayan Mallick, and Sarah Adel Bargal

work page arXiv 2018
[7]

Tinyfusion: Diffusion transformers learned shallow

Tbstar-edit: From image editing pattern shifting to consistency enhance- ment.Preprint, arXiv:2510.04483. Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, and 1 oth- ers

work page arXiv
[8]

arXiv preprint arXiv:2312.10300

Shot2story20k: A new benchmark for comprehensive understanding of multi-shot videos. arXiv preprint arXiv:2312.10300. Liu He, Yizhi Song, Hejun Huang, Pinxin Liu, Yunlong Tang, Daniel Aliaga, and Xin Zhou. 2024a. Kubrick: Multimodal agent collaborations for synthetic video generation.arXiv preprint arXiv:2408.10453. Xuan He, Dongfu Jiang, Ge Zhang, Max Ku...

work page arXiv 2024
[9]

Animate-a-story: Storytelling with retrieval- augmented video generation,

Animate-a-story: Storytelling with retrieval-augmented video generation.arXiv preprint arXiv:2307.06940. Roberto Henschel, Levon Khachatryan, Hayk Poghosyan, Daniil Hayrapetyan, Vahram Tade- vosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi

work page arXiv
[10]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Cogvideo: Large-scale pre- training for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868. Ziwei Ji, Yan Xu, I Cheng, Samuel Cahyawijaya, Rita Frieske, Etsuko Ishii, Min Zeng, Andrea Madotto, Pascale Fung, and 1 others

work page internal anchor Pith review arXiv
[11]

arXiv preprint arXiv:2203.00314

Vscript: Con- trollable script generation with visual presentation. arXiv preprint arXiv:2203.00314. Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, and 1 others

work page arXiv
[12]

Videopoet: A large language model for zero-shot video generation.arXiv:2312.14125,

Videopoet: A large language model for zero-shot video generation.arXiv preprint arXiv:2312.14125. Junnan Li, Yongkang Wong, Qi Zhao, and Mohan S Kankanhalli

work page arXiv
[13]

Chin-Yew Lin

From shots to stories: Llm-assisted video editing with unified language representations.arXiv preprint arXiv:2505.12237. Chin-Yew Lin

work page arXiv
[14]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee

Improving visual storytelling with multimodal large language models.arXiv preprint arXiv:2407.02586. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023a. Visual instruction tuning.Advances in neural information processing systems, 36:34892– 34916. Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023b. G-eval: Nlg e...

work page arXiv 2023
[15]

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Video-chatgpt: Towards detailed video understanding via large vision and language models.arXiv preprint arXiv:2306.05424. Louis Mahon and Mirella Lapata

work page internal anchor Pith review arXiv
[16]

arXiv preprint arXiv:2410.19809 (2024)

Screenwriter: Automatic screenplay generation and movie sum- marisation.arXiv preprint arXiv:2410.19809. Chenyu Mu, Xin He, Qu Yang, Wanshun Chen, Jiadi Yao, Huang Liu, Zihao Yi, Bo Zhao, Xingyu Chen, Ruotian Ma, and 1 others

work page arXiv
[17]

The script is all you need: An agentic framework for long-horizon dialogue-to-cinematic video generation.arXiv preprint arXiv:2601.17737. OpenAI

work page arXiv
[18]

PySceneDetect

Integrat- ing video and text: A balanced approach to multi- modal summary generation and evaluation.arXiv preprint arXiv:2505.06594. PySceneDetect. Pyscenedetect. www.scenedetect. com. Accessed: 2026-01-22. Dongjun Qian, Kai Su, Yiming Tan, Qishuai Diao, Xian Wu, Chang Liu, Bingyue Peng, and Zehuan Yuan

work page arXiv 2026
[19]

arXiv preprint arXiv:2504.05673

Vc-llm: Automated advertisement video creation from raw footage using multi-modal llms. arXiv preprint arXiv:2504.05673. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, and 1 others

work page arXiv
[20]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300. Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Skyscript-100m: 1,000,000,000 pairs of scripts and shooting scripts for short drama.arXiv preprint arXiv:2408.09333. Taobao. Taobao. www.taobao.com. Accessed: 2026- 01-09. TikTok. Tiktok. https://www.tiktok.com. Ac- cessed: 2026-01-09. Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, and 1 others

work page arXiv 2026
[22]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314. Bryan Wang, Yuliang Li, Zhaoyang Lv, Haijun Xia, Yan Xu, and Raj Sodhi. 2024a. Lave: Llm-powered agent assistance and language augmentation for video editing. InProceedings of the 29th International Conference on Intelligent User Interfaces, pages 699–

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Lvbench: An extreme long video understanding benchmark.CoRR, abs/2406.08035, 2024

Write-a- video: computational video montage from themed text.ACM Trans. Graph., 38(6):177–1. Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, and 1 others. 2024b. Lvbench: An extreme long video understanding benchmark.arXiv preprint arXiv:2406.08035. Yuetian Weng, Mingfei Han, Haoyu He, Xia...

work page arXiv
[24]

arXiv preprint arXiv:2503.07314 (2025)

Au- tomated movie generation via multi-agent cot plan- ning.arXiv preprint arXiv:2503.07314. Dingyi Yang, Chunru Zhan, Ziheng Wang, Biao Wang, Tiezheng Ge, Bo Zheng, and Qin Jin. 2024a. Syn- chronized video storytelling: Generating video nar- rations with structured storyline.arXiv preprint arXiv:2405.14040. Dongjie Yang, Suyuan Huang, Chengqiang Lu, Xi- ...

work page arXiv 2026
[25]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Llava-video: Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713. Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chen- hui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You

work page Pith review arXiv
[26]

Open-Sora: Democratizing Efficient Video Production for All

Open-sora: De- mocratizing efficient video production for all.arXiv preprint arXiv:2412.20404. Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yong- ping Xiong, Bo Zhang, and 1 others

work page internal anchor Pith review arXiv
[27]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479. Junchen Zhu, Huan Yang, Huiguo He, Wenjing Wang, Zixi Tuo, Wen-Huang Cheng, Lianli Gao, Jingkuan Song, and Jianlong Fu

work page internal anchor Pith review arXiv
[28]

We apply our trained evaluator to score generated scripts across the 6 dimensions, and take the average normalized score as the reward signal

to enhance response quality on top of MCSC-8B. We apply our trained evaluator to score generated scripts across the 6 dimensions, and take the average normalized score as the reward signal. The RL training is con- ducted with a learning rate of 1e-6 with AdamW optimizer, an effective batch size of 128, and KL regularization (β= 0.01 ). The vision encoder ...

2025