Recognition: no theorem link
MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation
Pith reviewed 2026-05-12 00:45 UTC · model grok-4.3
The pith
A new dataset from 3000 movies uses progressive captions and cross-shot matching to let AI models generate coherent multi-shot videos without copy-paste shortcuts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MuSS is constructed via a progressive captioning pipeline that secures shot-level accuracy first and then global narrative coherence, together with a cross-shot matching mechanism that removes the copy-paste shortcut; when used to augment training, this produces models that reach state-of-the-art performance on continuous storytelling and cross-shot subject identity, as measured by the new Cinematic Narrative Benchmark and its Anti-Copy-Paste Variance metric.
What carries the argument
The progressive captioning pipeline plus cross-shot matching mechanism: the pipeline builds accurate local captions before enforcing story-wide consistency, while matching prevents models from simply reusing the input subject image across shots.
If this is right
- Video foundation models trained on MuSS produce multi-shot sequences with measurably better narrative continuity than single-shot or un-augmented baselines.
- The ACP-Var metric provides a quantitative way to detect when a generator has collapsed into trivial 2D sticker behavior instead of 3D-consistent storytelling.
- Baselines without MuSS either lose subject identity across shots or fail to maintain story logic, confirming the three core dataset challenges listed in the paper.
- The dual-track design of MuSS supports both complex montage transitions and subject-centric narratives, allowing the same data to serve multiple generation modes.
Where Pith is reading between the lines
- If the matching mechanism generalizes, similar conflict-removal pipelines could be applied to other video tasks such as text-to-video or image-to-video with long sequences.
- The benchmark's focus on visual-logic-driven evaluation suggests future work could add metrics for dialogue consistency or emotional arc across shots.
- Extending MuSS with more diverse film sources or synthetic augmentations might further reduce any remaining domain biases from the original 3000-movie corpus.
Load-bearing premise
The progressive captioning pipeline and cross-shot matching mechanism actually remove spatiotemporal conflicts and the copy-paste shortcut without introducing new biases or artifacts into the dataset.
What would settle it
Train a model on MuSS and test it on the Cinematic Narrative Benchmark; if the generated videos still show high copy-paste rates, broken narrative logic, or low ACP-Var scores that match non-MuSS baselines, the central claim fails.
Figures
read the original abstract
While video foundation models excel at single-shot generation, real-world cinematic storytelling inherently relies on complex multi-shot sequencing. Further progress is constrained by the absence of datasets that address three core challenges: authentic narrative logic, spatiotemporal text-video alignment conflicts, and the "copy-paste" dilemma prevalent in Subject-to-Video (S2V) generation. To bridge this gap, we introduce MuSS, a large-scale, dual-track dataset tailored for multi-shot video and S2V generation. Sourced from over 3,000 movies, MuSS explicitly supports both complex montage transitions and subject-centric narratives. To construct this dataset, we pioneer a progressive captioning pipeline that eliminates contextual conflicts by ensuring local shot-level accuracy before enforcing global narrative coherence. Crucially, we implement a cross-shot matching mechanism to fundamentally eradicate the S2V copy-paste shortcut. Alongside the dataset, we propose the Cinematic Narrative Benchmark, featuring a visual-logic-driven paradigm and a novel Anti-Copy-Paste Variance (ACP-Var) metric to rigorously assess continuous storytelling and 3D structural consistency. Extensive experiments demonstrate that while current baselines struggle with continuous narrative logic or degenerate into trivial 2D sticker generators, our MuSS-augmented model achieves state-of-the-art narrative effectiveness and cross-shot identity preservation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MuSS, a large-scale dual-track dataset sourced from over 3,000 movies for multi-shot subject-to-video (S2V) generation. It addresses narrative logic, spatiotemporal alignment conflicts, and the copy-paste shortcut via a progressive captioning pipeline (local shot accuracy followed by global coherence) and a cross-shot matching mechanism. The work also proposes the Cinematic Narrative Benchmark with a visual-logic paradigm and the novel Anti-Copy-Paste Variance (ACP-Var) metric. Experiments claim that MuSS-augmented models achieve SOTA narrative effectiveness and cross-shot identity preservation compared to baselines that struggle with continuous logic or degenerate to 2D stickers.
Significance. If the dataset construction pipeline demonstrably resolves the three core challenges without introducing new biases or artifacts, MuSS and its benchmark would provide a valuable, large-scale resource for advancing cinematic multi-shot video generation. The movie-sourced scale and explicit support for montage transitions represent a concrete empirical contribution that could enable more rigorous evaluation of subject consistency and storytelling in video models.
major comments (3)
- [§3.2] §3.2 (Progressive Captioning Pipeline): The two-stage process is described at a high level, but the manuscript provides no quantitative before/after statistics on spatiotemporal conflict rates, no ablation on caption accuracy, and no human or automated verification scores confirming that local-to-global coherence eliminates conflicts rather than trading one set of inconsistencies for another. This directly underpins the claim that the dataset solves the alignment problem and supports the later ACP-Var results.
- [§3.3] §3.3 (Cross-Shot Matching Mechanism): No analysis or ablation is reported on whether the matching step introduces systematic biases in subject pose, lighting, camera angle, or motion statistics. Without such checks, it is unclear whether the mechanism truly eradicates the copy-paste shortcut or merely masks it in ways that the ACP-Var metric (defined in §4) may not detect, weakening the attribution of SOTA gains to the dataset.
- [§5] §5 (Experiments and Benchmark Results): The SOTA claims for narrative effectiveness and identity preservation rest on comparisons with baselines, yet the text lacks details on baseline re-implementations, statistical significance tests, or controls for dataset size effects. If the reported improvements are driven by unverified pipeline artifacts, the central empirical conclusion does not hold.
minor comments (3)
- [Figure 3] Figure 3 (dataset examples) would benefit from explicit annotations highlighting the montage transitions and cross-shot subject consistency that the benchmark is designed to test.
- [§4.2] The definition of ACP-Var in §4.2 uses variance over 3D structural features; clarify whether these features are extracted from ground-truth 3D reconstructions or estimated via off-the-shelf models, as this affects reproducibility.
- [Related Work] A few citations to prior multi-shot video datasets (e.g., in the related work section) appear incomplete; ensure all referenced works have full bibliographic details.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications and committing to specific revisions that will strengthen the empirical support for our claims without misrepresenting the current work.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Progressive Captioning Pipeline): The two-stage process is described at a high level, but the manuscript provides no quantitative before/after statistics on spatiotemporal conflict rates, no ablation on caption accuracy, and no human or automated verification scores confirming that local-to-global coherence eliminates conflicts rather than trading one set of inconsistencies for another. This directly underpins the claim that the dataset solves the alignment problem and supports the later ACP-Var results.
Authors: We agree that the current description of the progressive captioning pipeline would benefit from quantitative validation. In the revised manuscript, we will add before/after statistics on spatiotemporal conflict rates, an ablation study on caption accuracy, and both human and automated verification scores. These additions will demonstrate that the local-to-global coherence step resolves conflicts without introducing new inconsistencies, thereby providing stronger support for the dataset's role in addressing alignment issues and the subsequent ACP-Var results. revision: yes
-
Referee: [§3.3] §3.3 (Cross-Shot Matching Mechanism): No analysis or ablation is reported on whether the matching step introduces systematic biases in subject pose, lighting, camera angle, or motion statistics. Without such checks, it is unclear whether the mechanism truly eradicates the copy-paste shortcut or merely masks it in ways that the ACP-Var metric (defined in §4) may not detect, weakening the attribution of SOTA gains to the dataset.
Authors: We acknowledge that additional analysis of the cross-shot matching mechanism is warranted to rule out systematic biases. In the revision, we will include ablations examining effects on subject pose, lighting, camera angle, and motion statistics. We will also expand the discussion of the ACP-Var metric to show how it is designed to detect residual copy-paste artifacts, providing evidence that the mechanism eradicates rather than masks the shortcut. This will strengthen the link between the dataset construction and the reported SOTA gains. revision: yes
-
Referee: [§5] §5 (Experiments and Benchmark Results): The SOTA claims for narrative effectiveness and identity preservation rest on comparisons with baselines, yet the text lacks details on baseline re-implementations, statistical significance tests, or controls for dataset size effects. If the reported improvements are driven by unverified pipeline artifacts, the central empirical conclusion does not hold.
Authors: We agree that greater transparency and rigor in the experimental section are needed to substantiate the SOTA claims. In the revised manuscript, we will provide detailed descriptions of baseline re-implementations, report statistical significance tests, and include controls for dataset size effects by evaluating models trained on MuSS subsets of varying scales. These additions will confirm that the observed improvements in narrative effectiveness and identity preservation are robust and not attributable to unverified artifacts. revision: yes
Circularity Check
No circularity: empirical dataset construction with independent experimental validation
full rationale
The paper presents MuSS as a sourced dataset from 3000+ movies, built with a progressive captioning pipeline and cross-shot matching mechanism, plus a new benchmark and ACP-Var metric. No mathematical derivations, equations, or predictions are described that reduce by construction to fitted inputs or self-citations. Claims of SOTA performance rest on external experiments rather than self-referential definitions, making the contribution self-contained as an empirical release without load-bearing circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Movie clips from over 3,000 films provide representative examples of complex multi-shot narrative logic and spatiotemporal alignments suitable for training subject-to-video models.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. GPT-4 Technical Report.arXiv preprint arXiv:2303.08774 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al . 2025. Qwen3-VL Technical Report.arXiv preprint arXiv:2511.21631(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. 2021. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval. InProceedings of the IEEE/CVF international conference on computer vision. 1728–1738
work page 2021
-
[4]
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. 2024. Video generation models as world simulators.OpenAI Blog1, 8 (2024), 1
work page 2024
- [5]
-
[6]
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. Emerging Properties in Self-Supervised Vision Transformers. InProceedings of the IEEE/CVF international conference on computer vision. 9650–9660
work page 2021
-
[7]
Ruoxi Chen, Dongping Chen, Siyuan Wu, Sinan Wang, Shiyun Lang, Peter Sushko, Gaoyang Jiang, Yao Wan, and Ranjay Krishna. 2025. MultiRef: Controllable Image Generation with Multiple Visual References. InProceedings of the 33rd ACM International Conference on Multimedia. 13325–13331
work page 2025
-
[8]
Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. 2019. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4690–4699
work page 2019
-
[9]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. 2023. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Yuwei Guo, Ceyuan Yang, Ziyan Yang, Zhibei Ma, Zhijie Lin, Zhenheng Yang, Dahua Lin, and Lu Jiang. 2025. Long Context Tuning for Video Generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 17281– 17291
work page 2025
- [12]
-
[13]
Jingwen He, Hongbo Liu, Jiajun Li, Ziqi Huang, Qiao Yu, Wanli Ouyang, and Ziwei Liu. 2025. Cut2Next: Generating Next Shot via In-Context Tuning. In Proceedings of the SIGGRAPH Asia 2025 Conference Papers. 1–11
work page 2025
-
[14]
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuan- han Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. 2024. VBench: Comprehensive Benchmark Suite for Video Generative Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21807– 21818
work page 2024
- [15]
-
[16]
Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. 2025. VACE: All-in-One Video Creation and Editing. InProceedings of the IEEE/CVF International Conference on Computer Vision. 17191–17202
work page 2025
-
[17]
Glenn Jocher and Jing Qiu. 2024.Ultralytics YOLO11. https://github.com/ ultralytics/ultralytics
work page 2024
-
[18]
Ozgur Kara, Krishna Kumar Singh, Feng Liu, Duygu Ceylan, James M Rehg, and Tobias Hinz. 2025. ShotAdapter: Text-to-Multi-Shot Video Generation with Diffusion Models. InProceedings of the Computer Vision and Pattern Recognition Conference. 28405–28415
work page 2025
-
[19]
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. 2024. HunyuanVideo: A Systematic Framework For Large Video Generative Models.arXiv preprint arXiv:2412.03603 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Hui Li, Mingwang Xu, Yun Zhan, Shan Mu, Jiaye Li, Kaihui Cheng, Yuxuan Chen, Tan Chen, Mao Ye, Jingdong Wang, et al. 2025. OpenHumanVid: A Large- Scale High-Quality Dataset for Enhancing Human-Centric Video Generation. In Proceedings of the Computer Vision and Pattern Recognition Conference. 7752–7762
work page 2025
-
[21]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. DeepSeek-V3 Technical Report.arXiv preprint arXiv:2412.19437(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Jiawei Liu, Gen Li, Siyu Zhou, Qian He, and Xinglong Wu. 2025. Phantom: Subject-consistent video generation via cross-modal alignment. InProceedings of the IEEE/CVF International Conference on Computer Vision. 14951–14961
work page 2025
-
[23]
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. 2024. Grounding DINO: Mar- rying DINO with Grounded Pre-Training for Open-Set Object Detection. In European conference on computer vision. Springer, 38–55
work page 2024
-
[24]
Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. 2024. EvalCrafter: Benchmarking and Evaluating Large Video Generation Models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 22139–22149
work page 2024
-
[25]
Yexin Liu, Manyuan Zhang, Yueze Wang, Hongyu Li, Dian Zheng, Weiming Zhang, Changsheng Lu, Xunliang Cai, Yan Feng, Peng Pei, et al. 2025. OpenSub- ject: Leveraging Video-Derived Identity and Diversity Priors for Subject-driven Image Generation and Manipulation.arXiv preprint arXiv:2512.08294(2025)
- [26]
- [27]
-
[28]
Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhenheng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. 2024. OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation.arXiv preprint arXiv:2407.02371(2024)
work page internal anchor Pith review arXiv 2024
-
[29]
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El- Nouby, et al. 2023. DINOv2: Learning Robust Visual Features without Supervision. arXiv preprint arXiv:2304.07193(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [30]
-
[31]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al
-
[32]
InInternational conference on machine learning
Learning Transferable Visual Models From Natural Language Supervision. InInternational conference on machine learning. PmLR, 8748–8763
-
[33]
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. 2024. SAM 2: Segment Anything in Images and Videos.arXiv preprint arXiv:2408.00714(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. 2023. DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 22500–22510
work page 2023
- [35]
-
[36]
Tomás Soucek and Jakub Lokoc. 2024. TransNet V2: An effective deep network architecture for fast shot transition detection. InProceedings of the 32nd ACM International Conference on Multimedia. 11218–11221
work page 2024
-
[37]
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
Zachary Teed and Jia Deng. 2020. RAFT: Recurrent All-Pairs Field Transforms for Optical Flow. InEuropean conference on computer vision. Springer, 402–419
work page 2020
-
[39]
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. 2025. Wan: Open and Advanced Large-Scale Video Generative Models.arXiv preprint arXiv:2503.20314(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Jiahao Wang, Hualian Sheng, Sijia Cai, Weizhan Zhang, Caixia Yan, Yachuang Feng, Bing Deng, and Jieping Ye. 2025. EchoShot: Multi-Shot Portrait Video Gen- eration. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems
work page 2025
- [41]
-
[42]
Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. 2023. InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation.arXiv preprint Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Zhang and Wu, et al. arXiv:2307.06942(2023)
work page internal anchor Pith review arXiv 2023
- [43]
-
[44]
Junfei Xiao, Ceyuan Yang, Lvmin Zhang, Shengqu Cai, Yang Zhao, Yuwei Guo, Gordon Wetzstein, Maneesh Agrawala, Alan Yuille, and Lu Jiang. 2025. Cap- tain Cinema: Towards Short Movie Generation. InThe Fourteenth International Conference on Learning Representations
work page 2025
-
[45]
Tianwei Xiong, Yuqing Wang, Daquan Zhou, Zhijie Lin, Jiashi Feng, and Xihui Liu. 2024. LVD-2M: A Long-take Video Dataset with Temporally Dense Captions. Advances in Neural Information Processing Systems37 (2024), 16623–16644
work page 2024
- [46]
-
[47]
Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. 2021. VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding. InProceedings of the 2021 conference on empirical methods in natural language processing. 6787– 6800
work page 2021
-
[48]
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. InProceedings of the IEEE conference on computer vision and pattern recognition. 5288–5296
work page 2016
-
[49]
Zhendong Yang, Ailing Zeng, Chun Yuan, and Yu Li. 2023. Effective Whole-Body Pose Estimation with Two-Stages Distillation. InProceedings of the IEEE/CVF International Conference on Computer Vision. 4210–4220
work page 2023
-
[50]
Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. 2023. IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models.arXiv preprint arXiv:2308.06721(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [51]
-
[52]
Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyang Ge, Yujun Shi, Liuhan Chen, Jiebo Luo, and Li Yuan. 2025. Identity-Preserving Text-to-Video Generation by Frequency Decomposition. InProceedings of the Computer Vision and Pattern Recognition Conference. 12978–12988
work page 2025
-
[53]
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. 2023. Sig- moid Loss for Language Image Pre-Training. InProceedings of the IEEE/CVF international conference on computer vision. 11975–11986
work page 2023
-
[54]
Yuechen Zhang, Yaoyang Liu, Bin Xia, Bohao Peng, Zexin Yan, Eric Lo, and Jiaya Jia. 2025. MagicMirror: ID-Preserved Video Generation in Video Diffusion Trans- formers. InProceedings of the IEEE/CVF International Conference on Computer Vision. 14464–14474
work page 2025
-
[55]
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. 2024. LLaVA-Video: Video Instruction Tuning With Synthetic Data.arXiv preprint arXiv:2410.02713(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [56]
- [57]
-
[58]
Jiayi Zheng and Xiaodong Cun. 2025. FairyGen: Storied Cartoon Video from a Single Child-Drawn Character. InProceedings of the SIGGRAPH Asia 2025 Conference Papers. 1–11
work page 2025
-
[59]
Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, and Qibin Hou
-
[60]
StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation.Advances in Neural Information Processing Systems37 (2024), 110315– 110340
work page 2024
-
[61]
Cailin Zhuang, Ailin Huang, Yaoqi Hu, Jingwei Wu, Wei Cheng, Jiaqi Liao, Hongyuan Wang, Xinyao Liao, Weiwei Cai, Hengyuan Xu, et al . 2025. ViS- toryBench: Comprehensive Benchmark Suite for Story Visualization.arXiv preprint arXiv:2505.24862(2025). MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation Con...
-
[62]
using latent sequence concatenation to inject identity priors. This section provides the precise training hyperparameters utilized to achieve convergence: • Optimizer: AdamW ( 𝛽1 = 0.9, 𝛽2 = 0.999, weight decay =10 −4). • Learning Rate:1 × 10−5 with a linear warmup of 2,000 steps. •Total Training Steps: 50,000. • Resolution & Framerate:832 × 480spatial re...
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.