{"total":10,"items":[{"citing_arxiv_id":"2604.15127","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MCSC-Bench: Multimodal Context-to-Script Creation for Realistic Video Production","primary_cat":"cs.MM","submitted_at":"2026-04-16T15:13:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MCSC-Bench is the first large-scale dataset for the Multimodal Context-to-Script Creation task, requiring models to select relevant shots from redundant materials, plan missing shots, and generate coherent scripts with voiceovers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06339","ref_index":184,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Evolution of Video Generative Foundations","primary_cat":"cs.CV","submitted_at":"2026-04-07T18:17:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"a subject's identity from reference images. Methods fall into two categories:instance-specific, which requires per-subject model fine-tuning, often with multiple images; andend-to- end, where a pre-trained model generalizes to new subjects from a single image without optimization. 5.1.1 Instance-specific video customization. Instance-specific single-subject customization.Animate- A-Story [184] proposes 'TimeInv', dynamically adjusting concept embeddings across denoising steps for shape con- trol in early stages and texture refinement later. Custom- Crafter [185] learns identity information via LoRA [186] adaptation, with weighted sampling to balance identity fidelity and motion. Magic-Me [187] uses a 3D Gaussian Noise Prior and a three-stage refinement, acquiring identity"},{"citing_arxiv_id":"2603.28489","ref_index":245,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms","primary_cat":"eess.IV","submitted_at":"2026-03-30T14:23:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"priors embedded in video generation models to synthesize more coherent and realistic 3D/4D environments. Some approaches decompose the pipeline into two stages: video generation and 3D optimization. They first use a video model to synthesize a reference video or multi-view sequence, and then recover scene structure via techniques such as 4D Gaussians. VividDream [245] introduces a novel pipeline that first constructs and expands a static 3D scene according to an input image, then generates dynamic multi-view videos with a video diffusion model, and finally makes use of them to optimize an explorable 4D scene. Similarly, 4Real [246] and Free4D [247] first generate a temporally consistent reference video and then expand the viewpoint range through frame-"},{"citing_arxiv_id":"2510.20206","ref_index":63,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling","primary_cat":"cs.CV","submitted_at":"2025-10-23T04:45:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RAPO++ is a three-stage prompt optimization framework combining retrieval-augmented refinement, closed-loop test-time scaling, and LLM fine-tuning to enhance text-to-video generation quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.16819","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Character-Centered Dialogue Generation from Scene-Level Prompts","primary_cat":"cs.CV","submitted_at":"2025-05-22T15:54:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A training-free framework generates expressive, character-grounded dialogue and speech from scene prompts using vision-language encoders, LLMs, and a recursive narrative memory bank for cross-scene consistency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.21755","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness","primary_cat":"cs.CV","submitted_at":"2025-03-27T17:57:01+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs, and anomaly detection methods.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"0 aims to set a new standard for the next generation of video generative models in pursuit of intrinsic faithfulness. Index Terms -Video Generative Models, Evaluation Bench- mark. I. I NTRODUCTION V IDEO generation aims to create realistic and temporally coherent video sequences, with a wide range of applica- tions in video editing [1]-[14], customization [15]-[17], image animation [18], [19], and world models [20]. Earlier video generative models [21]-[24] primarily focused on generating short video clips of around two seconds, empha- sizing fundamental capabilities like per-frame aesthetics and temporal consistency. To systematically evaluate these capabil- ities, benchmarks [25]-[27] such as VBench [25], [26] have"},{"citing_arxiv_id":"2503.06310","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Scene-Action Prompt Fusion for Coherent Text-to-Video Storytelling","primary_cat":"cs.CV","submitted_at":"2025-03-08T19:04:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A prompt fusion approach combines bidirectional time-weighted latent blending, dynamics-informed prompt weighting via CLIP, and semantic action representations to produce temporally consistent long videos from text without retraining.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2501.04001","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos","primary_cat":"cs.CV","submitted_at":"2025-01-07T18:58:54+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Sa2VA unifies SAM-2 segmentation with MLLM reasoning into a single model for referring segmentation and conversation on images and videos, supported by a new 72k-expression Ref-SAV dataset.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Correspondence:Xiangtai Li atxiangtai.li@bytedance.com 1 Introduction Multi-modal Large Language Models (MLLMs) have made significant progress, fueled by the rapid develop- ment of Large Language Models (LLMs) [39, 83, 101]. Numerous MLLMs have been applied to image- and video-level tasks such as visual question answering (VQA) [1, 80], narrative story generation [33, 102, 122], and interactive editing [36, 45, 85]. One impor- tant direction is to understand video content in a fine-grained manner, including segmenting and track- ing pixels with language descriptions, and performing VQA on visual prompts in the video. In particular, we aim to achieve promptable fine-grained analysis of video, enabling the user to be in the loop when"},{"citing_arxiv_id":"2402.19473","ref_index":174,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Retrieval-Augmented Generation for AI-Generated Content: A Survey","primary_cat":"cs.CV","submitted_at":"2024-02-29T18:59:01+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A survey classifying RAG foundations for AIGC, summarizing enhancements, cross-modal applications, benchmarks, limitations, and future directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"finetunes the generator CODEGEN-MONO 350M [173] with a shuffled new file combined with API information and code blocks. CARE [117] trains encoders with image, audio, and video-text pairs, then fine-tunes the decoder (generator) to simultaneously reduce caption and concept detection loss, while keeping the encoders and retriever fixed. Animate-A- Story [174] optimizes the video generator with image data, and then finetunes a LoRA [175] adapter to capture the appearance details of the given character. RetDream [50] finetunes a LoRA adapter [175] with the rendered images. 4) Result Enhancement : In many scenarios, the result of RAG may not achieve the expected effect, and some tech- niques of Result Enhancement can help alleviate this problem."},{"citing_arxiv_id":"2402.17177","ref_index":150,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models","primary_cat":"cs.CV","submitted_at":"2024-02-27T03:30:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Cun, X. Wang,et al., \"Make- your-video: Customized video generation using textual and structural guidance,\" arXiv preprint arXiv:2306.00943, 2023. [149] Y . Guo, C. Yang, A. Rao, Y . Wang, Y . Qiao, D. Lin, and B. Dai, \"Animatediff: Animate your per- sonalized text-to-image diffusion models without specific tuning,\"arXiv preprint arXiv:2307.04725, 2023. [150] Y . He, M. Xia, H. Chen, X. Cun, Y . Gong, J. Xing, Y . Zhang, X. Wang, C. Weng, Y . Shan, et al. , \"Animate-a-story: Storytelling with retrieval-augmented video generation,\" arXiv preprint arXiv:2307.06940, 2023. [151] H. Ni, C. Shi, K. Li, S. X. Huang, and M. R. Min, \"Conditional image-to-video generation with latent flow diffusion models,\" inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern"}],"limit":50,"offset":0}