{"total":16,"items":[{"citing_arxiv_id":"2605.11869","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FIS-DiT: Breaking the Few-Step Video Inference Barrier via Training-Free Frame Interleaved Sparsity","primary_cat":"cs.CV","submitted_at":"2026-05-12T09:49:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FIS-DiT achieves 2.11-2.41x speedup on video DiT models in few-step regimes with negligible quality loss by exploiting frame-wise sparsity and consistency through a training-free interleaved execution strategy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17749","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Ego-InBetween: Generating Object State Transitions in Ego-Centric Videos","primary_cat":"cs.CV","submitted_at":"2026-04-20T03:07:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EgoIn uses a fine-tuned vision-language model to infer transition steps and a conditioning module plus auxiliary supervision to generate coherent egocentric video sequences of object state changes.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[37] Clinton J. Wang and Polina Golland. Interpolating between images with diffusion models.CoRR, abs/2307.12560, 2023. 3 [38] Xiaojuan Wang, Boyang Zhou, Brian Curless, Ira Kemelmacher-Shlizerman, Aleksander Holynski, and Steven M Seitz. Generative inbetweening: Adapting image-to-video models for keyframe interpolation.CoRR, abs/2408.15239, 2024. 3, 7 [39] Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video gen- eration with cascaded latent diffusion models.CoRR, abs/2309.15103, 2023. 2 [40] Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, Xi- aozhu Ju, Zhuqin Yang, Meng Li, Yinuo Zhao, Zhiyuan Xu, Guang Yang, et al."},{"citing_arxiv_id":"2603.13419","ref_index":60,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Diffusion Models Memorize in Training -- and Generalize in Inference","primary_cat":"cs.LG","submitted_at":"2026-03-12T21:02:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Diffusion models overfit denoising loss at intermediate noise but generalize in inference as model error smooths the flow field and sampling paths avoid memorized noisy training data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.20206","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling","primary_cat":"cs.CV","submitted_at":"2025-10-23T04:45:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RAPO++ is a three-stage prompt optimization framework combining retrieval-augmented refinement, closed-loop test-time scaling, and LLM fine-tuning to enhance text-to-video generation quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.19840","ref_index":83,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GenHSI: Controllable Generation of Human-Scene Interaction Videos","primary_cat":"cs.CV","submitted_at":"2025-06-24T17:58:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GenHSI is a training-free three-stage pipeline that turns a scene image, character image, and complex HSI prompt into long videos with plausible chained interactions by generating atomic actions, 3D keyframes via 2D inpainting plus optimization, and then feeding them to pre-trained video diffusion.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.21755","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness","primary_cat":"cs.CV","submitted_at":"2025-03-27T17:57:01+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs, and anomaly detection methods.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"in pursuit of intrinsic faithfulness. Index Terms -Video Generative Models, Evaluation Bench- mark. I. I NTRODUCTION V IDEO generation aims to create realistic and temporally coherent video sequences, with a wide range of applica- tions in video editing [1]-[14], customization [15]-[17], image animation [18], [19], and world models [20]. Earlier video generative models [21]-[24] primarily focused on generating short video clips of around two seconds, empha- sizing fundamental capabilities like per-frame aesthetics and temporal consistency. To systematically evaluate these capabil- ities, benchmarks [25]-[27] such as VBench [25], [26] have * equal contribution. B corresponding authors. Email: Dian Zheng zd1423606603@gmail."},{"citing_arxiv_id":"2501.03575","ref_index":212,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Cosmos World Foundation Model Platform for Physical AI","primary_cat":"cs.CV","submitted_at":"2025-01-07T06:55:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.15689","ref_index":58,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DOLLAR: Few-Step Video Generation via Distillation and Latent Reward Optimization","primary_cat":"cs.CV","submitted_at":"2024-12-20T09:07:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DOLLAR combines variational score and consistency distillation for few-step video generation plus latent reward optimization, reporting 82.57 VBench score and up to 278x speedup over the teacher diffusion model for 128-frame 10-second videos.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.13720","ref_index":71,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Movie Gen: A Cast of Media Foundation Models","primary_cat":"cs.CV","submitted_at":"2024-10-17T16:22:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.05363","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation","primary_cat":"cs.CV","submitted_at":"2024-10-07T17:56:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PhyGenBench supplies 160 prompts across 27 physical laws and an automated LLM/VLM evaluation pipeline to measure physical commonsense compliance in current text-to-video models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2409.18869","ref_index":88,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Emu3: Next-Token Prediction is All You Need","primary_cat":"cs.CV","submitted_at":"2024-09-27T16:06:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"\" Figure 7: Qualitative results of Emu3 text-to-video generation. 9 Models Type TotalscoreMotionsmoothnessDynamicdegreeAestheticqualityObjectclassMultipleobjectsHumanactionSpatialrelationshipSceneAppearancestyle SubjectconsistencyBackgroundconsistency ModelScope [87] Diff 75.75 95.79 66.39 56.39 82.25 38.98 92.4 33.68 39.26 25.67 89.87 95.29LaVie [88] Diff 77.08 96.38 49.72 54.94 91.82 33.32 96.8 34.09 52.69 23.56 91.41 97.47OpenSoraPlan V1.1 [41] Diff 78.00 98.28 47.72 56.85 76.3 40.35 86.80 53.11 27.17 22.90 95.73 96.73Show-1 [102] Diff 78.93 98.24 44.44 57.35 93.07 45.47 95.60 53.50 47.03 23.06 95.53 98.02OpenSora V1.2 [107] Diff 79.76 98.50 42.39 56.85 82.22 51.83 91.20 68.56 42.44 23.95 96."},{"citing_arxiv_id":"2408.06072","ref_index":104,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer","primary_cat":"cs.CV","submitted_at":"2024-08-12T11:47:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.03520","ref_index":105,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VideoPhy: Evaluating Physical Commonsense for Video Generation","primary_cat":"cs.CV","submitted_at":"2024-06-05T17:53:55+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VideoPhy benchmark shows state-of-the-art text-to-video models follow physical commonsense and text prompts in only 39.6% of cases for the best model.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Vision [84] baselines in Appendix J.3 We assess auto-rater effectiveness by computing the ROC-AUC between human annotations and model judgments for videos generated from testing prompts. 4 Setup Video Generative Models. We evaluate a diverse range oftwelve closed and open text-to-video generative models on VIDEO PHY dataset. The list of the models includes ZeroScope [20], LaVIE [105], VideoCrafter2[21], OpenSora [75], CogVideoX-2B and 5B [113], StableVideoDiffusion (SVD)- T2I2V [12], Gen-2 (Runway) [27], Lumiere-T2V, Lumiere-T2I2V (Google) [7], Dream Machine (Luma AI) [1], and Pika [78]. We provide more model and inference details in Appendix C and K. 4 Dataset setup. As described earlier, we train VIDEO CON-PHYSICS to enable cheaper and scalable"},{"citing_arxiv_id":"2311.15127","ref_index":99,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets","primary_cat":"cs.CV","submitted_at":"2023-11-25T22:28:38+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results while enabling motion LoRA and multi-view 3D applications.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"[97] Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571 , 2023. 3, 15 [98] Yaohui Wang, Piotr Bilinski, Francois Bremond, and An- titza Dantcheva. G3an: Disentangling appearance and mo- tion for video generation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2020. 15 [99] Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video genera- tion with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023. 15 [100] Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinyuan Chen, Yaohui Wang, Ping Luo, Ziwei"},{"citing_arxiv_id":"2311.04145","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models","primary_cat":"cs.CV","submitted_at":"2023-11-07T17:16:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"I2VGen-XL applies cascaded diffusion models with a base stage for semantic preservation via hierarchical encoders and a refinement stage for detail and resolution, trained on 35 million text-video and 6 billion text-image pairs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2310.19512","ref_index":52,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VideoCrafter1: Open Diffusion Models for High-Quality Video Generation","primary_cat":"cs.CV","submitted_at":"2023-10-30T13:12:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Open-source text-to-video and image-to-video diffusion models generate high-quality 1024x576 videos, with the I2V variant claimed as the first to strictly preserve reference image content.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}