{"total":20,"items":[{"citing_arxiv_id":"2605.17248","ref_index":150,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Image-to-Video Diffusion: From Foundations to Open Frontiers","primary_cat":"cs.CV","submitted_at":"2026-05-17T04:10:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A survey that organizes diffusion image-to-video methods into a taxonomy, distills core designs in condition encoding, temporal modeling, noise prior, and upsampling, and discusses applications plus challenges.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Semantically-guided sparse generation (SVG2[154]), Physics-grounded (PhyRPR[38]), Controllable generation (LMP[70]) Pose-guided 4D human video generation (AnimeAgent[138]), System-level efficient generation (UniCP[155]), Sparse generation (SVG[153]) Infinite-length films (SkyReels-V2[169]), 4D novel view synthesis (4DiM[159]), High-consistency and dynamic generation (Vidu[150]) Cross-frame consistency enhancement (FrameBridge[133]), Personalized animation (PersonalVideo[151]), Camera control (SRENDER[84]) Fig. 4: A taxonomy of diffusion I2V schemes, organized by model architecture and then subdivided by training paradigm. which builds on the large-scale HunyuanVideo line and equips the multimodal DiT backbone with character image injection,"},{"citing_arxiv_id":"2605.14382","ref_index":8,"ref_count":3,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-14T05:06:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Delta Forcing improves temporal coherence in interactive autoregressive video generation by estimating transition consistency from teacher-generator latent deltas and balancing it against a monotonic continuity objective.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07503","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Diffusion-APO: Trajectory-Aware Direct Preference Alignment for Video Diffusion Transformers","primary_cat":"cs.CV","submitted_at":"2026-05-08T09:37:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Diffusion-APO synchronizes training noise with inference trajectories in video diffusion models to improve preference alignment and visual quality.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"More recently, transformer-based diffusion architectures have been widely applied to large-scale video generation. Diffusion Transformers (DiT) [ 27] replace convolutional backbones with trans- former blocks, enabling improved scalability and representation capacity for large generative models. Inspired by the breakthrough work SORA [ 6], a series of DiT-based large-scale video diffusion models including ViDu [ 2], CogVideo [ 15, 46], Wan [ 39], Kling [ 36], and Seedance [ 32] have demonstrated remarkable capabilities in generating long, high-resolution, and temporally coherent videos given various references. Despite these rapid technical advances, effectively aligning such massive generative models with human intent remains an open challenge. The high dimensionality of"},{"citing_arxiv_id":"2605.06912","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Advancing Reliable Synthetic Video Detection: Insights from the SAFE Challenge","primary_cat":"cs.CV","submitted_at":"2026-05-07T20:22:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The SAFE challenge shows measurable progress in detecting synthetic videos across different generators but persistent weaknesses against post-processing operations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06509","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction","primary_cat":"cs.CV","submitted_at":"2026-05-07T16:21:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FreeSpec uses SVD-based spectral reconstruction to fuse global low-rank and local high-rank features, reducing content drift and preserving temporal dynamics in long video generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03652","ref_index":28,"ref_count":3,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics","primary_cat":"cs.CV","submitted_at":"2026-05-05T11:36:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AniMatrix generates anime videos by structuring artistic production rules into a controllable taxonomy and training the model to prioritize those rules over physical realism, achieving top scores from professional animators on prompt understanding and artistic motion.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Sora [2] demonstrated the effectiveness of large-scale DiT training with spatiotemporal attention for generating coherent, temporally extended videos. In the open-source domain, models such as HunyuanVideo [ 3], Wan 2.2 [4], CogVideoX [6], Open-Sora [27], and SkyReels [8, 9] have rapidly narrowed the gap with proprietary systems such as Kling [5], Seedance [1, 7], and Vidu [28] through scaling data curation [29] and architectural improvements including 3D RoPE [30], Mixture-of-Experts [31], and efficient attention [32], with standard benchmarks such as FVD [ 33], FID [34], and VBench [35] tracking this progress. The effectiveness of this paradigm rests on a single, often unstated premise: natural video implicitly encodes a universal physical prior that diffusion models absorb automatically during"},{"citing_arxiv_id":"2605.02641","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE","primary_cat":"cs.CV","submitted_at":"2026-05-04T14:26:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Technical Report Table 3Text-to-Video evaluation results on VBench 2.0 benchmark. Higher is better. Best results are inboldand second best are underlined. Name Date Creativity Common. Control. Human Physics Total↑ Proprietary Sora-480p [7] 2025-03 60.57 64.32 22.09 87.72 57.18 58.38 Kling1.6 [23] 2025-03 48.58 65.45 33.05 83.56 64.35 59.00 Vidu Q1 [48] 2025-04 56.54 65.98 38.13 81.24 71.63 62.70 Seedance 1.0 Pro [49] 2025-06 53.04 64.31 39.84 77.06 64.81 59.81 Veo3 [50] 2025-09 60.85 69.48 47.04 86.88 69.35 66.72 HunyuanVideo [3] 2025-03 41.84 63.44 28.60 82.41 60.20 55.30 Wan2.1 [4] 2025-03 55.2563.98 37.32 81.6062.84 60.20 LongCat-Video [22] 2025-10 54.73 70.94 44.7980.20 59.92 62.11 Mamoda2.5 2026-02 53."},{"citing_arxiv_id":"2604.27505","ref_index":3,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Leveraging Verifier-Based Reinforcement Learning in Image Editing","primary_cat":"cs.CV","submitted_at":"2026-04-30T06:54:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Edit-R1 builds a CoT-based reasoning reward model (RRM) via SFT and GCPO, then applies it with GRPO to improve image editing models such as FLUX.1-kontext.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Image editing has evolved from earlier task-specific systems for photo adjustment and exemplar-based stylization [23, 45, 67, 69, 71] to modern diffusion-based editors. With the rapid progress of diffusion and flow- based generative models, text-to-image (T2I) generation [2,6,10-13,15,16,18,25,28,32,40,42,44,46,60,70], image editing [5, 8, 35, 43, 50, 51, 56, 63], and video generation [3, 4, 7, 9, 17, 24, 29, 38, 47-49, 52, 54, 68] have advanced dramatically. In T2I generation, Reinforcement Learning from Human Feedback (RLHF) has become a core post-training step [16, 18, 60], driven by powerful reward models (RMs) [39, 57, 64] and optimization algorithms [53, 64, 66]. By contrast, the application of RLHF to image editing has remained"},{"citing_arxiv_id":"2604.17887","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement","primary_cat":"cs.RO","submitted_at":"2026-04-20T06:57:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict action accuracy on AgiBot and 9.7-17.6% gains in real-robot tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025. 3 [2] Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, et al. Cosmos-reason1: From physical common sense to embodied reasoning.arXiv preprint arXiv:2503.15558, 2025. 3 [3] Fan Bao, Chendong Xiang, Gang Yue, Guande He, Hongzhou Zhu, Kaiwen Zheng, Min Zhao, Shilong Liu, Yaole Wang, and Jun Zhu. Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models.arXiv preprint arXiv:2405.04233, 2024. 3 [4] Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doersch, Ted Xiao,"},{"citing_arxiv_id":"2604.12255","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ARGen: Affect-Reinforced Generative Augmentation towards Vision-based Dynamic Emotion Perception","primary_cat":"cs.CV","submitted_at":"2026-04-14T04:05:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ARGen generates high-fidelity dynamic facial expression videos using affective semantic injection and adaptive reinforcement diffusion to improve emotion recognition models facing data scarcity and long-tail distributions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11283","ref_index":140,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Multimodal Large Language Model-Enabled Video Translation: A Role-Oriented Survey","primary_cat":"cs.CV","submitted_at":"2026-04-13T10:42:31+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"MakeVideo [120], MagicVideo [121], AlignLatent [122], Dysen-VDM [123], VideoCrafter [124], Latent-VDM [125], Latent-Shift [126], LVDM [127], Tune-A-Video [128], AnimateDiff [129], Stable Video Diffusion [130], Kling-Avatar [131], Empatheia [132],Wav2Lip [133], TalkLip [134],VideoReTalking [135], IP-LAP [136], Diff2lip [137], MuseTalk [138], LatentSync [139] DiT-based Vidu [140], Phantom [141], OmniHuman-1 [142], CogVideoX [143], VideoAuteur [144], Prompt-A-Video [145], HunyuanVideo [146], Wan [147], MIDAS [148],OmniSync [149] Fig. 1. Taxonomy of MLLMs-based video translation, encompassing three primary dimensions: The Semantic Reasoner, Expressive Performer, and Visual Synthesizer, each with distinct sub-categories reflecting the diverse strategies employed in this role."},{"citing_arxiv_id":"2604.03819","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ActivityForensics: A Comprehensive Benchmark for Localizing Manipulated Activity in Videos","primary_cat":"cs.CV","submitted_at":"2026-04-04T18:00:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"ActivityForensics is the first large-scale benchmark for temporally localizing activity-level forgeries in videos, paired with a diffusion-based baseline called TADiff.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"advancing fine-grained video forensics research and foster- ing digital integrity infrastructures. 2. Related Works 2.1. Video Manipulation Methods Recent advances in video manipulation are largely driven by conditioned video generation and masked video editing. For conditioned generation methods, models such as Wan [33], FCVG [44], Scifi [9], and Vidu [1] synthesize temporally coherent sequences under text, pose, or key-frame condi- tioning, enabling controllable and high-fidelity creation of new actions. For masked video editing, approaches includ- ing the V ACE framework [15] and LTX [11] perform local- ized modifications guided by prompts, masks, and frame constraints while preserving the surrounding appearance"},{"citing_arxiv_id":"2602.13669","ref_index":53,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation","primary_cat":"cs.CV","submitted_at":"2026-02-14T08:32:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"EchoTorrent combines multi-teacher distillation, adaptive CFG calibration, hybrid long-tail forcing, and VAE decoder refinement to enable few-pass autoregressive streaming video generation with improved temporal consistency and audio-lip sync.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.02214","ref_index":1,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation","primary_cat":"cs.CV","submitted_at":"2026-02-02T15:19:22+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.16163","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning","primary_cat":"cs.AI","submitted_at":"2026-01-22T18:09:30+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Single-stage fine-tuning of a video model to generate actions as latent frames plus future states and values yields state-of-the-art robot policy performance on LIBERO, RoboCasa, and bimanual tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Markov decision processes (MDPs) defined by the tuple⟨S, A, T, R, H⟩, whereSis a set of states, Ais a set of actions,T:S×A→Π(S)is the state transition function,R:S×A→Ris the reward function, andH∈Nis the time horizon, with time stepst∈ {1,2, . . . , H}. We train a policyπ:S→Π(A)to maximize rewards, using sparse rewards whereR(s t, at) = 0for t < Hand terminal rewardsR(s H , aH)∈[0,1]. We train policies via imitation learning on expert demonstrations containing state-action pairs. Following Zhao et al. (2023), all policies predict action chunks-sequences of actions for multiple timesteps-to improve motion smoothness and success rates. 3 Figure 2: The latent diffusion sequence of Cosmos Policy.We illustratelatent frame injection-the primary"},{"citing_arxiv_id":"2510.08431","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency","primary_cat":"cs.CV","submitted_at":"2025-10-09T16:45:30+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The work introduces rCM, a score-regularized continuous-time consistency model that matches DMD2 quality on large models up to 14B parameters while improving diversity and enabling 1-4 step sampling.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.12898","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Vidar: Embodied Video Diffusion Model for Generalist Manipulation","primary_cat":"cs.LG","submitted_at":"2025-07-17T08:31:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Vidar shows that a video diffusion prior continuously pre-trained on 750K multi-view robot trajectories plus a label-free masked inverse dynamics adapter can generalize manipulation to new robot embodiments with 1% of typical demonstration data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.12768","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation","primary_cat":"cs.CV","submitted_at":"2025-07-17T03:48:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AnyPos automates task-agnostic action collection and inverse-dynamics modeling with arm/end-effector decoupling plus a direction-aware decoder, delivering 51% higher test accuracy and 30-40% better success rates on bimanual tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.20314","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Wan: Open and Advanced Large-Scale Video Generative Models","primary_cat":"cs.CV","submitted_at":"2025-03-26T08:25:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Wan releases open 1.3B and 14B video diffusion models claiming superior performance over open-source and commercial baselines across multiple tasks with consumer-grade efficiency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.03736","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data","primary_cat":"cs.LG","submitted_at":"2024-06-06T04:22:11+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Absorbing discrete diffusion models the conditional distributions of clean data; reparameterizing yields a time-independent RADD that unifies with AO-ARMs and reaches SOTA perplexity among diffusion models on zero-shot language benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}