{"total":59,"items":[{"citing_arxiv_id":"2606.28128","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PhysisForcing: Physics Reinforced World Simulator for Robotic Manipulation","primary_cat":"cs.CV","submitted_at":"2026-06-26T14:30:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PhysisForcing applies trajectory and relational alignment losses to DiT features in video models, improving physical plausibility on R-Bench, PAI-Bench, and EZS-Bench while raising closed-loop robotic success rates from 16% to 24%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28026","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EMOSH: Expressive Motion and Shape Disentanglement for Human Animation","primary_cat":"cs.CV","submitted_at":"2026-06-26T12:30:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EMOSH proposes an Expressive Human Model with disentangled parameters, coarse-to-fine motion injection, and spatially-aligned conditioning to generate high-fidelity expressive human videos without driving-subject shape leakage.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.25465","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EchoStyle: Unlocking High-Fidelity Video Stylization with Reverse Data Synthesis","primary_cat":"cs.CV","submitted_at":"2026-06-24T06:45:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"EchoStyle is a text-driven framework for arbitrary-length video stylization that creates the V-Style20k dataset through reverse synthesis and adds init-follow-mode with sliding windows to reduce style drift and motion issues.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30774","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CameraNoise: Enabling Faithful Camera Control in Video Diffusion through Geometry-Flow-Guided Noise Warping","primary_cat":"cs.CV","submitted_at":"2026-05-29T03:02:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CameraNoise embeds camera motion into the noise space of video diffusion via Geometry-guided Reprojection Flow and noise warping to achieve faithful trajectory control while preserving the diffusion prior.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30409","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer","primary_cat":"cs.CV","submitted_at":"2026-05-28T17:59:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SANA-Streaming delivers 1280x704 streaming video editing at 24 FPS end-to-end on an RTX 5090 using hybrid DiT blocks, cycle-reverse training, and mixed-precision quantization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28816","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Gamma-World: Generative Multi-Agent World Modeling Beyond Two Players","primary_cat":"cs.CV","submitted_at":"2026-05-27T17:59:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A multi-agent video world model using simplex rotary agent encoding and sparse hub attention achieves better fidelity, controllability, and consistency than baselines while generalizing from 2 to 4 players.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28035","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MTAVG-Bench 2.0: Diagnosing Failure Modes of Cinematic Expressiveness in Multi-Talker Audio-Video Generation","primary_cat":"cs.AI","submitted_at":"2026-05-27T06:38:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MTAVG-Bench 2.0 is a new benchmark that evaluates omni LLMs on diagnosing high-level cinematic failures in multi-talker audio-video generation using a taxonomy of acting, narrative, atmosphere, and audio-visual language.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23891","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Smart-Insertion-V: Photorealistic Video Insertion via a Closed-Loop Feedback Dual-Stream Framework","primary_cat":"cs.CV","submitted_at":"2026-05-22T17:54:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Smart-Insertion-V is a dual-stream closed-loop framework with Dual-World-View RoPE and a Decoupled Guidance Module that inserts reference objects into videos while achieving stylistic harmony despite domain gaps.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23878","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LaMo: Self-Supervised Latent Motion Priors for Physical Realism in Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-22T17:34:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LaMo adds self-supervised latent motion priors via a motion drift loss during training and motion prior guidance during sampling to boost physical fidelity in video diffusion models like CogVideoX.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23610","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EM-Vid: Training-Free Entity-Centric Memory for Efficient and Consistent Multi-Shot Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-22T13:20:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EM-Vid introduces an entity-centric latent patch memory bank with sparse token conditioning and budgeted updates for training-free consistent multi-shot video generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23522","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Precise: SDE-Consistent Stochastic Sampling for RL Post-Training of Flow-Matching Models","primary_cat":"cs.LG","submitted_at":"2026-05-22T11:37:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Precise is a new SDE-consistent stochastic sampler that balances exploration and stability for RL post-training of flow-matching models via a novel posterior-mean approximation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19391","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Tweedie's Formulae and Diffusion Generative Models Beyond Gaussian","primary_cat":"stat.ML","submitted_at":"2026-05-19T05:36:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Extends Tweedie's formulae to GBM, BESQ, and CIR processes to enable non-Gaussian diffusion generative models and empirical Bayes applications.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17248","ref_index":40,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Image-to-Video Diffusion: From Foundations to Open Frontiers","primary_cat":"cs.CV","submitted_at":"2026-05-17T04:10:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A survey that organizes diffusion image-to-video methods into a taxonomy, distills core designs in condition encoding, temporal modeling, noise prior, and upsampling, and discusses applications plus challenges.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"I2V diffusion generation as a first-class topic and provides a more complete taxonomy and systematic analysis of its technical foundations and future directions. In response, as seen in Fig. 4 we build a two-level tax- onomy that organizes first by model architecture,e.g., U- Net [35], [36], DiT [8], [37], and then by training paradigm, e.g., training-free [38], [39], multi-stage [40], [41], and fine- tuning [42], [43] schemes. Simultaneously, we provide a detailed review of the definition, architectures, datasets, and evaluation metrics of I2V diffusion, offering a comprehensive understanding of this field. Furthermore, to study the foun- dation techniques in I2V , we analyze four key designs,i.e., condition encoding, temporal modeling, noise prior designs,"},{"citing_arxiv_id":"2605.15980","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization","primary_cat":"cs.CV","submitted_at":"2026-05-15T14:13:39+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14269","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-14T02:12:13+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PhyMotion scores generated human videos by grounding recovered 3D poses in a physics simulator across kinematic, contact, and dynamic axes, yielding stronger human correlation and larger RL post-training gains than prior 2D rewards.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15237","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A3D: Agentic AI flow for autonomous Accelerator Design","primary_cat":"cs.AR","submitted_at":"2026-05-14T01:28:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A3D is an agentic AI system that automates end-to-end hardware accelerator design for complex applications like LAMMPS and QMCPACK with no human intervention.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14136","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion","primary_cat":"cs.CV","submitted_at":"2026-05-13T21:39:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TeDiO regularizes temporal diagonals in diffusion transformer attention maps to produce smoother video motion while keeping per-frame quality intact.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10434","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors","primary_cat":"cs.CV","submitted_at":"2026-05-11T12:06:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The paper presents WorldReasonBench, a benchmark that tests video generators on maintaining physical, social, logical, and informational consistency when predicting future states from initial conditions and actions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"We will release our benchmarks and evaluation toolkit to support community research on genuinely world-aware video generation athttps://github.com/UniX-AI-Lab/WorldReasonBench/. 1 Introduction The rapid advance of large-scale video generation models [17, 9, 23, 28, 27] has shifted the central question in video generation. Frontier systems in the Seedance, Veo, and Sora families [3, 26, 1] now produce longer, cleaner, and more controllable videos, while recent studies suggest that video models may already exhibit zero-shot learning and reasoning-like behavior in selected settings [26]. These advances make it increasingly plausible to ask whether modern video generators are beginning to act asworld modelsrather than only powerful pixel synthesizers."},{"citing_arxiv_id":"2605.10079","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SocialDirector: Training-Free Social Interaction Control for Multi-Person Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-11T07:01:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SocialDirector uses spatiotemporal actor masking and directional reweighting on cross-attention maps to reduce actor-action mismatches and improve target-directed interactions in generated multi-person videos.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"build a fully automated evaluation pipeline powered by open-source VLMs. Exper- iments on different video generation models show that SocialDirector significantly improves interaction fidelity and approaches the upper bound set by real videos. 1 Introduction Diffusion-based video generation has advanced rapidly in recent years. Large-scale models such as Seedance [15], Wan [60], and LTX [18] can now synthesize photorealistic characters, motions, and camera movements from text or image prompts. Building on this progress, studies have started extending to multi-person scenarios [28, 40, 20, 81, 61], motivated by applications in film production [53, 71] and social robotics [7, 4]. Among these, social interactions, where individuals engage in"},{"citing_arxiv_id":"2605.06912","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Advancing Reliable Synthetic Video Detection: Insights from the SAFE Challenge","primary_cat":"cs.CV","submitted_at":"2026-05-07T20:22:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The SAFE challenge shows measurable progress in detecting synthetic videos across different generators but persistent weaknesses against post-processing operations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04702","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"FaithfulFaces: Pose-Faithful Facial Identity Preservation for Text-to-Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-06T09:54:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FaithfulFaces introduces a pose-faithful identity aligner with a shared dictionary and invariance constraint to maintain facial identity in text-to-video generation under large pose changes and occlusions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03652","ref_index":1,"ref_count":3,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics","primary_cat":"cs.CV","submitted_at":"2026-05-05T11:36:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AniMatrix generates anime videos by structuring artistic production rules into a controllable taxonomy and training the model to prioritize those rules over physical realism, achieving top scores from professional animators on prompt understanding and artistic motion.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"near-physical motion to expressive anime motion. Third, deformation-aware preference optimization with a domain-specific reward model separates intentional artistry from pathological collapse. On an anime-specific human evaluation with five production dimensions scored by professional ani- mators, AniMatrix ranks first on four of five, with the largest gains over Seedance-Pro 1.0 [ 1] on Prompt Understanding (+0.70, +22.4%) and Artistic Motion (+0.55, +16.9%). We are preparing accompanying resources for public release to support reproducibility and follow-up research. 1 Introduction Video generation has advanced rapidly, with models such as Sora [ 2], HunyuanVideo [3], Wan 2.2 [4], Kling [ 5], CogVideoX [6], Seedance [7], and SkyReels [8, 9] producing coherent and visually rich natural video."},{"citing_arxiv_id":"2605.02641","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE","primary_cat":"cs.CV","submitted_at":"2026-05-04T14:26:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Table 3Text-to-Video evaluation results on VBench 2.0 benchmark. Higher is better. Best results are inboldand second best are underlined. Name Date Creativity Common. Control. Human Physics Total↑ Proprietary Sora-480p [7] 2025-03 60.57 64.32 22.09 87.72 57.18 58.38 Kling1.6 [23] 2025-03 48.58 65.45 33.05 83.56 64.35 59.00 Vidu Q1 [48] 2025-04 56.54 65.98 38.13 81.24 71.63 62.70 Seedance 1.0 Pro [49] 2025-06 53.04 64.31 39.84 77.06 64.81 59.81 Veo3 [50] 2025-09 60.85 69.48 47.04 86.88 69.35 66.72 HunyuanVideo [3] 2025-03 41.84 63.44 28.60 82.41 60.20 55.30 Wan2.1 [4] 2025-03 55.2563.98 37.32 81.6062.84 60.20 LongCat-Video [22] 2025-10 54.73 70.94 44.7980.20 59.92 62.11 Mamoda2.5 2026-02 53.81 69.19 38.61 84.56 62.05 61.64 VACE-14B [38], InsViE [52], Lucy-Edit [53], ICVE [54], Ditto [35], OpenVE-Edit [37], and VInO [17], while"},{"citing_arxiv_id":"2605.02134","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Video Generation with Predictive Latents","primary_cat":"cs.CV","submitted_at":"2026-05-04T01:30:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PV-VAE improves video latent spaces for generation by unifying reconstruction with future-frame prediction, reporting 52% faster convergence and 34.42 FVD gain over Wan2.2 VAE on UCF101.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"generative performance improving alongside VAE training, but also yields consistent gains in downstream video understanding, underscoring a latent space that effectively captures temporal coherence and motion priors. Date:May 5, 2026 Correspondence:Feng Wang atwangfeng.eve@bytedance.com Project Page:https://zhao-yian.github.io/PVVAE 1 Introduction Video generation has achieved extraordinary breakthroughs [16, 22, 37, 50, 58], with contemporary models producing content of cinematic brilliance that often surpasses professional-grade cinematography and pro- duction standards. This rapid progress stems from the ability to represent the visual world within compact latent spaces, largely driven by advances in Latent Video Diffusion Models (LVDMs) [5] and Video Variational"},{"citing_arxiv_id":"2605.01761","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"TrajShield: Trajectory-Level Safety Mediation for Defending Text-to-Video Models Against Jailbreak Attacks","primary_cat":"cs.CV","submitted_at":"2026-05-03T07:49:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TrajShield is a training-free defense that reduces jailbreak success rates by 52.44% on average in text-to-video models by localizing and neutralizing risks through trajectory simulation and causal intervention.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01725","ref_index":12,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Motion-Aware Caching for Efficient Autoregressive Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-03T05:49:27+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MotionCache accelerates autoregressive video generation up to 6.28x by motion-weighted cache reuse based on inter-frame differences, with negligible quality loss on SkyReels-V2 and MAGI-1.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Extensive experiments on state-of-the-art models like SkyReels-V2 and MAGI-1 demonstrate that MotionCache achieves significant speedups of 6.28× and 1.64× respectively, while effectively preserving generation quality (VBench:1%↓ and 0.01%↓respectively). The code is available at https://github.com/ywlq/MotionCache. 1 Introduction Video generation models [12, 18, 24, 26, 34, 41, 47] have achieved remarkable success, facilitating applications ranging from autonomous driving [10, 11, 37] and cinematic creation [6, 38] to social media [3]. While architectures have evolved from U-Nets [2, 27, 29] to scalable Diffusion Transformers (DiTs) [25], practical deployment is hindered by the prohibitive costs of iterative denoising."},{"citing_arxiv_id":"2605.01517","ref_index":251,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation","primary_cat":"cs.CV","submitted_at":"2026-05-02T16:10:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27505","ref_index":17,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Leveraging Verifier-Based Reinforcement Learning in Image Editing","primary_cat":"cs.CV","submitted_at":"2026-04-30T06:54:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Edit-R1 builds a CoT-based reasoning reward model (RRM) via SFT and GCPO, then applies it with GRPO to improve image editing models such as FLUX.1-kontext.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-firstinternational conference on machine learning, 2024. [16] Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report.arXiv preprint arXiv:2504.11346, 2025. [17] Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models. arXiv preprint arXiv:2506.09113, 2025. 11 [18] Lixue Gong, Xiaoxia Hou, Fanshi Li, Liang Li, Xiaochen Lian, Fei Liu, Liyang Liu, Wei Liu, Wei Lu, Yichun Shi, et al."},{"citing_arxiv_id":"2604.25427","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Systematic Post-Train Framework for Video Generation","primary_cat":"cs.CV","submitted_at":"2026-04-28T09:34:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"while adhering to strict sampling cost constraints. 1 Introduction Recent years have seen rapid progress in large-scale diffusion models and diffusion-transformer models [1, 2, 3, 4, 5, 6]. These models have advanced from generating short, low-resolution clips to producing longer, higher-resolution videos with more complex motion and richer semantics [7, 8, 9, 10]. Despite these improvements, pretrained video generation models still fall short of real-world deployment requirements [11, 12]. In practice, they are often sensitive to prompt wording, unstable over long time horizons, prone to local artifacts, such as errors in hands, text, and fast motion, and limited in instruction-following and controllable editing."},{"citing_arxiv_id":"2604.21776","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting","primary_cat":"cs.CV","submitted_at":"2026-04-23T15:32:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Reshoot-Anything trains a diffusion transformer on pseudo multi-view triplets created by cropping and warping monocular videos to achieve temporally consistent video reshooting with robust camera control on dynamic scenes.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"• Third, we demonstrate through extensive experiments that our approach achieves state-of-the-art temporal con- sistency and camera control across a diverse set of dy- namic videos, significantly advancing the capabilities of video reshooting. 2. Related Work 2.1. Camera Controls for Video Generation Models The introduction of transformer-based diffusion models [8, 26, 27, 34, 46] has catalyzed an explosion in video gen- eration research and the development of methods that en- able precise video control [6, 10, 42, 47]. These addi- tional control signals encompass user interactions like drag- ging [9, 17, 18, 51], explicit camera coordinates [12, 38, 41, 43, 45, 48, 50], and human pose estimation [4, 7, 28, 36]."},{"citing_arxiv_id":"2604.20157","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HumanScore: Benchmarking Human Motions in Generated Videos","primary_cat":"cs.CV","submitted_at":"2026-04-22T03:51:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"HumanScore defines six metrics for kinematic plausibility, temporal stability, and biomechanical consistency to benchmark human motions in videos from thirteen state-of-the-art generation models, revealing gaps between visual appeal and physical fidelity.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"tion details for each of the metrics are provided in the supplementary materials. 4 Experiments and Main Results 4.1 Evaluate Video Generation Models We benchmark thirteen video generation models, including four open-source sys- tems-Wan 2.2 [68], CogVideoX-5B [74], HunyuanVideo 1.5 [72], and Kandinsky 5.0 pro [5]-and nine proprietary systems: Sora-2 [46], Veo 3.1 fast [38], KlingAI 2.5 Turbo Pro [2], Seedance 1.0 Pro fast [16], Hailuo 02 [45], Pika v2.2 [49], 10 Y. Fang, T. Xiang et al. Table 1: HumanScore Leaderboard.Higher scores indicate better performance. The best score in each dimension is highlighted in cell colors. Models Anatomy CorrectnessKinematic CorrectnessKinetic CorrectnessOverall (I) (II)Avg (III) (IV)Avg (V) (VI)Avg Proprietary models Seedance 1.0 Pro fast [16]94."},{"citing_arxiv_id":"2604.19741","ref_index":49,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CityRAG: Stepping Into a City via Spatially-Grounded Video Generation","primary_cat":"cs.CV","submitted_at":"2026-04-21T17:59:03+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"We show that our approach demonstrates strong 3D understanding of the underlying scene, disentangles dynamic and static elements without any addi- tional heuristics, and generates realistic sequences across diverse settings. 2 Related Works 2.1 Video Generative Models Advances in video generative models [21,25,28,31,40,42,45,57] have unlocked a wide range of applications, such as content generation [14,49], novel view synthesis [33,69], and simulations for autonomous driving and robotics [19,61]. Most popular formulations include text-to-video (T2V) [50,67] and image- to-video (I2V) [3,5,6] generation due to their scalability, and they can then be finetunedbasedontherequirementsofdownstreamapplications.Ourapplication requires long-term consistency, pose control, and integration of external context."},{"citing_arxiv_id":"2604.19092","ref_index":16,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-04-21T05:09:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RoboWM-Bench evaluates video world models by converting their manipulation video predictions into executable actions validated in simulation, showing that visual plausibility does not guarantee physical executability.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"uate the embodied executability of state-of-the-art video world models, re- vealing their challenges and potential for physically grounded robot learning. 2 Related Work 2.1 World Models for Robotics Recent advances in large-scale video generation have renewed interest in world models as predictive models of physical dynamics [1,2,19,20,47,50,53]. Mod- els such as Sora [8], Veo [17], Wan [49], and Seedance [16] demonstrate strong visual realism and temporal coherence, suggesting that internet-scale training can capture rich spatiotemporal priors. However, while these models can gen- erate visually plausible manipulation videos, it remains unclear whether they preserve physical consistency and interaction dynamics in a manner that sup- ports executable control."},{"citing_arxiv_id":"2604.18215","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation","primary_cat":"cs.CV","submitted_at":"2026-04-20T13:00:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"strate that our approach achieves state-of-the-art performance in terms of both visual quality and spatial consistency. The code is available at https://github.com/iguoyanjun/Memorize-When-Needed. Keywords:Long-horizon Video Generation·Spatial Consistency·De- coupled Memory Modeling·Camera-aware Gating 1 Introduction Recent state-of-the-art video generation models [4,13,26-28,48,54] have achieved impressive spatio-temporal coherence within short-term sequences. However, ex- tending such fidelity to long-horizon synthesis remains a challenge [12,26,44]. ∗ Equal contribution. † Corresponding author. arXiv:2604.18215v2 [cs.CV] 21 Apr 2026 2 Y. Guo et al. Fig.1: Comparison of model architecture of long-horizon video generation"},{"citing_arxiv_id":"2604.17195","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"DreamShot: Personalized Storyboard Synthesis with Video Diffusion Prior","primary_cat":"cs.CV","submitted_at":"2026-04-19T01:51:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DreamShot uses video diffusion priors and a role-attention consistency loss to produce coherent, personalized storyboards with better character and scene continuity than text-to-image methods.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Storyboard data are scarce and need to be extracted from cinematic videos. Existing open-source datasets [28, 45] rarely contain storyboard-style structures, while high-quality narrative videos are difficult to access. To mitigate this, we collected 40K high-quality videos from web and supplemented them with 50K AIGC-generated videos created by advanced video tools [11] and diverse prompts. These synthetic videos are easier to obtain and cover a wide range of styles, scenes, and narratives, provid- ing rich material for storyboard generation. Narrative Scene Structuring.Next, unlike method [43] that uniformly sample frames as storyboards, we extract repre- sentative, narratively coherent, and scene-consistent shots"},{"citing_arxiv_id":"2604.15911","ref_index":276,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Efficient Video Diffusion Models: Advancements and Challenges","primary_cat":"cs.CV","submitted_at":"2026-04-17T10:11:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15299","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AnimationBench: Are Video Models Good at Character-Centric Animation?","primary_cat":"cs.CV","submitted_at":"2026-04-16T17:57:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AnimationBench is the first benchmark that operationalizes the twelve basic principles of animation and IP preservation into scalable, VLM-assisted metrics for animation-style I2V generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14148","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Seedance 2.0: Advancing Video Generation for World Complexity","primary_cat":"cs.CV","submitted_at":"2026-04-15T17:59:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Seedance 2.0 is an updated multi-modal model for generating 4-15 second audio-video content at 480p/720p with support for up to 3 video, 9 image, and 3 audio references.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16503","ref_index":11,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Motif-Video 2B: Technical Report","primary_cat":"cs.CV","submitted_at":"2026-04-14T15:09:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Motif-Video 2B reaches 83.76% on VBench, outperforming a 14B-parameter model with 7x fewer parameters and far less training data through shared cross-attention and a three-part backbone.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"14B while using 7× fewer parameters and substantially less training data. These results suggest that careful architectural specialization, combined with an efficiency-oriented training recipe, can narrow or exceed the quality gap typically associated with much larger video models. 1 Introduction Video generation has entered a scaling regime. The most capable open models, Wan2.1 [36], Hunyuan- Video [18], and Seedance [11], are trained on hundreds of millions of curated clips, with parameter counts ranging from 5B to 14B. This concentration of resources has produced impressive results, but it has also narrowed participation: in practice, training a competitive video generation model is accessible to very few groups. The image generation domain has begun to challenge this assumption."},{"citing_arxiv_id":"2604.11521","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Continuous Adversarial Flow Models","primary_cat":"cs.LG","submitted_at":"2026-04-13T14:23:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-image benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Our method yields better generalization. 1 Introduction Flow matching [46] has achieved significant success in recent years, yet a critical problem remains. The issue is particularly evident in the generation of visual 1 Correspondence to: Shanchuan Lin <peterlin@bytedance.com> arXiv:2604.11521v1 [cs.LG] 13 Apr 2026 2 S. Lin et al. modalities, such as image [6,67] and video [15,65,66] synthesis, where models often produce out-of-distribution samples unless guidance is applied [14,21,33]. While guidance improves sample quality, it alters the sampling distribution. How to more faithfully model the underlying distribution of the original data remains an open problem. One reason flow matching generates out-of-distribution samples is that it uses"},{"citing_arxiv_id":"2604.10980","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Tracking High-order Evolutions via Cascading Low-rank Fitting","primary_cat":"cs.LG","submitted_at":"2026-04-13T04:39:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Cascading low-rank fitting approximates successive high-order derivatives in diffusion models via a shared base function with sequentially added low-rank components, accompanied by theorems proving monotonic non-increasing ranks under linear decomposability and the possibility of arbitrary rank perm","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16479","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Latent-Compressed Variational Autoencoder for Video Diffusion Models","primary_cat":"cs.CV","submitted_at":"2026-04-12T04:45:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A frequency-based latent compression method for video VAEs yields higher reconstruction quality than channel-reduction baselines at fixed compression ratios.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"training [38]. Latent Video Diffusion Models.While autoregressive video generation models [24, 50] synthesize videos within discrete token spaces, latent diffusion models [3, 18] operate in continuous latent spaces and form the backbone of several state-of-the-art text-to-video models [11], including Sora [4], Hunyuan Video [25], Wan [44], and Seedance 1.0 [12]. Be- yond the design of the video V AEs [50], these models often introduce specific diffusion model architectures [ 25], em- ploy distinct training strategies [16] or perform specific data curation [56]. In this work, however, we focus on develop- ing a diffusion-agnostic video V AE, rather than pursuing state-of-the-art text-to-video generation, which typically re-"},{"citing_arxiv_id":"2604.10103","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation","primary_cat":"cs.CV","submitted_at":"2026-04-11T08:54:07+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Hybrid Forcing combines linear temporal attention for long-range retention, block-sparse attention for efficiency, and decoupled distillation to achieve real-time unbounded 832x480 streaming video generation at 29.5 FPS.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"The source code and trained models are available at https://github.com/leeruibin/hybrid-forcing. Keywords:Streaming Video Generation·Hybrid Attention·Sparse Attention·Linear Attention 1 Introduction The rapid scaling of video diffusion models (VDMs) has substantially improved video generation quality, enabling high-fidelity synthesis from text, image, or video inputs [11,12,21,32,33,60] to multimodal outputs [2,11,19,23,28,36, 42]. However, most large-scale VDMs rely on bidirectional attention, restricting arXiv:2604.10103v2 [cs.CV] 28 Apr 2026 2 R. Li et al. Fig. 1:Illustration of our hybrid attention paradigm for SVG. (a) The standard SWA approach only caches the most recent frames, leading to significant error accumulation"},{"citing_arxiv_id":"2604.07958","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks","primary_cat":"cs.CV","submitted_at":"2026-04-09T08:22:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ImVideoEdit learns video editing from 13K image pairs by decoupling spatial modifications from frozen temporal dynamics in pretrained models, matching larger video-trained systems in fidelity and consistency.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"computational overhead, ImVideoEdit achieves editing fi- delity and temporal consistency comparable to larger mod- els trained on extensive video datasets. 1. Introduction Diffusion models, particularly 3D Diffusion Transformers (3D DiTs), have achieved revolutionary breakthroughs in video generation, as demonstrated by cutting-edge models like Seedance [7, 27] and Veo [5]. However, generating high-quality videos is merely the first step. Real-world con- tent creation demands not only superior generation but also robust editing capabilities that strike a balance between se- mantic manipulation and structural preservation. Specifi- cally, this requires the ability to execute precise modifica- *Equal Contribution."},{"citing_arxiv_id":"2604.07026","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Not all tokens contribute equally to diffusion learning","primary_cat":"cs.CV","submitted_at":"2026-04-08T12:45:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DARE mitigates neglect of important tokens in conditional diffusion models via distribution-rectified guidance and spatial attention alignment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06339","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Evolution of Video Generative Foundations","primary_cat":"cs.CV","submitted_at":"2026-04-07T18:17:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Index Terms-Video Generation, Generative Adversarial Networks, Diffusion models, Auto-regressive Models, MultiModal Generation ✦ 1 INTRODUCTION The rapid advancement and widespread popularity of Artificial Intelligence Generated Content (AIGC) have sig- nificantly transformed the landscape of video generation, primarily driven by the emergence of diffusion models [1]- [3]. Modern proprietary systems like OpenAI's Sora [4], Google's Veo3 [5], and ByteDance's Seedance [6], along- side influential open-source models such as Wan [7] and HunyuanVideo [8], demonstrate unprecedented capabilities in synthesizing temporally coherent and semantically rich videos. These diverse advancements herald a promising path toward building actionable \"world models\", which are comprehensive representations of the environment that enable machines to understand, predict, and interact with"},{"citing_arxiv_id":"2604.06010","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OmniCamera: A Unified Framework for Multi-task Video Generation with Arbitrary Camera Control","primary_cat":"cs.CV","submitted_at":"2026-04-07T16:06:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OmniCamera disentangles video content and camera motion for multi-task generation with arbitrary camera control via the OmniCAM hybrid dataset and Dual-level Curriculum Co-Training.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"mechanism, we propose to conceptually decouple video generation into two in- dependent control dimensions: scene content and camera pose. While simulating realistic cinematographic operations is crucial for professional applications, cur- rent approaches typically restrict control to a single modality or focus on narrow tasks as shown in Table. 1. For instance, existing methods separately utilize tex- tual descriptions [13,15,34], 3D trajectories [1,2,16,22], or reference videos [26] for camera motion. They often struggle with the inherent limitations of each modality (e.g., text is too coarse, trajectories are hard to acquire) and fail to support the free combination of diverse content sources and camera conditions. To address these limitations and theoretically encompass all conceivable"},{"citing_arxiv_id":"2604.01621","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DWDP: Distributed Weight Data Parallelism for High-Performance LLM Inference on NVL72","primary_cat":"cs.DC","submitted_at":"2026-04-02T05:00:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DWDP distributes MoE weights across GPUs for independent execution without collective synchronization, improving output TPS/GPU by 8.8 percent on GB200 NVL72 for DeepSeek-R1 under 8K input and 1K output lengths.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. [2] R. Thomas McCoy, Shunyu Yao, Dan Friedman, Mathew D. Hardy, and Thomas L. Griffiths. When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1.arXiv preprint arXiv:2410.01792, 2024. [3] Yu Gao, Haoyuan Guo, Tuyen Hoang, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025. [4] Audrey Cheng, Shu Liu, Melissa Pan, et al. Barbarians at the gate: How AI is upending systems research. arXiv preprint arXiv:2510.06189, 2025. [5] DeepSeek-AI, Aixin Liu, Bei Feng, et al. DeepSeek-V3 technical report."},{"citing_arxiv_id":"2603.28489","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms","primary_cat":"eess.IV","submitted_at":"2026-03-30T14:23:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"conditions are injected into the generative backbone through cross-attention, adaptive normalization, or token merging. For example, autoregressive frameworks such as iVideoGPT [50] serialize heterogeneous conditions into a unified sequence, whereas diffusion-based models more often fuse them through cross-attention layers or a token merging mechanism [36], [37]. Overall, the conditioning module determines not only whatshould be generated, but alsohowthe generated world should evolve under external instructions or interactions. III. EFFICIENTMODELING Efficient modeling is central to scaling video generation from short clips to long-horizon, high-resolution sequences under practical latency and memory constraints."},{"citing_arxiv_id":"2601.20540","ref_index":63,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Advancing Open-source World Models","primary_cat":"cs.CV","submitted_at":"2026-01-28T12:37:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"LingBot-World is presented as an open-source world model that delivers high-fidelity simulation, minute-level contextual consistency, and real-time interactivity under one second latency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}