{"total":27,"items":[{"citing_arxiv_id":"2605.22344","ref_index":68,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Bernini: Latent Semantic Planning for Video Diffusion","primary_cat":"cs.CV","submitted_at":"2026-05-21T11:30:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Bernini is a framework that uses an MLLM planner to output semantic representations for a DiT renderer to generate or edit videos, reporting SOTA benchmark performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22051","ref_index":49,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EasyVFX: Frequency-Driven Decoupling for Resource-Efficient VFX Generation","primary_cat":"cs.CV","submitted_at":"2026-05-21T06:38:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"EasyVFX decouples VFX generation via frequency-aware Mixture-of-Experts and test-time training to achieve realistic effects with limited resources.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17923","ref_index":40,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AdaptiveLoad: Towards Efficient Video Diffusion Transformer Training","primary_cat":"cs.DC","submitted_at":"2026-05-18T06:30:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AdaptiveLoad cuts computational imbalance in video DiT training from 39% to 18.9% and raises throughput 27.2% via memory-compute constraints and a custom LayerNorm-Modulate kernel.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17423","ref_index":38,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Soap2Soap: Long Cinematic Video Remaking via Multi-Agent Collaboration","primary_cat":"cs.CV","submitted_at":"2026-05-17T12:38:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Soap2Soap uses a multi-agent system with dual-bridge consistency via JSON screenplays and visual anchors plus batch keyframe generation to achieve better long-term consistency in cinematic video remaking than commercial APIs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17312","ref_index":46,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VISTA: Triplet-Supervised Video Style Transfer with Diffusion Transformers","primary_cat":"cs.CV","submitted_at":"2026-05-17T08:03:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VISTA introduces a new synthetic triplet dataset and diffusion-transformer framework with style adapter that jointly models style, content, and motion to achieve state-of-the-art video style transfer.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17019","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"StreamingEffect: Real-Time Human-Centric Video Effect Generation","primary_cat":"cs.CV","submitted_at":"2026-05-16T14:45:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"StreamingEffect enables real-time 720p human-centric video effect generation on one GPU via teacher-student distillation, keyframe control, and a new 130K video dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16003","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-15T14:33:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Echo-Forcing decouples stable anchors, compressed history, and recent dynamics in video diffusion KV caches using hierarchical memory, scene recall frames, and difference-aware decay to support interactive long video generation under bounded cache.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14664","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MiVE: Multiscale Vision-language features for reference-guided video Editing","primary_cat":"cs.CV","submitted_at":"2026-05-14T10:19:19+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14382","ref_index":5,"ref_count":3,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-14T05:06:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Delta Forcing improves temporal coherence in interactive autoregressive video generation by estimating transition consistency from teacher-generator latent deltas and balancing it against a monotonic continuity objective.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15237","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A3D: Agentic AI flow for autonomous Accelerator Design","primary_cat":"cs.AR","submitted_at":"2026-05-14T01:28:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A3D is an agentic AI system that automates end-to-end hardware accelerator design for complex applications like LAMMPS and QMCPACK with no human intervention.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12038","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation","primary_cat":"cs.CV","submitted_at":"2026-05-12T12:21:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OmniHumanoid factorizes transferable motion learning from embodiment-specific adaptation to enable scalable cross-embodiment video generation without paired data for new humanoids.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Training is performed for 10000 optimization steps on 2 NVIDIA H200 GPUs, with a batch size of 1 per GPU. We adopt the AdamW optimizer with a learning rate of2e−4 . More implementation details are provided in the supplementary material. Baselines.We compare with state-of-the-art V2V generation and editing systems, including com- mercial APIs such as Runway Gen-4 [26], Kling O1 [33], and Kling O3 [13], as well as open-source methods such as Wan2.1-V ACE [34] and the closely related X-Humanoid [37]. Commercial APIs are evaluated as zero-shot reference-guided models without embodiment-specific adaptation, while trainable open-source baselines, including Wan2.1-V ACE [34] and X-Humanoid [37], are fine-tuned on the same synthetic dataset following their original training strategies."},{"citing_arxiv_id":"2605.11723","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating","primary_cat":"cs.CV","submitted_at":"2026-05-12T08:08:33+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"To enable the model to detect subtle hallucinations in generated videos, we construct a large-scale dataset with fine-grained spatiotemporal anomaly annotations, specifically targeting sparse video distortions. The overall construction pipeline is illustrated in Fig. 2. Data Collection and Taxonomy.We collect images and prompts describing complex scenes and motions, and synthesize videos using SoTA video generation models, including Kling [ 40], Sora [5], and Wan [42]. We manually screen the outputs and curate ∼25K videos covering broad anomaly types. As shown in the top-right of Fig. 2, we build a new anomaly taxonomy de- signed to help VLMs identify subtle anomalies and support the subsequent three-stage training. Single-frame recognizable anomalies (Stage 1) include object distortion and human distortion, with"},{"citing_arxiv_id":"2605.07061","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Do Joint Audio-Video Generation Models Understand Physics?","primary_cat":"cs.SD","submitted_at":"2026-05-08T00:14:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Current joint audio-video generation models lack robust physical commonsense, especially during transitions and when prompted for impossible behaviors.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Li, Min Yang, Dawei Zhu, Wei Zhang, et al. Sonicbench: Dissecting the physical perception bottleneck in large audio language models.arXiv preprint arXiv:2601.11039, 2026. [35] Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, et al. Kling-omni technical report.arXiv preprint arXiv:2512.16776, 2025. [36] Qwen Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026. [37] Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018. [38] Duomin Wang, Wei Zuo, Aojie Li, Ling-Hao Chen, Xinyao Liao, Deyu Zhou, Zixin Yin, Xili"},{"citing_arxiv_id":"2605.04515","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"From Priors to Perception: Grounding Video-LLMs in Physical Reality","primary_cat":"cs.CV","submitted_at":"2026-05-06T05:48:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Video-LLMs fail physical reasoning due to semantic prior dominance rather than perception deficits; a new programmatic adversarial curriculum and visual-anchored reasoning chain enable substantial gains via standard LoRA fine-tuning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Reasoning Chain (V ARC) combined with standard LoRA [16] provides an architecture-agnostic, lightweight solution that mechanistically enforces visual grounding to dismantle prior interference. 2.3 Physics-Aware Video Datasets Recent evaluation datasets [1, 24, 18] construct negative samples by capturing natural failures from text-to-video models (e.g., SVD [ 3], Sora [ 4], Kling [ 35], Veo [ 13]). However, this paradigm inherently conflates visual generative artifacts (e.g., texture flickering, object melting) with genuine physical fallacies. Consequently, evaluated models regress into shortcut-driven artifact detectors rather than rigorous physical reasoners. To overcome this, our Programmatic Adversarial Curriculum (PACC) synthesizes visually pristine yet physically invalid video pairs."},{"citing_arxiv_id":"2605.03652","ref_index":5,"ref_count":3,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics","primary_cat":"cs.CV","submitted_at":"2026-05-05T11:36:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AniMatrix generates anime videos by structuring artistic production rules into a controllable taxonomy and training the model to prioritize those rules over physical realism, achieving top scores from professional animators on prompt understanding and artistic motion.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"mators, AniMatrix ranks first on four of five, with the largest gains over Seedance-Pro 1.0 [ 1] on Prompt Understanding (+0.70, +22.4%) and Artistic Motion (+0.55, +16.9%). We are preparing accompanying resources for public release to support reproducibility and follow-up research. 1 Introduction Video generation has advanced rapidly, with models such as Sora [ 2], HunyuanVideo [3], Wan 2.2 [4], Kling [ 5], CogVideoX [6], Seedance [7], and SkyReels [8, 9] producing coherent and visually rich natural video. A key reason for this progress is that natural video obeys a single, universal set of physical laws: gravity pulls objects downward, rigid bodies conserve momentum, and light scatters according to well-defined optics. Every frame of natural video data"},{"citing_arxiv_id":"2605.01720","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SignVerse-2M: A Two-Million-Clip Pose-Native Universe of 55+ Sign Languages","primary_cat":"cs.CV","submitted_at":"2026-05-03T05:26:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SignVerse-2M provides a 2-million-clip multilingual pose-native dataset for sign language derived from public videos via DWPose preprocessing to enable robust modeling in real-world conditions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27711","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ExoActor: Exocentric Video Generation as Generalizable Interactive Humanoid Control","primary_cat":"cs.RO","submitted_at":"2026-04-30T10:57:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ExoActor uses exocentric video generation to implicitly model robot-environment-object interactions and converts the resulting videos into task-conditioned humanoid control sequences.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.25427","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Systematic Post-Train Framework for Video Generation","primary_cat":"cs.CV","submitted_at":"2026-04-28T09:34:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025. [8] Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. [9] Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, et al. Kling-omni technical report.arXiv preprint arXiv:2512.16776, 2025. [10] Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative"},{"citing_arxiv_id":"2604.19193","ref_index":62,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"How Far Are Video Models from True Multimodal Reasoning?","primary_cat":"cs.CV","submitted_at":"2026-04-21T08:04:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"CLVG-Bench and code are released here. Keywords:Video generation·Multimodal reasoning·Video evaluation 1 Introduction Driven by large-scale training on web-scale data with generative objectives [6, 31,74], video models have demonstrated groundbreaking zero-shot capabilities, evolving from mere instruction following to complex understanding and reason- ing [57,62,71,76]. Specifically, traditional reference-based video generation has arXiv:2604.19193v1 [cs.CV] 21 Apr 2026 2 X. Zhang et al. <image2><image1> Multiple Video Task Categories Multimodal Reasoning Reasoning-Oriented Metadata Text Image Audio Video Perception Element Editing PhysicalSimulation Script Continuation Logical Reasoning <context> PartialReference"},{"citing_arxiv_id":"2604.16592","ref_index":168,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Human Cognition in Machines: A Unified Perspective of World Models","primary_cat":"cs.RO","submitted_at":"2026-04-17T17:51:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and proposes Epistemic World Models as a new category for scientific discovery agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"geometries [11, 39, 23, 50], as seen in our Table 2 and outlined in recent com- prehensive roadmaps [211]. The defining characteristic of modern video World Models is their ability to generate future states conditioned on current obser- vations and, crucially, latent or explicit actions. Contemporary systems like the open-source Wan [175], alongside models such as Sora [125], and Kling [168], do not merely synthesize pixels; with proper latent representations and training techniques they can learn the underlying physical, causal, and temporal laws governing their simulated environments. In Sec. 4.1 we will review works ac- cording to their contribution to World Model representation (utilizing memory, language, or perception). Then in Sec."},{"citing_arxiv_id":"2604.11804","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation","primary_cat":"cs.CV","submitted_at":"2026-04-13T17:59:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OmniShow unifies text, image, audio, and pose conditions into an end-to-end model for high-quality human-object interaction video generation and introduces the HOIVG-Bench benchmark, claiming state-of-the-art results.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Vace: All-in-one video creation and editing.arXiv preprint arXiv:2503.07598, 2025. [32] Xuan Ju, Weicai Ye, Quande Liu, Qiulin Wang, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, and Qiang Xu. Fulldit: Video generative foundation models with multimodal control via full attention. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15737-15747, 2025. [33] Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, et al. Kling-omni technical report.arXiv preprint arXiv:2512.16776, 2025. [34] Zhe Kong, Feng Gao, Yong Zhang, Zhuoliang Kang, Xiaoming Wei, Xunliang Cai, Guanying Chen, and Wenhan Luo. Let them talk: Audio-driven multi-person conversational video generation."},{"citing_arxiv_id":"2604.11789","ref_index":157,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation","primary_cat":"cs.CV","submitted_at":"2026-04-13T17:55:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challenges like instance permanence and consistent interaction.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Sora[ 101] and Google'sVeo, which produce long-duration videos with consistent object identity, texture, and interaction under complex motion and occlusion. The latest progress focuses on fine-grained controllability and multimodal synchronization, where region-level editing via masking and latent interpolation allows precise manipulation of individual objects without disrupting the background, and systems likeKling 2.6[157] andSora 2further advance native audio-visual alignment and timeline-based control, enabling detailed spatiotemporal editing and dynamic object-level animation. 3 Object-Centric Visual Understanding 3.1 Architecture Unlike general vision-language tasks, object-centric understanding necessitates the construction of precise region-level representations."},{"citing_arxiv_id":"2604.08995","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory","primary_cat":"cs.CV","submitted_at":"2026-04-10T06:00:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Matrix-Game 3.0 delivers 720p real-time video generation at 40 FPS with minute-scale memory consistency by combining residual self-correction training, camera-aware memory injection, and DMD-based autoregressive distillation on a 5B model.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"DiT quantization, V AE pruning, retrieval via GPU, etc.) 3 2 Related Works 2.1 Video Generation Models Recent video generation models have largely converged toward Diffusion Transformer (DiT)-based architectures [31], which directly model spatiotemporal tokens and enable scalable high-resolution and high-quality video synthesis. Closed-source models such as Sora [29], Kling [36], and Hailuo have achieved significant progress in complex motion modeling and high-quality video generation through large-scale data and model scaling. However, these approaches are primarily designed for offline generation and lack explicit modeling of actions and interaction. Moreover, their long-horizon consistency typically relies on implicit modeling mechanisms, making it difficult to maintain stable"},{"citing_arxiv_id":"2604.08646","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation","primary_cat":"cs.CV","submitted_at":"2026-04-09T17:59:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InsEdit adapts a video diffusion backbone for text-instruction video editing via Mutual Context Attention, achieving SOTA open-source results with O(100K) data while also supporting image editing.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Lucy Edit: Open-Weight Text-Guided Video Edit- ing. https://d2drjpuinn46lb.cloudfront.net/Lucy_Edit__High_Fidelity_Text_ Guided_Video_Editing.pdf [31] Kling Team. 2025. Kling-Omni Technical Report. arXiv:2512.16776 [cs.CV] [32] Tencent Hunyuan Foundation Model Team. 2025. HunyuanVideo 1.5 Technical Report. arXiv:2511.18870 [cs.CV] https://arxiv.org/abs/2511.18870 [33] Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. 2025. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314(2025). [34] Yuhan Wang, Siwei Yang, Bingchen Zhao, Letian Zhang, Qing Liu, Yuyin Zhou, and Cihang Xie. 2025. Gpt-image-edit-1.5 m: A million-scale, gpt-generated"},{"citing_arxiv_id":"2604.08641","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"On Semiotic-Grounded Interpretive Evaluation of Generative Art","primary_cat":"cs.CV","submitted_at":"2026-04-09T17:30:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SemJudge uses a Hierarchical Semiosis Graph based on Peircean theory to evaluate deeper artistic meaning in generative art and aligns better with human judgments than prior metrics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07958","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks","primary_cat":"cs.CV","submitted_at":"2026-04-09T08:22:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ImVideoEdit learns video editing from 13K image pairs by decoupling spatial modifications from frozen temporal dynamics in pretrained models, matching larger video-trained systems in fidelity and consistency.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ants [8, 11, 29], significantly improved spatiotemporal mod- eling and enabled the generation of high-quality short video clips. More recently, Diffusion Transformers (DiTs) and large-scale generative architectures [24] have driven rapid advancements in video foundation models. Representative systems such as HunyuanVideo [14], Cosmos [1], Wan [36], and Kling [34] demonstrate strong capabilities in synthesiz- ing high-resolution, and physically plausible videos. These models benefit from scaling model capacity, improved ar- chitecture design, and training on large-scale curated video datasets, leading to substantial gains in realism, motion con- sistency, and multimodal alignment. 2.2. Video Editing Early diffusion-based video editing paradigms primarily"},{"citing_arxiv_id":"2602.07064","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OmniFysics: Towards Physical Intelligence Evolution via Omni-Modal Signal Processing and Network Optimization","primary_cat":"cs.CV","submitted_at":"2026-02-05T14:04:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"OmniFysics is an omni-modal network using a dynamic physical data engine and evolutive tuning to improve performance on multimodal benchmarks and physics-oriented tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}