{"total":19,"items":[{"citing_arxiv_id":"2607.02517","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"WorldDirector: Building Controllable World Simulators with Persistent Dynamic Memory","primary_cat":"cs.CV","submitted_at":"2026-07-02T17:59:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A video world model framework that uses LLM-orchestrated 3D trajectories as control signals for generation to achieve persistent dynamic object memory and viewpoint freedom.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.16449","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PermaVid: Consistent Video Generation Across Edits via Disentangled Context Memory","primary_cat":"cs.CV","submitted_at":"2026-06-15T09:20:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PermaVid disentangles spatial context into semantic appearance and geometric structure via multi-modal memory banks and edit-aware updates to maintain long-term consistency in video generation after edits.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.03911","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Bootstrap Your Generator: Unpaired Visual Editing with Flow Matching","primary_cat":"cs.CV","submitted_at":"2026-06-02T17:07:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ByG enables unpaired training of flow matching editing models by pairing self-extracted instruction-following cues with cycle-consistency and routing gradients from clean predictions to noisy states.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.03168","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"JAVEDIT: Joint Audio-Visual Instruction-Guided Video Editing with Agentic Data Curation","primary_cat":"cs.CV","submitted_at":"2026-06-02T05:26:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"JAVEdit-100k is the first large-scale dataset for instruction-guided joint audio-visual video editing, accompanied by JAVEditBench and the JAVEdit model that outperforms baselines on five of six metrics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30409","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer","primary_cat":"cs.CV","submitted_at":"2026-05-28T17:59:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SANA-Streaming delivers 1280x704 streaming video editing at 24 FPS end-to-end on an RTX 5090 using hybrid DiT blocks, cycle-reverse training, and mixed-precision quantization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25193","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SpongeBob: Sync-Aware Harmonious Audio-Visual Generative Editing","primary_cat":"cs.CV","submitted_at":"2026-05-24T17:50:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SpongeBob introduces the first end-to-end audio-visual joint editing framework using sync-aware bidirectional attention and context-aware modules, plus a new dataset and benchmark, claiming 30% Sync-C and 12.5% Ctx-F1 gains over baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.24674","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Reasoning to Align: Implicit Reasoning in Diffusion Transformers for Video Editing","primary_cat":"cs.CV","submitted_at":"2026-05-23T17:22:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RVEDiT improves DiT-based video editing by granularity-routed token conditioning and reference-anchored attention alignment to achieve better temporal coherence and localized edits.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23192","ref_index":120,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Occlusion-Aware Physics-Semantic Keyframe Selection for Robust Video Editing","primary_cat":"cs.CV","submitted_at":"2026-05-22T03:19:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A new keyframe selection framework combines structural, tracking, and semantic criteria to select reliable anchor frames for diffusion-based video editing under occlusion.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22344","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Bernini: Latent Semantic Planning for Video Diffusion","primary_cat":"cs.CV","submitted_at":"2026-05-21T11:30:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Bernini is a framework that uses an MLLM planner to output semantic representations for a DiT renderer to generate or edit videos, reporting SOTA benchmark performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18748","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Aurora: Unified Video Editing with a Tool-Using Agent","primary_cat":"cs.CV","submitted_at":"2026-05-18T17:59:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Aurora introduces a VLM-based agent that converts raw user video edit requests into structured conditioning inputs for a unified diffusion transformer, improving performance on underspecified tasks via a new benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18467","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InstructAV2AV: Instruction-Guided Audio-Video Joint Editing","primary_cat":"cs.CV","submitted_at":"2026-05-18T14:27:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"InstructAV2AV is an end-to-end instruction-guided audio-video joint editing model that adapts a pre-trained backbone with gated attention and two-stage training, outperforming prior methods on 11 metrics after building the InsAVE-80K dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14664","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MiVE: Multiscale Vision-language features for reference-guided video Editing","primary_cat":"cs.CV","submitted_at":"2026-05-14T10:19:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MiVE repurposes VLMs as multiscale feature extractors integrated into a unified self-attention Diffusion Transformer for reference-guided video editing, claiming top human preference scores over prior methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06535","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance","primary_cat":"cs.CV","submitted_at":"2026-05-07T16:35:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Sparkle supplies a large-scale dataset and benchmark for instruction-driven video background replacement, enabling models that generate more natural and temporally consistent new scenes than earlier approaches.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04569","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LIVEditor-14B: Lightning Unified Video Editing via In-Context Sparse Attention","primary_cat":"cs.CV","submitted_at":"2026-05-06T07:15:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LIVEditor-14B applies a new sparse attention method (ISA) that prunes context and uses query-sharpness routing to cut attention latency ~60% with no loss in editing quality on standard benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02641","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE","primary_cat":"cs.CV","submitted_at":"2026-05-04T14:26:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"81 59.81 Veo3 [50] 2025-09 60.85 69.48 47.04 86.88 69.35 66.72 HunyuanVideo [3] 2025-03 41.84 63.44 28.60 82.41 60.20 55.30 Wan2.1 [4] 2025-03 55.2563.98 37.32 81.6062.84 60.20 LongCat-Video [22] 2025-10 54.73 70.94 44.7980.20 59.92 62.11 Mamoda2.5 2026-02 53.81 69.19 38.61 84.56 62.05 61.64 VACE-14B [38], InsViE [52], Lucy-Edit [53], ICVE [54], Ditto [35], OpenVE-Edit [37], and VInO [17], while the closed-source baselines include PixVerse [55], Kling O1, and a top-tier proprietary model. Across the seven spatially aligned task categories, Mamoda2.5 achieves state-of-the-art overall performance, with an Overall score of 3.86, the highest among all evaluated models, surpassing both a top-tier proprietary"},{"citing_arxiv_id":"2604.17021","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LIVE: Leveraging Image Manipulation Priors for Instruction-based Video Editing","primary_cat":"cs.CV","submitted_at":"2026-04-18T15:09:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LIVE achieves state-of-the-art instruction-based video editing by jointly training on image and video data with a frame-wise token noise strategy to bridge domain gaps and a new benchmark of over 60 tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"efforts focus on automated data synthesis pipelines to generate large-scale video pairs. Early, InsV2V [8] employs the Prompt2Prompt [12] to create data by al- tering text prompts. InsViE-1M [47] generates target videos by editing the first frame and propagating changes through DDIM inversion across the temporal dimension. Recently, OpenVE-3M [11] and Ditto [2] introduce control signals to guide controllable video models, producing editing pairs with higher quality. Despite persistent efforts to enhance data quality, instruction-based video editing remains constrained by the scarcity of large-scale, diverse datasets com- LIVE 3 Remove the bike Make realworld (a) Samples Tasks Video Editing Tools: Open source"},{"citing_arxiv_id":"2604.11789","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation","primary_cat":"cs.CV","submitted_at":"2026-04-13T17:55:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challenges like instance permanence and consistent interaction.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"EditAR[ 116] unifies segmentation-to-image synthesis, instructed editing, and inpainting under a single conditional generation paradigm, demonstrating the versatility of modern autoregressive architectures. 5.3 Video Object Editing Video editing introduces temporal complexity beyond static image manipulation, requiring modifications to maintain consistency across frames while respecting motion dynamics and temporal coherence [6, 108, 208]. The challenge lies in balancing editing flexibility with temporal stability. 5.3.1 Temporal Propagation TokenFlow[ 48] enforces temporal consistency through diffusion feature manipulation, propagating keyframe features to intermediate frames via linear combination.ContextFlow[30] addresses background conflicts through adaptive context enrichment mechanisms that preserve scene coherence during object editing."},{"citing_arxiv_id":"2604.07958","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks","primary_cat":"cs.CV","submitted_at":"2026-04-09T08:22:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ImVideoEdit learns video editing from 13K image pairs by decoupling spatial modifications from frozen temporal dynamics in pretrained models, matching larger video-trained systems in fidelity and consistency.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"towards the end-to-end training of native video generative models. Approaches such as [13, 31, 37, 44] integrate Mul- timodal Large Language Models (MLLMs) with Diffusion Transformers to unify diverse editing tasks into a single ar- chitecture. Moreover, to overcome the data scarcity bottle- neck in training end-to-end video editing models, methods like [3, 9, 18] have proposed large-scale synthetic video generation pipelines. However, end-to-end training on such massive video datasets inevitably incurs substantial compu- tational overhead and high data generation costs. To cir- cumvent this heavy reliance on exhaustive video-level op- timization, we propose a novel approach that effectively achieves temporally coherent video editing by training ex-"},{"citing_arxiv_id":"2601.20540","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Advancing Open-source World Models","primary_cat":"cs.CV","submitted_at":"2026-01-28T12:37:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"LingBot-World is presented as an open-source world model that delivers high-fidelity simulation, minute-level contextual consistency, and real-time interactivity under one second latency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}