{"total":13,"items":[{"citing_arxiv_id":"2605.30409","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer","primary_cat":"cs.CV","submitted_at":"2026-05-28T17:59:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SANA-Streaming delivers 1280x704 streaming video editing at 24 FPS end-to-end on an RTX 5090 using hybrid DiT blocks, cycle-reverse training, and mixed-precision quantization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25193","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SpongeBob: Sync-Aware Harmonious Audio-Visual Generative Editing","primary_cat":"cs.CV","submitted_at":"2026-05-24T17:50:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SpongeBob introduces the first end-to-end audio-visual joint editing framework using sync-aware bidirectional attention and context-aware modules, plus a new dataset and benchmark, claiming 30% Sync-C and 12.5% Ctx-F1 gains over baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.24674","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Reasoning to Align: Implicit Reasoning in Diffusion Transformers for Video Editing","primary_cat":"cs.CV","submitted_at":"2026-05-23T17:22:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RVEDiT improves DiT-based video editing by granularity-routed token conditioning and reference-anchored attention alignment to achieve better temporal coherence and localized edits.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23192","ref_index":122,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Occlusion-Aware Physics-Semantic Keyframe Selection for Robust Video Editing","primary_cat":"cs.CV","submitted_at":"2026-05-22T03:19:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A new keyframe selection framework combines structural, tracking, and semantic criteria to select reliable anchor frames for diffusion-based video editing under occlusion.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22344","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Bernini: Latent Semantic Planning for Video Diffusion","primary_cat":"cs.CV","submitted_at":"2026-05-21T11:30:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Bernini is a framework that uses an MLLM planner to output semantic representations for a DiT renderer to generate or edit videos, reporting SOTA benchmark performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14664","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MiVE: Multiscale Vision-language features for reference-guided video Editing","primary_cat":"cs.CV","submitted_at":"2026-05-14T10:19:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MiVE repurposes VLMs as multiscale feature extractors integrated into a unified self-attention Diffusion Transformer for reference-guided video editing, claiming top human preference scores over prior methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06535","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance","primary_cat":"cs.CV","submitted_at":"2026-05-07T16:35:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Sparkle supplies a large-scale dataset and benchmark for instruction-driven video background replacement, enabling models that generate more natural and temporally consistent new scenes than earlier approaches.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04569","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LIVEditor-14B: Lightning Unified Video Editing via In-Context Sparse Attention","primary_cat":"cs.CV","submitted_at":"2026-05-06T07:15:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LIVEditor-14B applies a new sparse attention method (ISA) that prunes context and uses query-sharpness routing to cut attention latency ~60% with no loss in editing quality on standard benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02641","ref_index":54,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE","primary_cat":"cs.CV","submitted_at":"2026-05-04T14:26:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"84 77.06 64.81 59.81 Veo3 [50] 2025-09 60.85 69.48 47.04 86.88 69.35 66.72 HunyuanVideo [3] 2025-03 41.84 63.44 28.60 82.41 60.20 55.30 Wan2.1 [4] 2025-03 55.2563.98 37.32 81.6062.84 60.20 LongCat-Video [22] 2025-10 54.73 70.94 44.7980.20 59.92 62.11 Mamoda2.5 2026-02 53.81 69.19 38.61 84.56 62.05 61.64 VACE-14B [38], InsViE [52], Lucy-Edit [53], ICVE [54], Ditto [35], OpenVE-Edit [37], and VInO [17], while the closed-source baselines include PixVerse [55], Kling O1, and a top-tier proprietary model. Across the seven spatially aligned task categories, Mamoda2.5 achieves state-of-the-art overall performance, with an Overall score of 3.86, the highest among all evaluated models, surpassing both a top-tier proprietary"},{"citing_arxiv_id":"2604.15871","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs","primary_cat":"cs.CV","submitted_at":"2026-04-17T09:21:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"2 Evaluation Benchmarks for Visual Editing Evaluating visual editing quality remains challenging due to the complexity of disentangling intended edits from content preserva- tion. Traditional metrics such as PSNR [ 41] and CLIP Score [16] often show limited alignment with human perception. To address this, several task-specific benchmarks, including EditVal [3], PIE- Bench, and EditBench [30], have been proposed. However, these UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs Figure 2: The overall pipeline of UniEditBench. (A) Multi-Source Data Aggregation: A comprehensive dataset is constructed by aggregating assets from the internet, AI generation models (e.g., FLUX, SD3, Wan-video[ 42], HunyuanVideo[21]), and existing"},{"citing_arxiv_id":"2604.08646","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation","primary_cat":"cs.CV","submitted_at":"2026-04-09T17:59:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InsEdit adapts a video diffusion backbone for text-instruction video editing via Mutual Context Attention, achieving SOTA open-source results with O(100K) data while also supporting image editing.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"These studies provide a useful foundation for instruction follow- ing, but they work on static images and therefore do not address consistency over time or edits that happen only in part of a video. An emerging line of work tries to unify visual understanding, generation, and editing across different modalities. Representative examples include UniWorld [21], DreamVE [40], InstructX [26], Uni- Video [35], OmniV2V [19], VACE [13], UNIC [44], EditVerse [14], UniVid [24], and Kling-Omni [31]. These studies suggest that image and video editing can benefit from shared backbones and shared instruction-following ability. However, existing unified frameworks focus mainly on sharing the architecture, whereas our focus is to"},{"citing_arxiv_id":"2604.07958","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks","primary_cat":"cs.CV","submitted_at":"2026-04-09T08:22:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ImVideoEdit learns video editing from 13K image pairs by decoupling spatial modifications from frozen temporal dynamics in pretrained models, matching larger video-trained systems in fidelity and consistency.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"approximately 20 GB of VRAM per GPU, making it acces- sible to trainImVideoEditeven on a single 3090 GPU. Baseline SettingsTo comprehensively evaluate the su- periority ofImVideoEdit, we benchmark our framework against several recent state-of-the-art video editing mod- els, including V ACE(1.3B & 14B) [13], OmniVideo2-1.3B [31, 40], Lucy-Edit-Dev [33], Kiwi-Edit [17], DITTO [2], and ICVE [16]. To ensure a strictly fair comparison, all baseline methods are evaluated utilizing their official code- bases and default inference hyperparameters. Evaluation Dataset.We construct a meticulously curated testing benchmark encompassing 10 predefined video edit- ing categories, with 25 high-quality samples allocated for each task. To guarantee a diverse range of scenes and high-"},{"citing_arxiv_id":"2512.07469","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VideoCoF: Unified Video Editing with Temporal Reasoner","primary_cat":"cs.CV","submitted_at":"2025-12-08T11:50:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VideoCoF adds an explicit reasoning step using edit-region latents in video diffusion models to enable precise mask-free editing and motion alignment with only 50k training pairs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}