JAVEdit-100k is the first large-scale dataset for instruction-guided joint audio-visual video editing, accompanied by JAVEditBench and the JAVEdit model that outperforms baselines on five of six metrics.
hub
arXiv preprint arXiv:2510.15742 , year=
19 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
fields
cs.CV 19years
2026 19verdicts
UNVERDICTED 19roles
background 4polarities
background 4representative citing papers
Aurora introduces a VLM-based agent that converts raw user video edit requests into structured conditioning inputs for a unified diffusion transformer, improving performance on underspecified tasks via a new benchmark.
InstructAV2AV is an end-to-end instruction-guided audio-video joint editing model that adapts a pre-trained backbone with gated attention and two-stage training, outperforming prior methods on 11 metrics after building the InsAVE-80K dataset.
Sparkle supplies a large-scale dataset and benchmark for instruction-driven video background replacement, enabling models that generate more natural and temporally consistent new scenes than earlier approaches.
PermaVid disentangles spatial context into semantic appearance and geometric structure via multi-modal memory banks and edit-aware updates to maintain long-term consistency in video generation after edits.
SANA-Streaming delivers 1280x704 streaming video editing at 24 FPS end-to-end on an RTX 5090 using hybrid DiT blocks, cycle-reverse training, and mixed-precision quantization.
SpongeBob introduces the first end-to-end audio-visual joint editing framework using sync-aware bidirectional attention and context-aware modules, plus a new dataset and benchmark, claiming 30% Sync-C and 12.5% Ctx-F1 gains over baselines.
RVEDiT improves DiT-based video editing by granularity-routed token conditioning and reference-anchored attention alignment to achieve better temporal coherence and localized edits.
MiVE repurposes VLMs as multiscale feature extractors integrated into a unified self-attention Diffusion Transformer for reference-guided video editing, claiming top human preference scores over prior methods.
LIVEditor-14B applies a new sparse attention method (ISA) that prunes context and uses query-sharpness routing to cut attention latency ~60% with no loss in editing quality on standard benchmarks.
LIVE achieves state-of-the-art instruction-based video editing by jointly training on image and video data with a frame-wise token noise strategy to bridge domain gaps and a new benchmark of over 60 tasks.
ImVideoEdit learns video editing from 13K image pairs by decoupling spatial modifications from frozen temporal dynamics in pretrained models, matching larger video-trained systems in fidelity and consistency.
A video world model framework that uses LLM-orchestrated 3D trajectories as control signals for generation to achieve persistent dynamic object memory and viewpoint freedom.
ByG enables unpaired training of flow matching editing models by pairing self-extracted instruction-following cues with cycle-consistency and routing gradients from clean predictions to noisy states.
A new keyframe selection framework combines structural, tracking, and semantic criteria to select reliable anchor frames for diffusion-based video editing under occlusion.
Bernini is a framework that uses an MLLM planner to output semantic representations for a DiT renderer to generate or edit videos, reporting SOTA benchmark performance.
Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.
LingBot-World is presented as an open-source world model that delivers high-fidelity simulation, minute-level contextual consistency, and real-time interactivity under one second latency.
This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challenges like instance permanence and consistent interaction.
citing papers explorer
-
JAVEDIT: Joint Audio-Visual Instruction-Guided Video Editing with Agentic Data Curation
JAVEdit-100k is the first large-scale dataset for instruction-guided joint audio-visual video editing, accompanied by JAVEditBench and the JAVEdit model that outperforms baselines on five of six metrics.
-
Aurora: Unified Video Editing with a Tool-Using Agent
Aurora introduces a VLM-based agent that converts raw user video edit requests into structured conditioning inputs for a unified diffusion transformer, improving performance on underspecified tasks via a new benchmark.
-
InstructAV2AV: Instruction-Guided Audio-Video Joint Editing
InstructAV2AV is an end-to-end instruction-guided audio-video joint editing model that adapts a pre-trained backbone with gated attention and two-stage training, outperforming prior methods on 11 metrics after building the InsAVE-80K dataset.
-
Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance
Sparkle supplies a large-scale dataset and benchmark for instruction-driven video background replacement, enabling models that generate more natural and temporally consistent new scenes than earlier approaches.
-
PermaVid: Consistent Video Generation Across Edits via Disentangled Context Memory
PermaVid disentangles spatial context into semantic appearance and geometric structure via multi-modal memory banks and edit-aware updates to maintain long-term consistency in video generation after edits.
-
SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer
SANA-Streaming delivers 1280x704 streaming video editing at 24 FPS end-to-end on an RTX 5090 using hybrid DiT blocks, cycle-reverse training, and mixed-precision quantization.
-
SpongeBob: Sync-Aware Harmonious Audio-Visual Generative Editing
SpongeBob introduces the first end-to-end audio-visual joint editing framework using sync-aware bidirectional attention and context-aware modules, plus a new dataset and benchmark, claiming 30% Sync-C and 12.5% Ctx-F1 gains over baselines.
-
Reasoning to Align: Implicit Reasoning in Diffusion Transformers for Video Editing
RVEDiT improves DiT-based video editing by granularity-routed token conditioning and reference-anchored attention alignment to achieve better temporal coherence and localized edits.
-
MiVE: Multiscale Vision-language features for reference-guided video Editing
MiVE repurposes VLMs as multiscale feature extractors integrated into a unified self-attention Diffusion Transformer for reference-guided video editing, claiming top human preference scores over prior methods.
-
LIVEditor-14B: Lightning Unified Video Editing via In-Context Sparse Attention
LIVEditor-14B applies a new sparse attention method (ISA) that prunes context and uses query-sharpness routing to cut attention latency ~60% with no loss in editing quality on standard benchmarks.
-
LIVE: Leveraging Image Manipulation Priors for Instruction-based Video Editing
LIVE achieves state-of-the-art instruction-based video editing by jointly training on image and video data with a frame-wise token noise strategy to bridge domain gaps and a new benchmark of over 60 tasks.
-
ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks
ImVideoEdit learns video editing from 13K image pairs by decoupling spatial modifications from frozen temporal dynamics in pretrained models, matching larger video-trained systems in fidelity and consistency.
-
WorldDirector: Building Controllable World Simulators with Persistent Dynamic Memory
A video world model framework that uses LLM-orchestrated 3D trajectories as control signals for generation to achieve persistent dynamic object memory and viewpoint freedom.
-
Bootstrap Your Generator: Unpaired Visual Editing with Flow Matching
ByG enables unpaired training of flow matching editing models by pairing self-extracted instruction-following cues with cycle-consistency and routing gradients from clean predictions to noisy states.
-
Occlusion-Aware Physics-Semantic Keyframe Selection for Robust Video Editing
A new keyframe selection framework combines structural, tracking, and semantic criteria to select reliable anchor frames for diffusion-based video editing under occlusion.
-
Bernini: Latent Semantic Planning for Video Diffusion
Bernini is a framework that uses an MLLM planner to output semantic representations for a DiT renderer to generate or edit videos, reporting SOTA benchmark performance.
-
Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE
Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.
-
Advancing Open-source World Models
LingBot-World is presented as an open-source world model that delivers high-fidelity simulation, minute-level contextual consistency, and real-time interactivity under one second latency.
-
LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation
This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challenges like instance permanence and consistent interaction.