Proposes V2V-Zero, a training-free framework replacing text conditioning with VLM final-layer hidden states from visual pages, achieving 0.85 on GenEval and 32.7/100 on new Simple-V2V Bench across models including video extension.
Canonical reference
Unic: Unified in-context video editing
Canonical reference. 71% of citing Pith papers cite this work as background.
citation-role summary
citation-polarity summary
fields
cs.CV 9verdicts
UNVERDICTED 9representative citing papers
OmniCamera disentangles video content and camera motion for multi-task generation with arbitrary camera control via the OmniCAM hybrid dataset and Dual-level Curriculum Co-Training.
VideoCoF adds an explicit reasoning step using edit-region latents in video diffusion models to enable precise mask-free editing and motion alignment with only 50k training pairs.
Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keeping strong understanding performance.
LIVEditor-14B applies a new sparse attention method (ISA) that prunes context and uses query-sharpness routing to cut attention latency ~60% with no loss in editing quality on standard benchmarks.
Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.
InsEdit adapts a video diffusion backbone for text-instruction video editing via Mutual Context Attention, achieving SOTA open-source results with O(100K) data while also supporting image editing.
Smart-Insertion-V is a dual-stream closed-loop framework with Dual-World-View RoPE and a Decoupled Guidance Module that inserts reference objects into videos while achieving stylistic harmony despite domain gaps.
This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challenges like instance permanence and consistent interaction.
citing papers explorer
-
VideoCoF: Unified Video Editing with Temporal Reasoner
VideoCoF adds an explicit reasoning step using edit-region latents in video diffusion models to enable precise mask-free editing and motion alignment with only 50k training pairs.