pith. sign in

Kiwi-edit: Versatile video editing via instruction and reference guidance

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it
abstract

Instruction-based video editing has witnessed rapid progress, yet current methods often struggle with precise visual control, as natural language is inherently limited in describing complex visual nuances. Although reference-guided editing offers a robust solution, its potential is currently bottlenecked by the scarcity of high-quality paired training data. To bridge this gap, we introduce a scalable data generation pipeline that transforms existing video editing pairs into high-fidelity training quadruplets, leveraging image generative models to create synthesized reference scaffolds. Using this pipeline, we construct RefVIE, a large-scale dataset tailored for instruction-reference-following tasks, and establish RefVIE-Bench for comprehensive evaluation. Furthermore, we propose a unified editing architecture, Kiwi-Edit, that synergizes learnable queries and latent visual features for reference semantic guidance. Our model achieves significant gains in instruction following and reference fidelity via a progressive multi-stage training curriculum. Extensive experiments demonstrate that our data and architecture establish a new state-of-the-art in controllable video editing. All datasets, models, and code is released at https://github.com/showlab/Kiwi-Edit.

citation-role summary

baseline 2 background 1

citation-polarity summary

fields

cs.CV 7 cs.GR 1

years

2026 8

verdicts

UNVERDICTED 8

representative citing papers

Aurora: Unified Video Editing with a Tool-Using Agent

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

Aurora introduces a VLM-based agent that converts raw user video edit requests into structured conditioning inputs for a unified diffusion transformer, improving performance on underspecified tasks via a new benchmark.

Sound Sparks Motion: Audio and Text Tuning for Video Editing

cs.GR · 2026-05-14 · unverdicted · novelty 6.0

Sound Sparks Motion is a test-time tuning approach that adjusts audio and text conditioning signals in multimodal video models using VLM feedback to produce specific motion edits while preserving content.

citing papers explorer

Showing 8 of 8 citing papers.