SpongeBob: Sync-Aware Harmonious Audio-Visual Generative Editing

· 2026 · cs.CV · arXiv 2605.25193

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Visual and acoustic events in the physical world are inherently coupled, yet existing video editing methods typically adopt decoupled pipelines, lacking bidirectional modality interaction. This results in two key limitations: (i) audio-visual desynchronization and (ii) contextual conflicts between generated audio and preserved content. To address these, we propose SpongeBob, the first end-to-end audio-visual joint editing framework featuring bidirectional cross-modal interaction. For synchronization, a Sync-Aware Mechanism aligns visual edits with sound events via bidirectional attention, temporal alignment, and spatial constraints. For contextual consistency, a Context-Aware Module leverages acoustic and visual context attention to prevent semantic clashes. Additionally, we introduce Sync-Preserving Training and Guidance (SPTG) to enhance alignment without degrading quality. Due to the scarcity of paired data, we construct a scalable data pipeline and a large-scale subject-level dataset. We also propose SpongeBob-Bench for systematic evaluation. Experiments show SpongeBob significantly outperforms existing baselines, improving Sync-C by 30% and Ctx-F1 by 12.5%. Our project page is available at: https://hy-spongebob.github.io/.

representative citing papers

MMAE: A Massive Multitask Audio Editing Benchmark

cs.SD · 2026-06-05 · conditional · novelty 8.0

MMAE is a new multitask audio editing benchmark showing that leading models achieve under 5% exact match rate, with 0% on complex mixed-modality tasks.

Goku: A Million-Scale Universal Dataset and Benchmark for Instruction-Based Video Editing

cs.CV · 2026-06-29 · unverdicted · novelty 7.0 · 2 refs

Goku provides a 2M-pair dataset for multi-task structural video editing, Goku-Edit model with MLLM and dual-branch design, and Goku-Bench yielding up to 8% gains in instruction following.

citing papers explorer

Showing 2 of 2 citing papers after filters.

MMAE: A Massive Multitask Audio Editing Benchmark cs.SD · 2026-06-05 · conditional · none · ref 26 · internal anchor
MMAE is a new multitask audio editing benchmark showing that leading models achieve under 5% exact match rate, with 0% on complex mixed-modality tasks.
Goku: A Million-Scale Universal Dataset and Benchmark for Instruction-Based Video Editing cs.CV · 2026-06-29 · unverdicted · none · ref 20 · 2 links · internal anchor
Goku provides a 2M-pair dataset for multi-task structural video editing, Goku-Edit model with MLLM and dual-branch design, and Goku-Bench yielding up to 8% gains in instruction following.

SpongeBob: Sync-Aware Harmonious Audio-Visual Generative Editing

fields

years

verdicts

representative citing papers

citing papers explorer