SpongeBob: Sync-Aware Harmonious Audio-Visual Generative Editing

Cong Wang; Fengbin Guan; Sen Liang; Xin Li; Yiting Lu; Yuanzhi Wang; Yuan Zhou; Zhentao Yu; Zhibo Chen

arxiv: 2605.25193 · v2 · pith:TT4T4MN5new · submitted 2026-05-24 · 💻 cs.CV

SpongeBob: Sync-Aware Harmonious Audio-Visual Generative Editing

Sen Liang , Cong Wang , Fengbin Guan , Zhentao Yu , Yiting Lu , Yuanzhi Wang , Yuan Zhou , Xin Li

show 1 more author

Zhibo Chen

This is my paper

classification 💻 cs.CV

keywords audio-visualbidirectionaleditingspongebobvisualacousticalignmentattention

0 comments

read the original abstract

Visual and acoustic events in the physical world are inherently coupled, yet existing video editing methods typically adopt decoupled pipelines, lacking bidirectional modality interaction. This results in two key limitations: (i) audio-visual desynchronization and (ii) contextual conflicts between generated audio and preserved content. To address these, we propose SpongeBob, the first end-to-end audio-visual joint editing framework featuring bidirectional cross-modal interaction. For synchronization, a Sync-Aware Mechanism aligns visual edits with sound events via bidirectional attention, temporal alignment, and spatial constraints. For contextual consistency, a Context-Aware Module leverages acoustic and visual context attention to prevent semantic clashes. Additionally, we introduce Sync-Preserving Training and Guidance (SPTG) to enhance alignment without degrading quality. Due to the scarcity of paired data, we construct a scalable data pipeline and a large-scale subject-level dataset. We also propose SpongeBob-Bench for systematic evaluation. Experiments show SpongeBob significantly outperforms existing baselines, improving Sync-C by 30% and Ctx-F1 by 12.5%. Our project page is available at: https://hy-spongebob.github.io/.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MMAE: A Massive Multitask Audio Editing Benchmark
cs.SD 2026-06 conditional novelty 8.0

MMAE is a new multitask audio editing benchmark showing that leading models achieve under 5% exact match rate, with 0% on complex mixed-modality tasks.
Goku: A Million-Scale Universal Dataset and Benchmark for Instruction-Based Video Editing
cs.CV 2026-06 unverdicted novelty 7.0

Goku supplies a 2M-scale dataset, synthesis pipeline, decoupled dual-branch model, and 1000-case benchmark for multi-task instruction-based video editing, reporting up to 8% gains in instruction following.
LiveEdit: Towards Real-Time Diffusion-Based Streaming Video Editing
cs.CV 2026-06 unverdicted novelty 6.0

LiveEdit distills a bidirectional video foundation model into a unidirectional streaming editor via three-stage training plus mask caching to reach 12.66 FPS with stable edits.