Learning Stochastic Bridges for Video Object Removal via Video-to-Video Translation

Chengjing Wu; Fei Che; Jiangtao Yao; Jiaxin Wang; Luoqi Liu; Tianbao Liu; Ting Liu; Xiangwei Feng; Xiaochao Qu; Zijie Lou

arxiv: 2601.12066 · v4 · pith:RCWK3COEnew · submitted 2026-01-17 · 💻 cs.CV

Learning Stochastic Bridges for Video Object Removal via Video-to-Video Translation

Zijie Lou , Xiangwei Feng , Jiaxin Wang , Jiangtao Yao , Fei Che , Tianbao Liu , Chengjing Wu , Xiaochao Qu

show 2 more authors

Luoqi Liu Ting Liu

This is my paper

classification 💻 cs.CV

keywords videoremovalmethodsobjectbridgeinputobjectsstochastic

0 comments

read the original abstract

Existing video object removal methods predominantly rely on diffusion models following a noise-to-data paradigm, where generation starts from uninformative Gaussian noise. This approach discards the rich structural and contextual priors present in the original input video. Consequently, such methods often lack sufficient guidance, leading to incomplete object erasure or the synthesis of implausible content that conflicts with the scene's physical logic. In this paper, we reformulate video object removal as a video-to-video translation task via a stochastic bridge model. Unlike noise-initialized methods, our framework establishes a direct stochastic path from the source video (with objects) to the target video (objects removed). This bridge formulation effectively leverages the input video as a strong structural prior, guiding the model to perform precise removal while ensuring that the filled regions are logically consistent with the surrounding environment. To address the trade-off where strong bridge priors hinder the removal of large objects, we propose a novel adaptive mask modulation strategy. This mechanism dynamically modulates input embeddings based on mask characteristics, balancing background fidelity with generative flexibility. Extensive experiments demonstrate that our approach significantly outperforms existing methods in both visual quality and temporal consistency. The project page is https://bridgeremoval.github.io/.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GenEraser: Generalizable Video Object Removal via Balanced Text-Mask Guidance and Decoupled Locator-Preserver
cs.CV 2026-05 unverdicted novelty 5.0

GenEraser proposes MC-MoE with bipartite text guidance, LD-CFG fusion, and a decoupled locator-preserver architecture for generalizable video object and effect removal, claiming 2.16 dB and 1.44 dB gains on ROSE and V...