VRMDiff: Text-Guided Video Referring Matting Generation of Diffusion

Daiqing Qi; Jincen Song; Lehan Yang; Sheng Li; Tianlong Wang; Weili Shi; Yuheng Liu

arxiv: 2503.10678 · v1 · pith:L3ZLRP3Mnew · submitted 2025-03-11 · 💻 cs.CV

VRMDiff: Text-Guided Video Referring Matting Generation of Diffusion

Lehan Yang , Jincen Song , Tianlong Wang , Daiqing Qi , Weili Shi , Yuheng Liu , Sheng Li This is my paper

classification 💻 cs.CV

keywords mattingvideoreferringalphadatasetdiffusiongenerationinstances

0 comments

read the original abstract

We propose a new task, video referring matting, which obtains the alpha matte of a specified instance by inputting a referring caption. We treat the dense prediction task of matting as video generation, leveraging the text-to-video alignment prior of video diffusion models to generate alpha mattes that are temporally coherent and closely related to the corresponding semantic instances. Moreover, we propose a new Latent-Constructive loss to further distinguish different instances, enabling more controllable interactive matting. Additionally, we introduce a large-scale video referring matting dataset with 10,000 videos. To the best of our knowledge, this is the first dataset that concurrently contains captions, videos, and instance-level alpha mattes. Extensive experiments demonstrate the effectiveness of our method. The dataset and code are available at https://github.com/Hansxsourse/VRMDiff.

This paper has not been read by Pith yet.

VRMDiff: Text-Guided Video Referring Matting Generation of Diffusion

discussion (0)