Robust Online Video Instance Segmentation with Track Queries

Daniel McKee; Svetlana Lazebnik; Zitong Zhan

arxiv: 2211.09108 · v1 · pith:ZJRODK3Lnew · submitted 2022-11-16 · 💻 cs.CV

Robust Online Video Instance Segmentation with Track Queries

Zitong Zhan , Daniel McKee , Svetlana Lazebnik This is my paper

classification 💻 cs.CV

keywords segmentationinstancevideotrackmethodsonlinequeriesframe

0 comments

read the original abstract

Recently, transformer-based methods have achieved impressive results on Video Instance Segmentation (VIS). However, most of these top-performing methods run in an offline manner by processing the entire video clip at once to predict instance mask volumes. This makes them incapable of handling the long videos that appear in challenging new video instance segmentation datasets like UVO and OVIS. We propose a fully online transformer-based video instance segmentation model that performs comparably to top offline methods on the YouTube-VIS 2019 benchmark and considerably outperforms them on UVO and OVIS. This method, called Robust Online Video Segmentation (ROVIS), augments the Mask2Former image instance segmentation model with track queries, a lightweight mechanism for carrying track information from frame to frame, originally introduced by the TrackFormer method for multi-object tracking. We show that, when combined with a strong enough image segmentation architecture, track queries can exhibit impressive accuracy while not being constrained to short videos.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SA-VIS: Sparse frame Annotations for training Video Instance Segmentation
cs.CV 2026-06 unverdicted novelty 6.0

SA-VIS trains video instance segmentation models on sparse frame annotations via a Past-frames Feature Propagation module and frame-specific instance queries, showing only a 0.4% AP drop versus dense training on YouTu...
SA-VIS: Sparse frame Annotations for training Video Instance Segmentation
cs.CV 2026-06 unverdicted novelty 5.0

SA-VIS uses Past-frames Feature Propagation and lightweight instance queries to achieve only a 0.4% performance drop in video instance segmentation when trained on 1/5 of the usual frame annotations.