GRAtt-VIS: Gated Residual Attention for Auto Rectifying Video Instance Segmentation

Bjoern Menze; Matthias Schubert; Maximilian Bernhard; Rajat Koner; Suprosanna Shit; Tanveer Hannan; Thomas Seidl; Volker Tresp

arxiv: 2305.17096 · v1 · pith:O7QHMYKXnew · submitted 2023-05-26 · 💻 cs.CV

GRAtt-VIS: Gated Residual Attention for Auto Rectifying Video Instance Segmentation

Tanveer Hannan , Rajat Koner , Maximilian Bernhard , Suprosanna Shit , Bjoern Menze , Volker Tresp , Matthias Schubert , Thomas Seidl This is my paper

classification 💻 cs.CV

keywords textbfinstanceattentionfeaturesgategratt-vismethodsresidual

0 comments

read the original abstract

Recent trends in Video Instance Segmentation (VIS) have seen a growing reliance on online methods to model complex and lengthy video sequences. However, the degradation of representation and noise accumulation of the online methods, especially during occlusion and abrupt changes, pose substantial challenges. Transformer-based query propagation provides promising directions at the cost of quadratic memory attention. However, they are susceptible to the degradation of instance features due to the above-mentioned challenges and suffer from cascading effects. The detection and rectification of such errors remain largely underexplored. To this end, we introduce \textbf{GRAtt-VIS}, \textbf{G}ated \textbf{R}esidual \textbf{Att}ention for \textbf{V}ideo \textbf{I}nstance \textbf{S}egmentation. Firstly, we leverage a Gumbel-Softmax-based gate to detect possible errors in the current frame. Next, based on the gate activation, we rectify degraded features from its past representation. Such a residual configuration alleviates the need for dedicated memory and provides a continuous stream of relevant instance features. Secondly, we propose a novel inter-instance interaction using gate activation as a mask for self-attention. This masking strategy dynamically restricts the unrepresentative instance queries in the self-attention and preserves vital information for long-term tracking. We refer to this novel combination of Gated Residual Connection and Masked Self-Attention as \textbf{GRAtt} block, which can easily be integrated into the existing propagation-based framework. Further, GRAtt blocks significantly reduce the attention overhead and simplify dynamic temporal modeling. GRAtt-VIS achieves state-of-the-art performance on YouTube-VIS and the highly challenging OVIS dataset, significantly improving over previous methods. Code is available at \url{https://github.com/Tanveer81/GRAttVIS}.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SA-VIS: Sparse frame Annotations for training Video Instance Segmentation
cs.CV 2026-06 unverdicted novelty 6.0

SA-VIS trains video instance segmentation models on sparse frame annotations via a Past-frames Feature Propagation module and frame-specific instance queries, showing only a 0.4% AP drop versus dense training on YouTu...
SA-VIS: Sparse frame Annotations for training Video Instance Segmentation
cs.CV 2026-06 unverdicted novelty 5.0

SA-VIS uses Past-frames Feature Propagation and lightweight instance queries to achieve only a 0.4% performance drop in video instance segmentation when trained on 1/5 of the usual frame annotations.