SA-VIS: Sparse frame Annotations for training Video Instance Segmentation

Ajad Chhatkuli; Edoardo Mello Rella; Ender Konukoglu; Luc Van Gool; Shipra Jain

arxiv: 2606.20140 · v2 · pith:ENWDJZJ2new · submitted 2026-06-18 · 💻 cs.CV

SA-VIS: Sparse frame Annotations for training Video Instance Segmentation

Edoardo Mello Rella , Ajad Chhatkuli , Shipra Jain , Ender Konukoglu , Luc Van Gool This is my paper

Pith reviewed 2026-06-30 10:43 UTC · model grok-4.3

classification 💻 cs.CV

keywords video instance segmentationsparse annotationsfeature propagationonline VISYouTube-VISOVISinstance queries

0 comments

The pith

A simple feature propagation module trains video instance segmentation on one-fifth the labels with only a 0.4 percent accuracy drop.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that video instance segmentation does not need dense per-frame labels during training. Its Past-frames Feature Propagation module pulls low-dimensional features from sparsely labeled past frames to track instance evolution and resolve ambiguities. Combined with light frame-specific queries, the resulting SA-VIS model closes most of the performance gap to dense-annotation training. The approach yields large gains over its own baseline on standard benchmarks while cutting annotation cost dramatically.

Core claim

The Past-frames Feature Propagation module aggregates low-dimensional features across multiple past frames inside the image encoder, supplying temporal context that lets the model learn instance motion and identity from sparse video labels. When paired with frame-specific Instance Queries, this yields end-to-end training whose accuracy on YouTube-VIS and OVIS falls only 0.4 percent below the same architecture trained on fully dense annotations.

What carries the argument

Past-frames Feature Propagation (PFP) module, which aggregates low-dimensional features from the image encoder of multiple frames to supply temporal context for instance modeling.

If this is right

SA-VIS raises accuracy over its baseline on YouTube-VIS 2019/2021/2022 and on OVIS.
In low-annotation regimes the method improves AP by more than 1 percent over prior state-of-the-art.
The same architecture trained on 20 percent of frames retains nearly all accuracy of its dense counterpart.
No additional temporal modeling components are required once the propagation module is added.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The result implies that many video tasks may tolerate far sparser supervision once simple feature reuse is introduced.
Extending the propagation window to longer sequences could test whether the low-dimensional cue remains sufficient.
The same module might reduce labeling needs in related tasks such as video object tracking.

Load-bearing premise

Low-dimensional features from sparsely labeled past frames are enough to capture how instances move and stay distinct without dense labels or extra temporal networks.

What would settle it

Train SA-VIS on 1/5 annotations and measure AP on YouTube-VIS 2019; if the drop versus the dense version exceeds 2 percent, the core claim fails.

Figures

Figures reproduced from arXiv: 2606.20140 by Ajad Chhatkuli, Edoardo Mello Rella, Ender Konukoglu, Luc Van Gool, Shipra Jain.

**Figure 1.** Figure 1: SA-VIS: we propose a method that includes past frames awareness and generates framespecific instance queries for the task of video instance segmentation. Thanks to the addition of the shown modules, we propose a method that i) generates queries based on the objects actually visible in the image (FSI Queries), and ii) uses the past frames to build useful and lightweight contextual information (PFP). The pr… view at source ↗

**Figure 2.** Figure 2: Qualitative results: Visualization of SA-VIS on a set of challenging scenes. 4.3 Qualitative Results In [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

read the original abstract

Recent online video instance segmentation (VIS) methods have achieved impressive results, thus becoming the preferred approach to segment instances in videos. Despite the resurgence of impressive single image models, the online (or semi-online) VIS approaches outperform single-image models (e.g., based on SAM) by using long sequences of densely annotated frames during training. However,such a training setup of VIS is expensive in the sense of compute as well as dense annotations required. In order to solve these major flaws, we argue that the effective modeling of the instances and their evolution in videos do not require densely annotated frames. To that end, we propose a simple and effective module, called Past-frames Feature Propagation (PFP) which aggregates low-dimensional features from the image encoder of multiple frames. This simple low-compute module provides tremendous learning capability in using sparse video frame labels for end-to-end training. Combined with a light-weight frame-specific Instance Queries, our Sparse frame Annotation VIS (SA-VIS) significantly improves performance over its baseline. Most interestingly, our simple design that avoids complexities effectively bridges the gap in accuracy between training on sparsely and densely annotated video sequences. This translates to a mere 0.4% drop in performance of SA-VIS when using annotations for only 1/5 of the images in the dataset. Empirically, SA-VIS shows strong improvements over the baseline on YouTube-VIS 2019/2021/2022 and Occluded VIS (OVIS) and an over 1% improvement in AP on the state-of-the-art in a limited annotations scenario.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SA-VIS shows a simple past-frames feature propagation module closes most of the gap between sparse and dense training for video instance segmentation.

read the letter

The main thing to know is that this paper reports training VIS models on only one-fifth the annotated frames while losing just 0.4% AP on standard benchmarks, using a lightweight Past-frames Feature Propagation module plus per-frame instance queries.

The new element is the PFP aggregation of low-dimensional encoder features from past frames, applied specifically to the sparse-annotation regime. It pairs that with frame-specific queries to avoid heavier temporal modeling. The work does well on the empirical side by showing gains over its own baseline on YouTube-VIS 2019/2021/2022 and OVIS, plus more than 1% AP over prior methods in the limited-annotation setting. The design stays deliberately simple, which matches the goal of cutting annotation cost without adding compute.

The soft spots are modest. The abstract gives no numbers on run variance or exact baseline configurations, so the 0.4% figure needs the full tables and ablations to be fully convincing. The assumption that propagated low-dim features suffice for instance evolution holds in their reported results, but it would be worth checking sensitivity to frame selection or longer videos. No sign of circular metrics or mismatched controls.

This is for VIS researchers focused on data efficiency and anyone building video datasets. A reader who cares about annotation budgets gets direct value from the numbers. It deserves peer review because the empirical result on sparse training is concrete and testable.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes SA-VIS for online video instance segmentation, introducing a Past-frames Feature Propagation (PFP) module that aggregates low-dimensional features from the image encoder across multiple (sparsely annotated) frames, combined with lightweight frame-specific instance queries. The central empirical claim is that this simple design bridges the performance gap between sparse and dense annotation regimes, yielding only a 0.4% AP drop when training on annotations for 1/5 of the frames versus full dense supervision, while also outperforming baselines and prior SOTA by over 1% AP in the limited-annotation setting on YouTube-VIS 2019/2021/2022 and OVIS.

Significance. If the reported numbers hold after verification, the result would be significant for reducing the annotation and compute burden of VIS training. The approach demonstrates that low-dimensional feature aggregation plus per-frame queries can nearly close the dense-vs-sparse gap without extra temporal modules. Credit is given for evaluating on multiple standard benchmarks (YouTube-VIS 2019/2021/2022 and OVIS) and for the reproducible experimental setup implied by the benchmark comparisons.

major comments (2)

[Abstract] Abstract: the central claim of a 'mere 0.4% drop' when using annotations for only 1/5 of the images is presented without error bars, number of runs, statistical significance tests, or explicit baseline definitions (dense vs. sparse SA-VIS), which is load-bearing for the assertion that the gap is effectively bridged.
[§4] §4 (Experiments): the reported gains over baselines and SOTA in the sparse regime lack details on ablation controls isolating PFP from the frame-specific queries, as well as variance across runs, undermining verification of the 0.4% claim and the 'simple design' sufficiency argument.

minor comments (2)

[Abstract] Abstract: missing space after the comma in 'However,such a training setup'.
[Abstract] Abstract: 'an over 1% improvement' appears to be a typographical error and should read 'and over 1% improvement'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on clarifying the empirical claims. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of a 'mere 0.4% drop' when using annotations for only 1/5 of the images is presented without error bars, number of runs, statistical significance tests, or explicit baseline definitions (dense vs. sparse SA-VIS), which is load-bearing for the assertion that the gap is effectively bridged.

Authors: We agree the abstract would benefit from explicit context. In revision we will define the dense vs. sparse SA-VIS comparison (same architecture and training protocol, differing only in annotation density) and report error bars plus number of runs. The 0.4% figure is the direct AP difference on YouTube-VIS between full dense supervision and 1/5-frame annotations for the identical SA-VIS model. revision: yes
Referee: [§4] §4 (Experiments): the reported gains over baselines and SOTA in the sparse regime lack details on ablation controls isolating PFP from the frame-specific queries, as well as variance across runs, undermining verification of the 0.4% claim and the 'simple design' sufficiency argument.

Authors: We will add the requested ablation controls that isolate PFP from frame-specific queries and will report standard deviation across multiple runs in the revised §4. These additions will allow direct verification of the 0.4% gap and the contribution of each component. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper proposes an empirical architecture (PFP module for aggregating low-dimensional encoder features plus frame-specific queries) and reports benchmark results on YouTube-VIS and OVIS. No derivation chain, equations, or first-principles claims exist that reduce by construction to fitted inputs or self-citations. Performance numbers (e.g., 0.4% AP drop on 1/5 annotations) are external comparisons against baselines and prior SOTA; the central claim is falsifiable on held-out data and does not rely on self-referential definitions or imported uniqueness theorems.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the design appears to reuse standard image-encoder features and instance-query mechanisms.

pith-pipeline@v0.9.1-grok · 5833 in / 1068 out tokens · 29263 ms · 2026-06-30T10:43:00.198861+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 6 canonical work pages

[1]

In: Computer Vision– ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16

Cao, J., Anwer, R.M., Cholakkal, H., Khan, F.S., Pang, Y ., Shao, L.: Sipmask: Spatial information preservation for fast image and video instance segmentation. In: Computer Vision– ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16. pp. 1–18. Springer (2020)

2020
[2]

In: European conference on computer vision

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: European conference on computer vision. pp. 213–229. Springer (2020)

2020
[3]

ArXivabs/2112.10764(2021), https://api

Cheng, B., Choudhuri, A., Misra, I., Kirillov, A., Girdhar, R., Schwing, A.G.: Mask2former for video instance segmentation. ArXivabs/2112.10764(2021), https://api. semanticscholar.org/CorpusID:245335013

work page arXiv 2021
[4]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020)

Cheng, B., Collins, M.D., Zhu, Y ., Liu, T., Huang, T.S., Adam, H., Chen, L.C.: Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020)

2020
[5]

Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask trans- former for universal image segmentation (2022)

2022
[6]

Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation (2021)

2021
[7]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Choudhuri, A., Chowdhary, G., Schwing, A.G.: Context-aware relative object queries to unify video instance and panoptic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6377–6386 (June 2023)

2023
[8]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Choudhuri, A., Chowdhary, G., Schwing, A.G.: Context-aware relative object queries to unify video instance and panoptic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6377–6386 (2023)

2023
[9]

IEEE Transactions on Circuits and Systems for Video Technology pp

Fang, H., Zhang, T., Zhou, X., Zhang, X.: Learning better video query with sam for video instance segmentation. IEEE Transactions on Circuits and Systems for Video Technology pp. 1–1 (2024). https://doi.org/10.1109/TCSVT.2024.3361076

work page doi:10.1109/tcsvt.2024.3361076 2024
[10]

Fischer, T., Huang, T.E., Pang, J., Qiu, L., Chen, H., Darrell, T., Yu, F.: Qdtrack: Quasi-dense similarity learning for appearance-only multiple object tracking (2023)

2023
[11]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Fu, Y ., Yang, L., Liu, D., Huang, T.S., Shi, H.: Compfeat: Comprehensive feature aggrega- tion for video instance segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 1361–1369 (2021)

2021
[12]

In: Proceedings of the IEEE International Conference on Computer Vision

Gadde, R., Jampani, V ., Gehler, P.V .: Semantic video cnns through representation warping. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 4453–4462 (2017)

2017
[13]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Han, S.H., Hwang, S., Oh, S.W., Park, Y ., Kim, H., Kim, M.J., Kim, S.J.: Visolo: Grid-based space-time aggregation for efficient online video instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2896–2905 (2022)

2022
[14]

arXiv preprint arXiv:2305.17096 (2023)

Hannan, T., Koner, R., Bernhard, M., Shit, S., Menze, B., Tresp, V ., Schubert, M., Seidl, T.: Gratt-vis: Gated residual attention for auto rectifying video instance segmentation. arXiv preprint arXiv:2305.17096 (2023)

work page arXiv 2023
[15]

Advances in Neural Information Processing Systems35, 19370–19383 (2022)

He, F., Zhang, H., Gao, N., Jia, J., Shan, Y ., Zhao, X., Huang, K.: Inspro: Propagating instance query and proposal for online video instance segmentation. Advances in Neural Information Processing Systems35, 19370–19383 (2022)

2022
[16]

In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (Oct 2017)

He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (Oct 2017)

2017
[17]

In: Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016)

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016)

2016
[18]

In: CVPR (2023)

Heo, M., Hwang, S., Hyun, J., Kim, H., Oh, S.W., Lee, J.Y ., Kim, S.J.: A generalized framework for video instance segmentation. In: CVPR (2023)

2023
[19]

In: Advances in Neural Information Processing Systems (2022) 10

Heo, M., Hwang, S., Oh, S.W., Lee, J.Y ., Kim, S.J.: Vita: Video instance segmentation via object token association. In: Advances in Neural Information Processing Systems (2022) 10

2022
[20]

Huang, D.A., Yu, Z., Anandkumar, A.: Minvis: A minimal video instance segmentation framework without video-based training (2022)

2022
[21]

Advances in Neural Information Processing Systems34, 13352– 13363 (2021)

Hwang, S., Heo, M., Oh, S.W., Kim, S.J.: Video instance segmentation using inter-frame communication transformers. Advances in Neural Information Processing Systems34, 13352– 13363 (2021)

2021
[22]

In: Computer Vision – ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV (2023)

Jiang, Z., Gu, Z., Peng, J., Zhou, H., Liu, L., Wang, Y ., Tai, Y ., Wang, C., Zhang, L.: Stc: Spatio-temporal contrastive learning for video instance segmentation. In: Computer Vision – ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV (2023)

2022
[23]

In: Proceedings of the IEEE International Conference on Computer Vision

Jin, X., Li, X., Xiao, H., Shen, X., Lin, Z., Yang, J., Chen, Y ., Dong, J., Liu, L., Jie, Z., et al.: Video scene parsing with predictive feature learning. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 5580–5588 (2017)

2017
[24]

In: CVPR (2023)

Ke, L., Danelljan, M., Ding, H., Tai, Y .W., Tang, C.K., Yu, F.: Mask-free video instance segmentation. In: CVPR (2023)

2023
[25]

arXiv preprint arXiv:2208.10547 (2022)

Koner, R., Hannan, T., Shit, S., Sharifzadeh, S., Schubert, M., Seidl, T., Tresp, V .: Instance- former: An online video instance segmentation framework. arXiv preprint arXiv:2208.10547 (2022)

work page arXiv 2022
[26]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Lin, C.C., Hung, Y ., Feris, R., He, L.: Video instance segmentation tracking with a modified vae architecture. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13147–13157 (2020)

2020
[27]

In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13

Lin, T.Y ., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. pp. 740–755. Springer (2014)

2014
[28]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Liu, D., Cui, Y ., Tan, W., Chen, Y .: Sg-net: Spatial granularity network for one-stage video instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9816–9825 (2021)

2021
[29]

In: Proceedings of the IEEE/CVF international conference on computer vision

Liu, Z., Lin, Y ., Cao, Y ., Hu, H., Wei, Y ., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 10012–10022 (2021)

2021
[30]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Nilsson, D., Sminchisescu, C.: Semantic video segmentation by gated recurrent flow propaga- tion. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6819–6828 (2018)

2018
[31]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Porzi, L., Hofinger, M., Ruiz, I., Serrat, J., Bulo, S.R., Kontschieder, P.: Learning multi-object tracking and segmentation from automatic annotations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6846–6855 (2020)

2020
[32]

In: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round

Qi, J., Gao, Y ., Hu, Y ., Wang, X., Liu, X., Bai, X., Belongie, S., Yuille, A., Torr, P., Bai, S.: Occluded video instance segmentation: Dataset and ICCV 2021 challenge. In: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round

2021
[33]

(2021),https://openreview.net/forum?id=IfzTefIU_3j

2021
[34]

In: Com- puter Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16

Tian, Z., Shen, C., Chen, H.: Conditional convolutions for instance segmentation. In: Com- puter Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16. pp. 282–298. Springer (2020)

2020
[35]

In: Guyon, I., Luxburg, U.V ., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.u., Polosukhin, I.: Attention is all you need. In: Guyon, I., Luxburg, U.V ., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Process- ing Systems. vol. 30. Curran Associates, Inc. (2017), https://proceedings.neur...

2017
[36]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)

V oigtlaender, P., Krause, M., Osep, A., Luiten, J., Sekar, B.B.G., Geiger, A., Leibe, B.: Mots: Multi-object tracking and segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)

2019
[37]

In: Proc

Wang, Y ., Xu, Z., Wang, X., Shen, C., Cheng, B., Shen, H., Xia, H.: End-to-end video instance segmentation with transformers. In: Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR) (2021) 11

2021
[38]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Wu, D., Wang, T., Zhang, Y ., Zhang, X., Shen, J.: Onlinerefer: A simple online baseline for referring video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2761–2770 (2023)

2023
[39]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wu, J., Cao, J., Song, L., Wang, Y ., Yang, M., Yuan, J.: Track to detect and segment: An online multi-object tracker. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12352–12361 (2021)

2021
[40]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wu, J., Yarram, S., Liang, H., Lan, T., Yuan, J., Eledath, J., Medioni, G.: Efficient video instance segmentation via tracklet query and proposal. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 959–968 (2022)

2022
[41]

arXiv preprint arXiv:2112.08275 (2021)

Wu, J., Jiang, Y ., Zhang, W., Bai, X., Bai, S.: Seqformer: a frustratingly simple model for video instance segmentation. arXiv preprint arXiv:2112.08275 (2021)

work page arXiv 2021
[42]

In: ECCV (2022)

Wu, J., Liu, Q., Jiang, Y ., Bai, S., Yuille, A., Bai, X.: In defense of online models for video instance segmentation. In: ECCV (2022)

2022
[43]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (October 2019)

Yang, L., Fan, Y ., Xu, N.: Video instance segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (October 2019)

2019
[44]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Yang, S., Fang, Y ., Wang, X., Li, Y ., Fang, C., Shan, Y ., Feng, B., Liu, W.: Crossover learning for fast online video instance segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 8043–8052 (October 2021)

2021
[45]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Yang, S., Wang, X., Li, Y ., Fang, Y ., Fang, J., Liu, W., Zhao, X., Shan, Y .: Temporally efficient vision transformer for video instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2885–2895 (2022)

2022
[46]

Advances in Neural Information Processing Systems35, 36324–36336 (2022)

Yang, Z., Yang, Y .: Decoupling features in hierarchical propagation for video object segmenta- tion. Advances in Neural Information Processing Systems35, 36324–36336 (2022)

2022
[47]

Ying, K., Zhong, Q., Mao, W., Wang, Z., Chen, H., Wu, L.Y ., Liu, Y ., Fan, C., Zhuge, Y ., Shen, C.: CTVIS: Consistent Training for Online Video Instance Segmentation (2023)

2023
[48]

arXiv preprint arXiv:2211.09108 (2022)

Zhan, Z., McKee, D., Lazebnik, S.: Robust online video instance segmentation with track queries. arXiv preprint arXiv:2211.09108 (2022)

work page arXiv 2022
[49]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Zhang, T., Tian, X., Wu, Y ., Ji, S., Wang, X., Zhang, Y ., Wan, P.: Dvis: Decoupled video instance segmentation framework. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 1282–1291 (October 2023)

2023
[50]

In: MultiMedia Modeling: 26th International Conference, MMM 2020, Daejeon, South Korea, January 5–8, 2020, Proceedings, Part I 26

Zhang, X., Han, G., He, W.: Unsupervised feature propagation for fast video object detection using generative adversarial networks. In: MultiMedia Modeling: 26th International Conference, MMM 2020, Daejeon, South Korea, January 5–8, 2020, Proceedings, Part I 26. pp. 617–627. Springer (2020)

2020
[51]

In: International Conference on Learning Representations (2021),https://openreview.net/forum?id=gZ9hCDWe6ke 12

Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable {detr}: Deformable transformers for end-to-end object detection. In: International Conference on Learning Representations (2021),https://openreview.net/forum?id=gZ9hCDWe6ke 12

2021

[1] [1]

In: Computer Vision– ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16

Cao, J., Anwer, R.M., Cholakkal, H., Khan, F.S., Pang, Y ., Shao, L.: Sipmask: Spatial information preservation for fast image and video instance segmentation. In: Computer Vision– ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16. pp. 1–18. Springer (2020)

2020

[2] [2]

In: European conference on computer vision

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: European conference on computer vision. pp. 213–229. Springer (2020)

2020

[3] [3]

ArXivabs/2112.10764(2021), https://api

Cheng, B., Choudhuri, A., Misra, I., Kirillov, A., Girdhar, R., Schwing, A.G.: Mask2former for video instance segmentation. ArXivabs/2112.10764(2021), https://api. semanticscholar.org/CorpusID:245335013

work page arXiv 2021

[4] [4]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020)

Cheng, B., Collins, M.D., Zhu, Y ., Liu, T., Huang, T.S., Adam, H., Chen, L.C.: Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020)

2020

[5] [5]

Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask trans- former for universal image segmentation (2022)

2022

[6] [6]

Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation (2021)

2021

[7] [7]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Choudhuri, A., Chowdhary, G., Schwing, A.G.: Context-aware relative object queries to unify video instance and panoptic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6377–6386 (June 2023)

2023

[8] [8]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Choudhuri, A., Chowdhary, G., Schwing, A.G.: Context-aware relative object queries to unify video instance and panoptic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6377–6386 (2023)

2023

[9] [9]

IEEE Transactions on Circuits and Systems for Video Technology pp

Fang, H., Zhang, T., Zhou, X., Zhang, X.: Learning better video query with sam for video instance segmentation. IEEE Transactions on Circuits and Systems for Video Technology pp. 1–1 (2024). https://doi.org/10.1109/TCSVT.2024.3361076

work page doi:10.1109/tcsvt.2024.3361076 2024

[10] [10]

Fischer, T., Huang, T.E., Pang, J., Qiu, L., Chen, H., Darrell, T., Yu, F.: Qdtrack: Quasi-dense similarity learning for appearance-only multiple object tracking (2023)

2023

[11] [11]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Fu, Y ., Yang, L., Liu, D., Huang, T.S., Shi, H.: Compfeat: Comprehensive feature aggrega- tion for video instance segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 1361–1369 (2021)

2021

[12] [12]

In: Proceedings of the IEEE International Conference on Computer Vision

Gadde, R., Jampani, V ., Gehler, P.V .: Semantic video cnns through representation warping. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 4453–4462 (2017)

2017

[13] [13]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Han, S.H., Hwang, S., Oh, S.W., Park, Y ., Kim, H., Kim, M.J., Kim, S.J.: Visolo: Grid-based space-time aggregation for efficient online video instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2896–2905 (2022)

2022

[14] [14]

arXiv preprint arXiv:2305.17096 (2023)

Hannan, T., Koner, R., Bernhard, M., Shit, S., Menze, B., Tresp, V ., Schubert, M., Seidl, T.: Gratt-vis: Gated residual attention for auto rectifying video instance segmentation. arXiv preprint arXiv:2305.17096 (2023)

work page arXiv 2023

[15] [15]

Advances in Neural Information Processing Systems35, 19370–19383 (2022)

He, F., Zhang, H., Gao, N., Jia, J., Shan, Y ., Zhao, X., Huang, K.: Inspro: Propagating instance query and proposal for online video instance segmentation. Advances in Neural Information Processing Systems35, 19370–19383 (2022)

2022

[16] [16]

In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (Oct 2017)

He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (Oct 2017)

2017

[17] [17]

In: Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016)

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016)

2016

[18] [18]

In: CVPR (2023)

Heo, M., Hwang, S., Hyun, J., Kim, H., Oh, S.W., Lee, J.Y ., Kim, S.J.: A generalized framework for video instance segmentation. In: CVPR (2023)

2023

[19] [19]

In: Advances in Neural Information Processing Systems (2022) 10

Heo, M., Hwang, S., Oh, S.W., Lee, J.Y ., Kim, S.J.: Vita: Video instance segmentation via object token association. In: Advances in Neural Information Processing Systems (2022) 10

2022

[20] [20]

Huang, D.A., Yu, Z., Anandkumar, A.: Minvis: A minimal video instance segmentation framework without video-based training (2022)

2022

[21] [21]

Advances in Neural Information Processing Systems34, 13352– 13363 (2021)

Hwang, S., Heo, M., Oh, S.W., Kim, S.J.: Video instance segmentation using inter-frame communication transformers. Advances in Neural Information Processing Systems34, 13352– 13363 (2021)

2021

[22] [22]

In: Computer Vision – ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV (2023)

Jiang, Z., Gu, Z., Peng, J., Zhou, H., Liu, L., Wang, Y ., Tai, Y ., Wang, C., Zhang, L.: Stc: Spatio-temporal contrastive learning for video instance segmentation. In: Computer Vision – ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV (2023)

2022

[23] [23]

In: Proceedings of the IEEE International Conference on Computer Vision

Jin, X., Li, X., Xiao, H., Shen, X., Lin, Z., Yang, J., Chen, Y ., Dong, J., Liu, L., Jie, Z., et al.: Video scene parsing with predictive feature learning. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 5580–5588 (2017)

2017

[24] [24]

In: CVPR (2023)

Ke, L., Danelljan, M., Ding, H., Tai, Y .W., Tang, C.K., Yu, F.: Mask-free video instance segmentation. In: CVPR (2023)

2023

[25] [25]

arXiv preprint arXiv:2208.10547 (2022)

Koner, R., Hannan, T., Shit, S., Sharifzadeh, S., Schubert, M., Seidl, T., Tresp, V .: Instance- former: An online video instance segmentation framework. arXiv preprint arXiv:2208.10547 (2022)

work page arXiv 2022

[26] [26]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Lin, C.C., Hung, Y ., Feris, R., He, L.: Video instance segmentation tracking with a modified vae architecture. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13147–13157 (2020)

2020

[27] [27]

In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13

Lin, T.Y ., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. pp. 740–755. Springer (2014)

2014

[28] [28]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Liu, D., Cui, Y ., Tan, W., Chen, Y .: Sg-net: Spatial granularity network for one-stage video instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9816–9825 (2021)

2021

[29] [29]

In: Proceedings of the IEEE/CVF international conference on computer vision

Liu, Z., Lin, Y ., Cao, Y ., Hu, H., Wei, Y ., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 10012–10022 (2021)

2021

[30] [30]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Nilsson, D., Sminchisescu, C.: Semantic video segmentation by gated recurrent flow propaga- tion. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6819–6828 (2018)

2018

[31] [31]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Porzi, L., Hofinger, M., Ruiz, I., Serrat, J., Bulo, S.R., Kontschieder, P.: Learning multi-object tracking and segmentation from automatic annotations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6846–6855 (2020)

2020

[32] [32]

In: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round

Qi, J., Gao, Y ., Hu, Y ., Wang, X., Liu, X., Bai, X., Belongie, S., Yuille, A., Torr, P., Bai, S.: Occluded video instance segmentation: Dataset and ICCV 2021 challenge. In: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round

2021

[33] [33]

(2021),https://openreview.net/forum?id=IfzTefIU_3j

2021

[34] [34]

In: Com- puter Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16

Tian, Z., Shen, C., Chen, H.: Conditional convolutions for instance segmentation. In: Com- puter Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16. pp. 282–298. Springer (2020)

2020

[35] [35]

In: Guyon, I., Luxburg, U.V ., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.u., Polosukhin, I.: Attention is all you need. In: Guyon, I., Luxburg, U.V ., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Process- ing Systems. vol. 30. Curran Associates, Inc. (2017), https://proceedings.neur...

2017

[36] [36]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)

V oigtlaender, P., Krause, M., Osep, A., Luiten, J., Sekar, B.B.G., Geiger, A., Leibe, B.: Mots: Multi-object tracking and segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)

2019

[37] [37]

In: Proc

Wang, Y ., Xu, Z., Wang, X., Shen, C., Cheng, B., Shen, H., Xia, H.: End-to-end video instance segmentation with transformers. In: Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR) (2021) 11

2021

[38] [38]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Wu, D., Wang, T., Zhang, Y ., Zhang, X., Shen, J.: Onlinerefer: A simple online baseline for referring video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2761–2770 (2023)

2023

[39] [39]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wu, J., Cao, J., Song, L., Wang, Y ., Yang, M., Yuan, J.: Track to detect and segment: An online multi-object tracker. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12352–12361 (2021)

2021

[40] [40]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wu, J., Yarram, S., Liang, H., Lan, T., Yuan, J., Eledath, J., Medioni, G.: Efficient video instance segmentation via tracklet query and proposal. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 959–968 (2022)

2022

[41] [41]

arXiv preprint arXiv:2112.08275 (2021)

Wu, J., Jiang, Y ., Zhang, W., Bai, X., Bai, S.: Seqformer: a frustratingly simple model for video instance segmentation. arXiv preprint arXiv:2112.08275 (2021)

work page arXiv 2021

[42] [42]

In: ECCV (2022)

Wu, J., Liu, Q., Jiang, Y ., Bai, S., Yuille, A., Bai, X.: In defense of online models for video instance segmentation. In: ECCV (2022)

2022

[43] [43]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (October 2019)

Yang, L., Fan, Y ., Xu, N.: Video instance segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (October 2019)

2019

[44] [44]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Yang, S., Fang, Y ., Wang, X., Li, Y ., Fang, C., Shan, Y ., Feng, B., Liu, W.: Crossover learning for fast online video instance segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 8043–8052 (October 2021)

2021

[45] [45]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Yang, S., Wang, X., Li, Y ., Fang, Y ., Fang, J., Liu, W., Zhao, X., Shan, Y .: Temporally efficient vision transformer for video instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2885–2895 (2022)

2022

[46] [46]

Advances in Neural Information Processing Systems35, 36324–36336 (2022)

Yang, Z., Yang, Y .: Decoupling features in hierarchical propagation for video object segmenta- tion. Advances in Neural Information Processing Systems35, 36324–36336 (2022)

2022

[47] [47]

Ying, K., Zhong, Q., Mao, W., Wang, Z., Chen, H., Wu, L.Y ., Liu, Y ., Fan, C., Zhuge, Y ., Shen, C.: CTVIS: Consistent Training for Online Video Instance Segmentation (2023)

2023

[48] [48]

arXiv preprint arXiv:2211.09108 (2022)

Zhan, Z., McKee, D., Lazebnik, S.: Robust online video instance segmentation with track queries. arXiv preprint arXiv:2211.09108 (2022)

work page arXiv 2022

[49] [49]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Zhang, T., Tian, X., Wu, Y ., Ji, S., Wang, X., Zhang, Y ., Wan, P.: Dvis: Decoupled video instance segmentation framework. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 1282–1291 (October 2023)

2023

[50] [50]

In: MultiMedia Modeling: 26th International Conference, MMM 2020, Daejeon, South Korea, January 5–8, 2020, Proceedings, Part I 26

Zhang, X., Han, G., He, W.: Unsupervised feature propagation for fast video object detection using generative adversarial networks. In: MultiMedia Modeling: 26th International Conference, MMM 2020, Daejeon, South Korea, January 5–8, 2020, Proceedings, Part I 26. pp. 617–627. Springer (2020)

2020

[51] [51]

In: International Conference on Learning Representations (2021),https://openreview.net/forum?id=gZ9hCDWe6ke 12

Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable {detr}: Deformable transformers for end-to-end object detection. In: International Conference on Learning Representations (2021),https://openreview.net/forum?id=gZ9hCDWe6ke 12

2021