pith. sign in

arxiv: 2606.07394 · v1 · pith:AEIZA667new · submitted 2026-06-05 · 💻 cs.CV

Mind the Gap: Disentangling Performance Bottlenecks in Video Instance Segmentation

Pith reviewed 2026-06-27 22:08 UTC · model grok-4.3

classification 💻 cs.CV
keywords video instance segmentationtracking instabilityperformance diagnosisinteger linear programmingonline methodsocclusiontemporal association
0
0 comments X

The pith

Tracking instability creates gaps exceeding 20 AP for online video instance segmentation under occlusion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a diagnostic framework that uses an integer linear program to isolate the separate effects of classification, segmentation, and tracking errors on overall performance in video instance segmentation. When applied to seven methods across standard benchmarks and occlusion-heavy subsets, the analysis identifies tracking instability as the dominant source of loss for online approaches, with the gaps widening in longer videos and denser scenes. Stronger backbones raise baseline scores but leave the tracking gaps largely unchanged, showing the limitation lies in temporal association rather than feature representation. This separation matters because it shows where algorithmic effort should focus to close the remaining performance shortfalls.

Core claim

Formulating identity and class assignment as an integer linear program produces a model-agnostic oracle that decomposes performance loss hierarchically by error source. The resulting measurements on online and offline VIS methods demonstrate that tracking instability is the primary bottleneck, producing gaps larger than 20 AP under heavy occlusion that increase sharply with video length and instance density, while classification contributes less once tracking has already failed.

What carries the argument

The integer linear program oracle that isolates each error source in identity and class assignment without reference to any particular model.

If this is right

  • Fixing temporal association would produce the largest AP gains for online VIS methods.
  • Semantic classification improvements yield diminishing returns on benchmarks where tracking already fails.
  • Replacing the backbone leaves the size of the tracking gaps essentially unchanged.
  • The magnitude of tracking gaps scales directly with sequence length and instance count.
  • Offline methods exhibit smaller tracking gaps than online ones on the same data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same ILP decomposition could be applied to other video tasks that combine detection and association to locate their dominant failure modes.
  • Models that already achieve high per-frame accuracy may still need explicit long-horizon association modules to realize those gains in full video metrics.
  • TrackLens visualizations could be used during training to surface specific failure queries for targeted data collection.

Load-bearing premise

The integer linear program produces an oracle that separates the error sources without introducing its own assignment biases or artifacts that change the measured gaps.

What would settle it

Re-running the oracle after manually correcting all tracking assignments on a held-out set of videos and observing whether the reported tracking gaps drop to near zero while other gaps remain.

Figures

Figures reproduced from arXiv: 2606.07394 by Danial Hamdi, Fardin Ayar, Mahdi Javanmardi.

Figure 1
Figure 1. Figure 1: Overview of our hierarchical error decomposition framework. The figure illustrates the stan [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of TrackLens. Rows represent query tracks across time, columns show video frames. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Tracking and classification gaps across backbone scales on YouTube-VIS 2021. For each method, [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: AP tracking gap by instance density and video duration on the OVIS diagnostic split (online [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
read the original abstract

In Video Instance Segmentation (VIS), classification, segmentation, and tracking objectives are jointly evaluated, but their individual contributions to performance loss remain opaque. We introduce a diagnostic framework that formulates identity and class assignment as an Integer Linear Program (ILP), yielding a model-agnostic oracle that hierarchically isolates each error source. Applied to seven VIS methods spanning online and offline paradigms across YouTube-VIS 2019/2021 and a diagnostic subset of OVIS, our analysis reveals a consistent picture. Tracking instability is a critical bottleneck for online methods, with gaps exceeding 20 AP under heavy occlusion, and grows sharply with video length and instance density. While semantic classification contributes meaningfully on standard benchmarks, its impact becomes negligible where tracking fails most. Although stronger backbones substantially lift default scores, they leave AP tracking gaps largely intact, confirming that temporal fragility is algorithmic rather than purely representational. To complement the oracle, we introduce TrackLens, a visual tool that translates gap magnitude into observable, query-level failure modes. Together, these tools provide a systematic foundation for targeting VIS's core challenge: robust long-term temporal association.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces an ILP-based diagnostic oracle to hierarchically decompose performance loss in video instance segmentation into classification, segmentation, and tracking components. Applied across seven methods (online and offline) on YouTube-VIS 2019/2021 and a diagnostic OVIS subset, the analysis concludes that tracking instability is the dominant bottleneck for online methods (gaps >20 AP under heavy occlusion, increasing with video length and instance density), that semantic classification impact is secondary where tracking fails, and that stronger backbones do not close the tracking gap, indicating an algorithmic rather than representational issue. TrackLens is presented as a complementary visualization tool.

Significance. If the ILP oracle is shown to isolate error sources without attribution bias, the framework would offer a useful model-agnostic diagnostic for VIS research, clarifying why temporal association remains the core challenge and providing concrete targets for improvement. The multi-method, multi-dataset consistency and the addition of TrackLens strengthen its potential utility as a tool for the community.

major comments (1)
  1. [ILP formulation and oracle validation (likely §3)] The central claim that tracking gaps exceed 20 AP and grow with length/density/occlusion rests on the ILP oracle correctly attributing errors without its own biases (e.g., in identity-switch costs or occlusion handling). The manuscript must include the full ILP objective, all constraints, and validation experiments (such as sensitivity to cost parameters or comparison against alternative oracles) demonstrating that measured gaps are not inflated by the diagnostic formulation itself, particularly in the high-occlusion regimes highlighted in the results.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the ILP oracle's formulation and validation. We agree that full transparency on the diagnostic is necessary to substantiate the tracking gap claims, particularly under occlusion. We address the single major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: The central claim that tracking gaps exceed 20 AP and grow with length/density/occlusion rests on the ILP oracle correctly attributing errors without its own biases (e.g., in identity-switch costs or occlusion handling). The manuscript must include the full ILP objective, all constraints, and validation experiments (such as sensitivity to cost parameters or comparison against alternative oracles) demonstrating that measured gaps are not inflated by the diagnostic formulation itself, particularly in the high-occlusion regimes highlighted in the results.

    Authors: We agree that the full ILP formulation and validation are required for the claims to be credible. In the revised version we will add the complete objective function (including all terms for classification, segmentation, and tracking costs) together with the full set of constraints to Section 3. We have performed sensitivity analyses on the identity-switch and occlusion-handling cost parameters; the resulting tracking gaps vary by less than 2 AP across the tested range and remain above 18 AP in the high-occlusion OVIS subset. We will also include a comparison against a greedy bipartite-matching oracle, which produces qualitatively identical gap rankings and magnitudes. These additions will be placed in a new subsection of §3 and an appendix, directly addressing potential attribution bias in the regimes highlighted in the results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; ILP oracle is independent diagnostic

full rationale

The paper formulates an ILP as a model-agnostic oracle to hierarchically isolate classification, segmentation, and tracking errors, then applies it to measure gaps on YouTube-VIS and OVIS for seven existing methods. No equations or steps reduce the reported tracking gaps (>20 AP under occlusion, scaling with length/density) to quantities defined by the paper's own fitted parameters or self-citations. The oracle is presented as an external decomposition tool rather than a self-referential fit, and no load-bearing uniqueness theorems or ansatzes from prior author work are invoked. The derivation chain remains self-contained against the external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that an ILP can serve as an unbiased oracle for error decomposition; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Integer Linear Programming can be solved to optimality to produce a model-agnostic assignment oracle for identity and class labels
    The diagnostic framework is built directly on this premise.

pith-pipeline@v0.9.1-grok · 5733 in / 1149 out tokens · 21966 ms · 2026-06-27T22:08:55.032510+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 13 canonical work pages

  1. [1]

    L. Yang, Y . Fan, N. Xu, Video instance segmentation, in: Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 5188–5197

  2. [2]

    Cheng, A

    B. Cheng, A. Choudhuri, I. Misra, A. Kirillov, R. Girdhar, A. G. Schwing, Mask2former for video instance segmentation, ArXiv abs/2112.10764 (2021). URLhttps://api.semanticscholar.org/CorpusID:245335013

  3. [3]

    Y . Wang, Z. Xu, X. Wang, C. Shen, B. Cheng, H. Shen, H. Xia, End-to-end video instance segmentation with transformers, in: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 8737–8746. doi:10.1109/CVPR46437.2021.00863

  4. [4]

    J. Wu, Y . Jiang, S. Bai, W. Zhang, X. Bai, Seqformer: Sequential transformer for video instance segmentation, in: S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, T. Hassner (Eds.), Computer Vision – ECCV 2022, Springer Nature Switzerland, Cham, 2022, pp. 553–569

  5. [5]

    Hwang, M

    S. Hwang, M. Heo, S. W. Oh, S. J. Kim, Video instance segmentation using inter- frame communication transformers, Advances in Neural Information Processing Systems 34 (2021) 13352–13363

  6. [6]

    H. Lin, R. Wu, S. Liu, J. Lu, J. Jia, Video instance segmentation with a propose- reduce paradigm, in: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 1719–1728. doi:10.1109/ICCV48922.2021.00176

  7. [7]

    M. Heo, S. Hwang, S. W. Oh, J.-Y . Lee, S. J. Kim, Vita: Video instance segmen- tation via object token association, Advances in Neural Information Processing Systems 35 (2022) 23109–23120

  8. [8]

    M. Heo, S. Hwang, J. Hyun, H. Kim, S. W. Oh, J.-Y . Lee, S. J. Kim, A gener- alized framework for video instance segmentation, in: 2023 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 14623– 14632. doi:10.1109/CVPR52729.2023.01405

  9. [9]

    Huang, Z

    D.-A. Huang, Z. Yu, A. Anandkumar, Minvis: A minimal video instance seg- mentation framework without video-based training, Advances in Neural Infor- mation Processing Systems 35 (2022) 31265–31277. 17

  10. [10]

    J. Cao, R. M. Anwer, H. Cholakkal, F. S. Khan, Y . Pang, L. Shao, Sipmask: Spatial information preservation for fast image and video instance segmentation, Proc. European Conference on Computer Vision (2020)

  11. [11]

    H. Kim, J. Kang, M. Heo, S. Hwang, S. W. Oh, S. J. Kim, Visage: Video instance segmentation with appearance-guided enhancement (2024). arXiv:2312.04885

  12. [12]

    K. Ying, Q. Zhong, W. Mao, Z. Wang, H. Chen, L. Y . Wu, Y . Liu, C. Fan, Y . Zhuge, C. Shen, Ctvis: Consistent training for online video instance segmen- tation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 899–908

  13. [13]

    S. Yang, Y . Fang, X. Wang, Y . Li, C. Fang, Y . Shan, B. Feng, W. Liu, Crossover learning for fast online video instance segmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 8043–8052

  14. [14]

    Cheng, I

    B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, R. Girdhar, Masked-attention mask transformer for universal image segmentation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 1290–1299

  15. [15]

    J. Wu, Q. Liu, Y . Jiang, S. Bai, A. Yuille, X. Bai, In defense of online models for video instance segmentation, in: European Conference on Computer Vision, Springer, 2022, pp. 588–605

  16. [16]

    S. Lee, J. Seo, K. Han, M. Choi, S. Im, Cavis: Context-aware video instance segmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 4507–4517

  17. [17]

    Phasemp: Robust 3d pose estimation via phase-conditioned human motion prior

    T. Zhang, X. Tian, Y . Wu, S. Ji, X. Wang, Y . Zhang, P. Wan, Dvis: De- coupled video instance segmentation framework, in: 2023 IEEE/CVF In- ternational Conference on Computer Vision (ICCV), 2023, pp. 1282–1291. doi:10.1109/ICCV51070.2023.00124

  18. [18]

    Zhang, X

    T. Zhang, X. Tian, Y . Zhou, S. Ji, X. Wang, X. Tao, Y . Zhang, P. Wan, Z. Wang, Y . Wu, Dvis++: Improved decoupled framework for universal video segmen- tation, IEEE Transactions on Pattern Analysis and Machine Intelligence 47 (7) (2025) 5918–5929. doi:10.1109/TPAMI.2025.3552694

  19. [19]

    J. Qi, Y . Gao, Y . Hu, X. Wang, X. Liu, X. Bai, S. Belongie, A. Yuille, P. H. S. Torr, S. Bai, Occluded video instance segmentation: A benchmark, International Journal of Computer Vision 130 (8) (2022) 2022–2039. doi:10.1007/s11263- 022-01629-1. URLhttps://doi.org/10.1007/s11263-022-01629-1

  20. [20]

    L. Yang, Y . Fan, Y . Fu, N. Xu, The 3rd large-scale video object segmentation challenge - video instance segmentation track (Jun. 2021). 18

  21. [21]

    K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask r-cnn, in: 2017 IEEE In- ternational Conference on Computer Vision (ICCV), 2017, pp. 2980–2988. doi:10.1109/ICCV .2017.322

  22. [22]

    Wojke, A

    N. Wojke, A. Bewley, D. Paulus, Simple online and realtime tracking with a deep association metric, in: 2017 IEEE International Conference on Image Processing (ICIP), 2017, pp. 3645–3649. doi:10.1109/ICIP.2017.8296962

  23. [23]

    Carion, F

    N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko, End- to-end object detection with transformers, in: A. Vedaldi, H. Bischof, T. Brox, J.-M. Frahm (Eds.), Computer Vision – ECCV 2020, Springer International Pub- lishing, Cham, 2020, pp. 213–229

  24. [24]

    X. Zhu, W. Su, L. Lu, B. Li, X. Wang, J. Dai, Deformable {detr}: Deformable transformers for end-to-end object detection, international Conference on Learn- ing Representations (2021). URLhttps://openreview.net/forum?id=gZ9hCDWe6ke

  25. [25]

    M. Li, S. Li, W. Xiang, L. Zhang, Mdqe: Mining discriminative query embed- dings to segment occluded instances on challenging videos, in: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 10524–10533. doi:10.1109/CVPR52729.2023.01014

  26. [26]

    H. K. Cheng, Y .-W. Tai, C.-K. Tang, Rethinking space-time networks with im- proved memory coverage for efficient video object segmentation, Advances in Neural Information Processing Systems 34 (2021) 11781–11794

  27. [27]

    H. K. Cheng, A. G. Schwing, Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model, in: S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, T. Hassner (Eds.), Computer Vision – ECCV 2022, Springer Nature Switzerland, Cham, 2022, pp. 640–658

  28. [28]

    Y . Zhou, T. Zhang, S. Ji, S. Yan, X. Li, Improving video segmentation via dy- namic anchor queries, in: A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, G. Varol (Eds.), Computer Vision – ECCV 2024, Springer Nature Switzerland, Cham, 2025, pp. 446–463

  29. [29]

    Athar, A

    A. Athar, A. Hermans, J. Luiten, D. Ramanan, B. Leibe, Tarvis: A unified ap- proach for target-based video segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 18738–18748

  30. [30]

    M. Li, S. Li, X. Zhang, L. Zhang, UniVS: Unified and Universal Video Segmen- tation with Prompts as Queries , in: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Computer Society, Los Alamitos, CA, USA, 2024, pp. 3227–3238. doi:10.1109/CVPR52733.2024.00311. URLhttps://doi.ieeecomputersociety.org/10.1109/CVPR52733.20 24.00311 19

  31. [31]

    J. Wu, Y . Jiang, Q. Liu, Z. Yuan, X. Bai, S. Bai, General object founda- tion model for images and videos at scale, in: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 3783–3795. doi:10.1109/CVPR52733.2024.00363

  32. [32]

    HOTA: A Higher Order Metric for Evaluating Multi-object Tracking,

    J. Luiten, A. O ˘sep, P. Dendorfer, P. Torr, A. Geiger, L. Leal-Taixé, B. Leibe, Hota: A higher order metric for evaluating multi-object tracking, Int. J. Comput. Vision 129 (2) (2021) 548–578. doi:10.1007/s11263-020-01375-2. URLhttps://doi.org/10.1007/s11263-020-01375-2

  33. [33]

    Bolya, S

    D. Bolya, S. Foley, J. Hays, J. Hoffman, Tide: A general toolbox for identifying object detection errors, in: European Conference on Computer Vision, Springer, 2020, pp. 558–573

  34. [34]

    W. Jia, L. Yang, Z. Jia, W. Zhao, Y . Zhou, Q. Song, Tive: A toolbox for identi- fying video instance segmentation errors, Neurocomputing 545 (2023) 126321

  35. [35]

    K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recogni- tion, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778. doi:10.1109/CVPR.2016.90

  36. [36]

    Perron, V

    L. Perron, V . Furnon, Or-tools (2025). URLhttps://developers.google.com/optimization/ 20