Open-World Video Segmentation
Pith reviewed 2026-06-27 04:09 UTC · model grok-4.3
The pith
Savvy maintains stable object identities across long dynamic videos through hierarchical mask discovery, deferred admission, and track consolidation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Savvy combines hierarchical mask discovery, deferred admission, and track consolidation to support persistent object discovery, safe track promotion, and stable long-range identity maintenance in zero-shot open-world long-horizon video segmentation. The paper also introduces OGA, whose Granularity-Agnostic matching protocol relaxes conventional 1:1 matching to n:1 mapping, detects support discontinuities through sever points, and scores each reference object via its dominant coherent fragment. This enables GA-adapted metrics including identity persistence and identity concentration. On VIPSeg the new protocol recovers performance suppressed by 1:1 scoring; on ScanNet and HM3D Savvy outperfor
What carries the argument
Savvy's three coordinated mechanisms (hierarchical mask discovery, deferred admission, track consolidation) together with OGA's Granularity-Agnostic matching protocol that uses sever points and dominant coherent fragments.
If this is right
- Standard 1:1 matching underestimates open-world methods on VIPSeg.
- GA evaluation recovers much of the performance suppressed by rigid matching.
- Savvy outperforms baselines across both classical metrics and the new IP/IC diagnostics on long-horizon data.
- The same mechanisms support zero-shot operation without closed-set assumptions.
Where Pith is reading between the lines
- The GA protocol could be applied to improve evaluation fairness in other open-set tasks such as instance segmentation or 3D scene understanding.
- If track consolidation proves robust, the approach may reduce reliance on post-processing heuristics in deployed video systems.
- Long-horizon identity maintenance could enable downstream tasks like persistent object querying in robotics without retraining.
Load-bearing premise
The three mechanisms can maintain stable long-range object identities in dynamic ego-motion videos without systematic errors or dataset-specific tuning.
What would settle it
A controlled experiment on ScanNet or HM3D sequences showing frequent identity switches, dropped objects, or systematic promotion of erroneous tracks when Savvy is applied would falsify the stability claim.
read the original abstract
While video segmentation has advanced rapidly on short clips and closed-set benchmarks, open-world video segmentation remains largely unexplored. The challenge is twofold: (1) existing methods are not designed to support object discovery and identity maintenance in long videos of dynamic ego-motion, and (2) existing evaluation protocols rely on a rigid 1:1 matching that unfairly penalizes semantically valid predictions with mismatched granularity. To address both gaps, we introduce Savvy, a practical and strong system for zero-shot open-world long-horizon video segmentation. Savvy combines hierarchical mask discovery, deferred admission, and track consolidation to support persistent object discovery, safe track promotion, and stable long-range identity maintenance. We further propose OGA, a granularity-aware evaluation suite for open-world video segmentation. Built on a Granularity-Agnostic (GA) matching protocol, OGA relaxes conventional 1:1 matching to an n:1 mapping, but still enforces temporal rigor by detecting support discontinuities through sever points and scoring each reference object through its dominant coherent fragment. This prevents fragmented or flickering support from being over-rewarded while enabling GA-adapted metrics and structural diagnostics: identity persistence (IP), and identity concentration (IC). On VIPSeg, we show that standard 1:1 evaluation substantially underestimates open-world methods, whereas GA evaluation recovers much of their suppressed performance. On the more realistic long-horizon benchmarks: ScanNet and HM3D, Savvy consistently outperforms strong baselines across both classical and proposed metrics, including STQ, VPQ$_\infty$, IP and IC. Together, these results establish a practical benchmark and a strong baseline for open-world long-horizon video segmentation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that open-world video segmentation for long-horizon dynamic ego-motion videos is underexplored, with existing methods and 1:1 evaluation protocols inadequate. It introduces Savvy, which integrates hierarchical mask discovery, deferred admission, and track consolidation for persistent discovery and stable long-range identity maintenance, along with OGA, a granularity-aware evaluation suite based on Granularity-Agnostic (GA) n:1 matching that detects support discontinuities via sever points and scores via dominant coherent fragments. This enables adapted metrics including identity persistence (IP) and identity concentration (IC). The paper reports that 1:1 matching underestimates open-world methods on VIPSeg while GA recovers performance, and that Savvy outperforms strong baselines on ScanNet and HM3D across STQ, VPQ∞, IP, and IC.
Significance. If the empirical robustness holds, the work would establish a practical baseline and more appropriate evaluation protocol for an underexplored setting. The GA matching and new structural diagnostics (IP, IC) address a genuine mismatch between open-world predictions and rigid 1:1 protocols, potentially influencing future benchmarks. The combination of existing techniques into a deployable system for long-horizon ego-motion is a concrete contribution if the identity stability claim is substantiated.
major comments (2)
- [Abstract] Abstract: the headline claim of consistent outperformance on ScanNet and HM3D across STQ, VPQ∞, IP, and IC rests on the assertion that hierarchical mask discovery + deferred admission + track consolidation together produce stable long-range identities without systematic fragmentation or swaps under ego-motion. No ablation, failure-mode analysis, or quantitative evidence against mask drift or scale-induced errors is referenced, making this load-bearing assumption unverifiable from the provided description.
- [Abstract] Abstract: the statement that GA evaluation 'recovers much of their suppressed performance' on VIPSeg is presented as a key result demonstrating the protocol's value, yet no quantitative deltas, per-method scores, or comparison tables are supplied to allow assessment of effect size or whether the recovery is uniform across baselines.
Simulated Author's Rebuttal
We thank the referee for highlighting areas where the abstract could better reference supporting evidence. We address each comment below and will revise the abstract accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline claim of consistent outperformance on ScanNet and HM3D across STQ, VPQ∞, IP, and IC rests on the assertion that hierarchical mask discovery + deferred admission + track consolidation together produce stable long-range identities without systematic fragmentation or swaps under ego-motion. No ablation, failure-mode analysis, or quantitative evidence against mask drift or scale-induced errors is referenced, making this load-bearing assumption unverifiable from the provided description.
Authors: We agree the abstract would benefit from explicit pointers to the supporting analyses. The manuscript includes ablations in Section 4.3 quantifying each component's contribution to identity stability, plus failure-mode analysis in Section 5 addressing mask drift and scale errors under ego-motion. We will revise the abstract to cite these sections and results. revision: yes
-
Referee: [Abstract] Abstract: the statement that GA evaluation 'recovers much of their suppressed performance' on VIPSeg is presented as a key result demonstrating the protocol's value, yet no quantitative deltas, per-method scores, or comparison tables are supplied to allow assessment of effect size or whether the recovery is uniform across baselines.
Authors: The quantitative deltas, per-method scores, and tables for VIPSeg under 1:1 vs. GA matching appear in Section 4.1 and Table 1. We will update the abstract to include specific effect sizes (e.g., average recovery percentages) and reference the table. revision: yes
Circularity Check
No circularity: empirical system description with no derivation chain or fitted predictions
full rationale
The paper introduces Savvy as a composition of hierarchical mask discovery, deferred admission, and track consolidation for open-world video segmentation, evaluated empirically on benchmarks like ScanNet and HM3D using proposed OGA metrics. No equations, first-principles derivations, parameter fitting to subsets followed by 'predictions,' or self-citation chains appear in the provided text. Claims rest on benchmark outperformance rather than any reduction of outputs to inputs by construction. This is a standard engineering/integration paper with independent empirical content.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Mask2former for video instance segmentation
Bowen Cheng, Anwesa Choudhuri, Ishan Misra, Alexander Kirillov, Rohit Girdhar, and Alexander G Schwing. Mask2former for video instance segmentation. arXiv preprint arXiv:2112. 10764, 2021. 2
2021
-
[2]
Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model
Ho Kei Cheng and Alexander G Schwing. Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model. In European conference on computer vision , pages 640–658. Springer, 2022. 1, 3
2022
-
[3]
Tracking anything with decoupled video segmentation
Ho Kei Cheng, Seoung Wug Oh, Brian Price, Alexander Schwing, and Joon-Young Lee. Tracking anything with decoupled video segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 1316–1326, 2023. 1, 3, 7, 8
2023
-
[4]
Putting the object back into video object segmentation
Ho Kei Cheng, Seoung Wug Oh, Brian Price, Joon-Young Lee, and Alexander Schwing. Putting the object back into video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 3151– 3161, 2024. 2, 3
2024
-
[5]
Scannet: Richly- annotated 3d reconstructions of indoor scenes
Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly- annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 1, 2, 7
2017
-
[6]
The epic-kitchens dataset: Collection, challenges and baselines
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. The epic-kitchens dataset: Collection, challenges and baselines. arXiv preprint arXiv:2005.00343, 2020. 63
arXiv 2005
-
[7]
Solving the maximum-weight connected subgraph problem to optimality
Mohammed El-Kebir and Gunnar W Klau. Solving the maximum-weight connected subgraph problem to optimality. arXiv preprint arXiv:1409.5308, 2014. 30
Pith/arXiv arXiv 2014
-
[8]
Approximation algorithms for finding highly connected subgraphs
Samir Khuller. Approximation algorithms for finding highly connected subgraphs. 1998. 30
1998
-
[9]
Video panoptic segmentation
Dahun Kim, Sanghyun Woo, Joon-Young Lee, and In So Kweon. Video panoptic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 9859–9868, 2020. 2, 3
2020
-
[10]
Segment anything
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023. 1, 3
2023
-
[11]
Video k-net: A simple, strong, and unified baseline for video segmentation
Xiangtai Li, Wenwei Zhang, Jiangmiao Pang, Kai Chen, Guangliang Cheng, Yunhai Tong, and Chen Change Loy. Video k-net: A simple, strong, and unified baseline for video segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18847–18857, 2022. 2
2022
-
[12]
Large-scale video panoptic segmentation in the wild: A benchmark
Jiaxu Miao, Xiaohan Wang, Yu Wu, Wei Li, Xu Zhang, Yunchao Wei, and Yi Yang. Large-scale video panoptic segmentation in the wild: A benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 21033–21043, 2022. 1, 2, 3, 7, 8
2022
-
[13]
Video object segmentation using space-time memory networks
Seoung Wug Oh, Joon-Young Lee, Ning Xu, and Seon Joo Kim. Video object segmentation using space-time memory networks. In Proceedings of the IEEE/CVF international conference on computer vision , pages 9226–9235, 2019. 2, 3
2019
-
[14]
Dinov2: Learning robust visual features without supervision
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 1, 3
Pith/arXiv arXiv 2023
-
[15]
Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d envi- ronments for embodied ai
Santhosh K Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alex Clegg, John Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d envi- ronments for embodied ai. arXiv preprint arXiv:2109.08238, 2021. 2, 7
Pith/arXiv arXiv 2021
-
[16]
Sam 2: Segment anything in images and videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714, 2024. 1, 3, 7, 8
Pith/arXiv arXiv 2024
-
[17]
Step: Segmenting and tracking every pixel
Mark Weber, Jun Xie, Maxwell Collins, Yukun Zhu, Paul Voigtlaender, Hartwig Adam, Bradley Green, Andreas Geiger, Bastian Leibe, Daniel Cremers, et al. Step: Segmenting and tracking every pixel. arXiv preprint arXiv:2102. 11859, 2021. 2, 3
2021
-
[18]
Associating objects with transformers for video object segmentation
Zongxin Yang, Yunchao Wei, and Yi Yang. Associating objects with transformers for video object segmentation. Advances in Neural Information Processing Systems, 34:2491–2502, 2021. 2, 3
2021
-
[19]
Entitysam: Segment everything in video
Mingqiao Ye, Seoung Wug Oh, Lei Ke, and Joon-Young Lee. Entitysam: Segment everything in video. In Proceedings of the Computer Vision and Pattern Recognition Conference , pages 24234–24243, 2025. 1, 2, 3, 7, 8 11
2025
-
[20]
cooldown
Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, and Liang-Chieh Chen. Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip. Advances in Neural Information Processing Systems , 36:32215–32234, 2023. 2 12 Appendix Contents A Details of Savvy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 A. 1 Pi...
2023
-
[21]
13 6.26 39.20 19.28 82.81 71
10 54. 13 6.26 39.20 19.28 82.81 71. 11 51.25 60.22 30.97
-
[22]
11 16.42 78.23 69.88 47.41 59.22 30.22 0.20 49
15 51.81 4.26 35. 11 16.42 78.23 69.88 47.41 59.22 30.22 0.20 49. 16 2.94 30.99 13.73 73.26 68.50 43.54 58.48 29.91 Flicker 0.025 58.53 2.72 36. 10 14.54 91.59 74.65 56.58 75.22 16.78 0.050 58.53 0.71 28.54 9. 13 91.59 75. 16 55.20 78.28 12. 15 0.075 58.53 0.38 23.74 6.77 91.59 75.42 53.79 79.39 10.28
-
[23]
11 44.94 22.54 91.59 73.84 44.44 68.93 21.84 2 58.53 7.03 40.29 17.97 91.59 74.33 38.77 72.08 18.57 3 58.53 5.09 37.36 15.53 91.59 74.59 35.47 74.31 16.81 4 58.53 3
100 58.53 0.05 18.94 4.05 91.59 75.68 52.37 80.91 8.41 Sever 1 58.53 13. 11 44.94 22.54 91.59 73.84 44.44 68.93 21.84 2 58.53 7.03 40.29 17.97 91.59 74.33 38.77 72.08 18.57 3 58.53 5.09 37.36 15.53 91.59 74.59 35.47 74.31 16.81 4 58.53 3. 15 34.43 11.29 91.59 74.72 34.54 76.62 16.56 Void 1 58.64 31.57 57.57 35.94 95.25 72.21 59.89 61.07 31.51 2 58.46 31.5...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.