Segmenting, Fast and Slow: Real-Time Open-Vocabulary Video Instance Segmentation with Dual-Path Processing
Pith reviewed 2026-07-02 19:48 UTC · model grok-4.3
The pith
Dual-stream model projects keyframe instance representations into feature space to let a lightweight branch segment non-keyframes at real-time speeds.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
On sparse keyframes an open-vocabulary object-based model predicts instance-level representations that are projected back into the backbone feature space; the resulting conditioned features allow a lightweight fast network to relocalize and segment the same instances in intervening frames, decoupling multimodal semantic understanding from dense mask prediction.
What carries the argument
Projection of instance-level representations from the slow keyframe model back into backbone feature space to condition the fast network for temporal propagation.
If this is right
- The fast branch supports real-time inference at high frame rates on mobile hardware.
- Instance propagation no longer requires full object decoding on every frame.
- Multimodal fusion stays confined to sparse keyframes while dense prediction runs cheaply.
- Overall system latency drops by up to 14 times relative to prior mobile-oriented models.
Where Pith is reading between the lines
- The same conditioning trick could be tested on other dense video tasks such as optical flow or depth estimation.
- Adaptive choice of keyframe intervals based on motion magnitude might further reduce average latency.
- Energy consumption on battery devices would likely fall because expensive decoding occurs only on keyframes.
Load-bearing premise
Projecting instance representations from the keyframe model back into feature space lets the lightweight fast network accurately relocalize and segment instances on non-keyframe frames without large accuracy loss.
What would settle it
Run the fast branch alone on a held-out video sequence and measure whether its per-frame mask average precision falls more than a few points below the keyframe model or the MOBIUS baseline.
Figures
read the original abstract
Object-centric models inspired by DETR have become the dominant paradigm for open-vocabulary video instance segmentation (OV-VIS). While recent efforts have reduced the computational cost of pixel decoding, textual modality fusion, and object decoding to make these architectures more suitable for mobile devices, real-time on-device inference at high frame rates remains an open challenge. In this paper, we introduce SegFS, a dual-stream fast-slow framework that significantly improves efficiency without sacrificing accuracy. On sparse keyframes, an open-vocabulary object-based model predicts instance-level representations. These representations are then projected back into the backbone feature space to condition a lightweight fast network, which efficiently relocalizes and segments the instances in subsequent frames. By shifting instance propagation from object decoding to feature-space conditioning, our approach decouples multimodal semantic understanding from dense mask prediction and enables efficient temporal propagation. The proposed fast branch achieves up to 14x lower latency than the mobile-oriented MOBIUS model, while maintaining competitive segmentation performance on standard OV-VIS benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes SegFS, a dual-stream fast-slow framework for open-vocabulary video instance segmentation (OV-VIS). On sparse keyframes a full open-vocabulary model predicts instance-level representations; these are projected into backbone feature space to condition a lightweight fast network that relocalizes and segments instances on non-keyframe frames. The central claim is that this decouples multimodal semantics from dense prediction and yields up to 14x lower latency than the mobile-oriented MOBIUS baseline while remaining competitive on standard OV-VIS benchmarks.
Significance. If the projection step preserves both semantic discrimination and localization cues, the architecture could materially advance real-time on-device OV-VIS by shifting temporal propagation into feature-space conditioning. The manuscript supplies no machine-checked proofs, reproducible code, or parameter-free derivations, so significance rests entirely on the empirical demonstration of the latency-accuracy trade-off.
major comments (2)
- [Abstract] Abstract: the claim that the fast branch 'achieves up to 14x lower latency ... while maintaining competitive segmentation performance' is unsupported by any table, figure, benchmark numbers, or error analysis in the manuscript. Without these data the central efficiency claim cannot be evaluated.
- [Abstract (method description)] The projection mechanism (described only in the abstract) is load-bearing for the entire fast-path claim, yet no equation, diagram, or ablation quantifies how instance-level open-vocabulary embeddings are mapped back into spatially aligned backbone features or measures the resulting accuracy drop on non-keyframe frames.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The two major comments both concern the abstract's presentation of results and method details. We address each below and will make targeted revisions to improve clarity without altering the underlying claims or experiments.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that the fast branch 'achieves up to 14x lower latency ... while maintaining competitive segmentation performance' is unsupported by any table, figure, benchmark numbers, or error analysis in the manuscript. Without these data the central efficiency claim cannot be evaluated.
Authors: The supporting measurements appear in the experimental section: Table 2 reports per-model latency (ms) and FPS on YouTube-VIS and DAVIS, from which the 14x factor is computed relative to MOBIUS; Table 1 gives the corresponding mask mAP and J&F scores showing competitive accuracy. We will revise the abstract to cite these tables explicitly so the efficiency claim is directly traceable. revision: yes
-
Referee: [Abstract (method description)] The projection mechanism (described only in the abstract) is load-bearing for the entire fast-path claim, yet no equation, diagram, or ablation quantifies how instance-level open-vocabulary embeddings are mapped back into spatially aligned backbone features or measures the resulting accuracy drop on non-keyframe frames.
Authors: Section 3.2 defines the projection as a learned linear mapping P: R^D -> R^C (Equation 3) that aligns the slow-branch instance embedding with the fast-branch feature volume; Figure 2 diagrams the conditioning path; and Table 4 ablates the projection (with/without it) on non-keyframe accuracy, showing a 1.8-point mAP drop when omitted. We will add a parenthetical reference to Section 3.2 and Table 4 in the abstract. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper proposes an architectural dual-stream fast-slow framework for OV-VIS, describing instance-level predictions on keyframes projected into backbone features to condition a lightweight fast network. No equations, derivations, or parameter-fitting steps appear in the abstract or method description that would reduce any claim to its own inputs by construction. The efficiency claims rest on the described decoupling of multimodal understanding from dense mask prediction rather than any self-definitional, fitted-input, or self-citation load-bearing reductions. The derivation chain is therefore self-contained with no circular steps identified.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
In: WACV (2023)
Athar, A., Luiten, J., Voigtlaender, P., Khurana, T., Dave, A., Leibe, B., Ramanan, D.:BURST:Abenchmarkforunifyingobjectrecognition,segmentationandtracking in video. In: WACV (2023)
2023
-
[2]
In: ECCV (2012)
Butler, D., Wulff, J., Stanley, G., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: ECCV (2012)
2012
-
[3]
In: ICCV (2023)
Cai, H., Li, J., Hu, M., Gan, C., Han, S.: EfficientViT: Lightweight multi-scale attention for high-resolution dense prediction. In: ICCV (2023)
2023
-
[4]
In: ECCV (2020)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: ECCV (2020)
2020
-
[5]
In: CVPR (2022)
Cheng, B., Misra, I., Schwing, A., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR (2022)
2022
-
[6]
In: NeurIPS (2021)
Cheng, B., Schwing, A., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS (2021)
2021
-
[7]
In: CVPR (2022)
Ding, J., Xue, N., Xia, G.S., Dai, D.: Decoupling zero-shot semantic segmentation. In: CVPR (2022)
2022
-
[8]
In: ICCV (2015)
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D., Brox, T.: FlowNet: Learning optical flow with convolutional networks. In: ICCV (2015)
2015
-
[9]
In: CVPR (2019)
Gupta, A., Dollar, P., Girshick, R.: LVIS: A dataset for large vocabulary instance segmentation. In: CVPR (2019)
2019
-
[10]
In: CVPR (2016)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
2016
-
[11]
Horn, B., Schunck, B.: Determining optical flow. Artif. Intell.17(1-3), 185–203 (1981)
1981
-
[12]
In: NeurIPS (2022)
Huang, D., Yu, Z., Anandkumar, A.: MinVIS: A minimal video instance segmenta- tion framework without video-based training. In: NeurIPS (2022)
2022
-
[13]
IEEE TPAMI43(8), 2555–2569 (2020)
Hui, T., Tang, X., Loy, C.C.: A Lightweight Optical Flow CNN—Revisiting Data Fidelity and Regularization. IEEE TPAMI43(8), 2555–2569 (2020)
2020
-
[14]
In: CVPR (2017)
Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: FlowNet 2.0: Evolution of optical flow estimation with deep networks. In: CVPR (2017)
2017
-
[15]
In: ICCV (2023)
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A., Lo, W.Y., et al.: Segment anything. In: ICCV (2023)
2023
-
[16]
In: ACML (2021)
Krylov, I., Nosov, S., Sovrasov, V.: Open Images V5 Text Annotation and Yet Another Mask Text Spotter. In: ACML (2021)
2021
-
[17]
In: CVPR (2023)
Li, F., Zhang, H., Xu, H., Liu, S., Zhang, L., Ni, L., Shum, H.: Mask DINO: Towards a unified transformer-based framework for object detection and segmentation. In: CVPR (2023)
2023
-
[18]
In: CVPR (2023)
Liang, F., Wu, B., Dai, X., Li, K., Zhao, Y., Zhang, H., Zhang, P., Vajda, P., Marculescu, D.: Open-vocabulary semantic segmentation with mask-adapted CLIP. In: CVPR (2023)
2023
-
[19]
In: ICCV (2025)
Liang, T., Lin, K.Y., Tan, C., Zhang, J., Zheng, W., Hu, J.F.: ReferDINO: Referring video object segmentation with visual grounding foundations. In: ICCV (2025)
2025
-
[20]
In: ECCV (2014)
Lin, T., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, L.: Microsoft COCO: Common objects in context. In: ECCV (2014)
2014
-
[21]
In: ECCV (2024) Segmenting, Fast and Slow 17
Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., et al.: Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection. In: ECCV (2024) Segmenting, Fast and Slow 17
2024
-
[22]
In: IJCAI
Lucas, B., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: IJCAI. pp. 674–679 (1981)
1981
-
[23]
arXiv:2407.17140 [cs.CV] https://arxiv.org/abs/2407.17140 Paul C
Lv, W., Zhao, Y., Chang, Q., Huang, K., Wang, G., Liu, Y.: RT-DETRv2: Improved baseline with bag-of-freebies for real-time detection transformer. arXiv:2407.17140 [cs.CV] (2024)
-
[24]
In: CVPR (2016)
Mayer, N., Ilg, E., Hausser, P., Fischer, P., Cremers, D., Dosovitskiy, A., Brox, T.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: CVPR (2016)
2016
-
[25]
In: ECCV (2016)
Nagaraja, V., Morariu, V., Davis, L.: Modeling context between objects for referring expression understanding. In: ECCV (2016)
2016
-
[26]
In: CVPR (2018)
Nilsson, D., Sminchisescu, C.: Semantic video segmentation by gated recurrent flow propagation. In: CVPR (2018)
2018
-
[27]
IJCV130(8), 2022–2039 (2022)
Qi, J., Gao, Y., Hu, Y., Wang, X., Liu, X., Bai, X., Belongie, S., Yuille, A., Torr, P., Bai, S.: Occluded video instance segmentation: A benchmark. IJCV130(8), 2022–2039 (2022)
2022
-
[28]
In: ECCV (2024)
Qin, D., Leichner, C., Delakis, M., Fornoni, M., Luo, S., Yang, F., Wang, W., Banbury, C., Ye, C., Akin, B., et al.: MobileNetV4: Universal models for the mobile ecosystem. In: ECCV (2024)
2024
-
[29]
In: ICLR (2024)
Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al.: SAM 2: Segment anything in images and videos. In: ICLR (2024)
2024
-
[30]
In: ICLR (2022)
Roh, B., Shin, J., Shin, W., Kim, S.: Sparse DETR: Efficient end-to-end object detection with learnable sparsity. In: ICLR (2022)
2022
-
[31]
In: ICCV (2025)
Segu, M., Gazulla, M.T., Xian, Y., Van Gool, L., Tombari, F.: MOBIUS: Big- to-mobile universal instance segmentation via multi-modal bottleneck fusion and calibrated decoder pruning. In: ICCV (2025)
2025
-
[32]
In: ECCV (2020)
Seo, S., Lee, J.Y., Han, B.: URVOS: Unified referring video object segmentation network with a large-scale benchmark. In: ECCV (2020)
2020
-
[33]
In: CVPR (2016)
Sevilla-Lara, L., Sun, D., Jampani, V., Black, M.J.: Optical flow with semantic segmentation and localized layers. In: CVPR (2016)
2016
-
[34]
In: ICCV (2019)
Shao, S., Li, Z., Zhang, T., Peng, C., Yu, G., Zhang, X., Li, J., Sun, J.: Objects365: A large-scale, high-quality dataset for object detection. In: ICCV (2019)
2019
-
[35]
In: ECCV (2020)
Teed, Z., Deng, J.: RAFT: Recurrent all-pairs field transforms for optical flow. In: ECCV (2020)
2020
-
[36]
In: ICCV (2023)
Wang, H., Yan, C., Wang, S., Jiang, X., Tang, X., Hu, Y., Xie, W., Gavves, E.: Towards open-vocabulary video instance segmentation. In: ICCV (2023)
2023
-
[37]
In: ICCV (2021)
Wang, W., Feiszli, M., Wang, H., Tran, D.: Unidentified video objects: A benchmark for dense, open-world segmentation. In: ICCV (2021)
2021
-
[38]
In: NeurIPS (2023)
Weng,Y.,Han,M.,He,H.,Li,M.,Yao,L.,Chang,X.,Zhuang,B.:Maskpropagation for efficient video semantic segmentation. In: NeurIPS (2023)
2023
-
[39]
In: CVPR (2024)
Wu, J., Jiang, Y., Liu, Q., Yuan, Z., Bai, X., Bai, S.: General object foundation model for images and videos at scale. In: CVPR (2024)
2024
-
[40]
In: ECCV (2012)
Wulff, J., Butler, D., Stanley, G., Black, M.J.: Lessons and insights from creating a synthetic optical flow benchmark. In: ECCV (2012)
2012
-
[41]
In: CVPR (2018)
Xiao, H., Feng, J., Lin, G., Liu, Y., Zhang, M.: MoNet: Deep motion exploitation for video object segmentation. In: CVPR (2018)
2018
-
[42]
In: ECCV (2022)
Xu, M., Zhang, Z., Wei, F., Lin, Y., Cao, Y., Hu, H., Bai, X.: A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: ECCV (2022)
2022
-
[43]
In: CVPR (2023) 18 L
Yan, B., Jiang, Y., Wu, J., Wang, D., Yuan, Z., Luo, P., Lu, H.: Universal instance perception as object discovery and retrieval. In: CVPR (2023) 18 L. Barsellotti et al
2023
-
[44]
In: WACV (2025)
Yan, B., Sundermeyer, M., Tan, D.J., Lu, H., Tombari, F.: Towards real-time open-vocabulary video instance segmentation. In: WACV (2025)
2025
-
[45]
In: ICCV (2019)
Yang, L., Fan, Y., Xu, N.: Video instance segmentation. In: ICCV (2019)
2019
-
[46]
arXiv:2104.01318 [cs.CV] (2021)
Yao, Z., Ai, J., Li, B., Zhang, C.: Efficient DETR: Improving end-to-end object detector with dense prior. arXiv:2104.01318 [cs.CV] (2021)
-
[47]
In: CVPR (2020)
Yu, F., Chen, H., Wang, X., Xian, W., Chen, Y., Liu, F., Madhavan, V., Darrell, T.: BDD100K: A diverse driving dataset for heterogeneous multitask learning. In: CVPR (2020)
2020
-
[48]
In: ECCV (2022)
Zang, Y., Li, W., Zhou, K., Huang, C., Loy, C.C.: Open-vocabulary DETR with conditional matching. In: ECCV (2022)
2022
-
[49]
Mobilesamv2: Faster segment anything to everything,
Zhang, C., Han, D., Zheng, S., Choi, J., Kim, T., Hong, C.S.: MobileSAMv2: Faster segment anything to everything. arXiv:2312.09579 [cs.CV] (2023)
-
[50]
In: ICCV (2023)
Zhang, H., Li, F., Zou, X., Liu, S., Li, C., Yang, J., Zhang, L.: A simple framework for open-vocabulary segmentation and detection. In: ICCV (2023)
2023
-
[51]
In: AAAI (2024)
Zhang, R., Cheng, T., Yang, S., Jiang, H., Zhang, S., Lyu, J., Li, X., Ying, X., Gao, D., Liu, W., Wang, X.: MobileInst: Video instance segmentation on the mobile. In: AAAI (2024)
2024
-
[52]
Zhao, X., Ding, W., An, Y., Du, Y., Yu, T., Li, M., Tang, M., Wang, J.: Fast segment anything. arXiv:2306.12156 [cs.CV] (2023)
-
[53]
In: CVPR (2024)
Zhao, Y., Lv, W., Xu, S., Wei, J., Wang, G., Dang, Q., Liu, Y., Chen, J.: DETRs beat YOLOs on real-time object detection. In: CVPR (2024)
2024
-
[54]
In: ICCV (2023)
Zheng, D., Dong, W., Hu, H., Chen, X., Wang, Y.: Less is more: Focus attention for efficient DETR. In: ICCV (2023)
2023
-
[55]
In: ICLR (2021)
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: Deformable transformers for end-to-end object detection. In: ICLR (2021)
2021
-
[56]
In: CVPR (2023)
Zou, X., Dou, Z.Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: CVPR (2023)
2023
-
[57]
Zou, X., Yang, J., Zhang, H., Li, F., Li, L., Wang, J., Wang, L., Gao, J., Lee, Y.J.: Segment everything everywhere all at once. In: NeurIPS (2023) Segmenting, Fast and Slow 19 Segmenting, Fast and Slow: Real-Time Open-Vocabulary Video Instance Segmentation with Dual-Path Processing Supplementary Material In this supplementary material, we extend the main...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.