pith. sign in

arxiv: 2605.30352 · v1 · pith:OGH5VR6Dnew · submitted 2026-05-28 · 💻 cs.CV

GMOS: Grounding Moving Object Segmentation in 3D Space and Time

Pith reviewed 2026-06-29 08:12 UTC · model grok-4.3

classification 💻 cs.CV
keywords moving object segmentation3D groundingRGB videotemporal motion annotationsvideo object segmentationonline inferenceunsupervised VOS
0
0 comments X

The pith

GMOS produces 3D-aware, temporally fine-grained segmentation of moving objects directly from RGB video.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current moving object segmentation methods depend on pre-computed 2D cues like optical flow that miss 3D geometry and treat motion at sequence level. GMOS instead processes RGB video to deliver 3D-aware and temporally precise segmentation of multiple independent moving objects. The authors also release GMOS-2K, a dataset with per-object temporal motion labels from existing VOS sources, and introduce the MOS-I protocol for fine-grained evaluation. If successful, this enables faster, online-capable systems that do not need auxiliary motion inputs.

Core claim

GMOS is a framework that operates directly on RGB video to produce 3D-aware, temporally fine-grained segmentation of multiple moving objects, achieving state-of-the-art results across MOS, MOS-I, and Unsupervised VOS benchmarks while running significantly faster than prior multi-object MOS methods and supporting online inference for streaming deployment.

What carries the argument

The GMOS framework that grounds moving object segmentation in 3D space and time by processing RGB video directly.

If this is right

  • Delivers segmentation without pre-computed 2D auxiliary modalities such as optical flow or point trajectories.
  • Accounts for the instantaneous motion state of each object rather than sequence-level attributes.
  • Achieves state-of-the-art results on MOS, MOS-I, and unsupervised VOS benchmarks.
  • Runs significantly faster than prior multi-object MOS methods and supports online inference for streaming deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Video analysis pipelines could eliminate separate optical flow computation steps if implicit 3D cues from RGB prove reliable.
  • The MOS-I protocol may encourage finer temporal metrics in future motion segmentation evaluations.
  • Repurposing VOS datasets with added per-object motion labels offers a scalable path to train 3D-grounded models.
  • The faster GMOS-S variant could enable deployment in real-time settings like live video monitoring.

Load-bearing premise

That RGB video by itself supplies enough 3D geometric information and instantaneous motion cues to segment moving objects accurately without pre-computed 2D auxiliary modalities.

What would settle it

A benchmark comparison on video sequences with ambiguous 2D projections of distinct 3D motions, checking whether GMOS accuracy falls below flow-based methods.

Figures

Figures reproduced from arXiv: 2605.30352 by Andrew Zisserman, Junyu Xie, Tengda Han, Weidi Xie.

Figure 1
Figure 1. Figure 1: Grounding Moving Object Segmentation (MOS) in 3D space and time. Left: GMOS grounds MOS in 3D space, enabling reliable segmentation under challenging viewpoints and heavy depth parallax from moving cameras. Right: GMOS also grounds MOS in time: conventional MOS masks every object that moves at some point in the sequence, regardless of whether it is currently in motion. In contrast, our proposed MOS-I (“I” … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of GMOS. Left: The overall proposer–propagator design. The proposer operates on a short temporal window (around 0.5 s) and outputs per-frame object proposals, which the propagator links into coherent tracks across the full video. Right: The GMOS proposer. A frozen π 3 geometric en￾coder ingests frames for a short temporal window and a frozen SAM2 segmentation encoder processes the middle frame of … view at source ↗
Figure 3
Figure 3. Figure 3: GMOS-2K overview. Left: Curation pipeline. 5,001 videos from five VOS datasets are filtered by two criteria, with 743 added directly to GMOS-2K and 1,467 requiring Temporal Fine-grained Annotation (TFA), which labels per-object motion intervals along the time axis. Right: Dataset statistics. Pie charts show the per-subset distribution of videos and objects. The upper histogram reports the number of moving … view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison on the MOS task. Example videos are sampled from YTVOS19 (first two columns), DAVIS17 (third column), and MoCA (last column). The middle block shows multi-object results, and the bottom block shows foreground–background results. Video MOS-I MOS time [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: GMOS on in-the-wild videos. Two in-the-wild sequences (sourced outside our training or test datasets) illustrate our MOS and MOS-I predictions. Under MOS-I, instantaneously moving objects are segmented, with object identities consistently associated across frames. Under MOS, GMOS produces full segmentation masks for every object that moves at any point in the sequence, regardless of its motion state at a g… view at source ↗
read the original abstract

Moving Object Segmentation (MOS) aims to discover, segment, and track objects that move independently of the camera. Current MOS methods, however, exhibit two fundamental limitations: they rely on pre-computed 2D auxiliary modalities such as optical flow or point trajectories that lack 3D geometric information, and they treat motion as a sequence-level attribute, overlooking the instantaneous motion state of each object. We address both by grounding MOS in 3D space and time, and propose GMOS, a framework that operates directly on RGB video to produce 3D-aware, temporally fine-grained segmentation of multiple moving objects, alongside a foreground--background variant GMOS-S for faster deployment. To support training and evaluation in this regime, we curate GMOS-2K, a dataset of 2,210 real-world videos with per-object temporal motion annotations drawn from five established Video Object Segmentation (VOS) benchmarks, and formalise MOS-I ("I" for instantaneous), a temporally fine-grained evaluation protocol with three complementary metrics. GMOS achieves state-of-the-art results across MOS, MOS-I, and Unsupervised VOS benchmarks, while running significantly faster than prior multi-object MOS methods and supporting online inference for streaming deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces GMOS, a framework for moving object segmentation (MOS) that grounds the task in 3D space and time by operating directly on RGB video without pre-computed 2D auxiliary modalities such as optical flow. It proposes a foreground-background variant GMOS-S, curates the GMOS-2K dataset (2,210 videos with per-object temporal motion annotations from existing VOS benchmarks), and formalizes the MOS-I protocol with three complementary metrics for temporally fine-grained evaluation. The authors report state-of-the-art results on MOS, MOS-I, and unsupervised VOS benchmarks, along with significantly faster runtime than prior multi-object methods and support for online inference.

Significance. If the results hold, the work is significant for shifting MOS away from reliance on 2D modalities and sequence-level motion attributes toward 3D-aware, instantaneous segmentation from RGB alone. The curation of GMOS-2K and formalization of the MOS-I protocol with three metrics provide concrete resources that could standardize fine-grained evaluation in the field. The reported efficiency gains and online capability are practical strengths for streaming deployment. These elements are load-bearing for the paper's contribution and are explicitly supported by the dataset curation and metric definitions.

minor comments (2)
  1. [Abstract] Abstract: the claim of 'significantly faster' runtime would benefit from a specific comparison (e.g., FPS or runtime table reference) to make the efficiency advantage immediately quantifiable.
  2. [§3] The paper should clarify in §3 (method) whether any implicit depth or 3D cues are learned end-to-end or if the 3D grounding is achieved purely through the temporal modeling and new loss terms.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript, the recognition of its contributions to 3D-aware MOS, the GMOS-2K dataset, and the MOS-I protocol, and the recommendation to accept.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces a new framework (GMOS) for 3D-aware moving object segmentation directly from RGB video, along with a curated dataset (GMOS-2K) and a new evaluation protocol (MOS-I). No derivation chain, equations, fitted parameters renamed as predictions, or self-citation load-bearing steps are present in the provided text. The central claims rest on empirical SOTA results and efficiency measurements rather than any self-referential construction. This is a standard empirical CV contribution with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; standard computer-vision assumptions about RGB video containing implicit 3D cues are implicit but unstated.

pith-pipeline@v0.9.1-grok · 5752 in / 1215 out tokens · 30148 ms · 2026-06-29T08:12:05.233082+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

90 extracted references · 7 canonical work pages · 2 internal anchors

  1. [1]

    It’s moving! a probabilistic model for causal motion segmentation in moving camera videos

    Pia Bideau and Erik Learned-Miller. It’s moving! a probabilistic model for causal motion segmentation in moving camera videos. InECCV, 2016

  2. [2]

    Object segmentation by long term analysis of point trajectories

    Thomas Brox and Jitendra Malik. Object segmentation by long term analysis of point trajectories. InECCV, 2010

  3. [3]

    SAM 3: Segment anything with concepts

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris Coll-Vinent, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Ro- man Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu ...

  4. [4]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InICCV, 2021

  5. [5]

    Moving object segmentation in 3D LiDAR data: A learning-based approach exploiting sequential data

    Xieyuanli Chen, Shijie Li, Benedikt Mersch, Louis Wiesmann, Jürgen Gall, Jens Behley, and Cyrill Stachniss. Moving object segmentation in 3D LiDAR data: A learning-based approach exploiting sequential data. InIROS, 2021

  6. [6]

    Easi3r: Estimating disentangled motion from dust3r without training

    Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Easi3r: Estimating disentangled motion from dust3r without training. InICCV, 2025

  7. [7]

    Schwing, and Alexander Kirillov

    Bowen Cheng, Alexander G. Schwing, and Alexander Kirillov. Per-pixel classification is not all you need for semantic segmentation. InNeurIPS, 2021

  8. [8]

    Ho Kei Cheng and Alexander G. Schwing. XMem: Long-term video object segmentation with an atkinson-shiffrin memory model. InECCV, 2022

  9. [9]

    Tracking anything with decoupled video segmentation

    Ho Kei Cheng, Seoung Wug Oh, Brian Price, Alexander Schwing, and Joon-Young Lee. Tracking anything with decoupled video segmentation. InICCV, 2023

  10. [10]

    Putting the object back into video object segmentation

    Ho Kei Cheng, Seoung Wug Oh, Brian Price, Joon-Young Lee, and Alexander Schwing. Putting the object back into video object segmentation. InCVPR, 2024

  11. [11]

    Implicit motion handling for video camouflaged object detection

    Xuelian Cheng, Huan Xiong, Deng-Ping Fan, Yiran Zhong, Mehrtash Harandi, Tom Drummond, and Zongyuan Ge. Implicit motion handling for video camouflaged object detection. InCVPR, 2022

  12. [12]

    S. Cho, M. Lee, S. Lee, C. Park, D. Kim, and S. Lee. Treating motion as option to reduce motion dependency in unsupervised video object segmentation. InWACV, 2023

  13. [13]

    Dual prototype attention for unsupervised video object segmentation

    Suhwan Cho, Minhyeok Lee, Seunghoon Lee, Dogyoon Lee, Heeseung Choi, Ig-Jae Kim, and Sangyoun Lee. Dual prototype attention for unsupervised video object segmentation. InCVPR, 2024

  14. [14]

    Guess What Moves: Unsupervised Video and Image Segmentation by Anticipating Motion

    Subhabrata Choudhury, Laurynas Karazija, Iro Laina, Andrea Vedaldi, and Christian Rupprecht. Guess What Moves: Unsupervised Video and Image Segmentation by Anticipating Motion. In BMVC, 2022

  15. [15]

    Towards segmenting anything that moves

    Achal Dave, Pavel Tokmakov, and Deva Ramanan. Towards segmenting anything that moves. InICCV, 2019

  16. [16]

    Tap-vid: A benchmark for tracking any point in a video

    Carl Doersch, Ankush Gupta, Larisa Markeeva, Adria Recasens, Lucas Smaira, Yusuf Aytar, Joao Carreira, Andrew Zisserman, and Yi Yang. Tap-vid: A benchmark for tracking any point in a video. InNeurIPS, 2022. 10

  17. [17]

    Tapir: Tracking any point with per-frame initialization and temporal refinement

    Carl Doersch, Yi Yang, Mel Vecerik, Dilara Gokay, Ankush Gupta, Yusuf Aytar, Joao Carreira, and Andrew Zisserman. Tapir: Tracking any point with per-frame initialization and temporal refinement. InICCV, 2023

  18. [18]

    The via annotation software for images, audio and video

    Abhishek Dutta and Andrew Zisserman. The via annotation software for images, audio and video. InACM MM, 2019

  19. [19]

    Shifting more attention to video salient object detection

    Deng-Ping Fan, Wenguan Wang, Ming-Ming Cheng, and Jianbing Shen. Shifting more attention to video salient object detection. InCVPR, 2019

  20. [20]

    Video segmentation by tracing discontinuities in a trajectory embedding

    Katerina Fragkiadaki, Geng Zhang, and Jianbo Shi. Video segmentation by tracing discontinuities in a trajectory embedding. InCVPR, 2012

  21. [21]

    Fleet, Saurabh Saxena, and Andrea Tagliasacchi

    Lily Goli, Sara Sabour, Mark Matthews, Brubaker Marcus, Dmitry Lagun, Alec Jacobson, David J. Fleet, Saurabh Saxena, and Andrea Tagliasacchi. RoMo: Robust motion segmentation improves structure from motion. InICCV, 2025

  22. [22]

    Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, Daniel Duckworth, David J. Fleet, Dan Gnanapragasam, Florian Golemo, Charles Herrmann, Thomas Kipf, Abhijit Kundu, Dmitry Lagun, Issam Laradji, Hsueh-Ti (Derek) Liu, Henning Meyer, Yishu Miao, Derek Nowrouzezahrai, Cengiz Oztireli, Etienne Pot, Noha Radwan, Daniel Rebain, Sara Sabour, Me...

  23. [23]

    Geomotion: Rethinking motion segmentation via latent 4d geometry

    Xiankang He, Peile Lin, Ying Cui, Dongyan Guo, Chunhua Shen, and Xiaoqin Zhang. Geomotion: Rethinking motion segmentation via latent 4d geometry. InCVPR, 2026

  24. [24]

    Segment any motion in videos

    Nan Huang, Wenzhao Zheng, Chenfeng Xu, Kurt Keutzer, Shanghang Zhang, Angjoo Kanazawa, and Qianqian Wang. Segment any motion in videos. InCVPR, 2025

  25. [25]

    Fusionseg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos

    Suyog Dutt Jain, Bo Xiong, and Kristen Grauman. Fusionseg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. InCVPR, 2017

  26. [26]

    Full-duplex strategy for video object segmentation

    Ge-Peng Ji, Keren Fu, Zhe Wu, Deng-Ping Fan, Jianbing Shen, and Ling Shao. Full-duplex strategy for video object segmentation. InICCV, 2021

  27. [27]

    Dynamicstereo: Consistent dynamic depth from stereo videos

    Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Chris- tian Rupprecht. Dynamicstereo: Consistent dynamic depth from stereo videos. InCVPR, 2023

  28. [28]

    Cotracker3: Simpler and better point tracking by pseudo-labeling real videos

    Nikita Karaev, Yuri Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker3: Simpler and better point tracking by pseudo-labeling real videos. In ICCV, 2025

  29. [29]

    Learning segmentation from point trajectories

    Laurynas Karazija, Iro Laina, Christian Rupprecht, and Andrea Vedaldi. Learning segmentation from point trajectories. InNeurIPS, 2024

  30. [30]

    Motion trajectory segmentation via minimum cost multicuts

    Margret Keuper, Bjoern Andres, and Thomas Brox. Motion trajectory segmentation via minimum cost multicuts. InICCV, 2015

  31. [31]

    Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick. Segment anything. InICCV, 2023

  32. [32]

    Betrayed by motion: Camouflaged object discovery via motion segmentation

    Hala Lamdouar, Charig Yang, Weidi Xie, and Andrew Zisserman. Betrayed by motion: Camouflaged object discovery via motion segmentation. InACCV, 2020

  33. [33]

    Divided attention: Unsupervised multi-object discovery with contextually separated slots.arXiv preprint arxiv:2304.01430, 2023

    Dong Lao, Zhengyang Hu, Francesco Locatello, Yanchao Yang, and Stefano Soatto. Divided attention: Unsupervised multi-object discovery with contextually separated slots.arXiv preprint arxiv:2304.01430, 2023

  34. [34]

    M. Lee, S. Cho, S. Lee, C. Park, and S. Lee. Unsupervised video object segmentation via prototype memory network. InWACV, 2023

  35. [35]

    Guided slot attention for unsupervised video object segmentation

    Minhyeok Lee, Suhwan Cho, Dogyoon Lee, Chaewon Park, Jungho Lee, and Sangyoun Lee. Guided slot attention for unsupervised video object segmentation. InCVPR, 2024. 11

  36. [36]

    Ni, and Heung-Yeung Shum

    Feng Li, Hao Zhang, Huaizhe Xu, Shilong Liu, Lei Zhang, Lionel M. Ni, and Heung-Yeung Shum. Mask dino: Towards a unified transformer-based framework for object detection and segmentation. InCVPR, 2023

  37. [37]

    Fuxin Li, Taeyoung Kim, Ahmad Humayun, David Tsai, and James M. Rehg. Video segmentation by tracking many figure-ground segments. InICCV, 2013

  38. [38]

    Multi-sensor fusion for robust localization with moving object segmentation in complex dynamic 3D scenes.International Journal of Applied Earth Observation and Geoinformation, 2023

    Qipeng Li, Yuan Zhuang, You Chen, Jianzhu Huai, Miaomiao Li, Tianxiang Ma, Yufei Tang, and Xinlian Liang. Multi-sensor fusion for robust localization with moving object segmentation in complex dynamic 3D scenes.International Journal of Applied Earth Observation and Geoinformation, 2023

  39. [39]

    Zhengqi Li, Tali Dekel, Forrester Cole, Richard Tucker, Noah Snavely, Ce Liu, and William T. Freeman. Learning the depths of moving people by watching frozen people. InCVPR, 2019

  40. [40]

    Megasam: Accurate, fast and robust structure and motion from casual dynamic videos

    Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holynski, and Noah Snavely. Megasam: Accurate, fast and robust structure and motion from casual dynamic videos. InCVPR, 2025

  41. [41]

    Long Lian, Zhirong Wu, and Stella X. Yu. Bootstrapping objectness from videos by relaxed common fate and visual grouping. InCVPR, 2023

  42. [42]

    Chen, Zhenyu Li, Yang Zhao, Sida Peng, Hengkai Guo, Xiaowei Zhou, Guang Shi, Jiashi Feng, and Bingyi Kang

    Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Yang Zhao, Sida Peng, Hengkai Guo, Xiaowei Zhou, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views. InICLR, 2026

  43. [43]

    F2net: Learning to focus on the foreground for unsupervised video object segmentation

    Daizong Liu, Dongdong Yu, Changhu Wang, and Pan Zhou. F2net: Learning to focus on the foreground for unsupervised video object segmentation. InAAAI, 2021

  44. [44]

    Depth-aware test-time training for zero-shot video object segmentation

    Weihuang Liu, Xi Shen, Haolun Li, Xiuli Bi, Bo Liu, Chi-Man Pun, and Xiaodong Cun. Depth-aware test-time training for zero-shot video object segmentation. InCVPR, 2024

  45. [45]

    Hoi4d: A 4d egocentric dataset for category-level human-object interaction

    Yunze Liu, Yun Liu, Che Jiang, Kangbo Lyu, Weikang Wan, Hao Shen, Boqiang Liang, Zhoujie Fu, He Wang, and Li Yi. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. InCVPR, 2022

  46. [46]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019

  47. [47]

    See more, know more: Unsupervised video object segmentation with co-attention siamese networks

    Xiankai Lu, Wenguan Wang, Chao Ma, Jianbing Shen, Ling Shao, and Fatih Porikli. See more, know more: Unsupervised video object segmentation with co-attention siamese networks. In CVPR, 2019

  48. [48]

    arXiv preprint arXiv:2310.01040 , year =

    Etienne Meunier and Patrick Bouthemy. Unsupervised motion segmentation in one go: Smooth long-term model over a video.arXiv preprint arXiv:2310.01040, 2023

  49. [49]

    Em-driven unsupervised learning for efficient motion segmentation.IEEE TPAMI, 2022

    Etienne Meunier, Anaïs Badoual, and Patrick Bouthemy. Em-driven unsupervised learning for efficient motion segmentation.IEEE TPAMI, 2022

  50. [50]

    Deep anomaly detection through visual attention in surveillance videos.Journal of Big Data, 2020

    Nasaruddin Nasaruddin, Kahlil Muchtar, Afdhal Afdhal, and Alvin Prayuda Juniarta Dwiyantoro. Deep anomaly detection through visual attention in surveillance videos.Journal of Big Data, 2020

  51. [51]

    P. Ochs, J. Malik, and T. Brox. Segmentation of moving objects by long term video analysis. IEEE TPAMI, 2014

  52. [52]

    Object segmentation in video: a hierarchical variational approach for turning point trajectories into dense regions

    Peter Ochs and Thomas Brox. Object segmentation in video: a hierarchical variational approach for turning point trajectories into dense regions. InICCV, 2011

  53. [53]

    Video object segmentation using space-time memory networks

    Seoung Wug Oh, Joon-Young Lee, Ning Xu, and Seon Joo Kim. Video object segmentation using space-time memory networks. InICCV, 2019

  54. [54]

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick L...

  55. [55]

    Fast object segmentation in unconstrained video

    Anestis Papazoglou and Vittorio Ferrari. Fast object segmentation in unconstrained video. In ICCV, 2013

  56. [56]

    Hier- archical feature alignment network for unsupervised video object segmentation

    Gensheng Pei, Fumin Shen, Yazhou Yao, Guo-Sen Xie, Zhenmin Tang, and Jinhui Tang. Hier- archical feature alignment network for unsupervised video object segmentation. InECCV, 2022

  57. [57]

    A benchmark dataset and evaluation methodology for video object segmentation

    Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. InCVPR, 2016

  58. [58]

    The 2017 DAVIS Challenge on Video Object Segmentation

    Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation.arXiv preprint arXiv:1704.00675, 2017

  59. [59]

    Occluded video instance segmentation: A benchmark.IJCV, 2022

    Jiyang Qi, Yan Gao, Yao Hu, Xinggang Wang, Xiaoyu Liu, Xiang Bai, Serge Belongie, Alan Yuille, Philip Torr, and Song Bai. Occluded video instance segmentation: A benchmark.IJCV, 2022

  60. [60]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024

  61. [61]

    Reciprocal transformations for unsupervised video object segmentation

    Sucheng Ren, Wenxi Liu, Yongtuo Liu, Haoxin Chen, Guoqiang Han, and Shengfeng He. Reciprocal transformations for unsupervised video object segmentation. InCVPR, 2021

  62. [62]

    Hiera: A hierarchical vision transformer without the bells-and-whistles

    Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, and Christoph Feichtenhofer. Hiera: A hierarchical vision transformer without the bells-and-whistles. InICML, 2023

  63. [63]

    Multi-object discovery by low-dimensional object motion

    Sadra Safadoust and Fatma Güney. Multi-object discovery by low-dimensional object motion. InICCV, 2023

  64. [64]

    Generalizable fourier augmentation for unsupervised video object segmentation

    Huihui Song, Tiankang Su, Yuhui Zheng, Kaihua Zhang, Bo Liu, and Dong Liu. Generalizable fourier augmentation for unsupervised video object segmentation. InAAAI, 2024

  65. [65]

    Learning video object segmentation with visual memory

    Pavel Tokmakov, Karteek Alahari, and Cordelia Schmid. Learning video object segmentation with visual memory. InICCV, 2017

  66. [66]

    Learning to segment moving objects

    Pavel Tokmakov, Cordelia Schmid, and Karteek Alahari. Learning to segment moving objects. IJCV, 2019

  67. [67]

    Self- supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics

    Jiangliu Wang, Jianbo Jiao, Linchao Bao, Shengfeng He, Yunhui Liu, and Wei Liu. Self- supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. InCVPR, 2019

  68. [68]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InCVPR, 2025

  69. [69]

    Dust3r: Geometric 3d vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InCVPR, 2024

  70. [70]

    Videocutler: Surpris- ingly simple unsupervised video instance segmentation.arXiv preprint arXiv:2308.14710, 2023

    Xudong Wang, Ishan Misra, Zizun Zeng, Rohit Girdhar, and Trevor Darrell. Videocutler: Surpris- ingly simple unsupervised video instance segmentation.arXiv preprint arXiv:2308.14710, 2023

  71. [71]

    π3: Permutation-equivariant visual geometry learning

    Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. π3: Permutation-equivariant visual geometry learning. InICLR, 2026

  72. [72]

    Attention-based temporal encoding network with background-independent motion mask for action recognition.Computational Intelligence and Neuroscience, 2021

    Zhengkui Weng, Zhipeng Jin, Shuangxi Chen, Quanquan Shen, Xiangyang Ren, and Wuzhao Li. Attention-based temporal encoding network with background-independent motion mask for action recognition.Computational Intelligence and Neuroscience, 2021

  73. [73]

    Segmenting moving objects via an object-centric layered representation

    Junyu Xie, Weidi Xie, and Andrew Zisserman. Segmenting moving objects via an object-centric layered representation. InNeurIPS, 2022. 13

  74. [74]

    Appearance-based refinement for object-centric motion segmentation

    Junyu Xie, Weidi Xie, and Andrew Zisserman. Appearance-based refinement for object-centric motion segmentation. InECCV, 2024

  75. [75]

    Moving object segmentation: All you need is sam (and flow)

    Junyu Xie, Charig Yang, Weidi Xie, and Andrew Zisserman. Moving object segmentation: All you need is sam (and flow). InACCV, 2024

  76. [76]

    Das3r: Dynamics-aware gaussian splatting for static scene reconstruction.arXiv preprint arxiv:2412.19584, 2024

    Kai Xu, Tze Ho Elden Tse, Jizong Peng, and Angela Yao. Das3r: Dynamics-aware gaussian splatting for static scene reconstruction.arXiv preprint arxiv:2412.19584, 2024

  77. [77]

    Youtube-vos: A large-scale video object segmentation benchmark

    Ning Xu, Linjie Yang, Yuchen Fan, Dingcheng Yue, Yuchen Liang, Jianchao Yang, and Thomas Huang. Youtube-vos: A large-scale video object segmentation benchmark. InECCV, 2018

  78. [78]

    Self-supervised video object segmentation by motion grouping

    Charig Yang, Hala Lamdouar, Erika Lu, Andrew Zisserman, and Weidi Xie. Self-supervised video object segmentation by motion grouping. InICCV, 2021

  79. [79]

    S. Yang, L. Zhang, J. Qi, H. Lu, S. Wang, and X. Zhang. Learning motion-appearance co-attention for zero-shot video object segmentation. InICCV, 2021

  80. [80]

    Unsupervised moving object detection via contextual information separation

    Yanchao Yang, Antonio Loquercio, Davide Scaramuzza, and Stefano Soatto. Unsupervised moving object detection via contextual information separation. InCVPR, 2019

Showing first 80 references.