GMOS: Grounding Moving Object Segmentation in 3D Space and Time

Andrew Zisserman; Junyu Xie; Tengda Han; Weidi Xie

arxiv: 2605.30352 · v1 · pith:OGH5VR6Dnew · submitted 2026-05-28 · 💻 cs.CV

GMOS: Grounding Moving Object Segmentation in 3D Space and Time

Junyu Xie , Tengda Han , Weidi Xie , Andrew Zisserman This is my paper

Pith reviewed 2026-06-29 08:12 UTC · model grok-4.3

classification 💻 cs.CV

keywords moving object segmentation3D groundingRGB videotemporal motion annotationsvideo object segmentationonline inferenceunsupervised VOS

0 comments

The pith

GMOS produces 3D-aware, temporally fine-grained segmentation of moving objects directly from RGB video.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current moving object segmentation methods depend on pre-computed 2D cues like optical flow that miss 3D geometry and treat motion at sequence level. GMOS instead processes RGB video to deliver 3D-aware and temporally precise segmentation of multiple independent moving objects. The authors also release GMOS-2K, a dataset with per-object temporal motion labels from existing VOS sources, and introduce the MOS-I protocol for fine-grained evaluation. If successful, this enables faster, online-capable systems that do not need auxiliary motion inputs.

Core claim

GMOS is a framework that operates directly on RGB video to produce 3D-aware, temporally fine-grained segmentation of multiple moving objects, achieving state-of-the-art results across MOS, MOS-I, and Unsupervised VOS benchmarks while running significantly faster than prior multi-object MOS methods and supporting online inference for streaming deployment.

What carries the argument

The GMOS framework that grounds moving object segmentation in 3D space and time by processing RGB video directly.

If this is right

Delivers segmentation without pre-computed 2D auxiliary modalities such as optical flow or point trajectories.
Accounts for the instantaneous motion state of each object rather than sequence-level attributes.
Achieves state-of-the-art results on MOS, MOS-I, and unsupervised VOS benchmarks.
Runs significantly faster than prior multi-object MOS methods and supports online inference for streaming deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Video analysis pipelines could eliminate separate optical flow computation steps if implicit 3D cues from RGB prove reliable.
The MOS-I protocol may encourage finer temporal metrics in future motion segmentation evaluations.
Repurposing VOS datasets with added per-object motion labels offers a scalable path to train 3D-grounded models.
The faster GMOS-S variant could enable deployment in real-time settings like live video monitoring.

Load-bearing premise

That RGB video by itself supplies enough 3D geometric information and instantaneous motion cues to segment moving objects accurately without pre-computed 2D auxiliary modalities.

What would settle it

A benchmark comparison on video sequences with ambiguous 2D projections of distinct 3D motions, checking whether GMOS accuracy falls below flow-based methods.

Figures

Figures reproduced from arXiv: 2605.30352 by Andrew Zisserman, Junyu Xie, Tengda Han, Weidi Xie.

**Figure 1.** Figure 1: Grounding Moving Object Segmentation (MOS) in 3D space and time. Left: GMOS grounds MOS in 3D space, enabling reliable segmentation under challenging viewpoints and heavy depth parallax from moving cameras. Right: GMOS also grounds MOS in time: conventional MOS masks every object that moves at some point in the sequence, regardless of whether it is currently in motion. In contrast, our proposed MOS-I (“I” … view at source ↗

**Figure 2.** Figure 2: Overview of GMOS. Left: The overall proposer–propagator design. The proposer operates on a short temporal window (around 0.5 s) and outputs per-frame object proposals, which the propagator links into coherent tracks across the full video. Right: The GMOS proposer. A frozen π 3 geometric encoder ingests frames for a short temporal window and a frozen SAM2 segmentation encoder processes the middle frame of … view at source ↗

**Figure 3.** Figure 3: GMOS-2K overview. Left: Curation pipeline. 5,001 videos from five VOS datasets are filtered by two criteria, with 743 added directly to GMOS-2K and 1,467 requiring Temporal Fine-grained Annotation (TFA), which labels per-object motion intervals along the time axis. Right: Dataset statistics. Pie charts show the per-subset distribution of videos and objects. The upper histogram reports the number of moving … view at source ↗

**Figure 4.** Figure 4: Qualitative comparison on the MOS task. Example videos are sampled from YTVOS19 (first two columns), DAVIS17 (third column), and MoCA (last column). The middle block shows multi-object results, and the bottom block shows foreground–background results. Video MOS-I MOS time [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: GMOS on in-the-wild videos. Two in-the-wild sequences (sourced outside our training or test datasets) illustrate our MOS and MOS-I predictions. Under MOS-I, instantaneously moving objects are segmented, with object identities consistently associated across frames. Under MOS, GMOS produces full segmentation masks for every object that moves at any point in the sequence, regardless of its motion state at a g… view at source ↗

read the original abstract

Moving Object Segmentation (MOS) aims to discover, segment, and track objects that move independently of the camera. Current MOS methods, however, exhibit two fundamental limitations: they rely on pre-computed 2D auxiliary modalities such as optical flow or point trajectories that lack 3D geometric information, and they treat motion as a sequence-level attribute, overlooking the instantaneous motion state of each object. We address both by grounding MOS in 3D space and time, and propose GMOS, a framework that operates directly on RGB video to produce 3D-aware, temporally fine-grained segmentation of multiple moving objects, alongside a foreground--background variant GMOS-S for faster deployment. To support training and evaluation in this regime, we curate GMOS-2K, a dataset of 2,210 real-world videos with per-object temporal motion annotations drawn from five established Video Object Segmentation (VOS) benchmarks, and formalise MOS-I ("I" for instantaneous), a temporally fine-grained evaluation protocol with three complementary metrics. GMOS achieves state-of-the-art results across MOS, MOS-I, and Unsupervised VOS benchmarks, while running significantly faster than prior multi-object MOS methods and supporting online inference for streaming deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GMOS adds a useful new dataset and instantaneous evaluation protocol to MOS, with an RGB-only method that claims SOTA without flow auxiliaries.

read the letter

The core new pieces here are the GMOS-2K dataset of 2,210 videos with added per-object temporal motion labels pulled from existing VOS sources, plus the MOS-I protocol that scores instantaneous motion state instead of sequence-level. GMOS itself is framed as operating straight on RGB to deliver multi-object segmentations that are both 3D-positioned and time-resolved, with a lighter GMOS-S variant for speed.

The dataset curation and protocol definition look like the most concrete contributions; they directly tackle the two limitations called out in the abstract. The efficiency numbers and online inference support are practical if the experiments back them. The paper also reports gains on standard MOS, MOS-I, and unsupervised VOS benchmarks.

The main soft spot is the 3D claim. The method still starts from RGB only, so any geometric awareness has to be inferred rather than measured; whether that actually outperforms flow-based priors in practice will depend on the ablations and error analysis, which the abstract leaves out. The stress-test note finds no internal mismatch, but the soundness score stayed low because details are thin.

This is aimed at the MOS/VOS community and anyone building streaming video analytics or robotics pipelines that need moving-object masks without extra modalities. The new data and protocol give other groups something concrete to use or extend.

I would send it to peer review. The dataset and protocol are real additions worth checking, even if the method results need closer scrutiny on the 3D and runtime claims.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces GMOS, a framework for moving object segmentation (MOS) that grounds the task in 3D space and time by operating directly on RGB video without pre-computed 2D auxiliary modalities such as optical flow. It proposes a foreground-background variant GMOS-S, curates the GMOS-2K dataset (2,210 videos with per-object temporal motion annotations from existing VOS benchmarks), and formalizes the MOS-I protocol with three complementary metrics for temporally fine-grained evaluation. The authors report state-of-the-art results on MOS, MOS-I, and unsupervised VOS benchmarks, along with significantly faster runtime than prior multi-object methods and support for online inference.

Significance. If the results hold, the work is significant for shifting MOS away from reliance on 2D modalities and sequence-level motion attributes toward 3D-aware, instantaneous segmentation from RGB alone. The curation of GMOS-2K and formalization of the MOS-I protocol with three metrics provide concrete resources that could standardize fine-grained evaluation in the field. The reported efficiency gains and online capability are practical strengths for streaming deployment. These elements are load-bearing for the paper's contribution and are explicitly supported by the dataset curation and metric definitions.

minor comments (2)

[Abstract] Abstract: the claim of 'significantly faster' runtime would benefit from a specific comparison (e.g., FPS or runtime table reference) to make the efficiency advantage immediately quantifiable.
[§3] The paper should clarify in §3 (method) whether any implicit depth or 3D cues are learned end-to-end or if the 3D grounding is achieved purely through the temporal modeling and new loss terms.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript, the recognition of its contributions to 3D-aware MOS, the GMOS-2K dataset, and the MOS-I protocol, and the recommendation to accept.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces a new framework (GMOS) for 3D-aware moving object segmentation directly from RGB video, along with a curated dataset (GMOS-2K) and a new evaluation protocol (MOS-I). No derivation chain, equations, fitted parameters renamed as predictions, or self-citation load-bearing steps are present in the provided text. The central claims rest on empirical SOTA results and efficiency measurements rather than any self-referential construction. This is a standard empirical CV contribution with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; standard computer-vision assumptions about RGB video containing implicit 3D cues are implicit but unstated.

pith-pipeline@v0.9.1-grok · 5752 in / 1215 out tokens · 30148 ms · 2026-06-29T08:12:05.233082+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

90 extracted references · 7 canonical work pages · 2 internal anchors

[1]

It’s moving! a probabilistic model for causal motion segmentation in moving camera videos

Pia Bideau and Erik Learned-Miller. It’s moving! a probabilistic model for causal motion segmentation in moving camera videos. InECCV, 2016

2016
[2]

Object segmentation by long term analysis of point trajectories

Thomas Brox and Jitendra Malik. Object segmentation by long term analysis of point trajectories. InECCV, 2010

2010
[3]

SAM 3: Segment anything with concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris Coll-Vinent, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Ro- man Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu ...

2026
[4]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InICCV, 2021

2021
[5]

Moving object segmentation in 3D LiDAR data: A learning-based approach exploiting sequential data

Xieyuanli Chen, Shijie Li, Benedikt Mersch, Louis Wiesmann, Jürgen Gall, Jens Behley, and Cyrill Stachniss. Moving object segmentation in 3D LiDAR data: A learning-based approach exploiting sequential data. InIROS, 2021

2021
[6]

Easi3r: Estimating disentangled motion from dust3r without training

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Easi3r: Estimating disentangled motion from dust3r without training. InICCV, 2025

2025
[7]

Schwing, and Alexander Kirillov

Bowen Cheng, Alexander G. Schwing, and Alexander Kirillov. Per-pixel classification is not all you need for semantic segmentation. InNeurIPS, 2021

2021
[8]

Ho Kei Cheng and Alexander G. Schwing. XMem: Long-term video object segmentation with an atkinson-shiffrin memory model. InECCV, 2022

2022
[9]

Tracking anything with decoupled video segmentation

Ho Kei Cheng, Seoung Wug Oh, Brian Price, Alexander Schwing, and Joon-Young Lee. Tracking anything with decoupled video segmentation. InICCV, 2023

2023
[10]

Putting the object back into video object segmentation

Ho Kei Cheng, Seoung Wug Oh, Brian Price, Joon-Young Lee, and Alexander Schwing. Putting the object back into video object segmentation. InCVPR, 2024

2024
[11]

Implicit motion handling for video camouflaged object detection

Xuelian Cheng, Huan Xiong, Deng-Ping Fan, Yiran Zhong, Mehrtash Harandi, Tom Drummond, and Zongyuan Ge. Implicit motion handling for video camouflaged object detection. InCVPR, 2022

2022
[12]

S. Cho, M. Lee, S. Lee, C. Park, D. Kim, and S. Lee. Treating motion as option to reduce motion dependency in unsupervised video object segmentation. InWACV, 2023

2023
[13]

Dual prototype attention for unsupervised video object segmentation

Suhwan Cho, Minhyeok Lee, Seunghoon Lee, Dogyoon Lee, Heeseung Choi, Ig-Jae Kim, and Sangyoun Lee. Dual prototype attention for unsupervised video object segmentation. InCVPR, 2024

2024
[14]

Guess What Moves: Unsupervised Video and Image Segmentation by Anticipating Motion

Subhabrata Choudhury, Laurynas Karazija, Iro Laina, Andrea Vedaldi, and Christian Rupprecht. Guess What Moves: Unsupervised Video and Image Segmentation by Anticipating Motion. In BMVC, 2022

2022
[15]

Towards segmenting anything that moves

Achal Dave, Pavel Tokmakov, and Deva Ramanan. Towards segmenting anything that moves. InICCV, 2019

2019
[16]

Tap-vid: A benchmark for tracking any point in a video

Carl Doersch, Ankush Gupta, Larisa Markeeva, Adria Recasens, Lucas Smaira, Yusuf Aytar, Joao Carreira, Andrew Zisserman, and Yi Yang. Tap-vid: A benchmark for tracking any point in a video. InNeurIPS, 2022. 10

2022
[17]

Tapir: Tracking any point with per-frame initialization and temporal refinement

Carl Doersch, Yi Yang, Mel Vecerik, Dilara Gokay, Ankush Gupta, Yusuf Aytar, Joao Carreira, and Andrew Zisserman. Tapir: Tracking any point with per-frame initialization and temporal refinement. InICCV, 2023

2023
[18]

The via annotation software for images, audio and video

Abhishek Dutta and Andrew Zisserman. The via annotation software for images, audio and video. InACM MM, 2019

2019
[19]

Shifting more attention to video salient object detection

Deng-Ping Fan, Wenguan Wang, Ming-Ming Cheng, and Jianbing Shen. Shifting more attention to video salient object detection. InCVPR, 2019

2019
[20]

Video segmentation by tracing discontinuities in a trajectory embedding

Katerina Fragkiadaki, Geng Zhang, and Jianbo Shi. Video segmentation by tracing discontinuities in a trajectory embedding. InCVPR, 2012

2012
[21]

Fleet, Saurabh Saxena, and Andrea Tagliasacchi

Lily Goli, Sara Sabour, Mark Matthews, Brubaker Marcus, Dmitry Lagun, Alec Jacobson, David J. Fleet, Saurabh Saxena, and Andrea Tagliasacchi. RoMo: Robust motion segmentation improves structure from motion. InICCV, 2025

2025
[22]

Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, Daniel Duckworth, David J. Fleet, Dan Gnanapragasam, Florian Golemo, Charles Herrmann, Thomas Kipf, Abhijit Kundu, Dmitry Lagun, Issam Laradji, Hsueh-Ti (Derek) Liu, Henning Meyer, Yishu Miao, Derek Nowrouzezahrai, Cengiz Oztireli, Etienne Pot, Noha Radwan, Daniel Rebain, Sara Sabour, Me...

2022
[23]

Geomotion: Rethinking motion segmentation via latent 4d geometry

Xiankang He, Peile Lin, Ying Cui, Dongyan Guo, Chunhua Shen, and Xiaoqin Zhang. Geomotion: Rethinking motion segmentation via latent 4d geometry. InCVPR, 2026

2026
[24]

Segment any motion in videos

Nan Huang, Wenzhao Zheng, Chenfeng Xu, Kurt Keutzer, Shanghang Zhang, Angjoo Kanazawa, and Qianqian Wang. Segment any motion in videos. InCVPR, 2025

2025
[25]

Fusionseg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos

Suyog Dutt Jain, Bo Xiong, and Kristen Grauman. Fusionseg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. InCVPR, 2017

2017
[26]

Full-duplex strategy for video object segmentation

Ge-Peng Ji, Keren Fu, Zhe Wu, Deng-Ping Fan, Jianbing Shen, and Ling Shao. Full-duplex strategy for video object segmentation. InICCV, 2021

2021
[27]

Dynamicstereo: Consistent dynamic depth from stereo videos

Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Chris- tian Rupprecht. Dynamicstereo: Consistent dynamic depth from stereo videos. InCVPR, 2023

2023
[28]

Cotracker3: Simpler and better point tracking by pseudo-labeling real videos

Nikita Karaev, Yuri Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker3: Simpler and better point tracking by pseudo-labeling real videos. In ICCV, 2025

2025
[29]

Learning segmentation from point trajectories

Laurynas Karazija, Iro Laina, Christian Rupprecht, and Andrea Vedaldi. Learning segmentation from point trajectories. InNeurIPS, 2024

2024
[30]

Motion trajectory segmentation via minimum cost multicuts

Margret Keuper, Bjoern Andres, and Thomas Brox. Motion trajectory segmentation via minimum cost multicuts. InICCV, 2015

2015
[31]

Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick. Segment anything. InICCV, 2023

2023
[32]

Betrayed by motion: Camouflaged object discovery via motion segmentation

Hala Lamdouar, Charig Yang, Weidi Xie, and Andrew Zisserman. Betrayed by motion: Camouflaged object discovery via motion segmentation. InACCV, 2020

2020
[33]

Divided attention: Unsupervised multi-object discovery with contextually separated slots.arXiv preprint arxiv:2304.01430, 2023

Dong Lao, Zhengyang Hu, Francesco Locatello, Yanchao Yang, and Stefano Soatto. Divided attention: Unsupervised multi-object discovery with contextually separated slots.arXiv preprint arxiv:2304.01430, 2023

work page arXiv 2023
[34]

M. Lee, S. Cho, S. Lee, C. Park, and S. Lee. Unsupervised video object segmentation via prototype memory network. InWACV, 2023

2023
[35]

Guided slot attention for unsupervised video object segmentation

Minhyeok Lee, Suhwan Cho, Dogyoon Lee, Chaewon Park, Jungho Lee, and Sangyoun Lee. Guided slot attention for unsupervised video object segmentation. InCVPR, 2024. 11

2024
[36]

Ni, and Heung-Yeung Shum

Feng Li, Hao Zhang, Huaizhe Xu, Shilong Liu, Lei Zhang, Lionel M. Ni, and Heung-Yeung Shum. Mask dino: Towards a unified transformer-based framework for object detection and segmentation. InCVPR, 2023

2023
[37]

Fuxin Li, Taeyoung Kim, Ahmad Humayun, David Tsai, and James M. Rehg. Video segmentation by tracking many figure-ground segments. InICCV, 2013

2013
[38]

Multi-sensor fusion for robust localization with moving object segmentation in complex dynamic 3D scenes.International Journal of Applied Earth Observation and Geoinformation, 2023

Qipeng Li, Yuan Zhuang, You Chen, Jianzhu Huai, Miaomiao Li, Tianxiang Ma, Yufei Tang, and Xinlian Liang. Multi-sensor fusion for robust localization with moving object segmentation in complex dynamic 3D scenes.International Journal of Applied Earth Observation and Geoinformation, 2023

2023
[39]

Zhengqi Li, Tali Dekel, Forrester Cole, Richard Tucker, Noah Snavely, Ce Liu, and William T. Freeman. Learning the depths of moving people by watching frozen people. InCVPR, 2019

2019
[40]

Megasam: Accurate, fast and robust structure and motion from casual dynamic videos

Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holynski, and Noah Snavely. Megasam: Accurate, fast and robust structure and motion from casual dynamic videos. InCVPR, 2025

2025
[41]

Long Lian, Zhirong Wu, and Stella X. Yu. Bootstrapping objectness from videos by relaxed common fate and visual grouping. InCVPR, 2023

2023
[42]

Chen, Zhenyu Li, Yang Zhao, Sida Peng, Hengkai Guo, Xiaowei Zhou, Guang Shi, Jiashi Feng, and Bingyi Kang

Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Yang Zhao, Sida Peng, Hengkai Guo, Xiaowei Zhou, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views. InICLR, 2026

2026
[43]

F2net: Learning to focus on the foreground for unsupervised video object segmentation

Daizong Liu, Dongdong Yu, Changhu Wang, and Pan Zhou. F2net: Learning to focus on the foreground for unsupervised video object segmentation. InAAAI, 2021

2021
[44]

Depth-aware test-time training for zero-shot video object segmentation

Weihuang Liu, Xi Shen, Haolun Li, Xiuli Bi, Bo Liu, Chi-Man Pun, and Xiaodong Cun. Depth-aware test-time training for zero-shot video object segmentation. InCVPR, 2024

2024
[45]

Hoi4d: A 4d egocentric dataset for category-level human-object interaction

Yunze Liu, Yun Liu, Che Jiang, Kangbo Lyu, Weikang Wan, Hao Shen, Boqiang Liang, Zhoujie Fu, He Wang, and Li Yi. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. InCVPR, 2022

2022
[46]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019

2019
[47]

See more, know more: Unsupervised video object segmentation with co-attention siamese networks

Xiankai Lu, Wenguan Wang, Chao Ma, Jianbing Shen, Ling Shao, and Fatih Porikli. See more, know more: Unsupervised video object segmentation with co-attention siamese networks. In CVPR, 2019

2019
[48]

arXiv preprint arXiv:2310.01040 , year =

Etienne Meunier and Patrick Bouthemy. Unsupervised motion segmentation in one go: Smooth long-term model over a video.arXiv preprint arXiv:2310.01040, 2023

work page arXiv 2023
[49]

Em-driven unsupervised learning for efficient motion segmentation.IEEE TPAMI, 2022

Etienne Meunier, Anaïs Badoual, and Patrick Bouthemy. Em-driven unsupervised learning for efficient motion segmentation.IEEE TPAMI, 2022

2022
[50]

Deep anomaly detection through visual attention in surveillance videos.Journal of Big Data, 2020

Nasaruddin Nasaruddin, Kahlil Muchtar, Afdhal Afdhal, and Alvin Prayuda Juniarta Dwiyantoro. Deep anomaly detection through visual attention in surveillance videos.Journal of Big Data, 2020

2020
[51]

P. Ochs, J. Malik, and T. Brox. Segmentation of moving objects by long term video analysis. IEEE TPAMI, 2014

2014
[52]

Object segmentation in video: a hierarchical variational approach for turning point trajectories into dense regions

Peter Ochs and Thomas Brox. Object segmentation in video: a hierarchical variational approach for turning point trajectories into dense regions. InICCV, 2011

2011
[53]

Video object segmentation using space-time memory networks

Seoung Wug Oh, Joon-Young Lee, Ning Xu, and Seon Joo Kim. Video object segmentation using space-time memory networks. InICCV, 2019

2019
[54]

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick L...

2024
[55]

Fast object segmentation in unconstrained video

Anestis Papazoglou and Vittorio Ferrari. Fast object segmentation in unconstrained video. In ICCV, 2013

2013
[56]

Hier- archical feature alignment network for unsupervised video object segmentation

Gensheng Pei, Fumin Shen, Yazhou Yao, Guo-Sen Xie, Zhenmin Tang, and Jinhui Tang. Hier- archical feature alignment network for unsupervised video object segmentation. InECCV, 2022

2022
[57]

A benchmark dataset and evaluation methodology for video object segmentation

Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. InCVPR, 2016

2016
[58]

The 2017 DAVIS Challenge on Video Object Segmentation

Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation.arXiv preprint arXiv:1704.00675, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[59]

Occluded video instance segmentation: A benchmark.IJCV, 2022

Jiyang Qi, Yan Gao, Yao Hu, Xinggang Wang, Xiaoyu Liu, Xiang Bai, Serge Belongie, Alan Yuille, Philip Torr, and Song Bai. Occluded video instance segmentation: A benchmark.IJCV, 2022

2022
[60]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[61]

Reciprocal transformations for unsupervised video object segmentation

Sucheng Ren, Wenxi Liu, Yongtuo Liu, Haoxin Chen, Guoqiang Han, and Shengfeng He. Reciprocal transformations for unsupervised video object segmentation. InCVPR, 2021

2021
[62]

Hiera: A hierarchical vision transformer without the bells-and-whistles

Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, and Christoph Feichtenhofer. Hiera: A hierarchical vision transformer without the bells-and-whistles. InICML, 2023

2023
[63]

Multi-object discovery by low-dimensional object motion

Sadra Safadoust and Fatma Güney. Multi-object discovery by low-dimensional object motion. InICCV, 2023

2023
[64]

Generalizable fourier augmentation for unsupervised video object segmentation

Huihui Song, Tiankang Su, Yuhui Zheng, Kaihua Zhang, Bo Liu, and Dong Liu. Generalizable fourier augmentation for unsupervised video object segmentation. InAAAI, 2024

2024
[65]

Learning video object segmentation with visual memory

Pavel Tokmakov, Karteek Alahari, and Cordelia Schmid. Learning video object segmentation with visual memory. InICCV, 2017

2017
[66]

Learning to segment moving objects

Pavel Tokmakov, Cordelia Schmid, and Karteek Alahari. Learning to segment moving objects. IJCV, 2019

2019
[67]

Self- supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics

Jiangliu Wang, Jianbo Jiao, Linchao Bao, Shengfeng He, Yunhui Liu, and Wei Liu. Self- supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. InCVPR, 2019

2019
[68]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InCVPR, 2025

2025
[69]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InCVPR, 2024

2024
[70]

Videocutler: Surpris- ingly simple unsupervised video instance segmentation.arXiv preprint arXiv:2308.14710, 2023

Xudong Wang, Ishan Misra, Zizun Zeng, Rohit Girdhar, and Trevor Darrell. Videocutler: Surpris- ingly simple unsupervised video instance segmentation.arXiv preprint arXiv:2308.14710, 2023

work page arXiv 2023
[71]

π3: Permutation-equivariant visual geometry learning

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. π3: Permutation-equivariant visual geometry learning. InICLR, 2026

2026
[72]

Attention-based temporal encoding network with background-independent motion mask for action recognition.Computational Intelligence and Neuroscience, 2021

Zhengkui Weng, Zhipeng Jin, Shuangxi Chen, Quanquan Shen, Xiangyang Ren, and Wuzhao Li. Attention-based temporal encoding network with background-independent motion mask for action recognition.Computational Intelligence and Neuroscience, 2021

2021
[73]

Segmenting moving objects via an object-centric layered representation

Junyu Xie, Weidi Xie, and Andrew Zisserman. Segmenting moving objects via an object-centric layered representation. InNeurIPS, 2022. 13

2022
[74]

Appearance-based refinement for object-centric motion segmentation

Junyu Xie, Weidi Xie, and Andrew Zisserman. Appearance-based refinement for object-centric motion segmentation. InECCV, 2024

2024
[75]

Moving object segmentation: All you need is sam (and flow)

Junyu Xie, Charig Yang, Weidi Xie, and Andrew Zisserman. Moving object segmentation: All you need is sam (and flow). InACCV, 2024

2024
[76]

Das3r: Dynamics-aware gaussian splatting for static scene reconstruction.arXiv preprint arxiv:2412.19584, 2024

Kai Xu, Tze Ho Elden Tse, Jizong Peng, and Angela Yao. Das3r: Dynamics-aware gaussian splatting for static scene reconstruction.arXiv preprint arxiv:2412.19584, 2024

work page arXiv 2024
[77]

Youtube-vos: A large-scale video object segmentation benchmark

Ning Xu, Linjie Yang, Yuchen Fan, Dingcheng Yue, Yuchen Liang, Jianchao Yang, and Thomas Huang. Youtube-vos: A large-scale video object segmentation benchmark. InECCV, 2018

2018
[78]

Self-supervised video object segmentation by motion grouping

Charig Yang, Hala Lamdouar, Erika Lu, Andrew Zisserman, and Weidi Xie. Self-supervised video object segmentation by motion grouping. InICCV, 2021

2021
[79]

S. Yang, L. Zhang, J. Qi, H. Lu, S. Wang, and X. Zhang. Learning motion-appearance co-attention for zero-shot video object segmentation. InICCV, 2021

2021
[80]

Unsupervised moving object detection via contextual information separation

Yanchao Yang, Antonio Loquercio, Davide Scaramuzza, and Stefano Soatto. Unsupervised moving object detection via contextual information separation. InCVPR, 2019

2019

Showing first 80 references.

[1] [1]

It’s moving! a probabilistic model for causal motion segmentation in moving camera videos

Pia Bideau and Erik Learned-Miller. It’s moving! a probabilistic model for causal motion segmentation in moving camera videos. InECCV, 2016

2016

[2] [2]

Object segmentation by long term analysis of point trajectories

Thomas Brox and Jitendra Malik. Object segmentation by long term analysis of point trajectories. InECCV, 2010

2010

[3] [3]

SAM 3: Segment anything with concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris Coll-Vinent, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Ro- man Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu ...

2026

[4] [4]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InICCV, 2021

2021

[5] [5]

Moving object segmentation in 3D LiDAR data: A learning-based approach exploiting sequential data

Xieyuanli Chen, Shijie Li, Benedikt Mersch, Louis Wiesmann, Jürgen Gall, Jens Behley, and Cyrill Stachniss. Moving object segmentation in 3D LiDAR data: A learning-based approach exploiting sequential data. InIROS, 2021

2021

[6] [6]

Easi3r: Estimating disentangled motion from dust3r without training

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Easi3r: Estimating disentangled motion from dust3r without training. InICCV, 2025

2025

[7] [7]

Schwing, and Alexander Kirillov

Bowen Cheng, Alexander G. Schwing, and Alexander Kirillov. Per-pixel classification is not all you need for semantic segmentation. InNeurIPS, 2021

2021

[8] [8]

Ho Kei Cheng and Alexander G. Schwing. XMem: Long-term video object segmentation with an atkinson-shiffrin memory model. InECCV, 2022

2022

[9] [9]

Tracking anything with decoupled video segmentation

Ho Kei Cheng, Seoung Wug Oh, Brian Price, Alexander Schwing, and Joon-Young Lee. Tracking anything with decoupled video segmentation. InICCV, 2023

2023

[10] [10]

Putting the object back into video object segmentation

Ho Kei Cheng, Seoung Wug Oh, Brian Price, Joon-Young Lee, and Alexander Schwing. Putting the object back into video object segmentation. InCVPR, 2024

2024

[11] [11]

Implicit motion handling for video camouflaged object detection

Xuelian Cheng, Huan Xiong, Deng-Ping Fan, Yiran Zhong, Mehrtash Harandi, Tom Drummond, and Zongyuan Ge. Implicit motion handling for video camouflaged object detection. InCVPR, 2022

2022

[12] [12]

S. Cho, M. Lee, S. Lee, C. Park, D. Kim, and S. Lee. Treating motion as option to reduce motion dependency in unsupervised video object segmentation. InWACV, 2023

2023

[13] [13]

Dual prototype attention for unsupervised video object segmentation

Suhwan Cho, Minhyeok Lee, Seunghoon Lee, Dogyoon Lee, Heeseung Choi, Ig-Jae Kim, and Sangyoun Lee. Dual prototype attention for unsupervised video object segmentation. InCVPR, 2024

2024

[14] [14]

Guess What Moves: Unsupervised Video and Image Segmentation by Anticipating Motion

Subhabrata Choudhury, Laurynas Karazija, Iro Laina, Andrea Vedaldi, and Christian Rupprecht. Guess What Moves: Unsupervised Video and Image Segmentation by Anticipating Motion. In BMVC, 2022

2022

[15] [15]

Towards segmenting anything that moves

Achal Dave, Pavel Tokmakov, and Deva Ramanan. Towards segmenting anything that moves. InICCV, 2019

2019

[16] [16]

Tap-vid: A benchmark for tracking any point in a video

Carl Doersch, Ankush Gupta, Larisa Markeeva, Adria Recasens, Lucas Smaira, Yusuf Aytar, Joao Carreira, Andrew Zisserman, and Yi Yang. Tap-vid: A benchmark for tracking any point in a video. InNeurIPS, 2022. 10

2022

[17] [17]

Tapir: Tracking any point with per-frame initialization and temporal refinement

Carl Doersch, Yi Yang, Mel Vecerik, Dilara Gokay, Ankush Gupta, Yusuf Aytar, Joao Carreira, and Andrew Zisserman. Tapir: Tracking any point with per-frame initialization and temporal refinement. InICCV, 2023

2023

[18] [18]

The via annotation software for images, audio and video

Abhishek Dutta and Andrew Zisserman. The via annotation software for images, audio and video. InACM MM, 2019

2019

[19] [19]

Shifting more attention to video salient object detection

Deng-Ping Fan, Wenguan Wang, Ming-Ming Cheng, and Jianbing Shen. Shifting more attention to video salient object detection. InCVPR, 2019

2019

[20] [20]

Video segmentation by tracing discontinuities in a trajectory embedding

Katerina Fragkiadaki, Geng Zhang, and Jianbo Shi. Video segmentation by tracing discontinuities in a trajectory embedding. InCVPR, 2012

2012

[21] [21]

Fleet, Saurabh Saxena, and Andrea Tagliasacchi

Lily Goli, Sara Sabour, Mark Matthews, Brubaker Marcus, Dmitry Lagun, Alec Jacobson, David J. Fleet, Saurabh Saxena, and Andrea Tagliasacchi. RoMo: Robust motion segmentation improves structure from motion. InICCV, 2025

2025

[22] [22]

Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, Daniel Duckworth, David J. Fleet, Dan Gnanapragasam, Florian Golemo, Charles Herrmann, Thomas Kipf, Abhijit Kundu, Dmitry Lagun, Issam Laradji, Hsueh-Ti (Derek) Liu, Henning Meyer, Yishu Miao, Derek Nowrouzezahrai, Cengiz Oztireli, Etienne Pot, Noha Radwan, Daniel Rebain, Sara Sabour, Me...

2022

[23] [23]

Geomotion: Rethinking motion segmentation via latent 4d geometry

Xiankang He, Peile Lin, Ying Cui, Dongyan Guo, Chunhua Shen, and Xiaoqin Zhang. Geomotion: Rethinking motion segmentation via latent 4d geometry. InCVPR, 2026

2026

[24] [24]

Segment any motion in videos

Nan Huang, Wenzhao Zheng, Chenfeng Xu, Kurt Keutzer, Shanghang Zhang, Angjoo Kanazawa, and Qianqian Wang. Segment any motion in videos. InCVPR, 2025

2025

[25] [25]

Fusionseg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos

Suyog Dutt Jain, Bo Xiong, and Kristen Grauman. Fusionseg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. InCVPR, 2017

2017

[26] [26]

Full-duplex strategy for video object segmentation

Ge-Peng Ji, Keren Fu, Zhe Wu, Deng-Ping Fan, Jianbing Shen, and Ling Shao. Full-duplex strategy for video object segmentation. InICCV, 2021

2021

[27] [27]

Dynamicstereo: Consistent dynamic depth from stereo videos

Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Chris- tian Rupprecht. Dynamicstereo: Consistent dynamic depth from stereo videos. InCVPR, 2023

2023

[28] [28]

Cotracker3: Simpler and better point tracking by pseudo-labeling real videos

Nikita Karaev, Yuri Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker3: Simpler and better point tracking by pseudo-labeling real videos. In ICCV, 2025

2025

[29] [29]

Learning segmentation from point trajectories

Laurynas Karazija, Iro Laina, Christian Rupprecht, and Andrea Vedaldi. Learning segmentation from point trajectories. InNeurIPS, 2024

2024

[30] [30]

Motion trajectory segmentation via minimum cost multicuts

Margret Keuper, Bjoern Andres, and Thomas Brox. Motion trajectory segmentation via minimum cost multicuts. InICCV, 2015

2015

[31] [31]

Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick. Segment anything. InICCV, 2023

2023

[32] [32]

Betrayed by motion: Camouflaged object discovery via motion segmentation

Hala Lamdouar, Charig Yang, Weidi Xie, and Andrew Zisserman. Betrayed by motion: Camouflaged object discovery via motion segmentation. InACCV, 2020

2020

[33] [33]

Divided attention: Unsupervised multi-object discovery with contextually separated slots.arXiv preprint arxiv:2304.01430, 2023

Dong Lao, Zhengyang Hu, Francesco Locatello, Yanchao Yang, and Stefano Soatto. Divided attention: Unsupervised multi-object discovery with contextually separated slots.arXiv preprint arxiv:2304.01430, 2023

work page arXiv 2023

[34] [34]

M. Lee, S. Cho, S. Lee, C. Park, and S. Lee. Unsupervised video object segmentation via prototype memory network. InWACV, 2023

2023

[35] [35]

Guided slot attention for unsupervised video object segmentation

Minhyeok Lee, Suhwan Cho, Dogyoon Lee, Chaewon Park, Jungho Lee, and Sangyoun Lee. Guided slot attention for unsupervised video object segmentation. InCVPR, 2024. 11

2024

[36] [36]

Ni, and Heung-Yeung Shum

Feng Li, Hao Zhang, Huaizhe Xu, Shilong Liu, Lei Zhang, Lionel M. Ni, and Heung-Yeung Shum. Mask dino: Towards a unified transformer-based framework for object detection and segmentation. InCVPR, 2023

2023

[37] [37]

Fuxin Li, Taeyoung Kim, Ahmad Humayun, David Tsai, and James M. Rehg. Video segmentation by tracking many figure-ground segments. InICCV, 2013

2013

[38] [38]

Multi-sensor fusion for robust localization with moving object segmentation in complex dynamic 3D scenes.International Journal of Applied Earth Observation and Geoinformation, 2023

Qipeng Li, Yuan Zhuang, You Chen, Jianzhu Huai, Miaomiao Li, Tianxiang Ma, Yufei Tang, and Xinlian Liang. Multi-sensor fusion for robust localization with moving object segmentation in complex dynamic 3D scenes.International Journal of Applied Earth Observation and Geoinformation, 2023

2023

[39] [39]

Zhengqi Li, Tali Dekel, Forrester Cole, Richard Tucker, Noah Snavely, Ce Liu, and William T. Freeman. Learning the depths of moving people by watching frozen people. InCVPR, 2019

2019

[40] [40]

Megasam: Accurate, fast and robust structure and motion from casual dynamic videos

Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holynski, and Noah Snavely. Megasam: Accurate, fast and robust structure and motion from casual dynamic videos. InCVPR, 2025

2025

[41] [41]

Long Lian, Zhirong Wu, and Stella X. Yu. Bootstrapping objectness from videos by relaxed common fate and visual grouping. InCVPR, 2023

2023

[42] [42]

Chen, Zhenyu Li, Yang Zhao, Sida Peng, Hengkai Guo, Xiaowei Zhou, Guang Shi, Jiashi Feng, and Bingyi Kang

Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Yang Zhao, Sida Peng, Hengkai Guo, Xiaowei Zhou, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views. InICLR, 2026

2026

[43] [43]

F2net: Learning to focus on the foreground for unsupervised video object segmentation

Daizong Liu, Dongdong Yu, Changhu Wang, and Pan Zhou. F2net: Learning to focus on the foreground for unsupervised video object segmentation. InAAAI, 2021

2021

[44] [44]

Depth-aware test-time training for zero-shot video object segmentation

Weihuang Liu, Xi Shen, Haolun Li, Xiuli Bi, Bo Liu, Chi-Man Pun, and Xiaodong Cun. Depth-aware test-time training for zero-shot video object segmentation. InCVPR, 2024

2024

[45] [45]

Hoi4d: A 4d egocentric dataset for category-level human-object interaction

Yunze Liu, Yun Liu, Che Jiang, Kangbo Lyu, Weikang Wan, Hao Shen, Boqiang Liang, Zhoujie Fu, He Wang, and Li Yi. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. InCVPR, 2022

2022

[46] [46]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019

2019

[47] [47]

See more, know more: Unsupervised video object segmentation with co-attention siamese networks

Xiankai Lu, Wenguan Wang, Chao Ma, Jianbing Shen, Ling Shao, and Fatih Porikli. See more, know more: Unsupervised video object segmentation with co-attention siamese networks. In CVPR, 2019

2019

[48] [48]

arXiv preprint arXiv:2310.01040 , year =

Etienne Meunier and Patrick Bouthemy. Unsupervised motion segmentation in one go: Smooth long-term model over a video.arXiv preprint arXiv:2310.01040, 2023

work page arXiv 2023

[49] [49]

Em-driven unsupervised learning for efficient motion segmentation.IEEE TPAMI, 2022

Etienne Meunier, Anaïs Badoual, and Patrick Bouthemy. Em-driven unsupervised learning for efficient motion segmentation.IEEE TPAMI, 2022

2022

[50] [50]

Deep anomaly detection through visual attention in surveillance videos.Journal of Big Data, 2020

Nasaruddin Nasaruddin, Kahlil Muchtar, Afdhal Afdhal, and Alvin Prayuda Juniarta Dwiyantoro. Deep anomaly detection through visual attention in surveillance videos.Journal of Big Data, 2020

2020

[51] [51]

P. Ochs, J. Malik, and T. Brox. Segmentation of moving objects by long term video analysis. IEEE TPAMI, 2014

2014

[52] [52]

Object segmentation in video: a hierarchical variational approach for turning point trajectories into dense regions

Peter Ochs and Thomas Brox. Object segmentation in video: a hierarchical variational approach for turning point trajectories into dense regions. InICCV, 2011

2011

[53] [53]

Video object segmentation using space-time memory networks

Seoung Wug Oh, Joon-Young Lee, Ning Xu, and Seon Joo Kim. Video object segmentation using space-time memory networks. InICCV, 2019

2019

[54] [54]

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick L...

2024

[55] [55]

Fast object segmentation in unconstrained video

Anestis Papazoglou and Vittorio Ferrari. Fast object segmentation in unconstrained video. In ICCV, 2013

2013

[56] [56]

Hier- archical feature alignment network for unsupervised video object segmentation

Gensheng Pei, Fumin Shen, Yazhou Yao, Guo-Sen Xie, Zhenmin Tang, and Jinhui Tang. Hier- archical feature alignment network for unsupervised video object segmentation. InECCV, 2022

2022

[57] [57]

A benchmark dataset and evaluation methodology for video object segmentation

Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. InCVPR, 2016

2016

[58] [58]

The 2017 DAVIS Challenge on Video Object Segmentation

Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation.arXiv preprint arXiv:1704.00675, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[59] [59]

Occluded video instance segmentation: A benchmark.IJCV, 2022

Jiyang Qi, Yan Gao, Yao Hu, Xinggang Wang, Xiaoyu Liu, Xiang Bai, Serge Belongie, Alan Yuille, Philip Torr, and Song Bai. Occluded video instance segmentation: A benchmark.IJCV, 2022

2022

[60] [60]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[61] [61]

Reciprocal transformations for unsupervised video object segmentation

Sucheng Ren, Wenxi Liu, Yongtuo Liu, Haoxin Chen, Guoqiang Han, and Shengfeng He. Reciprocal transformations for unsupervised video object segmentation. InCVPR, 2021

2021

[62] [62]

Hiera: A hierarchical vision transformer without the bells-and-whistles

Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, and Christoph Feichtenhofer. Hiera: A hierarchical vision transformer without the bells-and-whistles. InICML, 2023

2023

[63] [63]

Multi-object discovery by low-dimensional object motion

Sadra Safadoust and Fatma Güney. Multi-object discovery by low-dimensional object motion. InICCV, 2023

2023

[64] [64]

Generalizable fourier augmentation for unsupervised video object segmentation

Huihui Song, Tiankang Su, Yuhui Zheng, Kaihua Zhang, Bo Liu, and Dong Liu. Generalizable fourier augmentation for unsupervised video object segmentation. InAAAI, 2024

2024

[65] [65]

Learning video object segmentation with visual memory

Pavel Tokmakov, Karteek Alahari, and Cordelia Schmid. Learning video object segmentation with visual memory. InICCV, 2017

2017

[66] [66]

Learning to segment moving objects

Pavel Tokmakov, Cordelia Schmid, and Karteek Alahari. Learning to segment moving objects. IJCV, 2019

2019

[67] [67]

Self- supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics

Jiangliu Wang, Jianbo Jiao, Linchao Bao, Shengfeng He, Yunhui Liu, and Wei Liu. Self- supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. InCVPR, 2019

2019

[68] [68]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InCVPR, 2025

2025

[69] [69]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InCVPR, 2024

2024

[70] [70]

Videocutler: Surpris- ingly simple unsupervised video instance segmentation.arXiv preprint arXiv:2308.14710, 2023

Xudong Wang, Ishan Misra, Zizun Zeng, Rohit Girdhar, and Trevor Darrell. Videocutler: Surpris- ingly simple unsupervised video instance segmentation.arXiv preprint arXiv:2308.14710, 2023

work page arXiv 2023

[71] [71]

π3: Permutation-equivariant visual geometry learning

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. π3: Permutation-equivariant visual geometry learning. InICLR, 2026

2026

[72] [72]

Attention-based temporal encoding network with background-independent motion mask for action recognition.Computational Intelligence and Neuroscience, 2021

Zhengkui Weng, Zhipeng Jin, Shuangxi Chen, Quanquan Shen, Xiangyang Ren, and Wuzhao Li. Attention-based temporal encoding network with background-independent motion mask for action recognition.Computational Intelligence and Neuroscience, 2021

2021

[73] [73]

Segmenting moving objects via an object-centric layered representation

Junyu Xie, Weidi Xie, and Andrew Zisserman. Segmenting moving objects via an object-centric layered representation. InNeurIPS, 2022. 13

2022

[74] [74]

Appearance-based refinement for object-centric motion segmentation

Junyu Xie, Weidi Xie, and Andrew Zisserman. Appearance-based refinement for object-centric motion segmentation. InECCV, 2024

2024

[75] [75]

Moving object segmentation: All you need is sam (and flow)

Junyu Xie, Charig Yang, Weidi Xie, and Andrew Zisserman. Moving object segmentation: All you need is sam (and flow). InACCV, 2024

2024

[76] [76]

Das3r: Dynamics-aware gaussian splatting for static scene reconstruction.arXiv preprint arxiv:2412.19584, 2024

Kai Xu, Tze Ho Elden Tse, Jizong Peng, and Angela Yao. Das3r: Dynamics-aware gaussian splatting for static scene reconstruction.arXiv preprint arxiv:2412.19584, 2024

work page arXiv 2024

[77] [77]

Youtube-vos: A large-scale video object segmentation benchmark

Ning Xu, Linjie Yang, Yuchen Fan, Dingcheng Yue, Yuchen Liang, Jianchao Yang, and Thomas Huang. Youtube-vos: A large-scale video object segmentation benchmark. InECCV, 2018

2018

[78] [78]

Self-supervised video object segmentation by motion grouping

Charig Yang, Hala Lamdouar, Erika Lu, Andrew Zisserman, and Weidi Xie. Self-supervised video object segmentation by motion grouping. InICCV, 2021

2021

[79] [79]

S. Yang, L. Zhang, J. Qi, H. Lu, S. Wang, and X. Zhang. Learning motion-appearance co-attention for zero-shot video object segmentation. InICCV, 2021

2021

[80] [80]

Unsupervised moving object detection via contextual information separation

Yanchao Yang, Antonio Loquercio, Davide Scaramuzza, and Stefano Soatto. Unsupervised moving object detection via contextual information separation. InCVPR, 2019

2019