Putting the Object Back into Video Object Segmentation

Alexander Schwing; Brian Price; Ho Kei Cheng; Joon-Young Lee; Seoung Wug Oh

arxiv: 2310.12982 · v2 · pith:NLZ3O65Fnew · submitted 2023-10-19 · 💻 cs.CV

Putting the Object Back into Video Object Segmentation

Ho Kei Cheng , Seoung Wug Oh , Brian Price , Joon-Young Lee , Alexander Schwing This is my paper

classification 💻 cs.CV

keywords objectcutiememorysegmentationreadingvideobackbottom-up

0 comments

read the original abstract

We present Cutie, a video object segmentation (VOS) network with object-level memory reading, which puts the object representation from memory back into the video object segmentation result. Recent works on VOS employ bottom-up pixel-level memory reading which struggles due to matching noise, especially in the presence of distractors, resulting in lower performance in more challenging data. In contrast, Cutie performs top-down object-level memory reading by adapting a small set of object queries. Via those, it interacts with the bottom-up pixel features iteratively with a query-based object transformer (qt, hence Cutie). The object queries act as a high-level summary of the target object, while high-resolution feature maps are retained for accurate segmentation. Together with foreground-background masked attention, Cutie cleanly separates the semantics of the foreground object from the background. On the challenging MOSE dataset, Cutie improves by 8.7 J&F over XMem with a similar running time and improves by 4.2 J&F over DeAOT while being three times faster. Code is available at: https://hkchengrex.github.io/Cutie

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Enabling Extensible Embodied Capabilities with Tools
cs.RO 2026-05 unverdicted novelty 6.0

Introduces Embodied Tool Protocol and tool externalization to improve embodied AI performance on perception and cognition tasks, with measured gains but limits on execution capabilities.
SigLoMa: Learning Open-World Quadrupedal Loco-Manipulation from Ego-Centric Vision
cs.RO 2026-05 unverdicted novelty 6.0

SigLoMa enables dynamic loco-manipulation on quadrupeds from ego-centric 5 Hz vision alone by using Sigma Points for scalable exteroception, an ego-centric Kalman Filter for high-rate state estimation, and an active s...
4D Vessel Reconstruction for Benchtop Thrombectomy Analysis
eess.IV 2026-04 conditional novelty 5.0

A nine-camera multi-view workflow with 4D Gaussian Splatting reconstructs dynamic vessel surfaces in thrombectomy phantoms to enable standardized comparative displacement and stress-proxy tracking.