pith. sign in

arxiv: 2310.12982 · v2 · pith:NLZ3O65Fnew · submitted 2023-10-19 · 💻 cs.CV

Putting the Object Back into Video Object Segmentation

classification 💻 cs.CV
keywords objectcutiememorysegmentationreadingvideobackbottom-up
0
0 comments X
read the original abstract

We present Cutie, a video object segmentation (VOS) network with object-level memory reading, which puts the object representation from memory back into the video object segmentation result. Recent works on VOS employ bottom-up pixel-level memory reading which struggles due to matching noise, especially in the presence of distractors, resulting in lower performance in more challenging data. In contrast, Cutie performs top-down object-level memory reading by adapting a small set of object queries. Via those, it interacts with the bottom-up pixel features iteratively with a query-based object transformer (qt, hence Cutie). The object queries act as a high-level summary of the target object, while high-resolution feature maps are retained for accurate segmentation. Together with foreground-background masked attention, Cutie cleanly separates the semantics of the foreground object from the background. On the challenging MOSE dataset, Cutie improves by 8.7 J&F over XMem with a similar running time and improves by 4.2 J&F over DeAOT while being three times faster. Code is available at: https://hkchengrex.github.io/Cutie

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Enabling Extensible Embodied Capabilities with Tools

    cs.RO 2026-05 unverdicted novelty 6.0

    Introduces Embodied Tool Protocol and tool externalization to improve embodied AI performance on perception and cognition tasks, with measured gains but limits on execution capabilities.

  2. SigLoMa: Learning Open-World Quadrupedal Loco-Manipulation from Ego-Centric Vision

    cs.RO 2026-05 unverdicted novelty 6.0

    SigLoMa enables dynamic loco-manipulation on quadrupeds from ego-centric 5 Hz vision alone by using Sigma Points for scalable exteroception, an ego-centric Kalman Filter for high-rate state estimation, and an active s...

  3. 4D Vessel Reconstruction for Benchtop Thrombectomy Analysis

    eess.IV 2026-04 conditional novelty 5.0

    A nine-camera multi-view workflow with 4D Gaussian Splatting reconstructs dynamic vessel surfaces in thrombectomy phantoms to enable standardized comparative displacement and stress-proxy tracking.