pith. sign in

hub

YouTube-VOS: A Large-Scale Video Object Segmentation Benchmark

17 Pith papers cite this work. Polarity classification is still indexing.

17 Pith papers citing it
abstract

Learning long-term spatial-temporal features are critical for many video analysis tasks. However, existing video segmentation methods predominantly rely on static image segmentation techniques, and methods capturing temporal dependency for segmentation have to depend on pretrained optical flow models, leading to suboptimal solutions for the problem. End-to-end sequential learning to explore spatialtemporal features for video segmentation is largely limited by the scale of available video segmentation datasets, i.e., even the largest video segmentation dataset only contains 90 short video clips. To solve this problem, we build a new large-scale video object segmentation dataset called YouTube Video Object Segmentation dataset (YouTube-VOS). Our dataset contains 4,453 YouTube video clips and 94 object categories. This is by far the largest video object segmentation dataset to our knowledge and has been released at http://youtube-vos.org. We further evaluate several existing state-of-the-art video object segmentation algorithms on this dataset which aims to establish baselines for the development of new algorithms in the future.

hub tools

citation-role summary

dataset 2 background 1 baseline 1

citation-polarity summary

fields

cs.CV 17

representative citing papers

Functionalization via Structure Completion and Motion Rectification

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture with a new paired dataset.

3AM: 3egment Anything with Geometric Consistency in Videos

cs.CV · 2026-01-13 · unverdicted · novelty 7.0

3AM integrates MUSt3R 3D features into SAM2 via a Feature Merger and FOV-aware sampling to deliver geometry-consistent video object segmentation from RGB alone, with large gains on wide-baseline datasets.

Recurrent Video Masked Autoencoders

cs.CV · 2025-12-15 · unverdicted · novelty 7.0

RVM uses recurrent computation inside a masked autoencoder to learn video representations that match or exceed prior video and image models on classification, tracking, and dense spatial tasks with up to 30x better parameter efficiency.

SAM 3: Segment Anything with Concepts

cs.CV · 2025-11-20 · unverdicted · novelty 7.0

SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.

SAM 2++: Tracking Anything at Any Granularity

cs.CV · 2025-10-21 · conditional · novelty 7.0

SAM 2++ unifies video tracking across mask, box, and point granularities via task-specific prompts, a unified decoder, task-adaptive memory, and a new multi-granularity dataset, reporting state-of-the-art results.

Robust Promptable Video Object Segmentation

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

The paper creates a real-world corruption benchmark for promptable video object segmentation and proposes MoGA, which uses object-specific memory to improve robustness and temporal consistency under adverse conditions.

Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners

cs.CV · 2026-04-29 · unverdicted · novelty 6.0

LILA learns temporally consistent semantic and geometric pixel features from uncurated videos via linear in-context learning on off-the-shelf depth and motion cues, yielding empirical gains on video object segmentation, surface normal estimation, and semantic segmentation.

X2SAM: Any Segmentation in Images and Videos

cs.CV · 2026-04-27 · unverdicted · novelty 6.0

X2SAM unifies any-segmentation across images and videos in one MLLM by adding a Mask Memory module for temporal consistency and joint training on mixed datasets.

SAM 2: Segment Anything in Images and Videos

cs.CV · 2024-08-01 · conditional · novelty 6.0

SAM 2 delivers more accurate video segmentation with 3x fewer user interactions and 6x faster image segmentation than the original SAM by training a streaming-memory transformer on the largest video segmentation dataset collected to date.

citing papers explorer

Showing 17 of 17 citing papers.

  • Functionalization via Structure Completion and Motion Rectification cs.CV · 2026-05-18 · unverdicted · none · ref 281 · internal anchor

    Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture with a new paired dataset.

  • 3AM: 3egment Anything with Geometric Consistency in Videos cs.CV · 2026-01-13 · unverdicted · none · ref 92 · internal anchor

    3AM integrates MUSt3R 3D features into SAM2 via a Feature Merger and FOV-aware sampling to deliver geometry-consistent video object segmentation from RGB alone, with large gains on wide-baseline datasets.

  • Backdoor Attacks on Prompt-Driven Video Segmentation Foundation Models cs.CV · 2025-12-26 · conditional · none · ref 57 · internal anchor

    BadVSFM is the first effective backdoor attack on prompt-driven video segmentation foundation models, using a two-stage encoder-decoder strategy to achieve high attack success rates with limited clean performance loss.

  • Recurrent Video Masked Autoencoders cs.CV · 2025-12-15 · unverdicted · none · ref 75 · internal anchor

    RVM uses recurrent computation inside a masked autoencoder to learn video representations that match or exceed prior video and image models on classification, tracking, and dense spatial tasks with up to 30x better parameter efficiency.

  • SAM 3: Segment Anything with Concepts cs.CV · 2025-11-20 · unverdicted · none · ref 145 · internal anchor

    SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.

  • SAM 2++: Tracking Anything at Any Granularity cs.CV · 2025-10-21 · conditional · none · ref 60 · internal anchor

    SAM 2++ unifies video tracking across mask, box, and point granularities via task-specific prompts, a unified decoder, task-adaptive memory, and a new multi-granularity dataset, reporting state-of-the-art results.

  • SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning cs.CV · 2025-06-05 · conditional · none · ref 54 · internal anchor

    SIV-Bench is a new video benchmark with 2,792 clips and 5,455 QA pairs that evaluates MLLMs on social scene understanding, state reasoning, and dynamics prediction using social relation theory.

  • Robust Promptable Video Object Segmentation cs.CV · 2026-05-12 · unverdicted · none · ref 47

    The paper creates a real-world corruption benchmark for promptable video object segmentation and proposes MoGA, which uses object-specific memory to improve robustness and temporal consistency under adverse conditions.

  • VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing cs.CV · 2026-05-05 · unverdicted · none · ref 38 · 2 links

    VEBENCH is the first benchmark with 3.9K videos and 3,080 human-verified QA pairs that measures LMMs on video editing technique recognition and operation simulation, revealing a large gap to human performance.

  • YOSE: You Only Select Essential Tokens for Efficient DiT-based Video Object Removal cs.CV · 2026-04-30 · unverdicted · none · ref 23

    YOSE accelerates DiT video object removal up to 2.5x by using BVI for adaptive token selection and DiffSim to simulate unmasked token effects, while preserving visual quality.

  • Towards a General-Purpose Zero-Shot Synthetic Low-Light Image and Video Pipeline cs.CV · 2025-04-16 · unverdicted · none · ref 44 · internal anchor

    A self-supervised Degradation Estimation Network estimates parameters for physics-informed noise distributions to generate realistic synthetic low-light data, showing gains on noise replication, enhancement, and detection tasks.

  • Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners cs.CV · 2026-04-29 · unverdicted · none · ref 44

    LILA learns temporally consistent semantic and geometric pixel features from uncurated videos via linear in-context learning on off-the-shelf depth and motion cues, yielding empirical gains on video object segmentation, surface normal estimation, and semantic segmentation.

  • X2SAM: Any Segmentation in Images and Videos cs.CV · 2026-04-27 · unverdicted · none · ref 42

    X2SAM unifies any-segmentation across images and videos in one MLLM by adding a Mask Memory module for temporal consistency and joint training on mixed datasets.

  • CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation cs.CV · 2026-04-16 · unverdicted · none · ref 23

    Cross-modal token modulation enables better fusion of appearance and motion cues in two-stream models, leading to state-of-the-art results in unsupervised video object segmentation.

  • PanoSAM2: Lightweight Distortion- and Memory-aware Adaptions of SAM2 for 360 Video Object Segmentation cs.CV · 2026-04-09 · unverdicted · none · ref 36

    PanoSAM2 adapts SAM2 with a Pano-Aware Decoder, Distortion-Guided Mask Loss, and Long-Short Memory Module to improve 360 video object segmentation, reporting +5.6 and +6.7 gains over base SAM2 on two benchmarks.

  • SAM 2: Segment Anything in Images and Videos cs.CV · 2024-08-01 · conditional · none · ref 29

    SAM 2 delivers more accurate video segmentation with 3x fewer user interactions and 6x faster image segmentation than the original SAM by training a streaming-memory transformer on the largest video segmentation dataset collected to date.

  • Understanding Deep Learning Techniques for Image Segmentation cs.CV · 2019-07-13 · unverdicted · none · ref 209 · internal anchor

    A 2019 survey that categorizes and intuitively explains major deep learning techniques for image segmentation, progressing from classical methods to modern neural architectures.