arxiv: 1704.00675 · v3 · submitted 2017-04-03 · 💻 cs.CV

Recognition: 3 theorem links

· Lean Theorem

The 2017 DAVIS Challenge on Video Object Segmentation

Jordi Pont-Tuset , Federico Perazzi , Sergi Caelles , Pablo Arbel\'aez , Alex Sorkine-Hornung , Luc Van Gool

Authors on Pith no claims yet

Pith reviewed 2026-05-13 23:50 UTC · model grok-4.3

classification 💻 cs.CV

keywords video object segmentationDAVISbenchmarkdatasetcompetitionevaluation metricsdense video annotation

0 comments

The pith

The 2017 DAVIS Challenge introduces a dataset, benchmark, and competition to advance video object segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the 2017 DAVIS Challenge on Video Object Segmentation, which includes a public dataset of densely annotated videos, an evaluation methodology, and a competition co-located with CVPR 2017. This setup follows successful precedents like ILSVRC and PASCAL VOC that standardized research in related fields. By providing this resource, the challenge aims to foster the development of new techniques for segmenting objects in video sequences. A reader would care because such benchmarks enable consistent comparisons across methods and have historically led to rapid advances in computer vision. The paper concludes with an analysis of the results from challenge participants.

Core claim

The authors establish the DAVIS 2017 Challenge as a new standard for evaluating video object segmentation methods through a dedicated dataset, defined metrics, and public competition, building directly on the prior DAVIS release that has already enabled state-of-the-art techniques.

What carries the argument

The DAVIS dataset of video sequences with dense pixel-level annotations together with the evaluation protocol for measuring segmentation accuracy over time.

If this is right

New video object segmentation methods can be fairly compared using the same dataset and metrics.
The competition encourages development of techniques that handle the specific challenges in the DAVIS videos.
Analysis of participant results reveals current performance levels and areas for improvement.
The workshop format allows for direct exchange of ideas among researchers in the field.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark may become a de facto standard for training and testing deep learning models for video segmentation.
Insights from the challenge could extend to related problems such as video instance segmentation or object tracking.
Future iterations might incorporate additional challenges like occlusions or camera motion variations.

Load-bearing premise

The chosen set of videos and the defined evaluation metrics adequately represent the range of difficulties in real-world video object segmentation.

What would settle it

Demonstrating that top-performing methods on the DAVIS challenge perform inconsistently or poorly on a new set of videos with different characteristics would question the benchmark's representativeness.

read the original abstract

We present the 2017 DAVIS Challenge on Video Object Segmentation, a public dataset, benchmark, and competition specifically designed for the task of video object segmentation. Following the footsteps of other successful initiatives, such as ILSVRC and PASCAL VOC, which established the avenue of research in the fields of scene classification and semantic segmentation, the DAVIS Challenge comprises a dataset, an evaluation methodology, and a public competition with a dedicated workshop co-located with CVPR 2017. The DAVIS Challenge follows up on the recent publication of DAVIS (Densely-Annotated VIdeo Segmentation), which has fostered the development of several novel state-of-the-art video object segmentation techniques. In this paper we describe the scope of the benchmark, highlight the main characteristics of the dataset, define the evaluation metrics of the competition, and present a detailed analysis of the results of the participants to the challenge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript presents the 2017 DAVIS Challenge on Video Object Segmentation, which supplies a public dataset of 50 densely annotated video sequences, an evaluation protocol using region similarity (J), contour accuracy (F), and temporal stability (T) metrics, and a competition whose participant results are analyzed in detail. The work positions the challenge as a standardized benchmark following the model of ILSVRC and PASCAL VOC, building directly on the prior DAVIS dataset release.

Significance. If the dataset and metrics prove representative, the challenge supplies a reproducible, public benchmark that has already stimulated new video object segmentation methods. The explicit provision of the dataset, evaluation code, and competition results constitutes a concrete community resource of the kind that has accelerated progress in related vision tasks.

minor comments (2)

[Dataset characteristics] In the dataset description section, the selection criteria for the 50 sequences are stated, yet no quantitative comparison (e.g., histograms of optical-flow magnitude, object-size distribution, or scene-type coverage) against larger video corpora is supplied; adding such statistics would directly address concerns about coverage of real-world variability.
[Introduction and Dataset] The abstract and introduction refer to 'main characteristics of the dataset' without a dedicated table summarizing sequence-level properties (length, number of objects, motion type); a compact summary table would improve clarity.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review of our manuscript and their recommendation to accept it.

Circularity Check

0 steps flagged

No circularity: benchmark and dataset presentation is self-contained

full rationale

The paper introduces a new public dataset, evaluation protocol, and competition for video object segmentation. Its core claims consist of describing the 50-video DAVIS 2017 sequences, defining the three metrics (Jaccard J, contour F, temporal stability T), and reporting participant results. No equations, fitted parameters, or predictions appear; the work does not derive any quantity from prior results by construction. The reference to the earlier DAVIS paper supplies historical context rather than load-bearing justification for any claim. All elements are externally verifiable through the released data and competition outcomes, satisfying the criteria for a non-circular benchmark description.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a benchmark and dataset paper with no mathematical derivations; it relies on empirical design choices for videos and metrics rather than axioms or free parameters.

pith-pipeline@v0.9.0 · 5472 in / 919 out tokens · 68199 ms · 2026-05-13T23:50:24.954342+00:00 · methodology

discussion (0)

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking
cs.CV 2026-05 unverdicted novelty 8.0

TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.
PhysInOne: Visual Physics Learning and Reasoning in One Suite
cs.CV 2026-04 unverdicted novelty 8.0

PhysInOne is a new dataset of 2 million videos across 153,810 dynamic 3D scenes covering 71 physical phenomena, shown to improve AI performance on physics-aware video generation, prediction, property estimation, and m...
PROVE: A Perceptual RemOVal cohErence Benchmark for Visual Media
cs.CV 2026-05 unverdicted novelty 7.0

PROVE proposes RC metrics for perceptual removal coherence and releases PROVE-Bench to better align automatic scores with human judgments on object removal tasks.
FluxShard: Motion-Aware Feature Cache Reuse for Collaborative Video Analytics in Mobile Edge Computing
cs.NI 2026-05 unverdicted novelty 7.0

FluxShard uses per-block motion vectors and a Receptive Field Alignment Principle to manage feature cache reuse in edge-cloud video analytics, delivering 32.6-83.8% lower latency and 14.9-64.0% lower energy than basel...
Online Reasoning Video Object Segmentation
cs.CV 2026-04 unverdicted novelty 7.0

The work introduces the ORVOS task, the ORVOSB benchmark with causal annotations across 210 videos, and a baseline using updated prompts plus a temporal token reservoir.
GP-4DGS: Probabilistic 4D Gaussian Splatting from Monocular Video via Variational Gaussian Processes
cs.CV 2026-04 unverdicted novelty 7.0

GP-4DGS uses variational Gaussian Processes with spatio-temporal kernels to provide uncertainty-aware reconstruction and prediction in 4D Gaussian Splatting for dynamic scenes.
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
cs.CV 2023-10 accept novelty 7.0

Set-of-Mark prompting marks segmented image regions with alphanumerics and masks to let GPT-4V achieve state-of-the-art zero-shot results on referring expression comprehension and segmentation benchmarks like RefCOCOg.
LiBrA-Net: Lie-Algebraic Bilateral Affine Fields for Real-Time 4K Video Dehazing
cs.CV 2026-05 unverdicted novelty 6.0

LiBrA-Net achieves real-time native 4K video dehazing via Lie-algebraic bilateral affine fields and releases the first 4K paired dehazing video benchmark with per-frame annotations.
Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners
cs.CV 2026-04 unverdicted novelty 6.0

LILA learns temporally consistent semantic and geometric pixel features from uncurated videos via linear in-context learning on off-the-shelf depth and motion cues, yielding empirical gains on video object segmentatio...
CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation
cs.CV 2026-04 unverdicted novelty 6.0

Cross-modal token modulation enables better fusion of appearance and motion cues in two-stream models, leading to state-of-the-art results in unsupervised video object segmentation.
Blind Bitstream-corrupted Video Recovery via Metadata-guided Diffusion Model
cs.CV 2026-04 unverdicted novelty 6.0

M-GDM uses motion vectors and frame types to guide a diffusion model in blind recovery of bitstream-corrupted videos without manual masks.
Learning Long-term Motion Embeddings for Efficient Kinematics Generation
cs.CV 2026-04 unverdicted novelty 6.0

A 64x temporally compressed motion embedding learned from trackers enables efficient conditional flow-matching generation of long-term motions that outperform video models and task-specific methods.
PanoSAM2: Lightweight Distortion- and Memory-aware Adaptions of SAM2 for 360 Video Object Segmentation
cs.CV 2026-04 unverdicted novelty 6.0

PanoSAM2 adapts SAM2 with a Pano-Aware Decoder, Distortion-Guided Mask Loss, and Long-Short Memory Module to improve 360 video object segmentation, reporting +5.6 and +6.7 gains over base SAM2 on two benchmarks.
Perception Encoder: The best visual embeddings are not at the output of the network
cs.CV 2025-04 unverdicted novelty 6.0

Intermediate layers of a contrastively trained vision-language encoder yield stronger general embeddings than the output layer, enabling state-of-the-art performance across image/video classification, multimodal QA, a...
SAM 2: Segment Anything in Images and Videos
cs.CV 2024-08 conditional novelty 6.0

SAM 2 delivers more accurate video segmentation with 3x fewer user interactions and 6x faster image segmentation than the original SAM by training a streaming-memory transformer on the largest video segmentation datas...
Vision Transformers Need Registers
cs.CV 2023-09 unverdicted novelty 6.0

Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation
cs.CV 2026-05 unverdicted novelty 5.0

ST-Gen4D uses a world model that fuses global appearance and local dynamic graphs into a 4D cognition representation to guide consistent 4D Gaussian generation.
PAT-VCM: Plug-and-Play Auxiliary Tokens for Video Coding for Machines
cs.CV 2026-04 unverdicted novelty 5.0

PAT-VCM adds lightweight auxiliary tokens to a shared baseline video stream to support multiple downstream machine tasks without task-specific codecs.
TAPNext++: What's Next for Tracking Any Point (TAP)?
cs.CV 2026-04 unverdicted novelty 5.0

TAPNext++ trains recurrent transformers on 1024-frame sequences with geometric augmentations and occluded-point supervision to achieve new state-of-the-art point tracking on long videos while adding a re-detection metric.
RCoT-Seg: Reinforced Chain-of-Thought for Video Reasoning and Segmentation
cs.CV 2026-05 unverdicted novelty 4.0

RCoT-Seg uses GRPO-reinforced keyframe selection from a CoT-start corpus followed by SAM2 mask propagation to improve video object segmentation under implicit temporal instructions over prior MLLM sampling methods.
APRVOS: 1st Place Winner of 5th PVUW MeViS-Audio Track
cs.SD 2026-04 unverdicted novelty 3.0

A staged pipeline using ASR transcription, visual existence verification, Sa2VA coarse segmentation, and agent-guided SAM3 refinement won first place in the PVUW MeViS-Audio track by decomposing audio-conditioned Ref-...
Evaluation of Winning Solutions of 2025 Low Power Computer Vision Challenge
cs.CV 2026-04 unverdicted novelty 2.0

The 2025 LPCVC winners demonstrate practical techniques for low-power image classification under varied conditions, open-vocabulary segmentation from text prompts, and monocular depth estimation.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · cited by 22 Pith papers

[1]

ImageNet Large Scale Visual Recognition Challenge,

O. Russakovsky , J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy , A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” IJCV, 2015

work page 2015
[2]

The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results,

M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results,” http://www.pascal- network.org/challenges/VOC/voc2012/workshop/index.html

work page 2012
[3]

A benchmark dataset and evaluation methodology for video object segmentation,

F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung, “A benchmark dataset and evaluation methodology for video object segmentation,” in CVPR, 2016

work page 2016
[4]

Microsoft COCO: Common Objects in Context,

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P . Perona, D. Ramanan, P . Dollr, and C. Zitnick, “Microsoft COCO: Common Objects in Context,” in ECCV, 2014

work page 2014
[5]

One-shot video object segmentation,

S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taix ´e, D. Cremers, and L. Van Gool, “One-shot video object segmentation,” in CVPR, 2017

work page 2017
[6]

Learning video object segmentation from static im- ages,

F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and A.Sorkine- Hornung, “Learning video object segmentation from static im- ages,” in CVPR, 2017

work page 2017
[7]

Bilateral space video segmentation,

N. Nicolas M ¨arki, F. Perazzi, O. Wang, and A. Sorkine-Hornung, “Bilateral space video segmentation,” in CVPR, 2016

work page 2016
[8]

Fully connected object proposals for video segmentation,

F. Perazzi, O. Wang, M. Gross, and A. Sorkine-Hornung, “Fully connected object proposals for video segmentation,” in ICCV, 2015

work page 2015
[9]

Video object segmentation with re-identiﬁcation,

X. Li, Y . Qi, Z. Wang, K. Chen, Z. Liu, J. Shi, P . Luo, C. C. Loy , and X. Tang, “Video object segmentation with re-identiﬁcation,” The 2017 DAVIS Challenge on Video Object Segmentation - CVPR Workshops, 2017

work page 2017
[10]

Lucid data dreaming for object tracking,

A. Khoreva, R. Benenson, E. Ilg, T. Brox, and B. Schiele, “Lucid data dreaming for object tracking,” The 2017 DAVIS Challenge on Video Object Segmentation - CVPR Workshops , 2017

work page 2017
[11]

Instance re-identiﬁcation ﬂow for video object segmentation,

T.-N. Le, K.-T. Nguyen, M.-H. Nguyen-Phan, T.-V . Ton, T.-A. N. (2), X.-S. Trinh, Q.-H. Dinh, V .-T. Nguyen, A.-D. Duong, A. Sugimoto, T. V . Nguyen, and M.-T. Tran, “Instance re-identiﬁcation ﬂow for video object segmentation,” The 2017 DAVIS Challenge on Video Object Segmentation - CVPR Workshops , 2017. 6 Ground Truth 1st [9] 2nd [10] 3rd [11] 4th [12]...

work page 2017
[12]

Multiple-instance video segmentation with sequence-speciﬁc object proposals,

A. Shaban, A. Firl, A. Humayun, J. Yuan, X. Wang, P . Lei, N. Dhanda, B. Boots, J. M. Rehg, and F. Li, “Multiple-instance video segmentation with sequence-speciﬁc object proposals,” The 2017 DAVIS Challenge on Video Object Segmentation - CVPR Work- shops, 2017

work page 2017
[13]

Online adaptation of convolutional neural networks for the 2017 davis challenge on video object seg- mentation,

P . V oigtlaender and B. Leibe, “Online adaptation of convolutional neural networks for the 2017 davis challenge on video object seg- mentation,” The 2017 DAVIS Challenge on Video Object Segmentation - CVPR Workshops , 2017

work page 2017
[14]

Learning to segment instances in videos with spatial propagation network,

J. Cheng, S. Liu, Y .-H. Tsai, W.-C. Hung, S. Gupta, J. Gu, J. Kautz, S. Wang, and M.-H. Yang, “Learning to segment instances in videos with spatial propagation network,” The 2017 DAVIS Chal- lenge on Video Object Segmentation - CVPR Workshops , 2017

work page 2017
[15]

Some promising ideas about multi-instance video seg- mentation,

H. Zhao, “Some promising ideas about multi-instance video seg- mentation,” The 2017 DAVIS Challenge on Video Object Segmentation - CVPR Workshops , 2017

work page 2017
[16]

One-shot video object segmentation with iterative online ﬁne-tuning,

A. Newswanger and C. Xu, “One-shot video object segmentation with iterative online ﬁne-tuning,” The 2017 DAVIS Challenge on Video Object Segmentation - CVPR Workshops , 2017

work page 2017
[17]

Video object segmen- tation using tracked object proposals,

G. Sharir, E. Smolyansky , and I. Friedman, “Video object segmen- tation using tracked object proposals,” The 2017 DAVIS Challenge on Video Object Segmentation - CVPR Workshops , 2017

work page 2017