hub Tool reference

The 2017 DAVIS Challenge on Video Object Segmentation

· 2017 · cs.CV · arXiv 1704.00675

Tool reference. 90% of classified Pith citations use this work as a method, library, or software dependency, not as a substantive claim.

48 Pith papers citing it

Method reference 90% of classified citations

open full Pith review browse 48 citing papers arXiv PDF

abstract

We present the 2017 DAVIS Challenge on Video Object Segmentation, a public dataset, benchmark, and competition specifically designed for the task of video object segmentation. Following the footsteps of other successful initiatives, such as ILSVRC and PASCAL VOC, which established the avenue of research in the fields of scene classification and semantic segmentation, the DAVIS Challenge comprises a dataset, an evaluation methodology, and a public competition with a dedicated workshop co-located with CVPR 2017. The DAVIS Challenge follows up on the recent publication of DAVIS (Densely-Annotated VIdeo Segmentation), which has fostered the development of several novel state-of-the-art video object segmentation techniques. In this paper we describe the scope of the benchmark, highlight the main characteristics of the dataset, define the evaluation metrics of the competition, and present a detailed analysis of the results of the participants to the challenge.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 9 background 1

citation-polarity summary

use dataset 9 background 1

representative citing papers

TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking

cs.CV · 2026-05-12 · unverdicted · novelty 8.0

TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.

PhysInOne: Visual Physics Learning and Reasoning in One Suite

cs.CV · 2026-04-10 · unverdicted · novelty 8.0

PhysInOne is a new dataset of 2 million videos across 153,810 dynamic 3D scenes covering 71 physical phenomena, shown to improve AI performance on physics-aware video generation, prediction, property estimation, and motion transfer.

Emerging Properties in Self-Supervised Vision Transformers

cs.CV · 2021-04-29 · conditional · novelty 8.0

Self-supervised ViTs show emergent semantic segmentation and 78.3% k-NN accuracy on ImageNet; DINO reaches 80.1% linear evaluation with ViT-Base.

Geo-Align: Video Generation Alignment via Metric Geometry Reward

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

Geo-Align applies RL with a perceptual reward derived from 3D camera trajectory estimation to improve controllability and fidelity in video generation without paired training data.

MotiMotion: Motion-Controlled Video Generation with Visual Reasoning

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

MotiMotion adds visual reasoning via a training-free VLM to refine primary trajectories and hallucinate secondary motions, plus a confidence-aware guidance scheme, yielding more plausible interactions on the new MotiBench benchmark.

PROVE: A Perceptual RemOVal cohErence Benchmark for Visual Media

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

PROVE proposes RC metrics for perceptual removal coherence and releases PROVE-Bench to better align automatic scores with human judgments on object removal tasks.

FluxShard: Motion-Aware Feature Cache Reuse for Collaborative Video Analytics in Mobile Edge Computing

cs.NI · 2026-05-07 · unverdicted · novelty 7.0

FluxShard uses per-block motion vectors and a Receptive Field Alignment Principle to manage feature cache reuse in edge-cloud video analytics, delivering 32.6-83.8% lower latency and 14.9-64.0% lower energy than baselines while preserving accuracy.

Online Reasoning Video Object Segmentation

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

The work introduces the ORVOS task, the ORVOSB benchmark with causal annotations across 210 videos, and a baseline using updated prompts plus a temporal token reservoir.

GP-4DGS: Probabilistic 4D Gaussian Splatting from Monocular Video via Variational Gaussian Processes

cs.CV · 2026-04-03 · unverdicted · novelty 7.0

GP-4DGS uses variational Gaussian Processes with spatio-temporal kernels to provide uncertainty-aware reconstruction and prediction in 4D Gaussian Splatting for dynamic scenes.

3AM: 3egment Anything with Geometric Consistency in Videos

cs.CV · 2026-01-13 · unverdicted · novelty 7.0

3AM integrates MUSt3R 3D features into SAM2 via a Feature Merger and FOV-aware sampling to deliver geometry-consistent video object segmentation from RGB alone, with large gains on wide-baseline datasets.

Backdoor Attacks on Prompt-Driven Video Segmentation Foundation Models

cs.CV · 2025-12-26 · conditional · novelty 7.0

BadVSFM is the first effective backdoor attack on prompt-driven video segmentation foundation models, using a two-stage encoder-decoder strategy to achieve high attack success rates with limited clean performance loss.

4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

cs.CV · 2025-12-18 · unverdicted · novelty 7.0

4D-RGPT uses perceptual 4D distillation to boost region-level 4D perception in multimodal LLMs and reports gains on existing and new video QA benchmarks.

Recurrent Video Masked Autoencoders

cs.CV · 2025-12-15 · unverdicted · novelty 7.0

RVM uses recurrent computation inside a masked autoencoder to learn video representations that match or exceed prior video and image models on classification, tracking, and dense spatial tasks with up to 30x better parameter efficiency.

SAM 3: Segment Anything with Concepts

cs.CV · 2025-11-20 · unverdicted · novelty 7.0

SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.

SAM 2++: Tracking Anything at Any Granularity

cs.CV · 2025-10-21 · conditional · novelty 7.0

SAM 2++ unifies video tracking across mask, box, and point granularities via task-specific prompts, a unified decoder, task-adaptive memory, and a new multi-granularity dataset, reporting state-of-the-art results.

UniEdit-Flow: Unleashing Inversion and Editing in the Era of Flow Models

cs.CV · 2025-04-17 · unverdicted · novelty 7.0

UniEdit-Flow presents tuning-free Uni-Inv and Uni-Edit methods for inversion and editing in flow models that achieve accurate reconstruction and robust region-preserving edits across generative models.

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

cs.CV · 2023-10-17 · accept · novelty 7.0

Set-of-Mark prompting marks segmented image regions with alphanumerics and masks to let GPT-4V achieve state-of-the-art zero-shot results on referring expression comprehension and segmentation benchmarks like RefCOCOg.

Self-supervised Training of Proposal-based Segmentation via Background Prediction

cs.CV · 2019-07-18 · unverdicted · novelty 7.0

A self-supervised loss based on background prediction trains proposal-based segmentation networks via Monte Carlo sampling for object detection in novel image appearances.

SimInsert: Seamless Video Object Insertion via Regional Sparse Attention Fusion

cs.CV · 2026-05-22 · unverdicted · novelty 6.0

SimInsert is a training-free video object insertion technique that decouples the task into single-frame editing and semantic motion description, using image-to-video diffusion models with non-invasive guidance to achieve spatio-temporal coherence.

Fast 4D Mesh Generation by Spatio-Temporal Attention Chains

cs.CV · 2026-05-19 · unverdicted · novelty 6.0

A training-free Spatio-Temporal Attention Chain framework accelerates 4D mesh generation 13x, improves quality, scales to 16x longer videos, and supports downstream tracking and camera estimation.

See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

SWIM aligns cross-attention maps from object nouns to ground-truth masks during training on the new NL-Refer dataset to enable text-only fine-grained video object understanding in MLLMs.

Weighted Reverse Convolution for Feature Upsampling

cs.CV · 2026-05-17 · unverdicted · novelty 6.0 · 2 refs

Weighted Reverse Convolution is a spatially adaptive inverse operator for densifying high-level visual descriptors from vision foundation models, using weighted regularization and an FFT closed-form solution to improve dense prediction tasks.

LiBrA-Net: Lie-Algebraic Bilateral Affine Fields for Real-Time 4K Video Dehazing

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

LiBrA-Net achieves real-time native 4K video dehazing via Lie-algebraic bilateral affine fields and releases the first 4K paired dehazing video benchmark with per-frame annotations.

Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners

cs.CV · 2026-04-29 · unverdicted · novelty 6.0

LILA learns temporally consistent semantic and geometric pixel features from uncurated videos via linear in-context learning on off-the-shelf depth and motion cues, yielding empirical gains on video object segmentation, surface normal estimation, and semantic segmentation.

citing papers explorer

Showing 48 of 48 citing papers.

TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking cs.CV · 2026-05-12 · unverdicted · none · ref 57 · internal anchor
TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.
PhysInOne: Visual Physics Learning and Reasoning in One Suite cs.CV · 2026-04-10 · unverdicted · none · ref 67 · internal anchor
PhysInOne is a new dataset of 2 million videos across 153,810 dynamic 3D scenes covering 71 physical phenomena, shown to improve AI performance on physics-aware video generation, prediction, property estimation, and motion transfer.
Emerging Properties in Self-Supervised Vision Transformers cs.CV · 2021-04-29 · conditional · none · ref 52 · internal anchor
Self-supervised ViTs show emergent semantic segmentation and 78.3% k-NN accuracy on ImageNet; DINO reaches 80.1% linear evaluation with ViT-Base.
Geo-Align: Video Generation Alignment via Metric Geometry Reward cs.CV · 2026-05-22 · unverdicted · none · ref 10 · internal anchor
Geo-Align applies RL with a perceptual reward derived from 3D camera trajectory estimation to improve controllability and fidelity in video generation without paired training data.
MotiMotion: Motion-Controlled Video Generation with Visual Reasoning cs.CV · 2026-05-21 · unverdicted · none · ref 87 · internal anchor
MotiMotion adds visual reasoning via a training-free VLM to refine primary trajectories and hallucinate secondary motions, plus a confidence-aware guidance scheme, yielding more plausible interactions on the new MotiBench benchmark.
PROVE: A Perceptual RemOVal cohErence Benchmark for Visual Media cs.CV · 2026-05-14 · unverdicted · none · ref 14 · internal anchor
PROVE proposes RC metrics for perceptual removal coherence and releases PROVE-Bench to better align automatic scores with human judgments on object removal tasks.
FluxShard: Motion-Aware Feature Cache Reuse for Collaborative Video Analytics in Mobile Edge Computing cs.NI · 2026-05-07 · unverdicted · none · ref 26 · internal anchor
FluxShard uses per-block motion vectors and a Receptive Field Alignment Principle to manage feature cache reuse in edge-cloud video analytics, delivering 32.6-83.8% lower latency and 14.9-64.0% lower energy than baselines while preserving accuracy.
Online Reasoning Video Object Segmentation cs.CV · 2026-04-13 · unverdicted · none · ref 35 · internal anchor
The work introduces the ORVOS task, the ORVOSB benchmark with causal annotations across 210 videos, and a baseline using updated prompts plus a temporal token reservoir.
GP-4DGS: Probabilistic 4D Gaussian Splatting from Monocular Video via Variational Gaussian Processes cs.CV · 2026-04-03 · unverdicted · none · ref 36 · internal anchor
GP-4DGS uses variational Gaussian Processes with spatio-temporal kernels to provide uncertainty-aware reconstruction and prediction in 4D Gaussian Splatting for dynamic scenes.
3AM: 3egment Anything with Geometric Consistency in Videos cs.CV · 2026-01-13 · unverdicted · none · ref 61 · internal anchor
3AM integrates MUSt3R 3D features into SAM2 via a Feature Merger and FOV-aware sampling to deliver geometry-consistent video object segmentation from RGB alone, with large gains on wide-baseline datasets.
Backdoor Attacks on Prompt-Driven Video Segmentation Foundation Models cs.CV · 2025-12-26 · conditional · none · ref 41 · internal anchor
BadVSFM is the first effective backdoor attack on prompt-driven video segmentation foundation models, using a two-stage encoder-decoder strategy to achieve high attack success rates with limited clean performance loss.
4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation cs.CV · 2025-12-18 · unverdicted · none · ref 71 · internal anchor
4D-RGPT uses perceptual 4D distillation to boost region-level 4D perception in multimodal LLMs and reports gains on existing and new video QA benchmarks.
Recurrent Video Masked Autoencoders cs.CV · 2025-12-15 · unverdicted · none · ref 59 · internal anchor
RVM uses recurrent computation inside a masked autoencoder to learn video representations that match or exceed prior video and image models on classification, tracking, and dense spatial tasks with up to 30x better parameter efficiency.
SAM 3: Segment Anything with Concepts cs.CV · 2025-11-20 · unverdicted · none · ref 107 · internal anchor
SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.
SAM 2++: Tracking Anything at Any Granularity cs.CV · 2025-10-21 · conditional · none · ref 48 · internal anchor
SAM 2++ unifies video tracking across mask, box, and point granularities via task-specific prompts, a unified decoder, task-adaptive memory, and a new multi-granularity dataset, reporting state-of-the-art results.
UniEdit-Flow: Unleashing Inversion and Editing in the Era of Flow Models cs.CV · 2025-04-17 · unverdicted · none · ref 45 · internal anchor
UniEdit-Flow presents tuning-free Uni-Inv and Uni-Edit methods for inversion and editing in flow models that achieve accurate reconstruction and robust region-preserving edits across generative models.
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V cs.CV · 2023-10-17 · accept · none · ref 39 · internal anchor
Set-of-Mark prompting marks segmented image regions with alphanumerics and masks to let GPT-4V achieve state-of-the-art zero-shot results on referring expression comprehension and segmentation benchmarks like RefCOCOg.
Self-supervised Training of Proposal-based Segmentation via Background Prediction cs.CV · 2019-07-18 · unverdicted · none · ref 19 · internal anchor
A self-supervised loss based on background prediction trains proposal-based segmentation networks via Monte Carlo sampling for object detection in novel image appearances.
SimInsert: Seamless Video Object Insertion via Regional Sparse Attention Fusion cs.CV · 2026-05-22 · unverdicted · none · ref 20 · internal anchor
SimInsert is a training-free video object insertion technique that decouples the task into single-frame editing and semantic motion description, using image-to-video diffusion models with non-invasive guidance to achieve spatio-temporal coherence.
Fast 4D Mesh Generation by Spatio-Temporal Attention Chains cs.CV · 2026-05-19 · unverdicted · none · ref 54 · internal anchor
A training-free Spatio-Temporal Attention Chain framework accelerates 4D mesh generation 13x, improves quality, scales to 16x longer videos, and supports downstream tracking and camera estimation.
See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding cs.CV · 2026-05-18 · unverdicted · none · ref 51 · internal anchor
SWIM aligns cross-attention maps from object nouns to ground-truth masks during training on the new NL-Refer dataset to enable text-only fine-grained video object understanding in MLLMs.
Weighted Reverse Convolution for Feature Upsampling cs.CV · 2026-05-17 · unverdicted · none · ref 35 · 2 links · internal anchor
Weighted Reverse Convolution is a spatially adaptive inverse operator for densifying high-level visual descriptors from vision foundation models, using weighted regularization and an FFT closed-form solution to improve dense prediction tasks.
LiBrA-Net: Lie-Algebraic Bilateral Affine Fields for Real-Time 4K Video Dehazing cs.CV · 2026-05-12 · unverdicted · none · ref 28 · internal anchor
LiBrA-Net achieves real-time native 4K video dehazing via Lie-algebraic bilateral affine fields and releases the first 4K paired dehazing video benchmark with per-frame annotations.
Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners cs.CV · 2026-04-29 · unverdicted · none · ref 46 · internal anchor
LILA learns temporally consistent semantic and geometric pixel features from uncurated videos via linear in-context learning on off-the-shelf depth and motion cues, yielding empirical gains on video object segmentation, surface normal estimation, and semantic segmentation.
CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation cs.CV · 2026-04-16 · unverdicted · none · ref 24 · internal anchor
Cross-modal token modulation enables better fusion of appearance and motion cues in two-stream models, leading to state-of-the-art results in unsupervised video object segmentation.
Blind Bitstream-corrupted Video Recovery via Metadata-guided Diffusion Model cs.CV · 2026-04-15 · unverdicted · none · ref 17 · internal anchor
M-GDM uses motion vectors and frame types to guide a diffusion model in blind recovery of bitstream-corrupted videos without manual masks.
Learning Long-term Motion Embeddings for Efficient Kinematics Generation cs.CV · 2026-04-13 · unverdicted · none · ref 32 · internal anchor
A 64x temporally compressed motion embedding learned from trackers enables efficient conditional flow-matching generation of long-term motions that outperform video models and task-specific methods.
PanoSAM2: Lightweight Distortion- and Memory-aware Adaptions of SAM2 for 360 Video Object Segmentation cs.CV · 2026-04-09 · unverdicted · none · ref 29 · internal anchor
PanoSAM2 adapts SAM2 with a Pano-Aware Decoder, Distortion-Guided Mask Loss, and Long-Short Memory Module to improve 360 video object segmentation, reporting +5.6 and +6.7 gains over base SAM2 on two benchmarks.
From Ideal to Real: Stable Video Object Removal under Imperfect Conditions cs.CV · 2026-03-10 · unverdicted · none · ref 26 · internal anchor
SVOR achieves stable, shadow-free video object removal under real-world imperfections via MUSE mask handling, DA-Seg localization, and curriculum training on real and synthetic data.
MotionAdapter: Video Motion Transfer via Content-Aware Attention Customization cs.CV · 2026-01-05 · unverdicted · none · ref 29 · internal anchor
MotionAdapter transfers reference video motions into target videos inside DiT diffusion models by isolating attention-derived motion fields and refining them via DINO-guided semantic alignment.
V$^{2}$-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence cs.CV · 2025-11-25 · unverdicted · none · ref 45 · internal anchor
V2-SAM adapts SAM2 to cross-view object correspondence with geometry-aware and appearance-based prompt generators plus a post-hoc cyclic consistency selector, reporting new state-of-the-art results on Ego-Exo4D, DAVIS-2017, and HANDAL-X.
Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning cs.CV · 2025-07-18 · conditional · none · ref 73 · internal anchor
Franca introduces nested Matryoshka clustering and positional disentanglement in a transparent SSL pipeline to deliver open-source vision models competitive with closed proprietary systems.
Perception Encoder: The best visual embeddings are not at the output of the network cs.CV · 2025-04-17 · unverdicted · none · ref 104 · internal anchor
Intermediate layers of a contrastively trained vision-language encoder yield stronger general embeddings than the output layer, enabling state-of-the-art performance across image/video classification, multimodal QA, and dense prediction after simple alignment.
SAM 2: Segment Anything in Images and Videos cs.CV · 2024-08-01 · conditional · none · ref 22 · internal anchor
SAM 2 delivers more accurate video segmentation with 3x fewer user interactions and 6x faster image segmentation than the original SAM by training a streaming-memory transformer on the largest video segmentation dataset collected to date.
Vision Transformers Need Registers cs.CV · 2023-09-28 · unverdicted · none · ref 175 · internal anchor
Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
TokenFlow: Consistent Diffusion Features for Consistent Video Editing cs.CV · 2023-07-19 · conditional · none · ref 15 · internal anchor
TokenFlow produces consistent text-driven video edits by propagating diffusion features according to inter-frame correspondences extracted from the source video.
Proposal, Tracking and Segmentation (PTS): A Cascaded Network for Video Object Segmentation cs.CV · 2019-07-02 · unverdicted · none · ref 40 · internal anchor
PTS is a cascaded neural network for video object segmentation that integrates proposal, tracking, and segmentation modules with a dynamic-reference adaptation scheme to report state-of-the-art results on DAVIS'17 and YouTube-VOS.
Tuning-free Instruction-based Video Editing Via Structural Noise Initialization and Guidance cs.CV · 2026-05-15 · unverdicted · none · ref 33 · internal anchor
Proposes SNIS and NGM to enable tuning-free instruction-based video editing with improved visual quality and claimed SOTA results.
ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation cs.CV · 2026-05-08 · unverdicted · none · ref 62 · internal anchor
ST-Gen4D uses a world model that fuses global appearance and local dynamic graphs into a 4D cognition representation to guide consistent 4D Gaussian generation.
PAT-VCM: Plug-and-Play Auxiliary Tokens for Video Coding for Machines cs.CV · 2026-04-14 · unverdicted · none · ref 23 · internal anchor
PAT-VCM adds lightweight auxiliary tokens to a shared baseline video stream to support multiple downstream machine tasks without task-specific codecs.
TAPNext++: What's Next for Tracking Any Point (TAP)? cs.CV · 2026-04-12 · unverdicted · none · ref 22 · internal anchor
TAPNext++ trains recurrent transformers on 1024-frame sequences with geometric augmentations and occluded-point supervision to achieve new state-of-the-art point tracking on long videos while adding a re-detection metric.
SAMannot: A Memory-Efficient, Local, Open-source Framework for Interactive Video Instance Segmentation based on SAM2 cs.CV · 2026-01-16 · conditional · none · ref 16 · internal anchor
SAMannot delivers a memory-efficient local framework for interactive video instance segmentation by optimizing SAM2 with persistent identity tracking, lock-and-refine workflows, and auto-prompting.
VVitCutLER: Towards Unsupervised Object Detection and Segmentation in Videos cs.CV · 2026-05-11 · unverdicted · none · ref 28 · internal anchor
VVitCutLER introduces VitCut as a temporally stable pseudo-label generator with cross-frame consistency and feature aggregation to improve unsupervised video object detection and segmentation.
RCoT-Seg: Reinforced Chain-of-Thought for Video Reasoning and Segmentation cs.CV · 2026-05-08 · unverdicted · none · ref 17 · internal anchor
RCoT-Seg uses GRPO-reinforced keyframe selection from a CoT-start corpus followed by SAM2 mask propagation to improve video object segmentation under implicit temporal instructions over prior MLLM sampling methods.
APRVOS: 1st Place Winner of 5th PVUW MeViS-Audio Track cs.SD · 2026-04-20 · unverdicted · none · ref 19 · internal anchor
A staged pipeline using ASR transcription, visual existence verification, Sa2VA coarse segmentation, and agent-guided SAM3 refinement won first place in the PVUW MeViS-Audio track by decomposing audio-conditioned Ref-VOS into sequential verification and refinement steps.
Evaluation of Winning Solutions of 2025 Low Power Computer Vision Challenge cs.CV · 2026-04-21 · unverdicted · none · ref 6 · internal anchor
The 2025 LPCVC winners demonstrate practical techniques for low-power image classification under varied conditions, open-vocabulary segmentation from text prompts, and monocular depth estimation.
Understanding Deep Learning Techniques for Image Segmentation cs.CV · 2019-07-13 · unverdicted · none · ref 168 · internal anchor
A 2019 survey that categorizes and intuitively explains major deep learning techniques for image segmentation, progressing from classical methods to modern neural architectures.
HL-OutPaint: Coarse-to-Fine Video Outpainting for High-Resolution Long-Range Videos cs.CV · 2026-05-17 · unreviewed · ref 37 · internal anchor

The 2017 DAVIS Challenge on Video Object Segmentation

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer