TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.
hub Tool reference
The 2017 DAVIS Challenge on Video Object Segmentation
Tool reference. 90% of classified Pith citations use this work as a method, library, or software dependency, not as a substantive claim.
abstract
We present the 2017 DAVIS Challenge on Video Object Segmentation, a public dataset, benchmark, and competition specifically designed for the task of video object segmentation. Following the footsteps of other successful initiatives, such as ILSVRC and PASCAL VOC, which established the avenue of research in the fields of scene classification and semantic segmentation, the DAVIS Challenge comprises a dataset, an evaluation methodology, and a public competition with a dedicated workshop co-located with CVPR 2017. The DAVIS Challenge follows up on the recent publication of DAVIS (Densely-Annotated VIdeo Segmentation), which has fostered the development of several novel state-of-the-art video object segmentation techniques. In this paper we describe the scope of the benchmark, highlight the main characteristics of the dataset, define the evaluation metrics of the competition, and present a detailed analysis of the results of the participants to the challenge.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
PhysInOne is a new dataset of 2 million videos across 153,810 dynamic 3D scenes covering 71 physical phenomena, shown to improve AI performance on physics-aware video generation, prediction, property estimation, and motion transfer.
Self-supervised ViTs show emergent semantic segmentation and 78.3% k-NN accuracy on ImageNet; DINO reaches 80.1% linear evaluation with ViT-Base.
Geo-Align applies RL with a perceptual reward derived from 3D camera trajectory estimation to improve controllability and fidelity in video generation without paired training data.
MotiMotion adds visual reasoning via a training-free VLM to refine primary trajectories and hallucinate secondary motions, plus a confidence-aware guidance scheme, yielding more plausible interactions on the new MotiBench benchmark.
PROVE proposes RC metrics for perceptual removal coherence and releases PROVE-Bench to better align automatic scores with human judgments on object removal tasks.
FluxShard uses per-block motion vectors and a Receptive Field Alignment Principle to manage feature cache reuse in edge-cloud video analytics, delivering 32.6-83.8% lower latency and 14.9-64.0% lower energy than baselines while preserving accuracy.
The work introduces the ORVOS task, the ORVOSB benchmark with causal annotations across 210 videos, and a baseline using updated prompts plus a temporal token reservoir.
GP-4DGS uses variational Gaussian Processes with spatio-temporal kernels to provide uncertainty-aware reconstruction and prediction in 4D Gaussian Splatting for dynamic scenes.
3AM integrates MUSt3R 3D features into SAM2 via a Feature Merger and FOV-aware sampling to deliver geometry-consistent video object segmentation from RGB alone, with large gains on wide-baseline datasets.
BadVSFM is the first effective backdoor attack on prompt-driven video segmentation foundation models, using a two-stage encoder-decoder strategy to achieve high attack success rates with limited clean performance loss.
4D-RGPT uses perceptual 4D distillation to boost region-level 4D perception in multimodal LLMs and reports gains on existing and new video QA benchmarks.
RVM uses recurrent computation inside a masked autoencoder to learn video representations that match or exceed prior video and image models on classification, tracking, and dense spatial tasks with up to 30x better parameter efficiency.
SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.
SAM 2++ unifies video tracking across mask, box, and point granularities via task-specific prompts, a unified decoder, task-adaptive memory, and a new multi-granularity dataset, reporting state-of-the-art results.
UniEdit-Flow presents tuning-free Uni-Inv and Uni-Edit methods for inversion and editing in flow models that achieve accurate reconstruction and robust region-preserving edits across generative models.
Set-of-Mark prompting marks segmented image regions with alphanumerics and masks to let GPT-4V achieve state-of-the-art zero-shot results on referring expression comprehension and segmentation benchmarks like RefCOCOg.
A self-supervised loss based on background prediction trains proposal-based segmentation networks via Monte Carlo sampling for object detection in novel image appearances.
SimInsert is a training-free video object insertion technique that decouples the task into single-frame editing and semantic motion description, using image-to-video diffusion models with non-invasive guidance to achieve spatio-temporal coherence.
A training-free Spatio-Temporal Attention Chain framework accelerates 4D mesh generation 13x, improves quality, scales to 16x longer videos, and supports downstream tracking and camera estimation.
SWIM aligns cross-attention maps from object nouns to ground-truth masks during training on the new NL-Refer dataset to enable text-only fine-grained video object understanding in MLLMs.
Weighted Reverse Convolution is a spatially adaptive inverse operator for densifying high-level visual descriptors from vision foundation models, using weighted regularization and an FFT closed-form solution to improve dense prediction tasks.
LiBrA-Net achieves real-time native 4K video dehazing via Lie-algebraic bilateral affine fields and releases the first 4K paired dehazing video benchmark with per-frame annotations.
LILA learns temporally consistent semantic and geometric pixel features from uncurated videos via linear in-context learning on off-the-shelf depth and motion cues, yielding empirical gains on video object segmentation, surface normal estimation, and semantic segmentation.
citing papers explorer
-
TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking
TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.
-
PhysInOne: Visual Physics Learning and Reasoning in One Suite
PhysInOne is a new dataset of 2 million videos across 153,810 dynamic 3D scenes covering 71 physical phenomena, shown to improve AI performance on physics-aware video generation, prediction, property estimation, and motion transfer.
-
Emerging Properties in Self-Supervised Vision Transformers
Self-supervised ViTs show emergent semantic segmentation and 78.3% k-NN accuracy on ImageNet; DINO reaches 80.1% linear evaluation with ViT-Base.
-
Geo-Align: Video Generation Alignment via Metric Geometry Reward
Geo-Align applies RL with a perceptual reward derived from 3D camera trajectory estimation to improve controllability and fidelity in video generation without paired training data.
-
MotiMotion: Motion-Controlled Video Generation with Visual Reasoning
MotiMotion adds visual reasoning via a training-free VLM to refine primary trajectories and hallucinate secondary motions, plus a confidence-aware guidance scheme, yielding more plausible interactions on the new MotiBench benchmark.
-
PROVE: A Perceptual RemOVal cohErence Benchmark for Visual Media
PROVE proposes RC metrics for perceptual removal coherence and releases PROVE-Bench to better align automatic scores with human judgments on object removal tasks.
-
FluxShard: Motion-Aware Feature Cache Reuse for Collaborative Video Analytics in Mobile Edge Computing
FluxShard uses per-block motion vectors and a Receptive Field Alignment Principle to manage feature cache reuse in edge-cloud video analytics, delivering 32.6-83.8% lower latency and 14.9-64.0% lower energy than baselines while preserving accuracy.
-
Online Reasoning Video Object Segmentation
The work introduces the ORVOS task, the ORVOSB benchmark with causal annotations across 210 videos, and a baseline using updated prompts plus a temporal token reservoir.
-
GP-4DGS: Probabilistic 4D Gaussian Splatting from Monocular Video via Variational Gaussian Processes
GP-4DGS uses variational Gaussian Processes with spatio-temporal kernels to provide uncertainty-aware reconstruction and prediction in 4D Gaussian Splatting for dynamic scenes.
-
3AM: 3egment Anything with Geometric Consistency in Videos
3AM integrates MUSt3R 3D features into SAM2 via a Feature Merger and FOV-aware sampling to deliver geometry-consistent video object segmentation from RGB alone, with large gains on wide-baseline datasets.
-
Backdoor Attacks on Prompt-Driven Video Segmentation Foundation Models
BadVSFM is the first effective backdoor attack on prompt-driven video segmentation foundation models, using a two-stage encoder-decoder strategy to achieve high attack success rates with limited clean performance loss.
-
4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation
4D-RGPT uses perceptual 4D distillation to boost region-level 4D perception in multimodal LLMs and reports gains on existing and new video QA benchmarks.
-
Recurrent Video Masked Autoencoders
RVM uses recurrent computation inside a masked autoencoder to learn video representations that match or exceed prior video and image models on classification, tracking, and dense spatial tasks with up to 30x better parameter efficiency.
-
SAM 3: Segment Anything with Concepts
SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.
-
SAM 2++: Tracking Anything at Any Granularity
SAM 2++ unifies video tracking across mask, box, and point granularities via task-specific prompts, a unified decoder, task-adaptive memory, and a new multi-granularity dataset, reporting state-of-the-art results.
-
UniEdit-Flow: Unleashing Inversion and Editing in the Era of Flow Models
UniEdit-Flow presents tuning-free Uni-Inv and Uni-Edit methods for inversion and editing in flow models that achieve accurate reconstruction and robust region-preserving edits across generative models.
-
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Set-of-Mark prompting marks segmented image regions with alphanumerics and masks to let GPT-4V achieve state-of-the-art zero-shot results on referring expression comprehension and segmentation benchmarks like RefCOCOg.
-
Self-supervised Training of Proposal-based Segmentation via Background Prediction
A self-supervised loss based on background prediction trains proposal-based segmentation networks via Monte Carlo sampling for object detection in novel image appearances.
-
SimInsert: Seamless Video Object Insertion via Regional Sparse Attention Fusion
SimInsert is a training-free video object insertion technique that decouples the task into single-frame editing and semantic motion description, using image-to-video diffusion models with non-invasive guidance to achieve spatio-temporal coherence.
-
Fast 4D Mesh Generation by Spatio-Temporal Attention Chains
A training-free Spatio-Temporal Attention Chain framework accelerates 4D mesh generation 13x, improves quality, scales to 16x longer videos, and supports downstream tracking and camera estimation.
-
See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding
SWIM aligns cross-attention maps from object nouns to ground-truth masks during training on the new NL-Refer dataset to enable text-only fine-grained video object understanding in MLLMs.
-
Weighted Reverse Convolution for Feature Upsampling
Weighted Reverse Convolution is a spatially adaptive inverse operator for densifying high-level visual descriptors from vision foundation models, using weighted regularization and an FFT closed-form solution to improve dense prediction tasks.
-
LiBrA-Net: Lie-Algebraic Bilateral Affine Fields for Real-Time 4K Video Dehazing
LiBrA-Net achieves real-time native 4K video dehazing via Lie-algebraic bilateral affine fields and releases the first 4K paired dehazing video benchmark with per-frame annotations.
-
Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners
LILA learns temporally consistent semantic and geometric pixel features from uncurated videos via linear in-context learning on off-the-shelf depth and motion cues, yielding empirical gains on video object segmentation, surface normal estimation, and semantic segmentation.
-
CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation
Cross-modal token modulation enables better fusion of appearance and motion cues in two-stream models, leading to state-of-the-art results in unsupervised video object segmentation.
-
Blind Bitstream-corrupted Video Recovery via Metadata-guided Diffusion Model
M-GDM uses motion vectors and frame types to guide a diffusion model in blind recovery of bitstream-corrupted videos without manual masks.
-
Learning Long-term Motion Embeddings for Efficient Kinematics Generation
A 64x temporally compressed motion embedding learned from trackers enables efficient conditional flow-matching generation of long-term motions that outperform video models and task-specific methods.
-
PanoSAM2: Lightweight Distortion- and Memory-aware Adaptions of SAM2 for 360 Video Object Segmentation
PanoSAM2 adapts SAM2 with a Pano-Aware Decoder, Distortion-Guided Mask Loss, and Long-Short Memory Module to improve 360 video object segmentation, reporting +5.6 and +6.7 gains over base SAM2 on two benchmarks.
-
From Ideal to Real: Stable Video Object Removal under Imperfect Conditions
SVOR achieves stable, shadow-free video object removal under real-world imperfections via MUSE mask handling, DA-Seg localization, and curriculum training on real and synthetic data.
-
MotionAdapter: Video Motion Transfer via Content-Aware Attention Customization
MotionAdapter transfers reference video motions into target videos inside DiT diffusion models by isolating attention-derived motion fields and refining them via DINO-guided semantic alignment.
-
V$^{2}$-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence
V2-SAM adapts SAM2 to cross-view object correspondence with geometry-aware and appearance-based prompt generators plus a post-hoc cyclic consistency selector, reporting new state-of-the-art results on Ego-Exo4D, DAVIS-2017, and HANDAL-X.
-
Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning
Franca introduces nested Matryoshka clustering and positional disentanglement in a transparent SSL pipeline to deliver open-source vision models competitive with closed proprietary systems.
-
Perception Encoder: The best visual embeddings are not at the output of the network
Intermediate layers of a contrastively trained vision-language encoder yield stronger general embeddings than the output layer, enabling state-of-the-art performance across image/video classification, multimodal QA, and dense prediction after simple alignment.
-
SAM 2: Segment Anything in Images and Videos
SAM 2 delivers more accurate video segmentation with 3x fewer user interactions and 6x faster image segmentation than the original SAM by training a streaming-memory transformer on the largest video segmentation dataset collected to date.
-
Vision Transformers Need Registers
Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
-
TokenFlow: Consistent Diffusion Features for Consistent Video Editing
TokenFlow produces consistent text-driven video edits by propagating diffusion features according to inter-frame correspondences extracted from the source video.
-
Proposal, Tracking and Segmentation (PTS): A Cascaded Network for Video Object Segmentation
PTS is a cascaded neural network for video object segmentation that integrates proposal, tracking, and segmentation modules with a dynamic-reference adaptation scheme to report state-of-the-art results on DAVIS'17 and YouTube-VOS.
-
Tuning-free Instruction-based Video Editing Via Structural Noise Initialization and Guidance
Proposes SNIS and NGM to enable tuning-free instruction-based video editing with improved visual quality and claimed SOTA results.
-
ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation
ST-Gen4D uses a world model that fuses global appearance and local dynamic graphs into a 4D cognition representation to guide consistent 4D Gaussian generation.
-
PAT-VCM: Plug-and-Play Auxiliary Tokens for Video Coding for Machines
PAT-VCM adds lightweight auxiliary tokens to a shared baseline video stream to support multiple downstream machine tasks without task-specific codecs.
-
TAPNext++: What's Next for Tracking Any Point (TAP)?
TAPNext++ trains recurrent transformers on 1024-frame sequences with geometric augmentations and occluded-point supervision to achieve new state-of-the-art point tracking on long videos while adding a re-detection metric.
-
SAMannot: A Memory-Efficient, Local, Open-source Framework for Interactive Video Instance Segmentation based on SAM2
SAMannot delivers a memory-efficient local framework for interactive video instance segmentation by optimizing SAM2 with persistent identity tracking, lock-and-refine workflows, and auto-prompting.
-
VVitCutLER: Towards Unsupervised Object Detection and Segmentation in Videos
VVitCutLER introduces VitCut as a temporally stable pseudo-label generator with cross-frame consistency and feature aggregation to improve unsupervised video object detection and segmentation.
-
RCoT-Seg: Reinforced Chain-of-Thought for Video Reasoning and Segmentation
RCoT-Seg uses GRPO-reinforced keyframe selection from a CoT-start corpus followed by SAM2 mask propagation to improve video object segmentation under implicit temporal instructions over prior MLLM sampling methods.
-
APRVOS: 1st Place Winner of 5th PVUW MeViS-Audio Track
A staged pipeline using ASR transcription, visual existence verification, Sa2VA coarse segmentation, and agent-guided SAM3 refinement won first place in the PVUW MeViS-Audio track by decomposing audio-conditioned Ref-VOS into sequential verification and refinement steps.
-
Evaluation of Winning Solutions of 2025 Low Power Computer Vision Challenge
The 2025 LPCVC winners demonstrate practical techniques for low-power image classification under varied conditions, open-vocabulary segmentation from text prompts, and monocular depth estimation.
-
Understanding Deep Learning Techniques for Image Segmentation
A 2019 survey that categorizes and intuitively explains major deep learning techniques for image segmentation, progressing from classical methods to modern neural architectures.
- HL-OutPaint: Coarse-to-Fine Video Outpainting for High-Resolution Long-Range Videos