Probing and Leveraging Video Diffusion Transformer Features for Robust Point Tracking

Chaehyun Kim; Dahyun Chung; Honggyu An; Hyunah Ko; Jisu Nam; Jung Yi; Junhwa Hur; Seungryong Kim; Siyoon Jin; Soowon Son

arxiv: 2512.20606 · v2 · pith:XRNC3X3Ynew · submitted 2025-12-23 · 💻 cs.CV

Probing and Leveraging Video Diffusion Transformer Features for Robust Point Tracking

Soowon Son , Honggyu An , Jisu Nam , Hyunah Ko , Chaehyun Kim , Dahyun Chung , Siyoon Jin , Jung Yi

show 2 more authors

Junhwa Hur Seungryong Kim

This is my paper

classification 💻 cs.CV

keywords trackingvideofeaturespointdiffusionreal-worldrobustbackbones

0 comments

read the original abstract

Despite achieving strong results on standard benchmarks, current point tracking methods rely on feature backbones that are rarely designed with the temporal coherence needed for robust real-world performance. While recent works incorporate powerful visual foundation model (VFM) features into tracking pipelines, no prior work has systematically analyzed which VFM provides the most robust representations for point tracking. We present the first such analysis, evaluating diverse VFMs in a zero-shot setting on both standard and robustness benchmarks for point tracking. Our study reveals that video diffusion transformers (DiTs) consistently yield the most temporally coherent and discriminative features, even surpassing ResNet backbones explicitly supervised on tracking data. We hypothesize this advantage stem from large-scale video pretraining, full 3D spatio-temporal attention, and a diffusion training objective. Motivated by this finding, we propose DiTracker, which integrates video DiT features into existing tracking frameworks through query-key matching cost computation, cost-level fusion with a lightweight ResNet branch, and LoRA adaptation. Under the same tracking head, DiTracker is trained solely on synthetic data with far fewer iterations, yet outperforms CoTracker3 trained with additional real-world videos, with the largest gains under challenging and corrupted scenarios. It further generalizes across tracking heads and scales with backbone size, confirming that generative video pretraining provides real-world priors that reduce the dependence on large-scale real-data supervision.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking
cs.CV 2026-05 unverdicted novelty 8.0

TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.