CDTB: A Color and Depth Visual Object Tracking Dataset and Benchmark
Pith reviewed 2026-05-25 12:06 UTC · model grok-4.3
The pith
New performance measures for long-term tracking generalize short-term metrics and remain robust to sparse annotations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Following a long-term tracking definition, the authors design performance measures that provide stronger analysis, outperform existing ones in interpretation and behavior distinction, generalize short-term measures to link the problems, and stay robust to annotation sparsity for much longer sequences. The CDTB dataset of carefully selected sequences with frequent target disappearances supports an extensive evaluation of the largest number of long-term trackers, their comparison to short-term state-of-the-art, analysis of architecture implementations, and exploration of re-detection and model update strategies for drift.
What carries the argument
The new long-term performance measures that generalize short-term ones and tolerate annotation sparsity, paired with the CDTB color-and-depth dataset containing many target disappearances.
If this is right
- The measures link short-term and long-term tracking problems.
- Annotation of sequences hundreds of times longer becomes possible without increasing manual labor.
- Influence of tracking architecture implementations on long-term performance can be systematically analyzed.
- Re-detection strategies and visual model update strategies can be compared for their effect on long-term tracking drift.
- The methodology integrates into the VOT toolkit to automate experimental analysis.
Where Pith is reading between the lines
- Trackers that add explicit re-detection modules may show measurable gains on sequences with frequent disappearances.
- The color and depth modalities together could encourage development of multimodal trackers that maintain identity across occlusions.
- The proposed taxonomy may help classify existing trackers and guide creation of hybrid systems that adapt between short-term and long-term modes.
- Widespread adoption of the measures could standardize reporting across short-term and long-term tracking papers.
Load-bearing premise
The carefully selected sequences with many target disappearances form a representative and sufficiently challenging test of long-term tracking behavior.
What would settle it
An independent collection of long sequences with target disappearances in which the new measures fail to distinguish tracking behaviors better than prior measures or lose their generalization to short-term metrics.
Figures
read the original abstract
A long-term visual object tracking performance evaluation methodology and a benchmark are proposed. Performance measures are designed by following a long-term tracking definition to maximize the analysis probing strength. The new measures outperform existing ones in interpretation potential and in better distinguishing between different tracking behaviors. We show that these measures generalize the short-term performance measures, thus linking the two tracking problems. Furthermore, the new measures are highly robust to temporal annotation sparsity and allow annotation of sequences hundreds of times longer than in the current datasets without increasing manual annotation labor. A new challenging dataset of carefully selected sequences with many target disappearances is proposed. A new tracking taxonomy is proposed to position trackers on the short-term/long-term spectrum. The benchmark contains an extensive evaluation of the largest number of long-term tackers and comparison to state-of-the-art short-term trackers. We analyze the influence of tracking architecture implementations to long-term performance and explore various re-detection strategies as well as influence of visual model update strategies to long-term tracking drift. The methodology is integrated in the VOT toolkit to automate experimental analysis and benchmarking and to facilitate future development of long-term trackers.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a long-term visual object tracking evaluation methodology and benchmark, including new performance measures designed to maximize analysis probing strength for long-term scenarios. These measures are claimed to outperform existing ones in interpretation and distinguishing tracking behaviors, while generalizing short-term measures and remaining robust to temporal annotation sparsity. A new challenging dataset CDTB is introduced with sequences featuring many target disappearances, along with a tracking taxonomy positioning trackers on the short-term/long-term spectrum. The work includes extensive evaluation of numerous long-term trackers versus short-term ones, analysis of architectures, re-detection strategies, and model updates, with integration into the VOT toolkit for automated benchmarking.
Significance. If the claims hold, this provides a valuable standardized benchmark and measures for long-term tracking, an area with limited prior resources compared to short-term tracking. The robustness to annotation sparsity and linkage between short- and long-term problems could facilitate scalable evaluation and development of trackers handling disappearances and drift. Credit is due for the empirical scale (largest number of long-term trackers evaluated) and practical integration with the VOT toolkit.
major comments (1)
- [Dataset and Benchmark sections] The central claim that the new measures outperform existing ones and generalize short-term measures rests on evaluations using the CDTB dataset of 'carefully selected' sequences. However, without explicit criteria, statistical representativeness analysis, or comparison to broader distributions of long-term tracking challenges (e.g., disappearance patterns or scene types), it is unclear whether superior performance and robustness would transfer beyond this specific data selection.
minor comments (2)
- [Abstract] Abstract contains typos: 'tackers' should be 'trackers' and 'tack' should be 'track'.
- [Evaluation] Clarify in the results how the new measures were quantitatively compared to prior ones (e.g., specific tables or figures showing interpretation potential and behavior distinction).
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive overall assessment of the work. We address the major comment below.
read point-by-point responses
-
Referee: [Dataset and Benchmark sections] The central claim that the new measures outperform existing ones and generalize short-term measures rests on evaluations using the CDTB dataset of 'carefully selected' sequences. However, without explicit criteria, statistical representativeness analysis, or comparison to broader distributions of long-term tracking challenges (e.g., disappearance patterns or scene types), it is unclear whether superior performance and robustness would transfer beyond this specific data selection.
Authors: We agree that the manuscript would be strengthened by explicitly stating the sequence selection criteria. The CDTB sequences were chosen to emphasize long-term tracking challenges, specifically a high frequency of target disappearances and reappearances (on average several times per sequence), combined with diversity in environments, lighting conditions, and motion patterns while ensuring both color and depth data are available. We will revise the Dataset section to list these criteria in detail. A formal statistical analysis comparing the dataset's disappearance pattern distribution to a hypothetical global distribution of all possible long-term videos is outside the scope of this work, as it would require constructing and annotating a much larger corpus. However, the performance measures themselves are derived directly from the formal definition of long-term tracking (target may disappear and reappear) and are shown both mathematically and empirically to generalize the short-term measures; their robustness to annotation sparsity is validated via controlled subsampling experiments independent of the specific sequence selection. These properties support the broader applicability of the methodology beyond the particular dataset. revision: yes
Circularity Check
Empirical benchmark paper with no derivation chain reducing to inputs
full rationale
This is a dataset/benchmark paper that defines new performance measures following a long-term tracking definition, proposes a new dataset of selected sequences, and evaluates trackers empirically. No equations or claims reduce by construction to fitted parameters, self-citations, or renamed inputs. The generalization and robustness claims are shown via experiments on the new data rather than forced by definition. No load-bearing self-citation chains or ansatzes are invoked. This is the expected non-finding for empirical construction work.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
- [1]
-
[2]
L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S. Torr. Fully-Convolutional Siamese Networks for Ob- ject Tracking. In ECCV Workshops, 2016
work page 2016
-
[3]
A. Bibi, T. Zhang, and B. Ghanem. 3D Part-Based Sparse Tracker with Automatic Synchronization and Registration. In CVPR, 2016
work page 2016
-
[4]
D. S. Bolme, J. Beveridge, B. A. Draper, and Y .-M. Lui. Vi- sual Object Tracking using Adaptive Correlation Filters. In CVPR, 2010
work page 2010
-
[5]
A. Buch, D. Kraft, J.-K. Kamarainen, H. Petersen, and N. Kruger. Pose estimation using local structure-specific shape and appearance context. In ICRA, 2013
work page 2013
-
[6]
M. Camplani, S. Hannuna, M. Mirmehdi, D. Damen, A. Paiement, L. Tao, and T. Burghardt. Real-time RGB-D Tracking with Depth Scaling Kernelised Correlation Filters and Occlusion Handling. In BMVC, 2015
work page 2015
-
[7]
C. Choi and H. Christensen. RGB-d object tracking: A par- ticle filter approach on GPU. In IROS, 2013
work page 2013
-
[8]
W. Choi, C. Pantofaru, and S. Savarese. A General Frame- work for Tracking Multiple People from a Moving Camera. IEEE PAMI, 2013
work page 2013
-
[9]
N. Dalal and B. Triggs. Histograms of Oriented Gradients for Human Detection. In CVPR, 2005
work page 2005
-
[10]
M. Danelljan, G. Bhat, F. Shahbaz Khan, and M. Fels- berg. ECO: Efficient Convolution Operators for Tracking. In CVPR, 2017
work page 2017
-
[11]
A. Ess, B. Leibe, K. Schindler, , and L. van Gool. A Mobile Vision System for Robust Multi-Person Tracking. In CVPR, 2008
work page 2008
-
[12]
H. Galoogahi, T. Sim, and S. Lucey. Correlation Filters with Limited Boundaries. In CVPR, 2015
work page 2015
-
[13]
G. Garcia-Hernando, S. Yuan, S. Baek, and T.-K. Kim. First- Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations. In CVPR, 2018
work page 2018
-
[14]
S. Hannuna, M. Camplani, J. Hall, M. Mirmehdi, D. Damen, T. Burghardt, A. Paiement, and L. Tao. DS-KCF: A Real- time Tracker for RGB-D Data. Journal of Real-Time Image Processing, 2016
work page 2016
-
[15]
R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Second edition, 2004
work page 2004
-
[16]
J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. High- Speed Tracking with Kernelized Correlation Filters. IEEE PAMI, 37(3):583–596, 2015
work page 2015
-
[17]
H. Hirschmuller. Accurate and Efficient Stereo Process- ing by Semi-Global Matching and Mutual Information. In CVPR, 2005
work page 2005
- [18]
-
[19]
U. Kart, J.-K. K ¨am¨ar¨ainen, and J. Matas. How to Make an RGBD Tracker ? In ECCV Workshops, 2018
work page 2018
-
[20]
U. Kart, J.-K. K ¨am¨ar¨ainen, J. Matas, L. Fan, and F. Cricri. Depth Masked Discriminative Correlation Filter. In ICPR, 2018
work page 2018
-
[21]
U. Kart, A. Luke ˇziˇc, M. Kristan, J.-K. K ¨am¨ar¨ainen, and J. Matas. Object Tracking by Reconstruction with View- Specific Discriminative Correlation Filters. InCVPR, 2019
work page 2019
-
[22]
H. Kiani Galoogahi, A. Fagg, and S. Lucey. Learning Background-Aware Correlation Filters for Visual Tracking. In ICCV, 2017
work page 2017
-
[23]
M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pflugfelder, L. ˇCehovin, T. V oj´ır, and et al. The Visual Ob- ject Tracking VOT2016 Challenge Results. In ECCV Work- shops, 2016
work page 2016
-
[24]
M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pflugfelder, and et al. The Visual Object Tracking VOT2017 Challenge Results. In ICCV Workshops, 2017
work page 2017
-
[25]
M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pfugfelder, L. C. Zajc, and T. V . et al. The sixth Visual Object Tracking VOT2018 challenge results. InECCV Work- shops, 2018
work page 2018
-
[26]
M. Kristan, J. Matas, A. Leonardis, M. Felsberg, and L. e. a. ˇCehovin Zajc. The Visual Object Tracking VOT2015 Chal- lenge Results. In ICCV Workshops, 2015
work page 2015
-
[27]
M. Kristan, J. Matas, G. Nebehay, F. Porikli, and L. ˇCehovin. A Novel Performance Evaluation Methodology for Single- Target Trackers. IEEE PAMI, 38(11):2137–2155, 2016
work page 2016
-
[28]
M. Kristan, R. Pflugfelder, A. Leonardis, J. Matas, L. ˇCehovin, G. Nebehay, T. V oj´ır, and et al. The Visual Ob- ject Tracking VOT2014 Challenge Results. In ECCV Work- shops, 2014
work page 2014
-
[29]
M. Kristan, R. Pflugfelder, A. Leonardis, J. Matas, F. Porikli, and et al. The Visual Object Tracking VOT2013 Challenge Results. In CVPR Workshops, 2013
work page 2013
- [30]
-
[31]
A. Luke ˇziˇc, L. ˇCehovin Zajc, T. V ojiˇr, J. Matas, and M. Kris- tan. FuCoLoT - A Fully-Correlational Long-Term Tracker. In ACCV, 2018
work page 2018
-
[32]
A. Luke ˇziˇc, T. V oj´ır, L. ˇCehovin, J. Matas, and M. Kristan. Discriminative Correlation Filter with Channel and Spatial Reliability. In CVPR, 2017
work page 2017
-
[33]
Now you see me: evaluating performance in long-term visual tracking
A. Lukezic, L. C. Zajc, T. V oj ´ır, J. Matas, and M. Kristan. Now you see me: evaluating performance in long-term visual tracking. CoRR, abs/1804.07056, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [34]
-
[35]
A. Moudgil and V . Gandhi. Long-Term Visual Object Track- ing Benchmark. In ACCV, 2018
work page 2018
-
[36]
M. Mueller, N. Smith, and B. Ghanem. A Benchmark and Simulator for UA V Tracking. InECCV, 2016
work page 2016
- [37]
- [38]
-
[39]
S. Richter, V . Vineet, S. Roth, and V . Koltun. Playing for Data: Ground Truth from Computer Games. InECCV, 2016
work page 2016
-
[40]
A. W. M. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara, A. Dehghan, and M. Shah. Visual Tracking: An Experimen- tal Survey. IEEE PAMI, 36(7):1442–1468, 2014
work page 2014
-
[41]
S. Song and J. Xiao. Tracking Revisited Using RGBD Cam- era: Unified Benchmark and Baselines. In ICCV, 2013
work page 2013
-
[42]
L. Spinello and K. O. Arras. People detection in RGB-D data. In IROS, 2011
work page 2011
-
[43]
J. Valmadre, L. Bertinetto, J. F. Henriques, R. Tao, A. Vedaldi, A. W. M. Smeulders, P. H. S. Torr, and E. Gavves. Long-term Tracking in the Wild: A Benchmark. In ECCV, 2018
work page 2018
-
[44]
Y . Wu, J. Lim, and Y . Ming-Hsuan. Object Tracking Bench- mark. IEEE PAMI, 37:1834 – 1848, 2015
work page 2015
-
[45]
J. Xiao, R. Stolkin, Y . Gao, and A. Leonardis. Robust Fu- sion of Color and Depth Data for RGB-D Target Tracking Using Adaptive Range-Invariant Depth Models and Spatio- Temporal Consistency Constraints. IEEE Transactions on Cybernetics, 48:2485 – 2499, 2018
work page 2018
-
[46]
Learning regression and verification networks for long-term visual tracking
Y . Zhang, D. Wang, L. Wang, J. Qi, and H. Lu. Learning Regression and Verification Networks for Long-term Visual Tracking. CoRR, abs/1809.04320, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.