pith. sign in

arxiv: 1907.00618 · v1 · pith:KEVUO6OMnew · submitted 2019-07-01 · 💻 cs.CV

CDTB: A Color and Depth Visual Object Tracking Dataset and Benchmark

Pith reviewed 2026-05-25 12:06 UTC · model grok-4.3

classification 💻 cs.CV
keywords visual object trackinglong-term trackingperformance measuresbenchmark datasetcolor and depthre-detectiontracking taxonomy
0
0 comments X

The pith

New performance measures for long-term tracking generalize short-term metrics and remain robust to sparse annotations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a long-term visual object tracking evaluation methodology and benchmark. New performance measures are designed to maximize analysis probing strength, outperforming prior ones by better distinguishing tracking behaviors and offering greater interpretation potential. These measures generalize short-term performance measures, linking the two problems, while remaining highly robust to temporal annotation sparsity. This robustness allows annotation of sequences hundreds of times longer than existing datasets without added manual effort. A new challenging dataset of sequences with many target disappearances, using color and depth, is introduced along with a taxonomy to position trackers on the short-term to long-term spectrum.

Core claim

Following a long-term tracking definition, the authors design performance measures that provide stronger analysis, outperform existing ones in interpretation and behavior distinction, generalize short-term measures to link the problems, and stay robust to annotation sparsity for much longer sequences. The CDTB dataset of carefully selected sequences with frequent target disappearances supports an extensive evaluation of the largest number of long-term trackers, their comparison to short-term state-of-the-art, analysis of architecture implementations, and exploration of re-detection and model update strategies for drift.

What carries the argument

The new long-term performance measures that generalize short-term ones and tolerate annotation sparsity, paired with the CDTB color-and-depth dataset containing many target disappearances.

If this is right

  • The measures link short-term and long-term tracking problems.
  • Annotation of sequences hundreds of times longer becomes possible without increasing manual labor.
  • Influence of tracking architecture implementations on long-term performance can be systematically analyzed.
  • Re-detection strategies and visual model update strategies can be compared for their effect on long-term tracking drift.
  • The methodology integrates into the VOT toolkit to automate experimental analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Trackers that add explicit re-detection modules may show measurable gains on sequences with frequent disappearances.
  • The color and depth modalities together could encourage development of multimodal trackers that maintain identity across occlusions.
  • The proposed taxonomy may help classify existing trackers and guide creation of hybrid systems that adapt between short-term and long-term modes.
  • Widespread adoption of the measures could standardize reporting across short-term and long-term tracking papers.

Load-bearing premise

The carefully selected sequences with many target disappearances form a representative and sufficiently challenging test of long-term tracking behavior.

What would settle it

An independent collection of long sequences with target disappearances in which the new measures fail to distinguish tracking behaviors better than prior measures or lose their generalization to short-term metrics.

Figures

Figures reproduced from arXiv: 1907.00618 by Ahmed Durmush, Alan Luke\v{z}i\v{c}, Jani K\"apyl\"a, Ji\v{r}\'i Matas, Joni-Kristian K\"am\"ar\"ainen, Matej Kristan, Ugur Kart.

Figure 1
Figure 1. Figure 1: RGB and depth sequences from CDTB. Depth offers a [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Two of the three sensors used in dataset acquisition: ToF [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The overall tracking performance is presented as tracking [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Tracking precision and recall calculated at the optimal [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Tracking performance w.r.t. visual attributes. The first eleven attributes correspond to scenarios with a visible target (showing F [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: No redetection experiment. Tracking recall is shown on [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

A long-term visual object tracking performance evaluation methodology and a benchmark are proposed. Performance measures are designed by following a long-term tracking definition to maximize the analysis probing strength. The new measures outperform existing ones in interpretation potential and in better distinguishing between different tracking behaviors. We show that these measures generalize the short-term performance measures, thus linking the two tracking problems. Furthermore, the new measures are highly robust to temporal annotation sparsity and allow annotation of sequences hundreds of times longer than in the current datasets without increasing manual annotation labor. A new challenging dataset of carefully selected sequences with many target disappearances is proposed. A new tracking taxonomy is proposed to position trackers on the short-term/long-term spectrum. The benchmark contains an extensive evaluation of the largest number of long-term tackers and comparison to state-of-the-art short-term trackers. We analyze the influence of tracking architecture implementations to long-term performance and explore various re-detection strategies as well as influence of visual model update strategies to long-term tracking drift. The methodology is integrated in the VOT toolkit to automate experimental analysis and benchmarking and to facilitate future development of long-term trackers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes a long-term visual object tracking evaluation methodology and benchmark, including new performance measures designed to maximize analysis probing strength for long-term scenarios. These measures are claimed to outperform existing ones in interpretation and distinguishing tracking behaviors, while generalizing short-term measures and remaining robust to temporal annotation sparsity. A new challenging dataset CDTB is introduced with sequences featuring many target disappearances, along with a tracking taxonomy positioning trackers on the short-term/long-term spectrum. The work includes extensive evaluation of numerous long-term trackers versus short-term ones, analysis of architectures, re-detection strategies, and model updates, with integration into the VOT toolkit for automated benchmarking.

Significance. If the claims hold, this provides a valuable standardized benchmark and measures for long-term tracking, an area with limited prior resources compared to short-term tracking. The robustness to annotation sparsity and linkage between short- and long-term problems could facilitate scalable evaluation and development of trackers handling disappearances and drift. Credit is due for the empirical scale (largest number of long-term trackers evaluated) and practical integration with the VOT toolkit.

major comments (1)
  1. [Dataset and Benchmark sections] The central claim that the new measures outperform existing ones and generalize short-term measures rests on evaluations using the CDTB dataset of 'carefully selected' sequences. However, without explicit criteria, statistical representativeness analysis, or comparison to broader distributions of long-term tracking challenges (e.g., disappearance patterns or scene types), it is unclear whether superior performance and robustness would transfer beyond this specific data selection.
minor comments (2)
  1. [Abstract] Abstract contains typos: 'tackers' should be 'trackers' and 'tack' should be 'track'.
  2. [Evaluation] Clarify in the results how the new measures were quantitatively compared to prior ones (e.g., specific tables or figures showing interpretation potential and behavior distinction).

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive overall assessment of the work. We address the major comment below.

read point-by-point responses
  1. Referee: [Dataset and Benchmark sections] The central claim that the new measures outperform existing ones and generalize short-term measures rests on evaluations using the CDTB dataset of 'carefully selected' sequences. However, without explicit criteria, statistical representativeness analysis, or comparison to broader distributions of long-term tracking challenges (e.g., disappearance patterns or scene types), it is unclear whether superior performance and robustness would transfer beyond this specific data selection.

    Authors: We agree that the manuscript would be strengthened by explicitly stating the sequence selection criteria. The CDTB sequences were chosen to emphasize long-term tracking challenges, specifically a high frequency of target disappearances and reappearances (on average several times per sequence), combined with diversity in environments, lighting conditions, and motion patterns while ensuring both color and depth data are available. We will revise the Dataset section to list these criteria in detail. A formal statistical analysis comparing the dataset's disappearance pattern distribution to a hypothetical global distribution of all possible long-term videos is outside the scope of this work, as it would require constructing and annotating a much larger corpus. However, the performance measures themselves are derived directly from the formal definition of long-term tracking (target may disappear and reappear) and are shown both mathematically and empirically to generalize the short-term measures; their robustness to annotation sparsity is validated via controlled subsampling experiments independent of the specific sequence selection. These properties support the broader applicability of the methodology beyond the particular dataset. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark paper with no derivation chain reducing to inputs

full rationale

This is a dataset/benchmark paper that defines new performance measures following a long-term tracking definition, proposes a new dataset of selected sequences, and evaluates trackers empirically. No equations or claims reduce by construction to fitted parameters, self-citations, or renamed inputs. The generalization and robustness claims are shown via experiments on the new data rather than forced by definition. No load-bearing self-citation chains or ansatzes are invoked. This is the expected non-finding for empirical construction work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the new measures rest on the domain assumption that long-term tracking is defined by frequent target disappearances.

pith-pipeline@v0.9.0 · 5765 in / 1059 out tokens · 30777 ms · 2026-05-25T12:06:47.911244+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 2 internal anchors

  1. [1]

    An, X.-G

    N. An, X.-G. Zhao, and Z.-G. Hou. Online RGB-D Tracking via Detection-Learning-Segmentation. In ICPR, 2016

  2. [2]

    Bertinetto, J

    L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S. Torr. Fully-Convolutional Siamese Networks for Ob- ject Tracking. In ECCV Workshops, 2016

  3. [3]

    A. Bibi, T. Zhang, and B. Ghanem. 3D Part-Based Sparse Tracker with Automatic Synchronization and Registration. In CVPR, 2016

  4. [4]

    D. S. Bolme, J. Beveridge, B. A. Draper, and Y .-M. Lui. Vi- sual Object Tracking using Adaptive Correlation Filters. In CVPR, 2010

  5. [5]

    A. Buch, D. Kraft, J.-K. Kamarainen, H. Petersen, and N. Kruger. Pose estimation using local structure-specific shape and appearance context. In ICRA, 2013

  6. [6]

    Camplani, S

    M. Camplani, S. Hannuna, M. Mirmehdi, D. Damen, A. Paiement, L. Tao, and T. Burghardt. Real-time RGB-D Tracking with Depth Scaling Kernelised Correlation Filters and Occlusion Handling. In BMVC, 2015

  7. [7]

    Choi and H

    C. Choi and H. Christensen. RGB-d object tracking: A par- ticle filter approach on GPU. In IROS, 2013

  8. [8]

    W. Choi, C. Pantofaru, and S. Savarese. A General Frame- work for Tracking Multiple People from a Moving Camera. IEEE PAMI, 2013

  9. [9]

    Dalal and B

    N. Dalal and B. Triggs. Histograms of Oriented Gradients for Human Detection. In CVPR, 2005

  10. [10]

    Danelljan, G

    M. Danelljan, G. Bhat, F. Shahbaz Khan, and M. Fels- berg. ECO: Efficient Convolution Operators for Tracking. In CVPR, 2017

  11. [11]

    A. Ess, B. Leibe, K. Schindler, , and L. van Gool. A Mobile Vision System for Robust Multi-Person Tracking. In CVPR, 2008

  12. [12]

    Galoogahi, T

    H. Galoogahi, T. Sim, and S. Lucey. Correlation Filters with Limited Boundaries. In CVPR, 2015

  13. [13]

    Garcia-Hernando, S

    G. Garcia-Hernando, S. Yuan, S. Baek, and T.-K. Kim. First- Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations. In CVPR, 2018

  14. [14]

    Hannuna, M

    S. Hannuna, M. Camplani, J. Hall, M. Mirmehdi, D. Damen, T. Burghardt, A. Paiement, and L. Tao. DS-KCF: A Real- time Tracker for RGB-D Data. Journal of Real-Time Image Processing, 2016

  15. [15]

    R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Second edition, 2004

  16. [16]

    J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. High- Speed Tracking with Kernelized Correlation Filters. IEEE PAMI, 37(3):583–596, 2015

  17. [17]

    Hirschmuller

    H. Hirschmuller. Accurate and Efficient Stereo Process- ing by Semi-Global Matching and Mutual Information. In CVPR, 2005

  18. [18]

    Kalal, K

    Z. Kalal, K. Mikolajczyk, and J. Matas. Tracking-Learning- Detection. IEEE PAMI, 34(7):1409–1422, 2011

  19. [19]

    Kart, J.-K

    U. Kart, J.-K. K ¨am¨ar¨ainen, and J. Matas. How to Make an RGBD Tracker ? In ECCV Workshops, 2018

  20. [20]

    Kart, J.-K

    U. Kart, J.-K. K ¨am¨ar¨ainen, J. Matas, L. Fan, and F. Cricri. Depth Masked Discriminative Correlation Filter. In ICPR, 2018

  21. [21]

    U. Kart, A. Luke ˇziˇc, M. Kristan, J.-K. K ¨am¨ar¨ainen, and J. Matas. Object Tracking by Reconstruction with View- Specific Discriminative Correlation Filters. InCVPR, 2019

  22. [22]

    Kiani Galoogahi, A

    H. Kiani Galoogahi, A. Fagg, and S. Lucey. Learning Background-Aware Correlation Filters for Visual Tracking. In ICCV, 2017

  23. [23]

    Kristan, A

    M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pflugfelder, L. ˇCehovin, T. V oj´ır, and et al. The Visual Ob- ject Tracking VOT2016 Challenge Results. In ECCV Work- shops, 2016

  24. [24]

    Kristan, A

    M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pflugfelder, and et al. The Visual Object Tracking VOT2017 Challenge Results. In ICCV Workshops, 2017

  25. [25]

    Kristan, A

    M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pfugfelder, L. C. Zajc, and T. V . et al. The sixth Visual Object Tracking VOT2018 challenge results. InECCV Work- shops, 2018

  26. [26]

    Kristan, J

    M. Kristan, J. Matas, A. Leonardis, M. Felsberg, and L. e. a. ˇCehovin Zajc. The Visual Object Tracking VOT2015 Chal- lenge Results. In ICCV Workshops, 2015

  27. [27]

    Kristan, J

    M. Kristan, J. Matas, G. Nebehay, F. Porikli, and L. ˇCehovin. A Novel Performance Evaluation Methodology for Single- Target Trackers. IEEE PAMI, 38(11):2137–2155, 2016

  28. [28]

    Kristan, R

    M. Kristan, R. Pflugfelder, A. Leonardis, J. Matas, L. ˇCehovin, G. Nebehay, T. V oj´ır, and et al. The Visual Ob- ject Tracking VOT2014 Challenge Results. In ECCV Work- shops, 2014

  29. [29]

    Kristan, R

    M. Kristan, R. Pflugfelder, A. Leonardis, J. Matas, F. Porikli, and et al. The Visual Object Tracking VOT2013 Challenge Results. In CVPR Workshops, 2013

  30. [30]

    Liu, X.-Y

    Y . Liu, X.-Y . Jing, J. Nie, H. Gao, J. Liu, and G.-P. Jiang. Context-aware 3-D Mean-shift with Occlusion Handling for Robust Object Tracking in RGB-D Videos. IEEE TMM , 2018

  31. [31]

    Luke ˇziˇc, L

    A. Luke ˇziˇc, L. ˇCehovin Zajc, T. V ojiˇr, J. Matas, and M. Kris- tan. FuCoLoT - A Fully-Correlational Long-Term Tracker. In ACCV, 2018

  32. [32]

    Luke ˇziˇc, T

    A. Luke ˇziˇc, T. V oj´ır, L. ˇCehovin, J. Matas, and M. Kristan. Discriminative Correlation Filter with Channel and Spatial Reliability. In CVPR, 2017

  33. [33]

    Now you see me: evaluating performance in long-term visual tracking

    A. Lukezic, L. C. Zajc, T. V oj ´ır, J. Matas, and M. Kristan. Now you see me: evaluating performance in long-term visual tracking. CoRR, abs/1804.07056, 2018

  34. [34]

    Meshgi, S

    K. Meshgi, S. ichi Maeda, S. Oba, H. Skibbe, Y . zhe Li, and S. Ishii. An Occlusion-aware Particle Filter Tracker to Handle Complex and Persistent Occlusions. CVIU, 150:81 – 94, 2016

  35. [35]

    Moudgil and V

    A. Moudgil and V . Gandhi. Long-Term Visual Object Track- ing Benchmark. In ACCV, 2018

  36. [36]

    Mueller, N

    M. Mueller, N. Smith, and B. Ghanem. A Benchmark and Simulator for UA V Tracking. InECCV, 2016

  37. [37]

    Muller, A

    M. Muller, A. Bibi, S. Giancola, S. Alsubaihi, and B. Ghanem. TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild. InECCV, 2018

  38. [38]

    Nam and B

    H. Nam and B. Han. Learning Multi-Domain Convolutional Neural Networks for Visual Tracking. In CVPR, 2016

  39. [39]

    Richter, V

    S. Richter, V . Vineet, S. Roth, and V . Koltun. Playing for Data: Ground Truth from Computer Games. InECCV, 2016

  40. [40]

    A. W. M. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara, A. Dehghan, and M. Shah. Visual Tracking: An Experimen- tal Survey. IEEE PAMI, 36(7):1442–1468, 2014

  41. [41]

    Song and J

    S. Song and J. Xiao. Tracking Revisited Using RGBD Cam- era: Unified Benchmark and Baselines. In ICCV, 2013

  42. [42]

    Spinello and K

    L. Spinello and K. O. Arras. People detection in RGB-D data. In IROS, 2011

  43. [43]

    Valmadre, L

    J. Valmadre, L. Bertinetto, J. F. Henriques, R. Tao, A. Vedaldi, A. W. M. Smeulders, P. H. S. Torr, and E. Gavves. Long-term Tracking in the Wild: A Benchmark. In ECCV, 2018

  44. [44]

    Y . Wu, J. Lim, and Y . Ming-Hsuan. Object Tracking Bench- mark. IEEE PAMI, 37:1834 – 1848, 2015

  45. [45]

    J. Xiao, R. Stolkin, Y . Gao, and A. Leonardis. Robust Fu- sion of Color and Depth Data for RGB-D Target Tracking Using Adaptive Range-Invariant Depth Models and Spatio- Temporal Consistency Constraints. IEEE Transactions on Cybernetics, 48:2485 – 2499, 2018

  46. [46]

    Learning regression and verification networks for long-term visual tracking

    Y . Zhang, D. Wang, L. Wang, J. Qi, and H. Lu. Learning Regression and Verification Networks for Long-term Visual Tracking. CoRR, abs/1809.04320, 2018