Pedestrian Tracking by Probabilistic Data Association and Correspondence Embeddings
Pith reviewed 2026-05-24 20:49 UTC · model grok-4.3
The pith
In moving-camera sequences with unknown ego-motion, global nearest-neighbor tracking of deep correspondence embeddings outperforms kinematic cues for pedestrian tracking.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that, for sequences captured by a moving camera whose ego-motion is unknown, the best tracking performance is obtained by discarding kinematic cues and instead performing global nearest-neighbor matching on deep correspondence embeddings. These embeddings are produced by fine-tuning the second block of ResNet-18 with an angular loss extended by a margin term. Direct insertion of the same embeddings into the JIPDA filter did not yield significant further gains, suggesting that the geometry of the embedding space for soft data association requires additional study.
What carries the argument
Global nearest-neighbor tracking of deep correspondence embeddings trained by angular loss with margin on ResNet-18 features.
If this is right
- A fine-tuned convolutional detector combined with kinematic-only JIPDA produces the top-ranked submission on the fixed-camera 3DMOT2015 benchmark.
- Appearance embeddings trained with angular loss plus margin enable reliable frame-to-frame matching when ego-motion is unmodeled.
- Direct use of the embeddings inside the JIPDA association step brings no clear benefit over the nearest-neighbor approach.
Where Pith is reading between the lines
- Embeddings appear to encode identity information that is more robust to unmodeled camera motion than explicit position-velocity models.
- The same nearest-neighbor strategy could be tested on other moving-platform tasks such as vehicle or drone tracking.
- An adaptive system that selects kinematics or embeddings according to estimated ego-motion reliability might combine the strengths of both.
Load-bearing premise
The learned embeddings stay stable enough across viewpoint changes and occlusions that nearest-neighbor distances correctly identify the same pedestrian from one frame to the next.
What would settle it
A moving-camera sequence in which nearest-neighbor matching on these embeddings produces more identity switches and track breaks than a purely kinematic tracker would falsify the superiority claim.
Figures
read the original abstract
This paper studies the interplay between kinematics (position and velocity) and appearance cues for establishing correspondences in multi-target pedestrian tracking. We investigate tracking-by-detection approaches based on a deep learning detector, joint integrated probabilistic data association (JIPDA), and appearance-based tracking of deep correspondence embeddings. We first addressed the fixed-camera setup by fine-tuning a convolutional detector for accurate pedestrian detection and combining it with kinematic-only JIPDA. The resulting submission ranked first on the 3DMOT2015 benchmark. However, in sequences with a moving camera and unknown ego-motion, we achieved the best results by replacing kinematic cues with global nearest neighbor tracking of deep correspondence embeddings. We trained the embeddings by fine-tuning features from the second block of ResNet-18 using angular loss extended by a margin term. We note that integrating deep correspondence embeddings directly in JIPDA did not bring significant improvement. It appears that geometry of deep correspondence embeddings for soft data association needs further investigation in order to obtain the best from both worlds.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper investigates the combination of kinematic and appearance cues for multi-target pedestrian tracking in a tracking-by-detection framework. It reports that fine-tuning a convolutional detector and applying kinematic-only JIPDA yields first place on the 3DMOT2015 benchmark for fixed-camera sequences. For moving-camera sequences with unknown ego-motion, the authors claim best results are obtained by replacing kinematics with global nearest-neighbor association on deep correspondence embeddings extracted from the second block of a ResNet-18 fine-tuned with angular loss plus a margin term. Direct insertion of the same embeddings into JIPDA is reported to produce no significant gain, and the authors conclude that further work is needed on the geometry of embeddings for soft data association.
Significance. If the moving-camera results hold under additional validation, the work usefully demonstrates the breakdown of kinematic models under unknown ego-motion and the practical value of learned embeddings for appearance-based association when kinematics are unavailable. The top ranking on the public 3DMOT2015 benchmark is a concrete, reproducible strength that can be directly compared by other researchers. The explicit negative result on embedding integration into JIPDA is also valuable for guiding future work on hybrid association methods.
major comments (2)
- [Abstract (moving-camera paragraph)] Abstract (moving-camera paragraph): The claim that global nearest-neighbor tracking of the learned correspondence embeddings outperforms kinematics in moving-camera sequences is load-bearing for the paper's central contribution, yet rests on the untested assumption that embeddings from ResNet-18 block 2 remain sufficiently invariant to viewpoint changes and occlusions. The manuscript itself notes that direct insertion of the same embeddings into JIPDA produced no significant improvement; this observation is consistent with only marginal robustness and requires an explicit ablation on viewpoint-augmented training data or cross-validation across camera-motion regimes to substantiate the generalization claim.
- [Abstract and results sections] Abstract and results sections: The reported benchmark rankings that support the performance ordering between kinematic JIPDA and embedding-based NN tracking supply no error bars, statistical significance tests, or ablation tables. Without these, it is impossible to determine whether the observed ordering is robust to random seeds, detector variations, or sequence selection, weakening the evidence for the central claim that embeddings are preferable when ego-motion is unknown.
minor comments (1)
- [Methods] The description of the angular loss and margin term would benefit from an explicit equation or pseudocode in the methods section to allow exact reproduction of the embedding training procedure.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract (moving-camera paragraph)] The claim that global nearest-neighbor tracking of the learned correspondence embeddings outperforms kinematics in moving-camera sequences is load-bearing for the paper's central contribution, yet rests on the untested assumption that embeddings from ResNet-18 block 2 remain sufficiently invariant to viewpoint changes and occlusions. The manuscript itself notes that direct insertion of the same embeddings into JIPDA produced no significant improvement; this observation is consistent with only marginal robustness and requires an explicit ablation on viewpoint-augmented training data or cross-validation across camera-motion regimes to substantiate the generalization claim.
Authors: The performance ordering is supported by the results on the 3DMOT2015 benchmark sequences with moving cameras, which feature real viewpoint changes, ego-motion, and occlusions. These sequences serve as a practical test of the embeddings' utility under the conditions described. We have explicitly noted the lack of improvement when integrating embeddings into JIPDA and concluded that further work on embedding geometry is required. We maintain that the benchmark results substantiate the claim for the evaluated scenarios without necessitating additional viewpoint-augmented ablations, which were not part of the original experimental design. revision: no
-
Referee: [Abstract and results sections] The reported benchmark rankings that support the performance ordering between kinematic JIPDA and embedding-based NN tracking supply no error bars, statistical significance tests, or ablation tables. Without these, it is impossible to determine whether the observed ordering is robust to random seeds, detector variations, or sequence selection, weakening the evidence for the central claim that embeddings are preferable when ego-motion is unknown.
Authors: We agree that the absence of error bars and statistical tests limits the assessment of robustness. The 3DMOT2015 benchmark uses a fixed set of sequences and a standardized evaluation, which is the conventional way to report and compare tracking performance. To strengthen the manuscript, we will revise the results section to include a brief discussion of these limitations and the deterministic nature of the reported rankings. This addresses the concern without requiring new experiments. revision: yes
Circularity Check
No significant circularity; results rest on external benchmarks
full rationale
The manuscript is an empirical tracking paper that reports performance on the public 3DMOT2015 benchmark after fine-tuning a ResNet-18 detector and training correspondence embeddings with angular loss. No derivation chain, uniqueness theorem, or fitted parameter is invoked to predict another quantity that is definitionally identical to the input. All load-bearing claims (e.g., superiority of global NN on embeddings when ego-motion is unknown) are evaluated by direct comparison against held-out test sequences rather than by algebraic reduction or self-citation. The single self-citation risk noted by the reader is minor and non-load-bearing.
Axiom & Free-Parameter Ledger
free parameters (1)
- margin term in angular loss
axioms (2)
- domain assumption Fine-tuning a convolutional detector yields accurate pedestrian detections on the target domain
- domain assumption Deep features from ResNet-18 block 2 can be turned into identity-preserving correspondence embeddings
Reference graph
Works this paper leans on
-
[1]
ImageNet Large Scale Visual Recognition Challenge,
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei- Fei, “ImageNet Large Scale Visual Recognition Challenge,” IJCV, 2015
work page 2015
-
[2]
Microsoft COCO: common objects in context,
T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: common objects in context,” CoRR, 2014
work page 2014
-
[3]
How far are we from solving pedestrian detection?
S. Zhang, R. Benenson, M. Omran, J. H. Hosang, and B. Schiele, “How far are we from solving pedestrian detection?” in CVPR, 2016
work page 2016
-
[4]
Citypersons: A diverse dataset for pedestrian detection,
S. Zhang, R. Benenson, and B. Schiele, “Citypersons: A diverse dataset for pedestrian detection,” in CVPR, 2017
work page 2017
-
[5]
Signature verification using a siamese time delay neural network,
J. Bromley, I. Guyon, Y . LeCun, E. Säckinger, and R. Shah, “Signature verification using a siamese time delay neural network,” in NIPS, 1993
work page 1993
-
[6]
Deep metric learning using triplet network,
E. Hoffer and N. Ailon, “Deep metric learning using triplet network,” in Similarity-Based Pattern Recognition - Third International Workshop, SIMBAD, 2015
work page 2015
-
[7]
Improved deep metric learning with multi-class n-pair loss objective,
K. Sohn, “Improved deep metric learning with multi-class n-pair loss objective,” in NIPS, 2016
work page 2016
-
[8]
Deep metric learning with angular loss,
J. Wang, F. Zhou, S. Wen, X. Liu, and Y . Lin, “Deep metric learning with angular loss,” in ICCV, 2017
work page 2017
-
[9]
Mask-guided contrastive attention model for person re-identification,
C. Song, Y . Huang, W. Ouyang, and L. Wang, “Mask-guided contrastive attention model for person re-identification,” in CVPR, 2018
work page 2018
-
[10]
B.-n. V o, M. Mallick, Y . Bar-shalom, S. Coraluppi, R. Osborne, R. Mahler, and B.-t. V o, “Multitarget Tracking,” in Wiley Encyclopedia of Electrical and Electronics Engineering , 2015
work page 2015
-
[11]
Tracking in a cluttered environnement with probabilistic data association,
Y . Bar-Shalom and E. Tse, “Tracking in a cluttered environnement with probabilistic data association,” Automatica, 1975
work page 1975
-
[12]
Sonar tracking of multiple targets using joint probabilistic data association,
T. Fortmann, Y . Bar-Shalom, and M. Scheffe, “Sonar tracking of multiple targets using joint probabilistic data association,” IEEE Journal of Oceanic Engineering , 1983
work page 1983
-
[13]
Joint Integrated Probabilistic Data Associa- tion - JIPDA,
D. Musicki and R. Evans, “Joint Integrated Probabilistic Data Associa- tion - JIPDA,” in Proceedings of the Fifth International Conference on Information Fusion (FUSION) , 2002
work page 2002
-
[14]
Integrated probabilistic data association,
D. Mušicki, R. Evans, and S. Stankovic, “Integrated probabilistic data association,” Transaction on Automatic Control , 1994
work page 1994
-
[15]
An algorithm for tracking multiple targets,
D. Reid, “An algorithm for tracking multiple targets,” IEEE Transactions on Automatic Control , 1979
work page 1979
-
[16]
Multiple hypothesis tracking for multiple target tracking,
S. S. Blackman, “Multiple hypothesis tracking for multiple target tracking,” IEEE Aerospace and Electronic Systems Magazine , 2004
work page 2004
-
[17]
I. R. Goodman, R. P. S. Mahler, and H. T. Nguyen, Mathematics of Data Fusion , Dordrecht, 1997
work page 1997
-
[18]
The Gaussian Mixture Probability Hypothesis Density Filter,
B.-N. V o and W.-K. Ma, “The Gaussian Mixture Probability Hypothesis Density Filter,” IEEE Transactions on Signal Processing , 2006
work page 2006
-
[19]
R. P. Mahler, Statistical Multisource-Multitarget Information Fusion , 2007
work page 2007
-
[20]
The labeled multi- Bernoulli filter,
S. Reuter, B. T. V o, B. N. V o, and K. Dietmayer, “The labeled multi- Bernoulli filter,” IEEE Transactions on Signal Processing , 2014
work page 2014
-
[21]
The Social Force PHD Filter for Tracking Pedestrians,
K. Krishanth, X. Chen, R. Tharmarasa, T. Kirubarajan, and M. Mc- Donald, “The Social Force PHD Filter for Tracking Pedestrians,” IEEE Transactions on Aerospace and Electronic Systems , 2017
work page 2017
-
[22]
Deep Person Re-identification for Probabilistic Data Association in Multiple Pedestrian Tracking
B. H. Wang, Y . Wang, K. Q. Weinberger, and M. Campbell, “Deep Person Re-identification for Probabilistic Data Association in Multiple Pedestrian Tracking,” in arXiv:1810.08565, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[23]
Resource Aware Person Re- identification across Multiple Resolutions,
Y . Wang, L. Wang, Y . You, X. Zou, V . Chen, S. Li, G. Huang, B. Hariharan, and K. Q. Weinberger, “Resource Aware Person Re- identification across Multiple Resolutions,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2018
work page 2018
-
[24]
Probabilistic multi-person tracking using dynamic bayes networks,
T. Klinger, F. Rottensteiner, and C. Heipke, “Probabilistic multi-person tracking using dynamic bayes networks,” in ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences , 2015
work page 2015
-
[25]
Moana: An online learned adaptive appearance model for robust multiple object tracking in 3d,
Z. Tang and J. Hwang, “Moana: An online learned adaptive appearance model for robust multiple object tracking in 3d,” IEEE Access , 2019
work page 2019
-
[26]
MOTChal- lenge 2015: Towards a Benchmark for Multi-Target Tracking,
L. Leal-Taixé, A. Milan, I. Reid, S. Roth, and K. Schindler, “MOTChal- lenge 2015: Towards a Benchmark for Multi-Target Tracking,” 2015
work page 2015
-
[27]
K. He, G. Gkioxari, P. Dollár, and R. B. Girshick, “Mask R-CNN,” in ICCV, 2017
work page 2017
-
[28]
Faster R-CNN: Towards real- time object detection with region proposal networks,
S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real- time object detection with region proposal networks,” in NIPS, 2015
work page 2015
-
[29]
M. Cordts, M. Omran, S. Ramos, T. Scharwächter, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset,” in CVPRW, 2015
work page 2015
-
[30]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016
work page 2016
-
[31]
MOT16: A Benchmark for Multi-Object Tracking,
A. Milan, L. Leal-Taixe, I. Reid, S. Roth, and K. Schindler, “MOT16: A Benchmark for Multi-Object Tracking,” 2016
work page 2016
-
[32]
Adam: A method for stochastic optimization,
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, 2014
work page 2014
-
[33]
Joint Probabilistic Data Association Revisited,
S. H. Rezatofighi, A. Milan, Z. Zhang, Q. Shi, A. Dick, and I. Reid, “Joint Probabilistic Data Association Revisited,” in 2015 IEEE Interna- tional Conference on Computer Vision (ICCV) , 2015
work page 2015
-
[34]
Probabilistic multi-person localisation and tracking in image sequences,
T. Klinger, F. Rottensteiner, and C. Heipke, “Probabilistic multi-person localisation and tracking in image sequences,” ISPRS Journal of Pho- togrammetry and Remote Sensing , 2017
work page 2017
-
[35]
IMMJPDA versus MHT and Kalman filter with NN correlation: performance comparison,
M. de Feo, A. Graziano, R. Miglioli, and A. Farina, “IMMJPDA versus MHT and Kalman filter with NN correlation: performance comparison,” IEE Proceedings - Radar , Sonar and Navigation , vol. 144, no. 2, 1997
work page 1997
-
[36]
Multitarget sensor reso- lution model and joint probabilistic data association,
D. Svensson, M. Ulmke, and L. Hammarstrand, “Multitarget sensor reso- lution model and joint probabilistic data association,” IEEE Transactions on Aerospace and Electronic Systems , vol. 48, no. 4, 2012
work page 2012
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.