pith. sign in

arxiv: 1907.07045 · v1 · pith:Y4EVAAQAnew · submitted 2019-07-16 · 💻 cs.CV · cs.LG· cs.RO

Pedestrian Tracking by Probabilistic Data Association and Correspondence Embeddings

Pith reviewed 2026-05-24 20:49 UTC · model grok-4.3

classification 💻 cs.CV cs.LGcs.RO
keywords pedestrian trackingdata associationcorrespondence embeddingsJIPDAmulti-target trackingdeep featuresmoving cameraego-motion
0
0 comments X

The pith

In moving-camera sequences with unknown ego-motion, global nearest-neighbor tracking of deep correspondence embeddings outperforms kinematic cues for pedestrian tracking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the relative value of position-velocity kinematics versus learned appearance features when linking detections of multiple pedestrians across video frames. In fixed-camera settings a fine-tuned detector paired with joint integrated probabilistic data association driven only by kinematics ranks first on the 3DMOT2015 benchmark. When the camera itself moves and its motion is unknown, the same kinematic approach is surpassed by switching to nearest-neighbor matching on embeddings trained from ResNet-18 features with angular loss plus a margin. The work also reports that feeding the embeddings directly into the probabilistic association step itself produces little additional benefit. This distinction matters because many practical tracking tasks occur from moving platforms whose ego-motion cannot be measured reliably.

Core claim

The central claim is that, for sequences captured by a moving camera whose ego-motion is unknown, the best tracking performance is obtained by discarding kinematic cues and instead performing global nearest-neighbor matching on deep correspondence embeddings. These embeddings are produced by fine-tuning the second block of ResNet-18 with an angular loss extended by a margin term. Direct insertion of the same embeddings into the JIPDA filter did not yield significant further gains, suggesting that the geometry of the embedding space for soft data association requires additional study.

What carries the argument

Global nearest-neighbor tracking of deep correspondence embeddings trained by angular loss with margin on ResNet-18 features.

If this is right

  • A fine-tuned convolutional detector combined with kinematic-only JIPDA produces the top-ranked submission on the fixed-camera 3DMOT2015 benchmark.
  • Appearance embeddings trained with angular loss plus margin enable reliable frame-to-frame matching when ego-motion is unmodeled.
  • Direct use of the embeddings inside the JIPDA association step brings no clear benefit over the nearest-neighbor approach.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Embeddings appear to encode identity information that is more robust to unmodeled camera motion than explicit position-velocity models.
  • The same nearest-neighbor strategy could be tested on other moving-platform tasks such as vehicle or drone tracking.
  • An adaptive system that selects kinematics or embeddings according to estimated ego-motion reliability might combine the strengths of both.

Load-bearing premise

The learned embeddings stay stable enough across viewpoint changes and occlusions that nearest-neighbor distances correctly identify the same pedestrian from one frame to the next.

What would settle it

A moving-camera sequence in which nearest-neighbor matching on these embeddings produces more identity switches and track breaks than a purely kinematic tracker would falsify the superiority claim.

Figures

Figures reproduced from arXiv: 1907.07045 by Borna Bi\'cani\'c, Ivan Markovi\'c, Ivan Petrovi\'c, Marin Or\v{s}i\'c, Sini\v{s}a \v{S}egvi\'c.

Figure 1
Figure 1. Figure 1: Pedestrian tracking on 3DMOT2015 sequences, PETS09-S2L1 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of scalar products of the deep embeddings [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
read the original abstract

This paper studies the interplay between kinematics (position and velocity) and appearance cues for establishing correspondences in multi-target pedestrian tracking. We investigate tracking-by-detection approaches based on a deep learning detector, joint integrated probabilistic data association (JIPDA), and appearance-based tracking of deep correspondence embeddings. We first addressed the fixed-camera setup by fine-tuning a convolutional detector for accurate pedestrian detection and combining it with kinematic-only JIPDA. The resulting submission ranked first on the 3DMOT2015 benchmark. However, in sequences with a moving camera and unknown ego-motion, we achieved the best results by replacing kinematic cues with global nearest neighbor tracking of deep correspondence embeddings. We trained the embeddings by fine-tuning features from the second block of ResNet-18 using angular loss extended by a margin term. We note that integrating deep correspondence embeddings directly in JIPDA did not bring significant improvement. It appears that geometry of deep correspondence embeddings for soft data association needs further investigation in order to obtain the best from both worlds.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper investigates the combination of kinematic and appearance cues for multi-target pedestrian tracking in a tracking-by-detection framework. It reports that fine-tuning a convolutional detector and applying kinematic-only JIPDA yields first place on the 3DMOT2015 benchmark for fixed-camera sequences. For moving-camera sequences with unknown ego-motion, the authors claim best results are obtained by replacing kinematics with global nearest-neighbor association on deep correspondence embeddings extracted from the second block of a ResNet-18 fine-tuned with angular loss plus a margin term. Direct insertion of the same embeddings into JIPDA is reported to produce no significant gain, and the authors conclude that further work is needed on the geometry of embeddings for soft data association.

Significance. If the moving-camera results hold under additional validation, the work usefully demonstrates the breakdown of kinematic models under unknown ego-motion and the practical value of learned embeddings for appearance-based association when kinematics are unavailable. The top ranking on the public 3DMOT2015 benchmark is a concrete, reproducible strength that can be directly compared by other researchers. The explicit negative result on embedding integration into JIPDA is also valuable for guiding future work on hybrid association methods.

major comments (2)
  1. [Abstract (moving-camera paragraph)] Abstract (moving-camera paragraph): The claim that global nearest-neighbor tracking of the learned correspondence embeddings outperforms kinematics in moving-camera sequences is load-bearing for the paper's central contribution, yet rests on the untested assumption that embeddings from ResNet-18 block 2 remain sufficiently invariant to viewpoint changes and occlusions. The manuscript itself notes that direct insertion of the same embeddings into JIPDA produced no significant improvement; this observation is consistent with only marginal robustness and requires an explicit ablation on viewpoint-augmented training data or cross-validation across camera-motion regimes to substantiate the generalization claim.
  2. [Abstract and results sections] Abstract and results sections: The reported benchmark rankings that support the performance ordering between kinematic JIPDA and embedding-based NN tracking supply no error bars, statistical significance tests, or ablation tables. Without these, it is impossible to determine whether the observed ordering is robust to random seeds, detector variations, or sequence selection, weakening the evidence for the central claim that embeddings are preferable when ego-motion is unknown.
minor comments (1)
  1. [Methods] The description of the angular loss and margin term would benefit from an explicit equation or pseudocode in the methods section to allow exact reproduction of the embedding training procedure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract (moving-camera paragraph)] The claim that global nearest-neighbor tracking of the learned correspondence embeddings outperforms kinematics in moving-camera sequences is load-bearing for the paper's central contribution, yet rests on the untested assumption that embeddings from ResNet-18 block 2 remain sufficiently invariant to viewpoint changes and occlusions. The manuscript itself notes that direct insertion of the same embeddings into JIPDA produced no significant improvement; this observation is consistent with only marginal robustness and requires an explicit ablation on viewpoint-augmented training data or cross-validation across camera-motion regimes to substantiate the generalization claim.

    Authors: The performance ordering is supported by the results on the 3DMOT2015 benchmark sequences with moving cameras, which feature real viewpoint changes, ego-motion, and occlusions. These sequences serve as a practical test of the embeddings' utility under the conditions described. We have explicitly noted the lack of improvement when integrating embeddings into JIPDA and concluded that further work on embedding geometry is required. We maintain that the benchmark results substantiate the claim for the evaluated scenarios without necessitating additional viewpoint-augmented ablations, which were not part of the original experimental design. revision: no

  2. Referee: [Abstract and results sections] The reported benchmark rankings that support the performance ordering between kinematic JIPDA and embedding-based NN tracking supply no error bars, statistical significance tests, or ablation tables. Without these, it is impossible to determine whether the observed ordering is robust to random seeds, detector variations, or sequence selection, weakening the evidence for the central claim that embeddings are preferable when ego-motion is unknown.

    Authors: We agree that the absence of error bars and statistical tests limits the assessment of robustness. The 3DMOT2015 benchmark uses a fixed set of sequences and a standardized evaluation, which is the conventional way to report and compare tracking performance. To strengthen the manuscript, we will revise the results section to include a brief discussion of these limitations and the deterministic nature of the reported rankings. This addresses the concern without requiring new experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results rest on external benchmarks

full rationale

The manuscript is an empirical tracking paper that reports performance on the public 3DMOT2015 benchmark after fine-tuning a ResNet-18 detector and training correspondence embeddings with angular loss. No derivation chain, uniqueness theorem, or fitted parameter is invoked to predict another quantity that is definitionally identical to the input. All load-bearing claims (e.g., superiority of global NN on embeddings when ego-motion is unknown) are evaluated by direct comparison against held-out test sequences rather than by algebraic reduction or self-citation. The single self-citation risk noted by the reader is minor and non-load-bearing.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

Abstract supplies almost no explicit free parameters or invented entities; the approach rests on standard assumptions about detector fine-tuning and embedding suitability.

free parameters (1)
  • margin term in angular loss
    Added to angular loss for embedding training; concrete value not reported.
axioms (2)
  • domain assumption Fine-tuning a convolutional detector yields accurate pedestrian detections on the target domain
    Invoked for the fixed-camera pipeline.
  • domain assumption Deep features from ResNet-18 block 2 can be turned into identity-preserving correspondence embeddings
    Central premise of the appearance branch.

pith-pipeline@v0.9.0 · 5738 in / 1246 out tokens · 22692 ms · 2026-05-24T20:49:26.682104+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 1 internal anchor

  1. [1]

    ImageNet Large Scale Visual Recognition Challenge,

    O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei- Fei, “ImageNet Large Scale Visual Recognition Challenge,” IJCV, 2015

  2. [2]

    Microsoft COCO: common objects in context,

    T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: common objects in context,” CoRR, 2014

  3. [3]

    How far are we from solving pedestrian detection?

    S. Zhang, R. Benenson, M. Omran, J. H. Hosang, and B. Schiele, “How far are we from solving pedestrian detection?” in CVPR, 2016

  4. [4]

    Citypersons: A diverse dataset for pedestrian detection,

    S. Zhang, R. Benenson, and B. Schiele, “Citypersons: A diverse dataset for pedestrian detection,” in CVPR, 2017

  5. [5]

    Signature verification using a siamese time delay neural network,

    J. Bromley, I. Guyon, Y . LeCun, E. Säckinger, and R. Shah, “Signature verification using a siamese time delay neural network,” in NIPS, 1993

  6. [6]

    Deep metric learning using triplet network,

    E. Hoffer and N. Ailon, “Deep metric learning using triplet network,” in Similarity-Based Pattern Recognition - Third International Workshop, SIMBAD, 2015

  7. [7]

    Improved deep metric learning with multi-class n-pair loss objective,

    K. Sohn, “Improved deep metric learning with multi-class n-pair loss objective,” in NIPS, 2016

  8. [8]

    Deep metric learning with angular loss,

    J. Wang, F. Zhou, S. Wen, X. Liu, and Y . Lin, “Deep metric learning with angular loss,” in ICCV, 2017

  9. [9]

    Mask-guided contrastive attention model for person re-identification,

    C. Song, Y . Huang, W. Ouyang, and L. Wang, “Mask-guided contrastive attention model for person re-identification,” in CVPR, 2018

  10. [10]

    Multitarget Tracking,

    B.-n. V o, M. Mallick, Y . Bar-shalom, S. Coraluppi, R. Osborne, R. Mahler, and B.-t. V o, “Multitarget Tracking,” in Wiley Encyclopedia of Electrical and Electronics Engineering , 2015

  11. [11]

    Tracking in a cluttered environnement with probabilistic data association,

    Y . Bar-Shalom and E. Tse, “Tracking in a cluttered environnement with probabilistic data association,” Automatica, 1975

  12. [12]

    Sonar tracking of multiple targets using joint probabilistic data association,

    T. Fortmann, Y . Bar-Shalom, and M. Scheffe, “Sonar tracking of multiple targets using joint probabilistic data association,” IEEE Journal of Oceanic Engineering , 1983

  13. [13]

    Joint Integrated Probabilistic Data Associa- tion - JIPDA,

    D. Musicki and R. Evans, “Joint Integrated Probabilistic Data Associa- tion - JIPDA,” in Proceedings of the Fifth International Conference on Information Fusion (FUSION) , 2002

  14. [14]

    Integrated probabilistic data association,

    D. Mušicki, R. Evans, and S. Stankovic, “Integrated probabilistic data association,” Transaction on Automatic Control , 1994

  15. [15]

    An algorithm for tracking multiple targets,

    D. Reid, “An algorithm for tracking multiple targets,” IEEE Transactions on Automatic Control , 1979

  16. [16]

    Multiple hypothesis tracking for multiple target tracking,

    S. S. Blackman, “Multiple hypothesis tracking for multiple target tracking,” IEEE Aerospace and Electronic Systems Magazine , 2004

  17. [17]

    I. R. Goodman, R. P. S. Mahler, and H. T. Nguyen, Mathematics of Data Fusion , Dordrecht, 1997

  18. [18]

    The Gaussian Mixture Probability Hypothesis Density Filter,

    B.-N. V o and W.-K. Ma, “The Gaussian Mixture Probability Hypothesis Density Filter,” IEEE Transactions on Signal Processing , 2006

  19. [19]

    R. P. Mahler, Statistical Multisource-Multitarget Information Fusion , 2007

  20. [20]

    The labeled multi- Bernoulli filter,

    S. Reuter, B. T. V o, B. N. V o, and K. Dietmayer, “The labeled multi- Bernoulli filter,” IEEE Transactions on Signal Processing , 2014

  21. [21]

    The Social Force PHD Filter for Tracking Pedestrians,

    K. Krishanth, X. Chen, R. Tharmarasa, T. Kirubarajan, and M. Mc- Donald, “The Social Force PHD Filter for Tracking Pedestrians,” IEEE Transactions on Aerospace and Electronic Systems , 2017

  22. [22]

    Deep Person Re-identification for Probabilistic Data Association in Multiple Pedestrian Tracking

    B. H. Wang, Y . Wang, K. Q. Weinberger, and M. Campbell, “Deep Person Re-identification for Probabilistic Data Association in Multiple Pedestrian Tracking,” in arXiv:1810.08565, 2018

  23. [23]

    Resource Aware Person Re- identification across Multiple Resolutions,

    Y . Wang, L. Wang, Y . You, X. Zou, V . Chen, S. Li, G. Huang, B. Hariharan, and K. Q. Weinberger, “Resource Aware Person Re- identification across Multiple Resolutions,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2018

  24. [24]

    Probabilistic multi-person tracking using dynamic bayes networks,

    T. Klinger, F. Rottensteiner, and C. Heipke, “Probabilistic multi-person tracking using dynamic bayes networks,” in ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences , 2015

  25. [25]

    Moana: An online learned adaptive appearance model for robust multiple object tracking in 3d,

    Z. Tang and J. Hwang, “Moana: An online learned adaptive appearance model for robust multiple object tracking in 3d,” IEEE Access , 2019

  26. [26]

    MOTChal- lenge 2015: Towards a Benchmark for Multi-Target Tracking,

    L. Leal-Taixé, A. Milan, I. Reid, S. Roth, and K. Schindler, “MOTChal- lenge 2015: Towards a Benchmark for Multi-Target Tracking,” 2015

  27. [27]

    Mask R-CNN,

    K. He, G. Gkioxari, P. Dollár, and R. B. Girshick, “Mask R-CNN,” in ICCV, 2017

  28. [28]

    Faster R-CNN: Towards real- time object detection with region proposal networks,

    S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real- time object detection with region proposal networks,” in NIPS, 2015

  29. [29]

    The cityscapes dataset,

    M. Cordts, M. Omran, S. Ramos, T. Scharwächter, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset,” in CVPRW, 2015

  30. [30]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016

  31. [31]

    MOT16: A Benchmark for Multi-Object Tracking,

    A. Milan, L. Leal-Taixe, I. Reid, S. Roth, and K. Schindler, “MOT16: A Benchmark for Multi-Object Tracking,” 2016

  32. [32]

    Adam: A method for stochastic optimization,

    D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, 2014

  33. [33]

    Joint Probabilistic Data Association Revisited,

    S. H. Rezatofighi, A. Milan, Z. Zhang, Q. Shi, A. Dick, and I. Reid, “Joint Probabilistic Data Association Revisited,” in 2015 IEEE Interna- tional Conference on Computer Vision (ICCV) , 2015

  34. [34]

    Probabilistic multi-person localisation and tracking in image sequences,

    T. Klinger, F. Rottensteiner, and C. Heipke, “Probabilistic multi-person localisation and tracking in image sequences,” ISPRS Journal of Pho- togrammetry and Remote Sensing , 2017

  35. [35]

    IMMJPDA versus MHT and Kalman filter with NN correlation: performance comparison,

    M. de Feo, A. Graziano, R. Miglioli, and A. Farina, “IMMJPDA versus MHT and Kalman filter with NN correlation: performance comparison,” IEE Proceedings - Radar , Sonar and Navigation , vol. 144, no. 2, 1997

  36. [36]

    Multitarget sensor reso- lution model and joint probabilistic data association,

    D. Svensson, M. Ulmke, and L. Hammarstrand, “Multitarget sensor reso- lution model and joint probabilistic data association,” IEEE Transactions on Aerospace and Electronic Systems , vol. 48, no. 4, 2012