pith. sign in

arxiv: 2606.03694 · v2 · pith:HY3GIGY5new · submitted 2026-06-02 · 💻 cs.RO · cs.CV· cs.HC

Face versus Body Tracking for Human-Robot Interaction: An Egocentric Dataset

Pith reviewed 2026-06-28 09:45 UTC · model grok-4.3

classification 💻 cs.RO cs.CVcs.HC
keywords egocentric trackinghuman-robot interactionmulti-object trackingidentity switchesface trackingbody trackingre-identificationtemporal memory
0
0 comments X

The pith

Body tracking with re-identification reduces identity switches by 49 percent in egocentric robot interactions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard multi-object tracking models, built for surveillance or driving, produce frequent identity switches when a social robot must follow users who move unpredictably, block one another, or leave and return to view. It supplies a new dataset recorded from the robot's own camera and runs a controlled comparison of face-based versus body-based detection, longer temporal memory, and appearance re-identification. Longer memory reduces breaks from occlusions but leaves complex motion events unsolved. Re-identification improves body tracks yet raises face switches because profile views change appearance sharply. The best combination of these components cuts identity switches by 49 percent relative to a plain tracking-by-detection baseline.

Core claim

The paper claims that, on an egocentric dataset of close-range human-robot encounters, body tracking augmented with appearance re-identification and extended temporal memory yields a 49 percent reduction in identity switches compared with a standard tracking-by-detection baseline, while the same re-identification step increases identity switches when applied to face detections because of sensitivity to profile angles.

What carries the argument

The custom-annotated egocentric dataset together with the modular pipeline that isolates detection from tracking logic and tests the separate contributions of temporal memory and re-identification.

If this is right

  • Increasing temporal memory reduces prolonged occlusions but does not resolve complex dynamic events.
  • Re-identification substantially improves body tracking stability yet causes facial identity switches to rise.
  • The optimized pipeline that combines these elements reduces identity switches by 49 percent over a tracking-by-detection baseline.
  • Standard surveillance or driving benchmarks lack the dense, close-quarter occlusions typical of social-robot scenes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Robots may maintain steadier engagement by relying on body features rather than faces when users turn or move close.
  • Perception models for social interaction require training and test data recorded from the robot's own viewpoint rather than from overhead or side cameras.
  • A hybrid cue that switches between or fuses face and body information might avoid the profile-angle problem observed with faces alone.
  • The same memory-plus-re-identification pattern could be tested in other close-range egocentric settings such as wearable cameras or handheld devices.

Load-bearing premise

The interactions recorded in the dataset capture the nonlinear movements, occlusions, and re-entries that occur in ordinary human-robot conversations.

What would settle it

Applying the identical optimized pipeline to an independent egocentric dataset collected from a different robot platform and measuring whether identity switches still fall by roughly 49 percent would directly test the reported gain.

Figures

Figures reproduced from arXiv: 2606.03694 by Gabriel Skantze, Jessica Wenninger.

Figure 1
Figure 1. Figure 1: The Egocentric HRI Tracking Challenge. Left: The experimental setup with the Furhat robot in a real-world office environment. Right: A representative frame from the robot’s egocentric perspective. The scene highlights the difficulty of maintaining consistent identities for multiple actors despite dynamic background motion and severe occlusions. puter vision and HRI. State-of-the-art models are heavily op￾t… view at source ↗
Figure 3
Figure 3. Figure 3: Impact of Memory and Appearance on Tracking Stability. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 2
Figure 2. Figure 2: Mitigation of qualitative failure modes for body tracking (see [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of tracking stability during a complex “U-Turn” occlusion event. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Tracking during an “ID Takeover” event. Top Row (B-GT-BT￾30): Moving Bystander occludes seated overhearer at the back. The heavy body bounding box overlap causes the Kalman filter to swap their identities. Bottom Row (F-GT-BT-30): Smaller, spatially distinct face boxes minimize IoU overlap, allowing the tracker to correctly maintain individual identities through the dynamic occlusion. as a stable visual an… view at source ↗
read the original abstract

Meaningful human-robot interaction (HRI) requires a robot to continuously assess user engagement through persistent user tracking. However, state-of-the-art Multi-Object Tracking models are heavily optimized for surveillance or autonomous driving. A social robot faces distinct egocentric challenges, such as humans moving in unpredictable nonlinear patterns, obstructing each other, or leaving and reentering the scene. These dynamics trigger frequent identity switches (IDSW), causing the robot to lose its footing mid-conversation. To address this, we introduce a focused, custom-annotated egocentric dataset collected via the Furhat robot. We present a systematic evaluation isolating detection errors from tracking logic, comparing face versus body tracking, and assessing the impact of extended memory and appearance re-identification (ReID). Results indicate that increasing temporal memory mitigates prolonged occlusions but fails on complex dynamic events. Integrating ReID resolves complex switches but exhibits opposing effects: it substantially improves body tracking stability, yet causes facial IDSW to spike due to profile angle sensitivity. Ultimately, our optimized pipeline reduces IDSW by 49% compared to a standard tracking-by-detection baseline, effectively mitigating interaction breakdowns. As standard benchmarks lack dense, close-quarter occlusions, this work highlights the critical need for natively captured social dynamics to truly validate HRI perception models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces a custom-annotated egocentric dataset collected via the Furhat robot for HRI tracking scenarios involving nonlinear movements, occlusions, and re-entries. It conducts a systematic evaluation comparing face versus body tracking, assessing the effects of extended temporal memory and appearance ReID, and reports that an optimized pipeline reduces IDSW by 49% relative to a tracking-by-detection baseline.

Significance. If the isolation of detection from tracking components is rigorously demonstrated and the dataset dynamics are representative, the results on opposing ReID effects for face versus body tracking would usefully inform HRI perception design. The dataset could address gaps in standard benchmarks that lack dense close-quarter social interactions.

major comments (2)
  1. [Abstract] Abstract: The manuscript states quantitative results including a 49% IDSW reduction and a 'systematic evaluation isolating detection errors from tracking logic' but supplies no dataset size, annotation protocol, statistical tests, or error analysis. This prevents verification of the central quantitative claim.
  2. [Abstract and evaluation] Abstract and evaluation: The 49% IDSW reduction is attributed to extended memory and ReID, yet face and body tracking employ distinct detectors with differing error profiles. No explicit ablation is described that holds the detector fixed while varying only the tracking logic (or reports separate detection-only metrics), so the improvement cannot be confidently assigned to the pipeline elements rather than detector differences.
minor comments (1)
  1. [Abstract] Abstract: 'reentering' should be hyphenated as 're-entering'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The manuscript states quantitative results including a 49% IDSW reduction and a 'systematic evaluation isolating detection errors from tracking logic' but supplies no dataset size, annotation protocol, statistical tests, or error analysis. This prevents verification of the central quantitative claim.

    Authors: The abstract is a concise summary subject to strict length limits and therefore omits supporting details that appear in the body of the manuscript. Section 3 fully specifies the dataset size, collection protocol, and annotation procedure; Sections 4 and 5 present the statistical tests, error analysis, and detection-only metrics. To improve standalone readability of the abstract, we will add the dataset size and a one-sentence reference to the evaluation protocol in the revised version. revision: yes

  2. Referee: [Abstract and evaluation] Abstract and evaluation: The 49% IDSW reduction is attributed to extended memory and ReID, yet face and body tracking employ distinct detectors with differing error profiles. No explicit ablation is described that holds the detector fixed while varying only the tracking logic (or reports separate detection-only metrics), so the improvement cannot be confidently assigned to the pipeline elements rather than detector differences.

    Authors: We acknowledge that an ablation keeping the detector identical while varying only the tracker would make the attribution clearer. The manuscript already reports separate detection metrics (precision/recall) for each detector before applying the tracking components; the 49 % figure is measured on the combined pipeline. Because face and body tracking are compared as they would actually be deployed in HRI, a full cross-detector swap was not performed. We will revise the evaluation section to state this design choice explicitly, add a dedicated paragraph describing how detection errors are isolated from tracking logic, and include a limited additional ablation if space allows. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical dataset and ablation study

full rationale

The paper introduces a new egocentric dataset and reports empirical results from tracking experiments (face vs. body, memory length, ReID). The central 49% IDSW reduction is presented as an observed outcome of the optimized pipeline versus baseline on this dataset. No equations, parameter fits, derivations, or self-citations appear in the provided text that would reduce any claim to its own inputs by construction. The evaluation is self-contained against external benchmarks (standard tracking-by-detection) and does not rely on load-bearing prior results from the same authors. This is the expected finding for an empirical robotics dataset paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that standard MOT metrics (IDSW) meaningfully capture interaction breakdowns and that the collected dataset adequately represents the stated egocentric challenges.

axioms (1)
  • domain assumption Standard multi-object tracking metrics such as IDSW are appropriate measures for HRI engagement failures
    Invoked when claiming the 49% reduction mitigates interaction breakdowns.

pith-pipeline@v0.9.1-grok · 5763 in / 1233 out tokens · 28842 ms · 2026-06-28T09:45:41.867169+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 28 canonical work pages · 2 internal anchors

  1. [1]

    From the Definition to the Automatic Assessment of Engagement in Human–Robot Interaction: A Systematic Review,

    A. Sorrentino, L. Fiorini, and F. Cavallo, “From the Definition to the Automatic Assessment of Engagement in Human–Robot Interaction: A Systematic Review,”International Journal of Social Robotics, vol. 16, no. 7, pp. 1641–1663, July 2024. [Online]. Available: https://doi.org/10.1007/s12369-024-01146-w

  2. [2]

    Are You Still With Me? Continuous Engagement Assessment From a Robot’s Point of View,

    F. Del Duchetto, P. Baxter, and M. Hanheide, “Are You Still With Me? Continuous Engagement Assessment From a Robot’s Point of View,”Frontiers in Robotics and AI, vol. 7, Sept. 2020. [Online]. Available: https://www.frontiersin.org/journals/robotics-and-ai/articles /10.3389/frobt.2020.00116/full

  3. [3]

    Footing in human-robot conversations: how robots might shape participant roles using gaze cues,

    B. Mutlu, T. Shiwa, T. Kanda, H. Ishiguro, and N. Hagita, “Footing in human-robot conversations: how robots might shape participant roles using gaze cues,” inProceedings of the 4th ACM/IEEE international conference on Human robot interaction, ser. HRI ’09. New York, NY , USA: Association for Computing Machinery, Mar. 2009, pp. 61–68. [Online]. Available: ...

  4. [4]

    Online Multi-Object Tracking Based on Record Confidence and Hierarchical Association for Cyber-Physical Social Intelligence,

    J. Yang, D. Feng, Y . Gao, and C. Liu, “Online Multi-Object Tracking Based on Record Confidence and Hierarchical Association for Cyber-Physical Social Intelligence,”Big Data Mining and Analytics, vol. 8, no. 4, pp. 851–866, Aug. 2025. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/11002437

  5. [5]

    A Taxonomy of Social Errors in Human-Robot Interaction,

    L. Tian and S. Oviatt, “A Taxonomy of Social Errors in Human-Robot Interaction,”J. Hum.-Robot Interact., vol. 10, no. 2, pp. 13:1–13:32, Feb. 2021. [Online]. Available: https://dl.acm.org/doi/10.1145/34397 20

  6. [6]

    REGROUP: A Robot-Centric Group Detection and Tracking System,

    A. Taylor and L. D. Riek, “REGROUP: A Robot-Centric Group Detection and Tracking System,” in2022 17th ACM/IEEE International Conference on Human-Robot Interaction (HRI). Sapporo, Japan: IEEE, Mar. 2022, pp. 412–421. [Online]. Available: https://ieeexplore.ieee.org/document/9889634/

  7. [7]

    BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning,

    F. Yu, H. Chen, X. Wang, W. Xian, Y . Chen, F. Liu, V . Madhavan, and T. Darrell, “BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning,” in2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, W A, USA: IEEE, June 2020, pp. 2633–2642. [Online]. Available: https://ieeexplore.ieee.org/document/9156329/

  8. [8]

    T2FPV: Dataset and Method for Correcting First-Person View Errors in Pedestrian Trajectory Prediction,

    B. Stoler, M. Jana, S. Hwang, and J. Oh, “T2FPV: Dataset and Method for Correcting First-Person View Errors in Pedestrian Trajectory Prediction,” Mar. 2023, arXiv:2209.11294 [cs]. [Online]. Available: http://arxiv.org/abs/2209.11294

  9. [9]

    A real-time and unsupervised face re-identification system for human-robot interaction,

    Y . Wang, J. Shen, S. Petridis, and M. Pantic, “A real-time and unsupervised face re-identification system for human-robot interaction,”Pattern Recognition Letters, vol. 128, pp. 559–568, Dec

  10. [10]

    Available: https://www.sciencedirect.com/science/arti cle/pii/S0167865518301296

    [Online]. Available: https://www.sciencedirect.com/science/arti cle/pii/S0167865518301296

  11. [11]

    Face Recognition and Tracking Framework for Human–Robot Interaction,

    A. Khalifa, A. A. Abdelrahman, D. Strazdas, J. Hintz, T. Hempel, and A. Al-Hamadi, “Face Recognition and Tracking Framework for Human–Robot Interaction,”Applied Sciences, vol. 12, no. 11, May

  12. [12]

    Available: https://www.mdpi.com/2076-3417/12/11/ 5568

    [Online]. Available: https://www.mdpi.com/2076-3417/12/11/ 5568

  13. [13]

    Face, Body, V oice: Video Person-Clustering With Multiple Modalities,

    A. Brown, V . Kalogeiton, and A. Zisserman, “Face, Body, V oice: Video Person-Clustering With Multiple Modalities,” 2021, pp. 3184–

  14. [14]

    Available: https://openaccess.thecvf.com/content/IC CV2021W/CVEU/html/Brown Face Body Voice Video Person-Clust ering With Multiple Modalities ICCVW 2021 paper.html

    [Online]. Available: https://openaccess.thecvf.com/content/IC CV2021W/CVEU/html/Brown Face Body Voice Video Person-Clust ering With Multiple Modalities ICCVW 2021 paper.html

  15. [15]

    BoT-FaceSORT: Bag-of-Tricks for Robust Multi-face Tracking in Unconstrained Videos,

    J. Kim, C.-Y . Ju, G.-W. Kim, and D.-H. Lee, “BoT-FaceSORT: Bag-of-Tricks for Robust Multi-face Tracking in Unconstrained Videos,” inComputer Vision – ACCV 2024, M. Cho, I. Laptev, D. Tran, A. Yao, and H. Zha, Eds. Singapore: Springer Nature Singapore, 2025, vol. 15473, pp. 278–294, series Title: Lecture Notes in Computer Science. [Online]. Available: htt...

  16. [16]

    Simple online and realtime tracking,

    A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple online and realtime tracking,” in2016 IEEE International Conference on Image Processing (ICIP), Sept. 2016, pp. 3464–3468, iSSN: 2381-8549. [Online]. Available: https://ieeexplore.ieee.org/document /7533003/

  17. [17]

    ByteTrack: Multi-object Tracking by Associating Every Detection Box,

    Y . Zhang, P. Sun, Y . Jiang, D. Yu, F. Weng, Z. Yuan, P. Luo, W. Liu, and X. Wang, “ByteTrack: Multi-object Tracking by Associating Every Detection Box,” inComputer Vision – ECCV 2022, S. Avidan, G. Brostow, M. Ciss ´e, G. M. Farinella, and T. Hassner, Eds. Cham: Springer Nature Switzerland, 2022, pp. 1–21. [Online]. Available: https://doi.org/10.1007/97...

  18. [18]

    arXiv preprint arXiv:2206.14651 , year=

    N. Aharon, R. Orfaig, and B.-Z. Bobrovsky, “BoT-SORT: Robust Associations Multi-Pedestrian Tracking,” July 2022, arXiv:2206.14651 [cs]. [Online]. Available: http://arxiv.org/abs/2206.14651

  19. [19]

    RGB-D-based human motion recognition with deep learning: A survey,

    P. Wang, W. Li, P. Ogunbona, J. Wan, and S. Escalera, “RGB-D-based human motion recognition with deep learning: A survey,”Computer Vision and Image Understanding, vol. 171, pp. 118–139, June 2018. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S1077 314218300663

  20. [20]

    Multiple Human Association and Tracking From Egocentric and Complementary Top Views,

    R. Han, W. Feng, Y . Zhang, J. Zhao, and S. Wang, “Multiple Human Association and Tracking From Egocentric and Complementary Top Views,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 9, pp. 5225–5242, Sept. 2022. [Online]. Available: https://ieeexplore.ieee.org/document/9394804/

  21. [21]

    A Joint Tracking System: Robot is Online to Access Surveillance Views,

    Z. Lin, S. Ji, W. Wang, M. Qin, R. Yang, M. Wan, J. Gu, T. Li, and C. Zhang, “A Joint Tracking System: Robot is Online to Access Surveillance Views,” in2023 IEEE International Conference on Robotics and Biomimetics (ROBIO), Dec. 2023, pp. 1–6. [Online]. Available: https://ieeexplore.ieee.org/document/10354902/

  22. [22]

    Real-Time Human-Robot Interaction Intent Detection Using RGB-based Pose and Emotion Cues with Cross-Camera Model Generalization,

    F. Mohsen and A. Safa, “Real-Time Human-Robot Interaction Intent Detection Using RGB-based Pose and Emotion Cues with Cross-Camera Model Generalization,” Dec. 2025, arXiv:2512.17958 [cs]. [Online]. Available: http://arxiv.org/abs/2512.17958

  23. [23]

    Q-Tracking: A Robust Visual Human Following for Quadruped Robots in Dynamic Environments,

    Y . Su, C. Cun, H. Xia, Y . Feng, B. He, Q. Sun, J. Zhong, and Z. Li, “Q-Tracking: A Robust Visual Human Following for Quadruped Robots in Dynamic Environments,” in2025 International Conference on Advanced Robotics and Mechatronics (ICARM), Aug. 2025, pp. 1–6, iSSN: 2993-4990. [Online]. Available: https://ieeexplore.ieee.org/document/11293732/

  24. [24]

    JRDB: A Dataset and Benchmark of Egocentric Robot Visual Perception of Humans in Built Environments,

    R. Mart ´ın-Mart´ın, M. Patel, H. Rezatofighi, A. Shenoi, J. Gwak, E. Frankel, A. Sadeghian, and S. Savarese, “JRDB: A Dataset and Benchmark of Egocentric Robot Visual Perception of Humans in Built Environments,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 6, pp. 6748–6765, June 2023. [Online]. Available: https://ieeexplore...

  25. [25]

    Following the Human Thread in Social Navigation,

    L. Scofano, A. Sampieri, T. Campari, V . Sacco, I. Spinelli, L. Ballan, and F. Galasso, “Following the Human Thread in Social Navigation,” Feb. 2025, arXiv:2404.11327 [cs]. [Online]. Available: http://arxiv.org/abs/2404.11327

  26. [26]

    MOTChallenge: A Benchmark for Single-Camera Multiple Target Tracking,

    P. Dendorfer, A. O ˇsep, A. Milan, K. Schindler, D. Cremers, I. Reid, S. Roth, and L. Leal-Taix ´e, “MOTChallenge: A Benchmark for Single-Camera Multiple Target Tracking,”International Journal of Computer Vision, vol. 129, no. 4, pp. 845–881, Apr. 2021. [Online]. Available: https://doi.org/10.1007/s11263-020-01393-0

  27. [27]

    TPT-Bench: A Large-Scale, Long-Term and Robot-Egocentric Dataset for Benchmarking Target Person Tracking,

    H. Ye, Y . Zhan, W. Situ, G. Chen, J. Yu, Z. Zhao, K. Cai, A. Ajoudani, and H. Zhang, “TPT-Bench: A Large-Scale, Long-Term and Robot-Egocentric Dataset for Benchmarking Target Person Tracking,” July 2025, arXiv:2505.07446 [cs]. [Online]. Available: http://arxiv.org/abs/2505.07446

  28. [28]

    MOT20: A benchmark for multi object tracking in crowded scenes,

    P. Dendorfer, H. Rezatofighi, A. Milan, J. Shi, D. Cremers, I. Reid, S. Roth, K. Schindler, and L. Leal-Taix ´e, “MOT20: A benchmark for multi object tracking in crowded scenes,” Mar. 2020, arXiv:2003.09003 [cs]. [Online]. Available: http: //arxiv.org/abs/2003.09003

  29. [29]

    DanceTrack: Multi-Object Tracking in Uniform Appearance and Diverse Motion,

    P. Sun, J. Cao, Y . Jiang, Z. Yuan, S. Bai, K. Kitani, and P. Luo, “DanceTrack: Multi-Object Tracking in Uniform Appearance and Diverse Motion,” in2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, LA, USA: IEEE, June 2022, pp. 20 961–20 970. [Online]. Available: https://ieeexplore.ieee.org/document/9879192/

  30. [30]

    CrowdHuman: A Benchmark for Detecting Human in a Crowd

    S. Shao, Z. Zhao, B. Li, T. Xiao, G. Yu, X. Zhang, and J. Sun, “CrowdHuman: A Benchmark for Detecting Human in a Crowd,” Apr. 2018, arXiv:1805.00123 [cs]. [Online]. Available: http://arxiv.org/abs/1805.00123

  31. [31]

    Furhat: A Back-Projected Human-Like Robot Head for Multiparty Human- Machine Interaction,

    S. Al Moubayed, J. Beskow, G. Skantze, and B. Granstr ¨om, “Furhat: A Back-Projected Human-Like Robot Head for Multiparty Human- Machine Interaction,” inCognitive Behavioural Systems, A. Esposito, A. M. Esposito, A. Vinciarelli, R. Hoffmann, and V . C. M ¨uller, Eds. Berlin, Heidelberg: Springer, 2012, pp. 114–130. [Online]. Available: https://doi.org/10....

  32. [32]

    Computer Vision Annotation Tool (CV AT),

    CV AT.ai Corporation, “Computer Vision Annotation Tool (CV AT),”

  33. [33]

    Available: https://github.com/cvat-ai/cvat

    [Online]. Available: https://github.com/cvat-ai/cvat

  34. [34]

    YOLOX: Exceeding YOLO Series in 2021

    Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “YOLOX: Exceeding YOLO Series in 2021,” Aug. 2021, arXiv:2107.08430 [cs]. [Online]. Available: http://arxiv.org/abs/2107.08430

  35. [35]

    RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild,

    J. Deng, J. Guo, E. Ververas, I. Kotsia, and S. Zafeiriou, “RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild,” in2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, W A, USA: IEEE, June 2020, pp. 5202–5211. [Online]. Available: https://ieeexplore.ieee.org/document/9157330/

  36. [36]

    Pedestrian Detection: An Evaluation of the State of the Art,

    P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian Detection: An Evaluation of the State of the Art,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 4, pp. 743–761, Apr. 2012. [Online]. Available: https: //ieeexplore.ieee.org/document/5975165/

  37. [37]

    WIDER FACE: A Face Detection Benchmark,

    S. Yang, P. Luo, C. C. Loy, and X. Tang, “WIDER FACE: A Face Detection Benchmark,” in2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV , USA: IEEE, June 2016, pp. 5525–5533. [Online]. Available: https://ieeexplore.ieee.org/document/7780965/

  38. [38]

    HOTA: A Higher Order Metric for Evaluating Multi-object Tracking,

    J. Luiten, A. O ˇsep, P. Dendorfer, P. Torr, A. Geiger, L. Leal-Taix ´e, and B. Leibe, “HOTA: A Higher Order Metric for Evaluating Multi-object Tracking,”International Journal of Computer Vision, vol. 129, no. 2, pp. 548–578, Feb. 2021. [Online]. Available: https://doi.org/10.1007/s11263-020-01375-2