pith. sign in

arxiv: 2606.04480 · v1 · pith:P6EKL5TXnew · submitted 2026-06-03 · 💻 cs.CV · cs.HC

IMPose: Interactive Multi-person Pose Estimation with Dynamic Correction Propagation

Pith reviewed 2026-06-28 06:59 UTC · model grok-4.3

classification 💻 cs.CV cs.HC
keywords interactive pose annotationmulti-person trackingcorrection propagationvideo annotation toolhuman pose estimationtrajectory bankdynamic annotation
0
0 comments X

The pith

Sparse human corrections on one frame propagate into full multi-person video pose trajectories via dual-level tracking and a trajectory bank.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces IMPose as an interactive annotation system that turns limited user inputs into complete, consistent pose data across entire videos containing multiple people. It does this by combining sequential modeling at the keypoint level with instance-level embeddings that use relative positions to keep different individuals distinct over time. A trajectory bank stores past pose and instance information to handle interruptions like occlusions or blur. The central result is that annotation effort drops sharply while accuracy remains high, as shown by needing only 27 clicks for a 1,050-frame video on 3DPW and 3 clicks per tracklet on 84-frame PoseTrack21 sequences. This directly addresses the labor cost of creating large-scale dynamic human motion datasets.

Core claim

IMPose features a dual-level tracking mechanism that propagates one-frame multi-person pose corrections from annotators across entire videos. The keypoint-level ensures corrections temporal propagation via sequential modeling, while the instance-level employs keypoint-aware embedding with relative positional encoding to maintain multi-person cross-frame consistency. To further improve robustness, IMPose maintains historical pose and instance cues in a trajectory bank, which enhances long-range temporal association and stabilizes annotation in challenging cases such as occlusion and motion blur.

What carries the argument

Dual-level tracking mechanism (keypoint-level sequential modeling plus instance-level keypoint-aware embedding with relative positional encoding) combined with a trajectory bank that stores historical pose and instance cues.

If this is right

  • High precision annotation requires only 27 clicks per 1,050 frame video on 3DPW.
  • Annotation needs only 3 clicks per tracklet per 84-frame sequence on PoseTrack21.
  • The system achieves a strong accuracy-efficiency tradeoff under varying interaction budgets.
  • PoseTrack21 can be expanded by 188K pose instances (3.55M keypoints) at minimal annotator cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The trajectory bank approach could be adapted to annotate other time-varying signals such as hand gestures or object trajectories in video.
  • Combining the propagation step with an initial automated pose estimator might lower the click count even further in practice.
  • The same correction-propagation logic could support annotation of 3D poses if the underlying 2D tracker is replaced by a depth-aware model.

Load-bearing premise

The dual-level tracking mechanism and trajectory bank maintain cross-frame consistency and robustness under occlusion and motion blur without introducing systematic errors.

What would settle it

Run the system on a held-out video sequence containing repeated occlusions and motion blur, apply the reported low number of corrections, and measure whether the output poses deviate systematically from independent manual ground truth.

Figures

Figures reproduced from arXiv: 2606.04480 by Haoyang Ge, Hongzhi Yu, Jian Ma, Jianqi Fan, Kun Li, Qihe Wang, Xingyu Chen, Ziwen Wang.

Figure 1
Figure 1. Figure 1: Demonstrations of IMPose Annotating Videos Containing Multiple Persons. The first row demonstrates that complete disappearance of a subject in frame 54 leads to detection failure upon reappearance in frame 62. The second row highlights pose estimation errors in multi-athlete scenes caused by motion blur. IMPose effectively corrects these anomalies and propagates temporal corrections, achieving reliable key… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of IMPose. The t-th frame is fed into the Multi-person Pose Estimation and Tracking Module for pose estimation and instance-level tracking. Then, annotators use Annotation Platform to correct keypoints, bounding boxes, and IDs. The corrections are then tracked by a Point Tracker, while the tracked results are then used to the query of box decoder and keypoint decoder for refining subsequent frames… view at source ↗
Figure 3
Figure 3. Figure 3: Demonstration of the Interactive Annotation Tool Labeling an in￾the-Wild Video. Panels (a–c) illustrate the interaction workflow: (a) drawing a bounding box (a red dashed bounding box) around the target to add a mis-detected person, (b) dragging the left-hand keypoint (from the red dot to the green dot) for refinement, and (c) clicking “Continue Inference” to automatically propagate the corrected pose to s… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative Comparison between AlphaPose [7], DSTA [4], Click-Pose [9],X-AnyLabeling [1] and IMPose on the In-the-Wild Cases with Body-part Occlusion and Motion Blur. The green dashed arrow denotes the temporal correction propagation. The uppers of Click-Pose [9] and IMPose show the correcting progress, and the bottoms are the results after correction. Notably, poses of DSTA [4] and Click-Pose [9] are whit… view at source ↗
Figure 5
Figure 5. Figure 5: Correction Propagation Comparison of IMPose and CoTracker [11] under Motion Blur (Left) and Occlusion (Right). The left line chart displays the average number of frames wherein the corrected keypoints maintain OKS thresholds for blurred and clear videos. The right line chart illustrates the successfully propagated proportion of keypoints after occlusion under frequent and rare settings. Across all conditio… view at source ↗
Figure 6
Figure 6. Figure 6: Comparative Capability of Corrected Keypoint Propagation between CoTracker [11] and IMPose under Motion Blur. The corrected keypoint (denoted as red dot) in the first frame of each case is the manually annotated keypoint. The left case is to propagate the left wrist, while the right case is to propagate the left knee. C o Tra c k er IM P o s e t=0 t=5 t=10 t=15 t=20 t=24 t=30 t=0 t=5 t=10 t=19 t=32 t=50 [… view at source ↗
Figure 7
Figure 7. Figure 7: Comparative Capability of Corrected Keypoint Propagation between CoTracker [11] and IMPose under Occlusion.The corrected keypoint (denoted as red dot) in the first frame of each case is the manually annotated keypoint. The left case is to propagate the right ankle of the lady in blue, while the right case is to propagate the right hip of the man wearing jeans [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The Capability of Corrected Keypoint Propagation of IMPose under ID Mismatching. The first row of each case presents the initial poses, while the second row of each case presents the corrected poses via IMPose. The frame covered by a cartoon man means the manual correction appears. The following frames without the cartoon man but with green bounding boxes denote the instance id and poses are automatically … view at source ↗
read the original abstract

High-quality dynamic human pose annotation equips AI with precise motion kinematics to enable human behavior mastery, yet remains labor-intensive and time-consuming. Current annotation tools either lack temporal correction propagation or fail in multi-person scenarios, necessitating excessive manual intervention. In this paper, we introduce IMPose, an interactive tool for multi-person dynamic pose annotation. It features a dual-level tracking mechanism that propagates one-frame multi-person pose corrections from annotators across entire videos. The keypoint-level ensures corrections temporal propagation via sequential modeling, while the instance-level employs keypoint-aware embedding with relative positional encoding to maintain multi-person cross-frame consistency. To further improve robustness, IMPose maintains historical pose and instance cues in a trajectory bank, which enhances long-range temporal association and stabilizes annotation in challenging cases such as occlusion and motion blur. By converting sparse human corrections into dense and coherent pose trajectories, our framework significantly reduces repeated manual refinement across frames. Extensive experiments show that IMPose consistently achieves a strong accuracy efficiency trade off under different interaction budgets, demonstrating particular advantages in low click annotation settings. IMPose achieves high precision annotation with high efficiency, requiring only 27 clicks per 1,050 frame video on 3DPW and 3 clicks per tracklet per 84-frame on PoseTrack21. We further expand PoseTrack21 with 188K pose instances (3.55M keypoints) at a minimal cost of 10 annotators in 10 hours. The annotation tool, codes, and extended dataset will be open-sourced.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents IMPose, an interactive tool for multi-person dynamic pose annotation in videos. It introduces a dual-level tracking mechanism—keypoint-level sequential modeling to propagate corrections temporally and instance-level keypoint-aware embedding with relative positional encoding to maintain cross-frame consistency—augmented by a trajectory bank that stores historical pose and instance cues for long-range association and robustness under occlusion or motion blur. The central claim is that sparse human corrections are converted into dense, coherent pose trajectories, yielding high efficiency (27 clicks per 1,050-frame video on 3DPW; 3 clicks per tracklet per 84-frame sequence on PoseTrack21) while enabling expansion of PoseTrack21 by 188K pose instances.

Significance. If the propagation mechanism functions without systematic error accumulation, the work would meaningfully reduce annotation labor for multi-person video pose data, supporting larger-scale datasets for human motion analysis with particular value in low-click regimes.

major comments (2)
  1. [Abstract] Abstract: the efficiency claims (27 clicks / 1,050 frames on 3DPW; 3 clicks per tracklet on PoseTrack21) rest on the trajectory bank and dual-level tracking converting one-frame corrections into coherent trajectories without introducing new errors that would require extra clicks, yet no description is given of conflict-resolution rules, bank-update logic under occlusion, or recovery from early mis-association.
  2. [Abstract] Abstract: the instance-level keypoint-aware embedding with relative positional encoding is asserted to maintain multi-person consistency, but without equations, algorithmic pseudocode, or ablation on how embeddings are updated or how the bank resolves conflicting corrections, the robustness claim under motion blur and occlusion cannot be evaluated.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'strong accuracy efficiency trade off' is imprecise; the manuscript should state the concrete accuracy metric (e.g., MPJPE, PCK) and efficiency metric used to support the claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater transparency on the internal mechanisms supporting our efficiency claims. We agree that the abstract is overly concise and will revise it to include brief descriptions of the requested logic while expanding the methods section with equations, pseudocode, and targeted ablations. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the efficiency claims (27 clicks / 1,050 frames on 3DPW; 3 clicks per tracklet on PoseTrack21) rest on the trajectory bank and dual-level tracking converting one-frame corrections into coherent trajectories without introducing new errors that would require extra clicks, yet no description is given of conflict-resolution rules, bank-update logic under occlusion, or recovery from early mis-association.

    Authors: We acknowledge that the abstract does not elaborate on these operational details. The full manuscript describes the dual-level propagation and trajectory bank in the methods, but we agree the abstract should be expanded to note the conflict-resolution rules (prioritizing recent corrections) and bank-update logic (keypoint-level sequential updates with instance-level re-association under occlusion). We will add a short clause to the abstract and include pseudocode for mis-association recovery in the revision. revision: yes

  2. Referee: [Abstract] Abstract: the instance-level keypoint-aware embedding with relative positional encoding is asserted to maintain multi-person consistency, but without equations, algorithmic pseudocode, or ablation on how embeddings are updated or how the bank resolves conflicting corrections, the robustness claim under motion blur and occlusion cannot be evaluated.

    Authors: The abstract summarizes the embedding approach without technical specifics. We will revise the abstract to reference the embedding update rule and add the requested equations for keypoint-aware embedding with relative positional encoding, pseudocode for bank conflict resolution, and an ablation study quantifying robustness under motion blur and occlusion to the experiments section. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results independent of derivation

full rationale

The paper describes an engineering system for interactive pose annotation via dual-level tracking and a trajectory bank. All performance numbers (27 clicks/1050 frames on 3DPW, 3 clicks/tracklet on PoseTrack21) are stated as outcomes of experiments on external datasets rather than any mathematical prediction, fitted parameter, or first-principles derivation. No equations, self-citations, ansatzes, or uniqueness theorems appear in the supplied text, so no load-bearing step reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; the design implicitly assumes standard computer-vision tracking assumptions hold for the proposed propagation mechanisms.

axioms (1)
  • domain assumption Corrections on one frame can be reliably propagated to other frames via sequential modeling and embedding without drift
    Core premise of the dual-level tracking and trajectory bank described in the abstract.

pith-pipeline@v0.9.1-grok · 5822 in / 1127 out tokens · 19889 ms · 2026-06-28T06:59:39.575214+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 1 canonical work pages

  1. [1]

    Advanced auto labeling solution with added features,

    W. Wang, “Advanced auto labeling solution with added features,” CVHub, 2023, https://github.com/CVHub520/X-AnyLabeling

  2. [2]

    Effective whole-body pose estimation with two-stages distillation,

    Z. Yang, A. Zeng, C. Yuan, and Y . Li, “Effective whole-body pose estimation with two-stages distillation,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 4210– 4220

  3. [3]

    Simple online and realtime tracking with a deep association metric,

    N. Wojke, A. Bewley, and D. Paulus, “Simple online and realtime tracking with a deep association metric,” in2017 IEEE International Conference on Image Processing (ICIP). IEEE, 2017, pp. 3645–3649

  4. [4]

    Video-based human pose regression via decoupled space-time aggregation,

    J. He and W. Yang, “Video-based human pose regression via decoupled space-time aggregation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2024, pp. 1022–1031

  5. [5]

    High-resolution spatiotemporal modeling with global-local state space models for video-based human pose estimation,

    R. Feng, H. J. Chang, T. H. E. Tse, B. Kim, Y . Chang, and Y . Gao, “High-resolution spatiotemporal modeling with global-local state space models for video-based human pose estimation,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 8929–8938

  6. [6]

    Causal- inspired multitask learning for video-based human pose estimation,

    H. Chen, S. Wu, Z. Wang, Y . Yin, Y . Jiao, Y . Lyu, and Z. Liu, “Causal- inspired multitask learning for video-based human pose estimation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 2, 2025, pp. 2052–2060

  7. [7]

    Alphapose: Whole-body regional multi-person pose estimation and tracking in real-time,

    H.-S. Fang, J. Li, H. Tang, C. Xu, H. Zhu, Y . Xiu, Y .-L. Li, and C. Lu, “Alphapose: Whole-body regional multi-person pose estimation and tracking in real-time,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022

  8. [8]

    A gated attention transformer for multi- person pose tracking,

    A. Doering and J. Gall, “A gated attention transformer for multi- person pose tracking,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 3189–3198

  9. [9]

    Neural interac- tive keypoint detection,

    J. Yang, A. Zeng, F. Li, S. Liu, R. Zhang, and L. Zhang, “Neural interac- tive keypoint detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 15 122–15 132

  10. [10]

    ipose: Interactive human pose reconstruction from video,

    J. Liu, L.-Y . Wei, A. Shamir, and T. Igarashi, “ipose: Interactive human pose reconstruction from video,” inProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, 2024, pp. 1–14

  11. [11]

    Cotracker3: Simpler and better point tracking by pseudo- labelling real videos,

    N. Karaev, Y . Makarov, J. Wang, N. Neverova, A. Vedaldi, and C. Rupprecht, “Cotracker3: Simpler and better point tracking by pseudo- labelling real videos,” inarXiv:2410.11831, 2024

  12. [12]

    Recovering accurate 3d human pose in the wild using imus and a moving camera,

    T. V on Marcard, R. Henschel, M. J. Black, B. Rosenhahn, and G. Pons- Moll, “Recovering accurate 3d human pose in the wild using imus and a moving camera,” inProceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 601–617

  13. [13]

    Posetrack21: A dataset for person search, multi-object tracking and multi-person pose tracking,

    A. D ¨oring, D. Chen, S. Zhang, B. Schiele, and J. Gall, “Posetrack21: A dataset for person search, multi-object tracking and multi-person pose tracking,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 20 963–20 972

  14. [14]

    Effortless data labeling,

    N. R. Lab, “Effortless data labeling,”CVHub, 2023, https://github.com/ vietanhdev/anylabeling

  15. [15]

    Cvat: Computer vision annota- tion tool,

    CV AT.ai Corporation and contributors, “Cvat: Computer vision annota- tion tool,” https://github.com/cvat-ai/cvat, 2020

  16. [16]

    Sam 2: Segment anything in images and videos,

    N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R¨adle, C. Rolland, L. Gustafsonet al., “Sam 2: Segment anything in images and videos,”arXiv preprint arXiv:2408.00714, 2024

  17. [17]

    Deep high-resolution repre- sentation learning for human pose estimation,

    K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution repre- sentation learning for human pose estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 5693–5703

  18. [18]

    Vitpose: Simple vision transformer baselines for human pose estimation,

    Y . Xu, J. Zhang, Q. Zhang, and D. Tao, “Vitpose: Simple vision transformer baselines for human pose estimation,” inAdvances in Neural Information Processing Systems, vol. 35, 2022, pp. 38 571–38 584

  19. [19]

    Realtime multi-person 2d pose estimation using part affinity fields,

    Z. Cao, T. Simon, S.-E. Wei, and Y . Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 7291–7299

  20. [20]

    Associative embedding: End-to- end learning for joint detection and grouping,

    A. Newell, Z. Huang, and J. Deng, “Associative embedding: End-to- end learning for joint detection and grouping,” inAdvances in Neural Information Processing Systems, vol. 30, 2017

  21. [21]

    Pifpaf: Composite fields for human pose estimation,

    S. Kreiss, L. Bertoni, and A. Alahi, “Pifpaf: Composite fields for human pose estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 11 977–11 986

  22. [22]

    Posetrack: Joint multi-person pose estimation and tracking,

    U. Iqbal, A. Milan, and J. Gall, “Posetrack: Joint multi-person pose estimation and tracking,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2011–2020

  23. [23]

    Detect-and-track: Efficient pose estimation in videos,

    R. Girdhar and D. Ramanan, “Detect-and-track: Efficient pose estimation in videos,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 350–359

  24. [24]

    Human pose estimation in video with temporal context,

    G. Bertasius and L. Torresani, “Human pose estimation in video with temporal context,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 493–502

  25. [25]

    Posetrack21: A dataset for person search, multi-object tracking and multi-person pose tracking,

    A. Doering, D. Chen, S. Zhang, B. Schiele, and J. Gall, “Posetrack21: A dataset for person search, multi-object tracking and multi-person pose tracking,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20 963–20 972

  26. [26]

    3d human pose estimation in video with temporal convolutions and semi-supervised training,

    D. Pavllo, C. Feichtenhofer, D. Grangier, and M. Auli, “3d human pose estimation in video with temporal convolutions and semi-supervised training,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7753–7762

  27. [27]

    Human pose regression with residual log-likelihood estimation,

    J. Li, S. Bian, A. Zeng, C. Wang, B. Pang, W. Liu, and C. Lu, “Human pose regression with residual log-likelihood estimation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 11 025–11 034

  28. [28]

    Poseur: Direct human pose regression with transformers,

    W. Mao, Y . Ge, C. Shen, Z. Tian, X. Wang, Z. Wang, and A. Van Den Hengel, “Poseur: Direct human pose regression with transformers,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 72– 88. 14

  29. [29]

    Pose flow: Efficient online pose tracking,

    Y . Xiu, J. Li, H. Wang, H.-S. Fang, and C. Lu, “Pose flow: Efficient online pose tracking,” inProceedings of the British Machine Vision Conference, 2018

  30. [30]

    Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens- 100,

    D. Damen, H. Doughty, G. M. Farinella, A. Furnari, J. Ma, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray, “Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens- 100,”International Journal of Computer Vision, vol. 130, p. 33–55,

  31. [31]

    Available: https://doi.org/10.1007/s11263-021-01531-2

    [Online]. Available: https://doi.org/10.1007/s11263-021-01531-2

  32. [32]

    Epic-kitchens visor benchmark: Video segmentations and object relations,

    A. Darkhalil, D. Shan, B. Zhu, J. Ma, A. Kar, R. Higgins, S. Fidler, D. Fouhey, and D. Damen, “Epic-kitchens visor benchmark: Video segmentations and object relations,”Advances in Neural Information Processing Systems, pp. 13 745–13 758, 2022

  33. [33]

    Posefix: Model-agnostic general human pose refinement network,

    G. Moon, J. Y . Chang, and K. M. Lee, “Posefix: Model-agnostic general human pose refinement network,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7773–7781

  34. [34]

    Fairmot: On the fairness of detection and re-identification in multiple object tracking,

    Y . Zhang, C. Wang, X. Wang, W. Zeng, and W. Liu, “Fairmot: On the fairness of detection and re-identification in multiple object tracking,” International Journal of Computer Vision, vol. 129, pp. 3069–3087, 2021

  35. [35]

    Bot-sort: Robust associa- tions multi-pedestrian tracking,

    N. Aharon, R. Orfaig, and B.-Z. Bobrovsky, “Bot-sort: Robust associa- tions multi-pedestrian tracking,”arXiv preprint arXiv:2206.14651, 2022

  36. [36]

    Deformable detr: Deformable transformers for end-to-end object detection,

    X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,”arXiv preprint arXiv:2010.04159, 2020

  37. [37]

    Dino: Detr with improved denoising anchor boxes for end-to- end object detection,

    H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H.-Y . Shum, “Dino: Detr with improved denoising anchor boxes for end-to- end object detection,”arXiv preprint arXiv:2203.03605, 2022

  38. [38]

    Multiple object tracking as id predic- tion,

    R. Gao, J. Qi, and L. Wang, “Multiple object tracking as id predic- tion,” inProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025, pp. 27 883–27 893

  39. [39]

    Hota: A higher order metric for evaluating multi-object tracking,

    J. Luiten, A. Osep, P. Dendorfer, P. Torr, A. Geiger, L. Leal-Taix ´e, and B. Leibe, “Hota: A higher order metric for evaluating multi-object tracking,”International Journal of Computer Vision, vol. 129, no. 2, pp. 548–578, 2021

  40. [40]

    Memotr: Long-term memory-augmented trans- former for multi-object tracking,

    R. Gao and L. Wang, “Memotr: Long-term memory-augmented trans- former for multi-object tracking,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 9901– 9910

  41. [41]

    Microsoft coco: Common objects in context,

    T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” inProceedings of the European conference on computer vision (ECCV). Springer, 2014, pp. 740–755

  42. [42]

    Bridging the gap between end-to-end and non-end-to-end multi-object tracking,

    F. Yan, W. Luo, Y . Zhong, Y . Gan, and L. Ma, “Bridging the gap between end-to-end and non-end-to-end multi-object tracking,”arXiv preprint arXiv:2305.12724, 2023