pith. sign in

arxiv: 2509.09946 · v2 · pith:2JESOQ2Fnew · submitted 2025-09-12 · 💻 cs.CV

Online 3D Multi-Camera Perception through Robust 2D Tracking and Depth-based Late Aggregation

Pith reviewed 2026-05-21 22:08 UTC · model grok-4.3

classification 💻 cs.CV
keywords multi-target multi-camera tracking3D perceptiondepth-based aggregationpoint cloud reconstructiondata associationonline trackingAI City Challenge3D bounding box
0
0 comments X

The pith

Any online 2D multi-camera tracking system extends to 3D by reconstructing targets as point clouds from depth data and recovering boxes via clustering and yaw refinement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates a method to upgrade existing 2D multi-target multi-camera tracking pipelines to full 3D perception without replacing core tracking components. Depth information projects each tracked target into point-cloud space after 2D processing, where clustering and yaw refinement produce accurate 3D bounding boxes. An enhanced data association step maintains local ID consistency to assign global IDs across frames and cameras. This matters for large-scale surveillance because it adds automatic 3D environmental understanding on top of proven 2D systems. The framework achieved third place on the 2025 AI City Challenge 3D MTMC dataset.

Core claim

The approach extends any online 2D multi-camera tracking system into 3D space by utilizing depth information to reconstruct a target in point-cloud space, and recovering its 3D box through clustering and yaw refinement following tracking. It also introduces an enhanced online data association mechanism that leverages the target's local ID consistency to assign global IDs across frames. The proposed framework is evaluated on the 2025 AI City Challenge's 3D MTMC dataset.

What carries the argument

Depth-based late aggregation that reconstructs 2D-tracked targets into point clouds and recovers 3D boxes through clustering and yaw refinement.

If this is right

  • Existing 2D MTMC systems gain 3D perception capability without replacing all tracking components.
  • Targets are reconstructed in point-cloud space to support automatic 3D environment perception.
  • Local ID consistency enables improved global ID assignment across frames in online operation.
  • The framework delivers competitive performance on 3D MTMC benchmarks such as the AI City Challenge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The modular late-aggregation design could pair with multiple different 2D trackers to match specific camera hardware in surveillance networks.
  • Reusing 2D models in this way may reduce development and compute costs compared to building dedicated 3D trackers from scratch.
  • The method could extend to real-time applications in smart-city monitoring or traffic analysis if depth sources vary between sensors and stereo estimation.

Load-bearing premise

Depth information is available and accurate enough to enable reliable point-cloud reconstruction of targets followed by clustering and yaw refinement to recover valid 3D boxes.

What would settle it

Applying the method to test scenes with noisy or missing depth maps and measuring whether recovered 3D boxes show large deviations from ground-truth 3D annotations on the AI City Challenge dataset.

Figures

Figures reproduced from arXiv: 2509.09946 by Duc Huy Do, Hai Tran, Huong Ninh, Thao-Anh Tran, Vu-Minh Le, Xuan Canh Do.

Figure 1
Figure 1. Figure 1: A depiction of the 3D Multi-Camera Tracking problem, [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An illustration of our Online 3D Multi-Target Multi-Camera Tracking pipeline with Late 3D Bounding Box Aggregation. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An example of our track splitting mechanism. At time [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: (a) A single object is assigned multiple 3D bounding [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
read the original abstract

Multi-Target Multi-Camera Tracking (MTMC) is an essential computer vision task for automating large-scale surveillance. With camera calibration and depth information, the targets in the scene can be projected into 3D space, offering unparalleled levels of automatic perception of a 3D environment. However, tracking in the 3D space requires replacing all 2D tracking components from the ground up, which may be infeasible for existing MTMC systems. In this paper, we present an approach for extending any online 2D multi-camera tracking system into 3D space by utilizing depth information to reconstruct a target in point-cloud space, and recovering its 3D box through clustering and yaw refinement following tracking. We also introduced an enhanced online data association mechanism that leverages the target's local ID consistency to assign global IDs across frames. The proposed framework is evaluated on the 2025 AI City Challenge's 3D MTMC dataset, achieving 3rd place on the leaderboard.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to extend any online 2D multi-camera tracking system to 3D perception by reconstructing tracked targets into point clouds using depth information, then applying clustering and yaw refinement to recover 3D bounding boxes; it further introduces an enhanced online data association step that exploits local ID consistency for global ID assignment across frames. The framework is evaluated on the 2025 AI City Challenge 3D MTMC dataset and reports a 3rd-place ranking.

Significance. If the central claim holds, the approach offers a practical, modular upgrade path for existing 2D MTMC pipelines to 3D without replacing core tracking components, which is relevant for large-scale surveillance. The reported 3rd-place result on the challenge dataset supplies concrete empirical support for the overall pipeline. However, the absence of targeted robustness analysis on the depth-to-3D conversion step limits the strength of the evidence for real-world deployment.

major comments (2)
  1. [Method description of 3D box recovery] The depth-based late aggregation (point-cloud reconstruction followed by clustering and yaw refinement) is the sole mechanism converting 2D tracks into 3D boxes and is therefore load-bearing. No quantitative characterization or ablation is provided on how depth noise, sensor inaccuracies, or partial occlusions affect under-/over-segmentation or yaw stability, leaving the central claim without direct verification.
  2. [Experiments and results] Table or leaderboard results report 3rd place but contain no component ablations isolating the contribution of the proposed clustering/yaw refinement versus the base 2D tracker, nor any error analysis broken down by depth quality or occlusion level. This makes it impossible to confirm that the 3D extension itself drives the ranking rather than upstream 2D performance.
minor comments (2)
  1. [Abstract and §3] The abstract and method section would benefit from explicit naming of the clustering algorithm (e.g., DBSCAN parameters) and the exact yaw refinement procedure to allow reproduction.
  2. [Data association subsection] Notation for local versus global IDs is introduced but not consistently defined in equations or pseudocode, which reduces clarity of the data-association enhancement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments correctly identify areas where additional analysis would strengthen the evidence for the proposed 3D extension. We address each major comment below and will incorporate revisions to improve the paper.

read point-by-point responses
  1. Referee: [Method description of 3D box recovery] The depth-based late aggregation (point-cloud reconstruction followed by clustering and yaw refinement) is the sole mechanism converting 2D tracks into 3D boxes and is therefore load-bearing. No quantitative characterization or ablation is provided on how depth noise, sensor inaccuracies, or partial occlusions affect under-/over-segmentation or yaw stability, leaving the central claim without direct verification.

    Authors: We agree that a targeted robustness study on the depth-to-3D conversion would provide stronger verification of the central claim. The current evaluation relies on the real-world variations present in the 2025 AI City Challenge 3D MTMC dataset, which includes diverse depth qualities and occlusion scenarios as reflected in the 3rd-place result. To directly address the concern, we will add a new subsection with controlled experiments injecting synthetic depth noise and simulating partial occlusions to measure effects on clustering stability and yaw estimation. revision: yes

  2. Referee: [Experiments and results] Table or leaderboard results report 3rd place but contain no component ablations isolating the contribution of the proposed clustering/yaw refinement versus the base 2D tracker, nor any error analysis broken down by depth quality or occlusion level. This makes it impossible to confirm that the 3D extension itself drives the ranking rather than upstream 2D performance.

    Authors: We concur that explicit component ablations and stratified error analysis are needed to isolate the contribution of the late-aggregation steps. In the revised manuscript we will include an ablation table comparing the full pipeline against a baseline that projects 2D tracks to 3D without clustering or yaw refinement. We will also add error breakdowns stratified by depth quality (using available sensor metadata) and occlusion level (derived from track visibility annotations) to demonstrate the incremental benefit of the proposed 3D components. revision: yes

Circularity Check

0 steps flagged

No circularity: standard depth-to-3D pipeline after 2D tracking

full rationale

The paper's core chain is: run any existing online 2D MTMC tracker, project tracked targets into point-cloud space using supplied depth, apply clustering plus yaw refinement to obtain 3D boxes, and use local-ID consistency for global association. None of these steps is defined in terms of its own output, fitted to a subset and then re-predicted, or justified solely by a self-citation whose content is unverified. The method is presented as a modular post-processing extension whose validity is checked by leaderboard performance on the 2025 AI City 3D MTMC dataset. No equations or uniqueness theorems reduce the result to the input by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Review based on abstract only; ledger reflects implied assumptions from the described pipeline. No explicit free parameters or invented entities are stated.

axioms (2)
  • domain assumption Depth information from cameras is available and sufficiently accurate for point-cloud reconstruction of tracked targets.
    The entire 3D extension step depends on this input being reliable for clustering and box recovery.
  • domain assumption Existing 2D multi-camera tracking produces sufficiently robust local IDs and trajectories to support subsequent 3D aggregation.
    The method is explicitly designed to extend any online 2D system, assuming the 2D component works well.

pith-pipeline@v0.9.0 · 5720 in / 1450 out tokens · 44390 ms · 2026-05-21T22:08:18.881984+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 2 internal anchors

  1. [1]

    Simple online and realtime tracking

    Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, and Ben Upcroft. Simple online and realtime tracking. In2016 IEEE international conference on image processing (ICIP), pages 3464–3468. Ieee, 2016. 2

  2. [2]

    M3d-rpn: Monocular 3d region proposal network for object detection

    Garrick Brazil and Xiaoming Liu. M3d-rpn: Monocular 3d region proposal network for object detection. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 9287–9296, 2019. 2

  3. [3]

    Cascade r-cnn: Delv- ing into high quality object detection

    Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delv- ing into high quality object detection. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 6154–6162, 2018. 2

  4. [4]

    Observation-centric sort: Rethink- ing sort for robust multi-object tracking

    Jinkun Cao, Jiangmiao Pang, Xinshuo Weng, Rawal Khi- rodkar, and Kris Kitani. Observation-centric sort: Rethink- ing sort for robust multi-object tracking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9686–9696, 2023. 2, 4

  5. [5]

    End-to- end object detection with transformers

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. InEuropean confer- ence on computer vision, pages 213–229. Springer, 2020. 2

  6. [6]

    Dsgn: Deep stereo geometry network for 3d object detection

    Yilun Chen, Shu Liu, Xiaoyong Shen, and Jiaya Jia. Dsgn: Deep stereo geometry network for 3d object detection. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 12536–12545, 2020. 2

  7. [7]

    Monopair: Monocular 3d object detection using pairwise spatial relationships

    Yongjian Chen, Lei Tai, Kai Sun, and Mingyang Li. Monopair: Monocular 3d object detection using pairwise spatial relationships. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 12093–12102, 2020. 2

  8. [8]

    Rest: A reconfigurable spatial-temporal graph model for multi-camera multi-object tracking

    Cheng-Che Cheng, Min-Xuan Qiu, Chen-Kuo Chiang, and Shang-Hong Lai. Rest: A reconfigurable spatial-temporal graph model for multi-camera multi-object tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10051–10060, 2023. 2

  9. [9]

    Online multi- camera people tracking with spatial-temporal mechanism and anchor-feature hierarchical clustering

    Riu Cherdchusakulchai, Sasin Phimsiri, Visarut Trairat- tanapa, Suchat Tungjitnob, Wasu Kudisthalert, Pornprom Ki- awjak, Ek Thamwiwatthana, Phawat Borisuitsawat, Teep- akorn Tosawadi, Pakcheera Choppradit, et al. Online multi- camera people tracking with spatial-temporal mechanism and anchor-feature hierarchical clustering. InProceedings of the IEEE/CVF ...

  10. [10]

    Deepstereo: Learning to predict new views from the world’s imagery

    John Flynn, Ivan Neulander, James Philbin, and Noah Snavely. Deepstereo: Learning to predict new views from the world’s imagery. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5515–5524,

  11. [11]

    Yolox: Exceeding yolo series in 2021, 2021

    Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. Yolox: Exceeding yolo series in 2021, 2021. 2

  12. [12]

    Fast r-cnn

    Ross Girshick. Fast r-cnn. InProceedings of the IEEE inter- national conference on computer vision, pages 1440–1448,

  13. [13]

    Mask r-cnn

    Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask r-cnn. InProceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017. 2

  14. [14]

    Enhancing multi-camera people tracking with anchor-guided clustering and spatio-temporal consistency id re-assignment

    Hsiang-Wei Huang, Cheng-Yen Yang, Zhongyu Jiang, Pyong-Kun Kim, Kyoungoh Lee, Kwangju Kim, Samartha Ramkumar, Chaitanya Mullapudi, In-Su Jang, Chung-I Huang, et al. Enhancing multi-camera people tracking with anchor-guided clustering and spatio-temporal consistency id re-assignment. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern R...

  15. [15]

    People tracking in multi-camera systems: a re- view.Multimedia Tools and Applications, 78:10773–10793,

    Rabah Iguernaissi, Djamal Merad, Kheireddine Aziz, and Pierre Drap. People tracking in multi-camera systems: a re- view.Multimedia Tools and Applications, 78:10773–10793,

  16. [16]

    Rtmpose: Real-time multi-person pose estimation based on mmpose

    Tao Jiang, Peng Lu, Li Zhang, Ningsheng Ma, Rui Han, Chengqi Lyu, Yining Li, and Kai Chen. Rtmpose: Real- time multi-person pose estimation based on mmpose.arXiv preprint arXiv:2303.07399, 2023. 4

  17. [17]

    Ultralytics yolo11, 2024

    Glenn Jocher and Jing Qiu. Ultralytics yolo11, 2024. 2

  18. [18]

    Addressing the occlusion problem in multi-camera people tracking with human pose estimation

    Jeongho Kim, Wooksu Shin, Hancheol Park, and Jongwon Baek. Addressing the occlusion problem in multi-camera people tracking with human pose estimation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5463–5469, 2023. 2, 4

  19. [19]

    Cluster self-refinement for enhanced online multi- camera people tracking

    Jeongho Kim, Wooksu Shin, Hancheol Park, and Donghyuk Choi. Cluster self-refinement for enhanced online multi- camera people tracking. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7190–7197, 2024. 1, 2, 4, 8

  20. [20]

    Pointpillars: Fast encoders for object detection from point clouds

    Alex H Lang, Sourabh V ora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12697–12705, 2019. 3

  21. [21]

    xformers: A modular and hackable trans- former modelling library.https : / / github

    Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, Daniel Haziza, Luca Wehrstedt, Jeremy Reizenstein, and Grigory Sizov. xformers: A modular and hackable trans- former modelling library.https : / / github . com / facebookresearch/xformers, 2022. 7

  22. [22]

    Crowdpose: Efficient crowded scenes pose estimation and a new benchmark

    Jiefeng Li, Can Wang, Hao Zhu, Yihuan Mao, Hao-Shu Fang, and Cewu Lu. Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10863–10872, 2019. 4

  23. [23]

    Stereo r-cnn based 3d object detection for autonomous driving

    Peiliang Li, Xiaozhi Chen, and Shaojie Shen. Stereo r-cnn based 3d object detection for autonomous driving. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7644–7652, 2019. 2

  24. [24]

    Rtm3d: Real-time monocular 3d detection from object key- points for autonomous driving

    Peixuan Li, Huaici Zhao, Pengfei Liu, and Feidao Cao. Rtm3d: Real-time monocular 3d detection from object key- points for autonomous driving. InEuropean Conference on Computer Vision, pages 644–660. Springer, 2020. 2

  25. [25]

    Clip-reid: exploiting vision-language model for image re-identification without concrete text labels

    Siyuan Li, Li Sun, and Qingli Li. Clip-reid: exploiting vision-language model for image re-identification without concrete text labels. InProceedings of the AAAI conference on artificial intelligence, pages 1405–1413, 2023. 2, 4, 7

  26. [26]

    Exploring plain vision transformer backbones for object de- tection

    Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. Exploring plain vision transformer backbones for object de- tection. InEuropean conference on computer vision, pages 280–296. Springer, 2022. 2

  27. [27]

    Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2024

    Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chong- hao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2024. 3

  28. [28]

    Monodetrnext: Next-generation accurate and efficient monocular 3d object detector, 2024

    Pan Liao, Feng Yang, Di Wu, Wenhui Zhao, and Jinwen Yu. Monodetrnext: Next-generation accurate and efficient monocular 3d object detector, 2024. 3

  29. [29]

    Smoke: Single- stage monocular 3d object detection via keypoint estimation

    Zechen Liu, Zizhang Wu, and Roland T ´oth. Smoke: Single- stage monocular 3d object detection via keypoint estimation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 996–997,

  30. [30]

    Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows . In 2021 IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 9992–10002, Los Alamitos, CA, USA,

  31. [31]

    IEEE Computer Society. 7

  32. [32]

    Geometry uncer- tainty projection network for monocular 3d object detection

    Yan Lu, Xinzhu Ma, Lei Yang, Tianzhu Zhang, Yating Liu, Qi Chu, Junjie Yan, and Wanli Ouyang. Geometry uncer- tainty projection network for monocular 3d object detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3111–3121, 2021. 2

  33. [33]

    Hota: A higher order metric for evaluating multi-object tracking.International journal of computer vision, 129:548– 578, 2021

    Jonathon Luiten, Aljosa Osep, Patrick Dendorfer, Philip Torr, Andreas Geiger, Laura Leal-Taix´e, and Bastian Leibe. Hota: A higher order metric for evaluating multi-object tracking.International journal of computer vision, 129:548– 578, 2021. 7

  34. [34]

    Rt-detr: Real-time detection transformer with efficient hybrid encoder, 2024

    Wenyu Lv, Yuxiang Chen, Xinghao Chen, Shangliang Xu, Yifan Xiao, Yizhen Gan, Lei Qi, Jinwei Chen, and Jianfeng He. Rt-detr: Real-time detection transformer with efficient hybrid encoder, 2024. 2

  35. [35]

    Deep oc-sort: Multi-pedestrian tracking by adaptive re-identification

    Gerard Maggiolino, Adnan Ahmad, Jinkun Cao, and Kris Kitani. Deep oc-sort: Multi-pedestrian tracking by adaptive re-identification. In2023 IEEE International conference on image processing (ICIP), pages 3025–3029. IEEE, 2023. 2, 4

  36. [36]

    Lmgp: Lifted mul- ticut meets geometry projections for multi-camera multi- object tracking

    Duy MH Nguyen, Roberto Henschel, Bodo Rosenhahn, Daniel Sonntag, and Paul Swoboda. Lmgp: Lifted mul- ticut meets geometry projections for multi-camera multi- object tracking. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8866– 8875, 2022. 2

  37. [37]

    Multi-camera people tracking with mixture of realistic and synthetic knowledge

    Quang Qui-Vinh Nguyen, Huy Dinh-Anh Le, Truc Thi- Thanh Chau, Duc Trung Luu, Nhat Minh Chung, and Synh Viet-Uyen Ha. Multi-camera people tracking with mixture of realistic and synthetic knowledge. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5496–5506, 2023. 2

  38. [38]

    Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d

    Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. InEuropean conference on computer vision, pages 194–210. Springer, 2020. 3

  39. [39]

    Sam 2: Segment anything in images and videos,

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos,

  40. [40]

    Categorical depth distribution network for monocular 3d object detection

    Cody Reading, Ali Harakeh, Julia Chae, and Steven L Waslander. Categorical depth distribution network for monocular 3d object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8555–8564, 2021. 2

  41. [41]

    Faster r-cnn: Towards real-time object detection with region proposal networks.Advances in neural information process- ing systems, 28, 2015

    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks.Advances in neural information process- ing systems, 28, 2015. 2

  42. [42]

    Disen- tangling monocular 3d object detection

    Andrea Simonelli, Samuel Rota Bulo, Lorenzo Porzi, Manuel L ´opez-Antequera, and Peter Kontschieder. Disen- tangling monocular 3d object detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 1991–1999, 2019. 2

  43. [43]

    Cameltrack: Context-aware multi-cue exploitation for online multi-object tracking, 2025

    Vladimir Somers, Baptiste Standaert, Victor Joos, Alexan- dre Alahi, and Christophe De Vleeschouwer. Cameltrack: Context-aware multi-cue exploitation for online multi-object tracking, 2025. 2

  44. [44]

    Ocmctrack: Online multi-target multi- camera tracking with corrective matching cascade

    Andreas Specker. Ocmctrack: Online multi-target multi- camera tracking with corrective matching cascade. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7236–7244, 2024. 2

  45. [45]

    Toward accurate on- line multi-target multi-camera tracking in real-time

    Andreas Specker and J ¨urgen Beyerer. Toward accurate on- line multi-target multi-camera tracking in real-time. In2022 30th European Signal Processing Conference (EUSIPCO), pages 533–537. IEEE, 2022. 2

  46. [46]

    Zheng Tang, Shuo Wang, David C. Anastasiu, Ming- Ching Chang, Anuj Sharma, Quan Kong, Norimasa Ko- bori, Munkhjargal Gochoo, Ganzorig Batnasan, Munkh- Erdene Otgonbold, Fady Alnajjar, Jun-Wei Hsieh, Tomasz Kornuta, Xiaolong Li, Yilin Zhao, Han Zhang, Subhashree Radhakrishnan, Arihant Jain, Ratnesh Kumar, Vidya N. Murali, Yuxing Wang, Sameer Satish Pusegao...

  47. [47]

    Earlybird: Early-fusion for multi- view tracking in the bird’s eye view

    Torben Teepe, Philipp Wolters, Johannes Gilg, Fabian Her- zog, and Gerhard Rigoll. Earlybird: Early-fusion for multi- view tracking in the bird’s eye view. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 102–111, 2024. 1

  48. [48]

    Advancing thermal multi-object tracking with attention and metric fu- sion, 2024

    Thao-Anh Tran, Vu-Minh Le, Thanh-Tung Phan, Dung Hoang, Duc Phan, Huong Ninh, and Hai Tran. Advancing thermal multi-object tracking with attention and metric fu- sion, 2024. 2

  49. [49]

    Yolov8: A novel object detection algorithm with enhanced performance and robust- ness

    Rejin Varghese and Sambath M. Yolov8: A novel object detection algorithm with enhanced performance and robust- ness. In2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), pages 1–6, 2024. 2

  50. [50]

    Pointpainting: Sequential fusion for 3d object de- tection

    Sourabh V ora, Alex H Lang, Bassam Helou, and Oscar Bei- jbom. Pointpainting: Sequential fusion for 3d object de- tection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4604–4612,

  51. [51]

    Pointaugmenting: Cross-modal augmentation for 3d object detection

    Chunwei Wang, Chao Ma, Ming Zhu, and Xiaokang Yang. Pointaugmenting: Cross-modal augmentation for 3d object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11794– 11803, 2021. 3

  52. [52]

    Db- scan: Optimal rates for density based clustering.arXiv: Statistics Theory, 2017

    Daren Wang, Xin Yang Lu, and Alessandro Rinaldo. Db- scan: Optimal rates for density based clustering.arXiv: Statistics Theory, 2017. 6

  53. [53]

    Anastasiu, Zheng Tang, Ming- Ching Chang, Yue Yao, Liang Zheng, Mohammed Shaiqur Rahman, Meenakshi S

    Shuo Wang, David C. Anastasiu, Zheng Tang, Ming- Ching Chang, Yue Yao, Liang Zheng, Mohammed Shaiqur Rahman, Meenakshi S. Arya, Anuj Sharma, Pranamesh Chakraborty, Sanjita Prajapati, Quan Kong, Norimasa Ko- bori, Munkhjargal Gochoo, Munkh-Erdene Otgonbold, Gan- zorig Batnasan, Fady Alnajjar, Ping-Yang Chen, Jun-Wei Hsieh, Xunlei Wu, Sameer Satish Pusegaon...

  54. [54]

    Fcos3d: Fully convolutional one-stage monocular 3d object detection

    Tai Wang, Xinge Zhu, Jiangmiao Pang, and Dahua Lin. Fcos3d: Fully convolutional one-stage monocular 3d object detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 913–922, 2021. 2

  55. [55]

    Pseudo- lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving

    Yan Wang, Wei-Lun Chao, Divyansh Garg, Bharath Hari- haran, Mark Campbell, and Kilian Q Weinberger. Pseudo- lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8445–8453, 2019. 2

  56. [56]

    Bev- sushi: Multi-target multi-camera 3d detection and tracking in bird’s-eye view.arXiv preprint arXiv:2412.00692, 2024

    Yizhou Wang, Tim Meinhardt, Orcun Cetintas, Cheng- Yen Yang, Sameer Satish Pusegaonkar, Benjamin Missaoui, Sujit Biswas, Zheng Tang, and Laura Leal-Taix ´e. Bev- sushi: Multi-target multi-camera 3d detection and tracking in bird’s-eye view.arXiv preprint arXiv:2412.00692, 2024. 1, 2

  57. [57]

    Simple online and realtime tracking with a deep association metric

    Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Simple online and realtime tracking with a deep association metric. In2017 IEEE international conference on image processing (ICIP), pages 3645–3649. IEEE, 2017. 2

  58. [58]

    ObjectSeeker: Certifiably Robust Object Detection against Patch Hiding Attacks via Patch-agnostic Masking

    Chong Xiang, Alexander Valtchanov, Saeed Mahloujifar, and Prateek Mittal. ObjectSeeker: Certifiably Robust Object Detection against Patch Hiding Attacks via Patch-agnostic Masking . In2023 IEEE Symposium on Security and Pri- vacy (SP), pages 1329–1347, Los Alamitos, CA, USA, 2023. IEEE Computer Society. 7

  59. [59]

    A robust online multi-camera people tracking system with geometric con- sistency and state-aware re-id correction

    Zhenyu Xie, Zelin Ni, Wenjie Yang, Yuang Zhang, Yi- hang Chen, Yang Zhang, and Xiao Ma. A robust online multi-camera people tracking system with geometric con- sistency and state-aware re-id correction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7007–7016, 2024. 1, 2

  60. [60]

    An online approach and evalua- tion method for tracking people across cameras in extremely long video sequence

    Cheng-Yen Yang, Hsiang-Wei Huang, Pyong-Kun Kim, Zhongyu Jiang, Kwang-Ju Kim, Chung-I Huang, Haiqing Du, and Jenq-Neng Hwang. An online approach and evalua- tion method for tracking people across cameras in extremely long video sequence. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 7037–7045, 2024. 2

  61. [61]

    City-scale multi-camera vehicle tracking based on space-time-appearance features

    Hui Yao, Zhizhao Duan, Zhen Xie, Jingbo Chen, Xi Wu, Duo Xu, and Yutao Gao. City-scale multi-camera vehicle tracking based on space-time-appearance features. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3310–3318, 2022. 2

  62. [62]

    Overlap suppression clustering for offline multi-camera people tracking

    Ryuto Yoshida, Junichi Okubo, Junichiro Fujii, Masazumi Amakata, and Takayoshi Yamashita. Overlap suppression clustering for offline multi-camera people tracking. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7153–7162, 2024. 1, 2

  63. [63]

    DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

    Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung-Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object detection.arXiv preprint arXiv:2203.03605, 2022. 2

  64. [64]

    Monodetr: Depth- guided transformer for monocular 3d object detection

    Renrui Zhang, Han Qiu, Tai Wang, Ziyu Guo, Ziteng Cui, Yu Qiao, Hongsheng Li, and Peng Gao. Monodetr: Depth- guided transformer for monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9155–9166, 2023. 2

  65. [65]

    Objects are differ- ent: Flexible monocular 3d object detection

    Yunpeng Zhang, Jiwen Lu, and Jie Zhou. Objects are differ- ent: Flexible monocular 3d object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 3289–3298, 2021. 2

  66. [66]

    Fairmot: On the fairness of detection and re-identification in multiple object tracking.International Journal of Computer Vision, 129(11):3069–3087, 2021

    Yifu Zhang, Chunyu Wang, Xinggang Wang, Wenjun Zeng, and Wenyu Liu. Fairmot: On the fairness of detection and re-identification in multiple object tracking.International Journal of Computer Vision, 129(11):3069–3087, 2021. 2

  67. [67]

    Bytetrack: Multi-object tracking by associating every detection box, 2022

    Yifu Zhang, Peize Sun, Yi Jiang, Dongdong Yu, Fucheng Weng, Zehuan Yuan, Ping Luo, Wenyu Liu, and Xinggang Wang. Bytetrack: Multi-object tracking by associating every detection box, 2022. 2

  68. [68]

    Multi-Target, Multi-Camera Tracking by Hierarchical Clustering: Recent Progress on DukeMTMC Project

    Zhimeng Zhang, Jianan Wu, Xuan Zhang, and Chi Zhang. Multi-target, multi-camera tracking by hierarchical cluster- ing: Recent progress on dukemtmc project.arXiv preprint arXiv:1712.09531, 2017. 2

  69. [69]

    Ob- jects as points, 2019

    Xingyi Zhou, Dequan Wang, and Philipp Kr ¨ahenb¨uhl. Ob- jects as points, 2019. 2

  70. [70]

    Deformable detr: Deformable transformers for end-to-end object detection, 2020

    Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection, 2020. 2

  71. [71]

    Detrs with col- laborative hybrid assignments training

    Zhuofan Zong, Guanglu Song, and Yu Liu. Detrs with col- laborative hybrid assignments training. InProceedings of the IEEE/CVF international conference on computer vision, pages 6748–6758, 2023. 2, 4