Online 3D Multi-Camera Perception through Robust 2D Tracking and Depth-based Late Aggregation

Duc Huy Do; Hai Tran; Huong Ninh; Thao-Anh Tran; Vu-Minh Le; Xuan Canh Do

arxiv: 2509.09946 · v2 · pith:2JESOQ2Fnew · submitted 2025-09-12 · 💻 cs.CV

Online 3D Multi-Camera Perception through Robust 2D Tracking and Depth-based Late Aggregation

Vu-Minh Le , Thao-Anh Tran , Duc Huy Do , Xuan Canh Do , Huong Ninh , Hai Tran This is my paper

Pith reviewed 2026-05-21 22:08 UTC · model grok-4.3

classification 💻 cs.CV

keywords multi-target multi-camera tracking3D perceptiondepth-based aggregationpoint cloud reconstructiondata associationonline trackingAI City Challenge3D bounding box

0 comments

The pith

Any online 2D multi-camera tracking system extends to 3D by reconstructing targets as point clouds from depth data and recovering boxes via clustering and yaw refinement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates a method to upgrade existing 2D multi-target multi-camera tracking pipelines to full 3D perception without replacing core tracking components. Depth information projects each tracked target into point-cloud space after 2D processing, where clustering and yaw refinement produce accurate 3D bounding boxes. An enhanced data association step maintains local ID consistency to assign global IDs across frames and cameras. This matters for large-scale surveillance because it adds automatic 3D environmental understanding on top of proven 2D systems. The framework achieved third place on the 2025 AI City Challenge 3D MTMC dataset.

Core claim

The approach extends any online 2D multi-camera tracking system into 3D space by utilizing depth information to reconstruct a target in point-cloud space, and recovering its 3D box through clustering and yaw refinement following tracking. It also introduces an enhanced online data association mechanism that leverages the target's local ID consistency to assign global IDs across frames. The proposed framework is evaluated on the 2025 AI City Challenge's 3D MTMC dataset.

What carries the argument

Depth-based late aggregation that reconstructs 2D-tracked targets into point clouds and recovers 3D boxes through clustering and yaw refinement.

If this is right

Existing 2D MTMC systems gain 3D perception capability without replacing all tracking components.
Targets are reconstructed in point-cloud space to support automatic 3D environment perception.
Local ID consistency enables improved global ID assignment across frames in online operation.
The framework delivers competitive performance on 3D MTMC benchmarks such as the AI City Challenge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The modular late-aggregation design could pair with multiple different 2D trackers to match specific camera hardware in surveillance networks.
Reusing 2D models in this way may reduce development and compute costs compared to building dedicated 3D trackers from scratch.
The method could extend to real-time applications in smart-city monitoring or traffic analysis if depth sources vary between sensors and stereo estimation.

Load-bearing premise

Depth information is available and accurate enough to enable reliable point-cloud reconstruction of targets followed by clustering and yaw refinement to recover valid 3D boxes.

What would settle it

Applying the method to test scenes with noisy or missing depth maps and measuring whether recovered 3D boxes show large deviations from ground-truth 3D annotations on the AI City Challenge dataset.

Figures

Figures reproduced from arXiv: 2509.09946 by Duc Huy Do, Hai Tran, Huong Ninh, Thao-Anh Tran, Vu-Minh Le, Xuan Canh Do.

**Figure 2.** Figure 2: An illustration of our Online 3D Multi-Target Multi-Camera Tracking pipeline with Late 3D Bounding Box Aggregation. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: An example of our track splitting mechanism. At time [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 5.** Figure 5: (a) A single object is assigned multiple 3D bounding [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

read the original abstract

Multi-Target Multi-Camera Tracking (MTMC) is an essential computer vision task for automating large-scale surveillance. With camera calibration and depth information, the targets in the scene can be projected into 3D space, offering unparalleled levels of automatic perception of a 3D environment. However, tracking in the 3D space requires replacing all 2D tracking components from the ground up, which may be infeasible for existing MTMC systems. In this paper, we present an approach for extending any online 2D multi-camera tracking system into 3D space by utilizing depth information to reconstruct a target in point-cloud space, and recovering its 3D box through clustering and yaw refinement following tracking. We also introduced an enhanced online data association mechanism that leverages the target's local ID consistency to assign global IDs across frames. The proposed framework is evaluated on the 2025 AI City Challenge's 3D MTMC dataset, achieving 3rd place on the leaderboard.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Practical way to bolt 3D onto existing 2D MTMC trackers via depth point clouds and clustering, with a third-place challenge result but thin robustness checks.

read the letter

The main thing to know is that this paper gives a late-fusion recipe for turning any online 2D multi-camera tracker into a 3D one: project the tracked targets into point clouds using depth, cluster to recover 3D boxes, refine yaw, and use local ID consistency to help with global association across frames. They placed third on the 2025 AI City Challenge 3D MTMC dataset. That is the concrete contribution. What works is the compatibility angle. By leaving the 2D tracker untouched and adding the depth step afterward, the method can sit on top of systems that are already running and calibrated. The ID consistency tweak is a reasonable, low-overhead way to keep associations stable online without rebuilding the whole matching logic. The soft spot is exactly the one the stress-test flagged. The 3D box recovery depends on depth data producing clean enough point clouds for clustering and yaw to succeed, yet the paper gives no numbers on how performance drops with typical depth noise, missing values, or partial occlusions. Those are everyday problems in surveillance, and without ablations or error analysis on that step the third-place score is harder to interpret. The rest of the pipeline looks standard. This is for teams already doing large-scale 2D MTMC who want a quick path to 3D output on calibrated rigs. A practitioner or challenge participant would get usable ideas from the description and the leaderboard placement. It is worth sending to peer review because the method is fully specified, the result is public and competitive, and the practical framing is clear even if reviewers will want more depth-sensitivity experiments.

Referee Report

2 major / 2 minor

Summary. The paper claims to extend any online 2D multi-camera tracking system to 3D perception by reconstructing tracked targets into point clouds using depth information, then applying clustering and yaw refinement to recover 3D bounding boxes; it further introduces an enhanced online data association step that exploits local ID consistency for global ID assignment across frames. The framework is evaluated on the 2025 AI City Challenge 3D MTMC dataset and reports a 3rd-place ranking.

Significance. If the central claim holds, the approach offers a practical, modular upgrade path for existing 2D MTMC pipelines to 3D without replacing core tracking components, which is relevant for large-scale surveillance. The reported 3rd-place result on the challenge dataset supplies concrete empirical support for the overall pipeline. However, the absence of targeted robustness analysis on the depth-to-3D conversion step limits the strength of the evidence for real-world deployment.

major comments (2)

[Method description of 3D box recovery] The depth-based late aggregation (point-cloud reconstruction followed by clustering and yaw refinement) is the sole mechanism converting 2D tracks into 3D boxes and is therefore load-bearing. No quantitative characterization or ablation is provided on how depth noise, sensor inaccuracies, or partial occlusions affect under-/over-segmentation or yaw stability, leaving the central claim without direct verification.
[Experiments and results] Table or leaderboard results report 3rd place but contain no component ablations isolating the contribution of the proposed clustering/yaw refinement versus the base 2D tracker, nor any error analysis broken down by depth quality or occlusion level. This makes it impossible to confirm that the 3D extension itself drives the ranking rather than upstream 2D performance.

minor comments (2)

[Abstract and §3] The abstract and method section would benefit from explicit naming of the clustering algorithm (e.g., DBSCAN parameters) and the exact yaw refinement procedure to allow reproduction.
[Data association subsection] Notation for local versus global IDs is introduced but not consistently defined in equations or pseudocode, which reduces clarity of the data-association enhancement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments correctly identify areas where additional analysis would strengthen the evidence for the proposed 3D extension. We address each major comment below and will incorporate revisions to improve the paper.

read point-by-point responses

Referee: [Method description of 3D box recovery] The depth-based late aggregation (point-cloud reconstruction followed by clustering and yaw refinement) is the sole mechanism converting 2D tracks into 3D boxes and is therefore load-bearing. No quantitative characterization or ablation is provided on how depth noise, sensor inaccuracies, or partial occlusions affect under-/over-segmentation or yaw stability, leaving the central claim without direct verification.

Authors: We agree that a targeted robustness study on the depth-to-3D conversion would provide stronger verification of the central claim. The current evaluation relies on the real-world variations present in the 2025 AI City Challenge 3D MTMC dataset, which includes diverse depth qualities and occlusion scenarios as reflected in the 3rd-place result. To directly address the concern, we will add a new subsection with controlled experiments injecting synthetic depth noise and simulating partial occlusions to measure effects on clustering stability and yaw estimation. revision: yes
Referee: [Experiments and results] Table or leaderboard results report 3rd place but contain no component ablations isolating the contribution of the proposed clustering/yaw refinement versus the base 2D tracker, nor any error analysis broken down by depth quality or occlusion level. This makes it impossible to confirm that the 3D extension itself drives the ranking rather than upstream 2D performance.

Authors: We concur that explicit component ablations and stratified error analysis are needed to isolate the contribution of the late-aggregation steps. In the revised manuscript we will include an ablation table comparing the full pipeline against a baseline that projects 2D tracks to 3D without clustering or yaw refinement. We will also add error breakdowns stratified by depth quality (using available sensor metadata) and occlusion level (derived from track visibility annotations) to demonstrate the incremental benefit of the proposed 3D components. revision: yes

Circularity Check

0 steps flagged

No circularity: standard depth-to-3D pipeline after 2D tracking

full rationale

The paper's core chain is: run any existing online 2D MTMC tracker, project tracked targets into point-cloud space using supplied depth, apply clustering plus yaw refinement to obtain 3D boxes, and use local-ID consistency for global association. None of these steps is defined in terms of its own output, fitted to a subset and then re-predicted, or justified solely by a self-citation whose content is unverified. The method is presented as a modular post-processing extension whose validity is checked by leaderboard performance on the 2025 AI City 3D MTMC dataset. No equations or uniqueness theorems reduce the result to the input by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Review based on abstract only; ledger reflects implied assumptions from the described pipeline. No explicit free parameters or invented entities are stated.

axioms (2)

domain assumption Depth information from cameras is available and sufficiently accurate for point-cloud reconstruction of tracked targets.
The entire 3D extension step depends on this input being reliable for clustering and box recovery.
domain assumption Existing 2D multi-camera tracking produces sufficiently robust local IDs and trajectories to support subsequent 3D aggregation.
The method is explicitly designed to extend any online 2D system, assuming the 2D component works well.

pith-pipeline@v0.9.0 · 5720 in / 1450 out tokens · 44390 ms · 2026-05-21T22:08:18.881984+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

utilizing depth information to reconstruct a target in point-cloud space, and recovering its 3D box through clustering and yaw refinement following tracking
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DBSCAN clustering ... epsilon ... min_samples=50 ... volume-based fusion ... yaw = arctan((yt - yt-10)/(xt - xt-10))

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 2 internal anchors

[1]

Simple online and realtime tracking

Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, and Ben Upcroft. Simple online and realtime tracking. In2016 IEEE international conference on image processing (ICIP), pages 3464–3468. Ieee, 2016. 2

work page 2016
[2]

M3d-rpn: Monocular 3d region proposal network for object detection

Garrick Brazil and Xiaoming Liu. M3d-rpn: Monocular 3d region proposal network for object detection. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 9287–9296, 2019. 2

work page 2019
[3]

Cascade r-cnn: Delv- ing into high quality object detection

Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delv- ing into high quality object detection. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 6154–6162, 2018. 2

work page 2018
[4]

Observation-centric sort: Rethink- ing sort for robust multi-object tracking

Jinkun Cao, Jiangmiao Pang, Xinshuo Weng, Rawal Khi- rodkar, and Kris Kitani. Observation-centric sort: Rethink- ing sort for robust multi-object tracking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9686–9696, 2023. 2, 4

work page 2023
[5]

End-to- end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. InEuropean confer- ence on computer vision, pages 213–229. Springer, 2020. 2

work page 2020
[6]

Dsgn: Deep stereo geometry network for 3d object detection

Yilun Chen, Shu Liu, Xiaoyong Shen, and Jiaya Jia. Dsgn: Deep stereo geometry network for 3d object detection. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 12536–12545, 2020. 2

work page 2020
[7]

Monopair: Monocular 3d object detection using pairwise spatial relationships

Yongjian Chen, Lei Tai, Kai Sun, and Mingyang Li. Monopair: Monocular 3d object detection using pairwise spatial relationships. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 12093–12102, 2020. 2

work page 2020
[8]

Rest: A reconfigurable spatial-temporal graph model for multi-camera multi-object tracking

Cheng-Che Cheng, Min-Xuan Qiu, Chen-Kuo Chiang, and Shang-Hong Lai. Rest: A reconfigurable spatial-temporal graph model for multi-camera multi-object tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10051–10060, 2023. 2

work page 2023
[9]

Online multi- camera people tracking with spatial-temporal mechanism and anchor-feature hierarchical clustering

Riu Cherdchusakulchai, Sasin Phimsiri, Visarut Trairat- tanapa, Suchat Tungjitnob, Wasu Kudisthalert, Pornprom Ki- awjak, Ek Thamwiwatthana, Phawat Borisuitsawat, Teep- akorn Tosawadi, Pakcheera Choppradit, et al. Online multi- camera people tracking with spatial-temporal mechanism and anchor-feature hierarchical clustering. InProceedings of the IEEE/CVF ...

work page 2024
[10]

Deepstereo: Learning to predict new views from the world’s imagery

John Flynn, Ivan Neulander, James Philbin, and Noah Snavely. Deepstereo: Learning to predict new views from the world’s imagery. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5515–5524,

work page
[11]

Yolox: Exceeding yolo series in 2021, 2021

Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. Yolox: Exceeding yolo series in 2021, 2021. 2

work page 2021
[12]

Fast r-cnn

Ross Girshick. Fast r-cnn. InProceedings of the IEEE inter- national conference on computer vision, pages 1440–1448,

work page
[13]

Mask r-cnn

Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask r-cnn. InProceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017. 2

work page 2017
[14]

Enhancing multi-camera people tracking with anchor-guided clustering and spatio-temporal consistency id re-assignment

Hsiang-Wei Huang, Cheng-Yen Yang, Zhongyu Jiang, Pyong-Kun Kim, Kyoungoh Lee, Kwangju Kim, Samartha Ramkumar, Chaitanya Mullapudi, In-Su Jang, Chung-I Huang, et al. Enhancing multi-camera people tracking with anchor-guided clustering and spatio-temporal consistency id re-assignment. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern R...

work page 2023
[15]

People tracking in multi-camera systems: a re- view.Multimedia Tools and Applications, 78:10773–10793,

Rabah Iguernaissi, Djamal Merad, Kheireddine Aziz, and Pierre Drap. People tracking in multi-camera systems: a re- view.Multimedia Tools and Applications, 78:10773–10793,

work page
[16]

Rtmpose: Real-time multi-person pose estimation based on mmpose

Tao Jiang, Peng Lu, Li Zhang, Ningsheng Ma, Rui Han, Chengqi Lyu, Yining Li, and Kai Chen. Rtmpose: Real- time multi-person pose estimation based on mmpose.arXiv preprint arXiv:2303.07399, 2023. 4

work page arXiv 2023
[17]

Ultralytics yolo11, 2024

Glenn Jocher and Jing Qiu. Ultralytics yolo11, 2024. 2

work page 2024
[18]

Addressing the occlusion problem in multi-camera people tracking with human pose estimation

Jeongho Kim, Wooksu Shin, Hancheol Park, and Jongwon Baek. Addressing the occlusion problem in multi-camera people tracking with human pose estimation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5463–5469, 2023. 2, 4

work page 2023
[19]

Cluster self-refinement for enhanced online multi- camera people tracking

Jeongho Kim, Wooksu Shin, Hancheol Park, and Donghyuk Choi. Cluster self-refinement for enhanced online multi- camera people tracking. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7190–7197, 2024. 1, 2, 4, 8

work page 2024
[20]

Pointpillars: Fast encoders for object detection from point clouds

Alex H Lang, Sourabh V ora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12697–12705, 2019. 3

work page 2019
[21]

xformers: A modular and hackable trans- former modelling library.https : / / github

Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, Daniel Haziza, Luca Wehrstedt, Jeremy Reizenstein, and Grigory Sizov. xformers: A modular and hackable trans- former modelling library.https : / / github . com / facebookresearch/xformers, 2022. 7

work page 2022
[22]

Crowdpose: Efficient crowded scenes pose estimation and a new benchmark

Jiefeng Li, Can Wang, Hao Zhu, Yihuan Mao, Hao-Shu Fang, and Cewu Lu. Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10863–10872, 2019. 4

work page 2019
[23]

Stereo r-cnn based 3d object detection for autonomous driving

Peiliang Li, Xiaozhi Chen, and Shaojie Shen. Stereo r-cnn based 3d object detection for autonomous driving. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7644–7652, 2019. 2

work page 2019
[24]

Rtm3d: Real-time monocular 3d detection from object key- points for autonomous driving

Peixuan Li, Huaici Zhao, Pengfei Liu, and Feidao Cao. Rtm3d: Real-time monocular 3d detection from object key- points for autonomous driving. InEuropean Conference on Computer Vision, pages 644–660. Springer, 2020. 2

work page 2020
[25]

Clip-reid: exploiting vision-language model for image re-identification without concrete text labels

Siyuan Li, Li Sun, and Qingli Li. Clip-reid: exploiting vision-language model for image re-identification without concrete text labels. InProceedings of the AAAI conference on artificial intelligence, pages 1405–1413, 2023. 2, 4, 7

work page 2023
[26]

Exploring plain vision transformer backbones for object de- tection

Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. Exploring plain vision transformer backbones for object de- tection. InEuropean conference on computer vision, pages 280–296. Springer, 2022. 2

work page 2022
[27]

Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2024

Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chong- hao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2024. 3

work page 2024
[28]

Monodetrnext: Next-generation accurate and efficient monocular 3d object detector, 2024

Pan Liao, Feng Yang, Di Wu, Wenhui Zhao, and Jinwen Yu. Monodetrnext: Next-generation accurate and efficient monocular 3d object detector, 2024. 3

work page 2024
[29]

Smoke: Single- stage monocular 3d object detection via keypoint estimation

Zechen Liu, Zizhang Wu, and Roland T ´oth. Smoke: Single- stage monocular 3d object detection via keypoint estimation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 996–997,

work page
[30]

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows . In 2021 IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 9992–10002, Los Alamitos, CA, USA,

work page 2021
[31]

IEEE Computer Society. 7

work page
[32]

Geometry uncer- tainty projection network for monocular 3d object detection

Yan Lu, Xinzhu Ma, Lei Yang, Tianzhu Zhang, Yating Liu, Qi Chu, Junjie Yan, and Wanli Ouyang. Geometry uncer- tainty projection network for monocular 3d object detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3111–3121, 2021. 2

work page 2021
[33]

Hota: A higher order metric for evaluating multi-object tracking.International journal of computer vision, 129:548– 578, 2021

Jonathon Luiten, Aljosa Osep, Patrick Dendorfer, Philip Torr, Andreas Geiger, Laura Leal-Taix´e, and Bastian Leibe. Hota: A higher order metric for evaluating multi-object tracking.International journal of computer vision, 129:548– 578, 2021. 7

work page 2021
[34]

Rt-detr: Real-time detection transformer with efficient hybrid encoder, 2024

Wenyu Lv, Yuxiang Chen, Xinghao Chen, Shangliang Xu, Yifan Xiao, Yizhen Gan, Lei Qi, Jinwei Chen, and Jianfeng He. Rt-detr: Real-time detection transformer with efficient hybrid encoder, 2024. 2

work page 2024
[35]

Deep oc-sort: Multi-pedestrian tracking by adaptive re-identification

Gerard Maggiolino, Adnan Ahmad, Jinkun Cao, and Kris Kitani. Deep oc-sort: Multi-pedestrian tracking by adaptive re-identification. In2023 IEEE International conference on image processing (ICIP), pages 3025–3029. IEEE, 2023. 2, 4

work page 2023
[36]

Lmgp: Lifted mul- ticut meets geometry projections for multi-camera multi- object tracking

Duy MH Nguyen, Roberto Henschel, Bodo Rosenhahn, Daniel Sonntag, and Paul Swoboda. Lmgp: Lifted mul- ticut meets geometry projections for multi-camera multi- object tracking. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8866– 8875, 2022. 2

work page 2022
[37]

Multi-camera people tracking with mixture of realistic and synthetic knowledge

Quang Qui-Vinh Nguyen, Huy Dinh-Anh Le, Truc Thi- Thanh Chau, Duc Trung Luu, Nhat Minh Chung, and Synh Viet-Uyen Ha. Multi-camera people tracking with mixture of realistic and synthetic knowledge. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5496–5506, 2023. 2

work page 2023
[38]

Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d

Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. InEuropean conference on computer vision, pages 194–210. Springer, 2020. 3

work page 2020
[39]

Sam 2: Segment anything in images and videos,

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos,

work page
[40]

Categorical depth distribution network for monocular 3d object detection

Cody Reading, Ali Harakeh, Julia Chae, and Steven L Waslander. Categorical depth distribution network for monocular 3d object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8555–8564, 2021. 2

work page 2021
[41]

Faster r-cnn: Towards real-time object detection with region proposal networks.Advances in neural information process- ing systems, 28, 2015

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks.Advances in neural information process- ing systems, 28, 2015. 2

work page 2015
[42]

Disen- tangling monocular 3d object detection

Andrea Simonelli, Samuel Rota Bulo, Lorenzo Porzi, Manuel L ´opez-Antequera, and Peter Kontschieder. Disen- tangling monocular 3d object detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 1991–1999, 2019. 2

work page 1991
[43]

Cameltrack: Context-aware multi-cue exploitation for online multi-object tracking, 2025

Vladimir Somers, Baptiste Standaert, Victor Joos, Alexan- dre Alahi, and Christophe De Vleeschouwer. Cameltrack: Context-aware multi-cue exploitation for online multi-object tracking, 2025. 2

work page 2025
[44]

Ocmctrack: Online multi-target multi- camera tracking with corrective matching cascade

Andreas Specker. Ocmctrack: Online multi-target multi- camera tracking with corrective matching cascade. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7236–7244, 2024. 2

work page 2024
[45]

Toward accurate on- line multi-target multi-camera tracking in real-time

Andreas Specker and J ¨urgen Beyerer. Toward accurate on- line multi-target multi-camera tracking in real-time. In2022 30th European Signal Processing Conference (EUSIPCO), pages 533–537. IEEE, 2022. 2

work page 2022
[46]

Zheng Tang, Shuo Wang, David C. Anastasiu, Ming- Ching Chang, Anuj Sharma, Quan Kong, Norimasa Ko- bori, Munkhjargal Gochoo, Ganzorig Batnasan, Munkh- Erdene Otgonbold, Fady Alnajjar, Jun-Wei Hsieh, Tomasz Kornuta, Xiaolong Li, Yilin Zhao, Han Zhang, Subhashree Radhakrishnan, Arihant Jain, Ratnesh Kumar, Vidya N. Murali, Yuxing Wang, Sameer Satish Pusegao...

work page 2025
[47]

Earlybird: Early-fusion for multi- view tracking in the bird’s eye view

Torben Teepe, Philipp Wolters, Johannes Gilg, Fabian Her- zog, and Gerhard Rigoll. Earlybird: Early-fusion for multi- view tracking in the bird’s eye view. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 102–111, 2024. 1

work page 2024
[48]

Advancing thermal multi-object tracking with attention and metric fu- sion, 2024

Thao-Anh Tran, Vu-Minh Le, Thanh-Tung Phan, Dung Hoang, Duc Phan, Huong Ninh, and Hai Tran. Advancing thermal multi-object tracking with attention and metric fu- sion, 2024. 2

work page 2024
[49]

Yolov8: A novel object detection algorithm with enhanced performance and robust- ness

Rejin Varghese and Sambath M. Yolov8: A novel object detection algorithm with enhanced performance and robust- ness. In2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), pages 1–6, 2024. 2

work page 2024
[50]

Pointpainting: Sequential fusion for 3d object de- tection

Sourabh V ora, Alex H Lang, Bassam Helou, and Oscar Bei- jbom. Pointpainting: Sequential fusion for 3d object de- tection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4604–4612,

work page
[51]

Pointaugmenting: Cross-modal augmentation for 3d object detection

Chunwei Wang, Chao Ma, Ming Zhu, and Xiaokang Yang. Pointaugmenting: Cross-modal augmentation for 3d object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11794– 11803, 2021. 3

work page 2021
[52]

Db- scan: Optimal rates for density based clustering.arXiv: Statistics Theory, 2017

Daren Wang, Xin Yang Lu, and Alessandro Rinaldo. Db- scan: Optimal rates for density based clustering.arXiv: Statistics Theory, 2017. 6

work page 2017
[53]

Anastasiu, Zheng Tang, Ming- Ching Chang, Yue Yao, Liang Zheng, Mohammed Shaiqur Rahman, Meenakshi S

Shuo Wang, David C. Anastasiu, Zheng Tang, Ming- Ching Chang, Yue Yao, Liang Zheng, Mohammed Shaiqur Rahman, Meenakshi S. Arya, Anuj Sharma, Pranamesh Chakraborty, Sanjita Prajapati, Quan Kong, Norimasa Ko- bori, Munkhjargal Gochoo, Munkh-Erdene Otgonbold, Gan- zorig Batnasan, Fady Alnajjar, Ping-Yang Chen, Jun-Wei Hsieh, Xunlei Wu, Sameer Satish Pusegaon...

work page 2024
[54]

Fcos3d: Fully convolutional one-stage monocular 3d object detection

Tai Wang, Xinge Zhu, Jiangmiao Pang, and Dahua Lin. Fcos3d: Fully convolutional one-stage monocular 3d object detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 913–922, 2021. 2

work page 2021
[55]

Pseudo- lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving

Yan Wang, Wei-Lun Chao, Divyansh Garg, Bharath Hari- haran, Mark Campbell, and Kilian Q Weinberger. Pseudo- lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8445–8453, 2019. 2

work page 2019
[56]

Bev- sushi: Multi-target multi-camera 3d detection and tracking in bird’s-eye view.arXiv preprint arXiv:2412.00692, 2024

Yizhou Wang, Tim Meinhardt, Orcun Cetintas, Cheng- Yen Yang, Sameer Satish Pusegaonkar, Benjamin Missaoui, Sujit Biswas, Zheng Tang, and Laura Leal-Taix ´e. Bev- sushi: Multi-target multi-camera 3d detection and tracking in bird’s-eye view.arXiv preprint arXiv:2412.00692, 2024. 1, 2

work page arXiv 2024
[57]

Simple online and realtime tracking with a deep association metric

Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Simple online and realtime tracking with a deep association metric. In2017 IEEE international conference on image processing (ICIP), pages 3645–3649. IEEE, 2017. 2

work page 2017
[58]

ObjectSeeker: Certifiably Robust Object Detection against Patch Hiding Attacks via Patch-agnostic Masking

Chong Xiang, Alexander Valtchanov, Saeed Mahloujifar, and Prateek Mittal. ObjectSeeker: Certifiably Robust Object Detection against Patch Hiding Attacks via Patch-agnostic Masking . In2023 IEEE Symposium on Security and Pri- vacy (SP), pages 1329–1347, Los Alamitos, CA, USA, 2023. IEEE Computer Society. 7

work page 2023
[59]

A robust online multi-camera people tracking system with geometric con- sistency and state-aware re-id correction

Zhenyu Xie, Zelin Ni, Wenjie Yang, Yuang Zhang, Yi- hang Chen, Yang Zhang, and Xiao Ma. A robust online multi-camera people tracking system with geometric con- sistency and state-aware re-id correction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7007–7016, 2024. 1, 2

work page 2024
[60]

An online approach and evalua- tion method for tracking people across cameras in extremely long video sequence

Cheng-Yen Yang, Hsiang-Wei Huang, Pyong-Kun Kim, Zhongyu Jiang, Kwang-Ju Kim, Chung-I Huang, Haiqing Du, and Jenq-Neng Hwang. An online approach and evalua- tion method for tracking people across cameras in extremely long video sequence. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 7037–7045, 2024. 2

work page 2024
[61]

City-scale multi-camera vehicle tracking based on space-time-appearance features

Hui Yao, Zhizhao Duan, Zhen Xie, Jingbo Chen, Xi Wu, Duo Xu, and Yutao Gao. City-scale multi-camera vehicle tracking based on space-time-appearance features. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3310–3318, 2022. 2

work page 2022
[62]

Overlap suppression clustering for offline multi-camera people tracking

Ryuto Yoshida, Junichi Okubo, Junichiro Fujii, Masazumi Amakata, and Takayoshi Yamashita. Overlap suppression clustering for offline multi-camera people tracking. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7153–7162, 2024. 1, 2

work page 2024
[63]

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung-Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object detection.arXiv preprint arXiv:2203.03605, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[64]

Monodetr: Depth- guided transformer for monocular 3d object detection

Renrui Zhang, Han Qiu, Tai Wang, Ziyu Guo, Ziteng Cui, Yu Qiao, Hongsheng Li, and Peng Gao. Monodetr: Depth- guided transformer for monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9155–9166, 2023. 2

work page 2023
[65]

Objects are differ- ent: Flexible monocular 3d object detection

Yunpeng Zhang, Jiwen Lu, and Jie Zhou. Objects are differ- ent: Flexible monocular 3d object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 3289–3298, 2021. 2

work page 2021
[66]

Fairmot: On the fairness of detection and re-identification in multiple object tracking.International Journal of Computer Vision, 129(11):3069–3087, 2021

Yifu Zhang, Chunyu Wang, Xinggang Wang, Wenjun Zeng, and Wenyu Liu. Fairmot: On the fairness of detection and re-identification in multiple object tracking.International Journal of Computer Vision, 129(11):3069–3087, 2021. 2

work page 2021
[67]

Bytetrack: Multi-object tracking by associating every detection box, 2022

Yifu Zhang, Peize Sun, Yi Jiang, Dongdong Yu, Fucheng Weng, Zehuan Yuan, Ping Luo, Wenyu Liu, and Xinggang Wang. Bytetrack: Multi-object tracking by associating every detection box, 2022. 2

work page 2022
[68]

Multi-Target, Multi-Camera Tracking by Hierarchical Clustering: Recent Progress on DukeMTMC Project

Zhimeng Zhang, Jianan Wu, Xuan Zhang, and Chi Zhang. Multi-target, multi-camera tracking by hierarchical cluster- ing: Recent progress on dukemtmc project.arXiv preprint arXiv:1712.09531, 2017. 2

work page internal anchor Pith review Pith/arXiv arXiv 2017
[69]

Ob- jects as points, 2019

Xingyi Zhou, Dequan Wang, and Philipp Kr ¨ahenb¨uhl. Ob- jects as points, 2019. 2

work page 2019
[70]

Deformable detr: Deformable transformers for end-to-end object detection, 2020

Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection, 2020. 2

work page 2020
[71]

Detrs with col- laborative hybrid assignments training

Zhuofan Zong, Guanglu Song, and Yu Liu. Detrs with col- laborative hybrid assignments training. InProceedings of the IEEE/CVF international conference on computer vision, pages 6748–6758, 2023. 2, 4

work page 2023

[1] [1]

Simple online and realtime tracking

Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, and Ben Upcroft. Simple online and realtime tracking. In2016 IEEE international conference on image processing (ICIP), pages 3464–3468. Ieee, 2016. 2

work page 2016

[2] [2]

M3d-rpn: Monocular 3d region proposal network for object detection

Garrick Brazil and Xiaoming Liu. M3d-rpn: Monocular 3d region proposal network for object detection. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 9287–9296, 2019. 2

work page 2019

[3] [3]

Cascade r-cnn: Delv- ing into high quality object detection

Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delv- ing into high quality object detection. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 6154–6162, 2018. 2

work page 2018

[4] [4]

Observation-centric sort: Rethink- ing sort for robust multi-object tracking

Jinkun Cao, Jiangmiao Pang, Xinshuo Weng, Rawal Khi- rodkar, and Kris Kitani. Observation-centric sort: Rethink- ing sort for robust multi-object tracking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9686–9696, 2023. 2, 4

work page 2023

[5] [5]

End-to- end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. InEuropean confer- ence on computer vision, pages 213–229. Springer, 2020. 2

work page 2020

[6] [6]

Dsgn: Deep stereo geometry network for 3d object detection

Yilun Chen, Shu Liu, Xiaoyong Shen, and Jiaya Jia. Dsgn: Deep stereo geometry network for 3d object detection. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 12536–12545, 2020. 2

work page 2020

[7] [7]

Monopair: Monocular 3d object detection using pairwise spatial relationships

Yongjian Chen, Lei Tai, Kai Sun, and Mingyang Li. Monopair: Monocular 3d object detection using pairwise spatial relationships. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 12093–12102, 2020. 2

work page 2020

[8] [8]

Rest: A reconfigurable spatial-temporal graph model for multi-camera multi-object tracking

Cheng-Che Cheng, Min-Xuan Qiu, Chen-Kuo Chiang, and Shang-Hong Lai. Rest: A reconfigurable spatial-temporal graph model for multi-camera multi-object tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10051–10060, 2023. 2

work page 2023

[9] [9]

Online multi- camera people tracking with spatial-temporal mechanism and anchor-feature hierarchical clustering

Riu Cherdchusakulchai, Sasin Phimsiri, Visarut Trairat- tanapa, Suchat Tungjitnob, Wasu Kudisthalert, Pornprom Ki- awjak, Ek Thamwiwatthana, Phawat Borisuitsawat, Teep- akorn Tosawadi, Pakcheera Choppradit, et al. Online multi- camera people tracking with spatial-temporal mechanism and anchor-feature hierarchical clustering. InProceedings of the IEEE/CVF ...

work page 2024

[10] [10]

Deepstereo: Learning to predict new views from the world’s imagery

John Flynn, Ivan Neulander, James Philbin, and Noah Snavely. Deepstereo: Learning to predict new views from the world’s imagery. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5515–5524,

work page

[11] [11]

Yolox: Exceeding yolo series in 2021, 2021

Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. Yolox: Exceeding yolo series in 2021, 2021. 2

work page 2021

[12] [12]

Fast r-cnn

Ross Girshick. Fast r-cnn. InProceedings of the IEEE inter- national conference on computer vision, pages 1440–1448,

work page

[13] [13]

Mask r-cnn

Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask r-cnn. InProceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017. 2

work page 2017

[14] [14]

Enhancing multi-camera people tracking with anchor-guided clustering and spatio-temporal consistency id re-assignment

Hsiang-Wei Huang, Cheng-Yen Yang, Zhongyu Jiang, Pyong-Kun Kim, Kyoungoh Lee, Kwangju Kim, Samartha Ramkumar, Chaitanya Mullapudi, In-Su Jang, Chung-I Huang, et al. Enhancing multi-camera people tracking with anchor-guided clustering and spatio-temporal consistency id re-assignment. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern R...

work page 2023

[15] [15]

People tracking in multi-camera systems: a re- view.Multimedia Tools and Applications, 78:10773–10793,

Rabah Iguernaissi, Djamal Merad, Kheireddine Aziz, and Pierre Drap. People tracking in multi-camera systems: a re- view.Multimedia Tools and Applications, 78:10773–10793,

work page

[16] [16]

Rtmpose: Real-time multi-person pose estimation based on mmpose

Tao Jiang, Peng Lu, Li Zhang, Ningsheng Ma, Rui Han, Chengqi Lyu, Yining Li, and Kai Chen. Rtmpose: Real- time multi-person pose estimation based on mmpose.arXiv preprint arXiv:2303.07399, 2023. 4

work page arXiv 2023

[17] [17]

Ultralytics yolo11, 2024

Glenn Jocher and Jing Qiu. Ultralytics yolo11, 2024. 2

work page 2024

[18] [18]

Addressing the occlusion problem in multi-camera people tracking with human pose estimation

Jeongho Kim, Wooksu Shin, Hancheol Park, and Jongwon Baek. Addressing the occlusion problem in multi-camera people tracking with human pose estimation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5463–5469, 2023. 2, 4

work page 2023

[19] [19]

Cluster self-refinement for enhanced online multi- camera people tracking

Jeongho Kim, Wooksu Shin, Hancheol Park, and Donghyuk Choi. Cluster self-refinement for enhanced online multi- camera people tracking. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7190–7197, 2024. 1, 2, 4, 8

work page 2024

[20] [20]

Pointpillars: Fast encoders for object detection from point clouds

Alex H Lang, Sourabh V ora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12697–12705, 2019. 3

work page 2019

[21] [21]

xformers: A modular and hackable trans- former modelling library.https : / / github

Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, Daniel Haziza, Luca Wehrstedt, Jeremy Reizenstein, and Grigory Sizov. xformers: A modular and hackable trans- former modelling library.https : / / github . com / facebookresearch/xformers, 2022. 7

work page 2022

[22] [22]

Crowdpose: Efficient crowded scenes pose estimation and a new benchmark

Jiefeng Li, Can Wang, Hao Zhu, Yihuan Mao, Hao-Shu Fang, and Cewu Lu. Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10863–10872, 2019. 4

work page 2019

[23] [23]

Stereo r-cnn based 3d object detection for autonomous driving

Peiliang Li, Xiaozhi Chen, and Shaojie Shen. Stereo r-cnn based 3d object detection for autonomous driving. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7644–7652, 2019. 2

work page 2019

[24] [24]

Rtm3d: Real-time monocular 3d detection from object key- points for autonomous driving

Peixuan Li, Huaici Zhao, Pengfei Liu, and Feidao Cao. Rtm3d: Real-time monocular 3d detection from object key- points for autonomous driving. InEuropean Conference on Computer Vision, pages 644–660. Springer, 2020. 2

work page 2020

[25] [25]

Clip-reid: exploiting vision-language model for image re-identification without concrete text labels

Siyuan Li, Li Sun, and Qingli Li. Clip-reid: exploiting vision-language model for image re-identification without concrete text labels. InProceedings of the AAAI conference on artificial intelligence, pages 1405–1413, 2023. 2, 4, 7

work page 2023

[26] [26]

Exploring plain vision transformer backbones for object de- tection

Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. Exploring plain vision transformer backbones for object de- tection. InEuropean conference on computer vision, pages 280–296. Springer, 2022. 2

work page 2022

[27] [27]

Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2024

Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chong- hao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2024. 3

work page 2024

[28] [28]

Monodetrnext: Next-generation accurate and efficient monocular 3d object detector, 2024

Pan Liao, Feng Yang, Di Wu, Wenhui Zhao, and Jinwen Yu. Monodetrnext: Next-generation accurate and efficient monocular 3d object detector, 2024. 3

work page 2024

[29] [29]

Smoke: Single- stage monocular 3d object detection via keypoint estimation

Zechen Liu, Zizhang Wu, and Roland T ´oth. Smoke: Single- stage monocular 3d object detection via keypoint estimation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 996–997,

work page

[30] [30]

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows . In 2021 IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 9992–10002, Los Alamitos, CA, USA,

work page 2021

[31] [31]

IEEE Computer Society. 7

work page

[32] [32]

Geometry uncer- tainty projection network for monocular 3d object detection

Yan Lu, Xinzhu Ma, Lei Yang, Tianzhu Zhang, Yating Liu, Qi Chu, Junjie Yan, and Wanli Ouyang. Geometry uncer- tainty projection network for monocular 3d object detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3111–3121, 2021. 2

work page 2021

[33] [33]

Hota: A higher order metric for evaluating multi-object tracking.International journal of computer vision, 129:548– 578, 2021

Jonathon Luiten, Aljosa Osep, Patrick Dendorfer, Philip Torr, Andreas Geiger, Laura Leal-Taix´e, and Bastian Leibe. Hota: A higher order metric for evaluating multi-object tracking.International journal of computer vision, 129:548– 578, 2021. 7

work page 2021

[34] [34]

Rt-detr: Real-time detection transformer with efficient hybrid encoder, 2024

Wenyu Lv, Yuxiang Chen, Xinghao Chen, Shangliang Xu, Yifan Xiao, Yizhen Gan, Lei Qi, Jinwei Chen, and Jianfeng He. Rt-detr: Real-time detection transformer with efficient hybrid encoder, 2024. 2

work page 2024

[35] [35]

Deep oc-sort: Multi-pedestrian tracking by adaptive re-identification

Gerard Maggiolino, Adnan Ahmad, Jinkun Cao, and Kris Kitani. Deep oc-sort: Multi-pedestrian tracking by adaptive re-identification. In2023 IEEE International conference on image processing (ICIP), pages 3025–3029. IEEE, 2023. 2, 4

work page 2023

[36] [36]

Lmgp: Lifted mul- ticut meets geometry projections for multi-camera multi- object tracking

Duy MH Nguyen, Roberto Henschel, Bodo Rosenhahn, Daniel Sonntag, and Paul Swoboda. Lmgp: Lifted mul- ticut meets geometry projections for multi-camera multi- object tracking. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8866– 8875, 2022. 2

work page 2022

[37] [37]

Multi-camera people tracking with mixture of realistic and synthetic knowledge

Quang Qui-Vinh Nguyen, Huy Dinh-Anh Le, Truc Thi- Thanh Chau, Duc Trung Luu, Nhat Minh Chung, and Synh Viet-Uyen Ha. Multi-camera people tracking with mixture of realistic and synthetic knowledge. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5496–5506, 2023. 2

work page 2023

[38] [38]

Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d

Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. InEuropean conference on computer vision, pages 194–210. Springer, 2020. 3

work page 2020

[39] [39]

Sam 2: Segment anything in images and videos,

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos,

work page

[40] [40]

Categorical depth distribution network for monocular 3d object detection

Cody Reading, Ali Harakeh, Julia Chae, and Steven L Waslander. Categorical depth distribution network for monocular 3d object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8555–8564, 2021. 2

work page 2021

[41] [41]

Faster r-cnn: Towards real-time object detection with region proposal networks.Advances in neural information process- ing systems, 28, 2015

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks.Advances in neural information process- ing systems, 28, 2015. 2

work page 2015

[42] [42]

Disen- tangling monocular 3d object detection

Andrea Simonelli, Samuel Rota Bulo, Lorenzo Porzi, Manuel L ´opez-Antequera, and Peter Kontschieder. Disen- tangling monocular 3d object detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 1991–1999, 2019. 2

work page 1991

[43] [43]

Cameltrack: Context-aware multi-cue exploitation for online multi-object tracking, 2025

Vladimir Somers, Baptiste Standaert, Victor Joos, Alexan- dre Alahi, and Christophe De Vleeschouwer. Cameltrack: Context-aware multi-cue exploitation for online multi-object tracking, 2025. 2

work page 2025

[44] [44]

Ocmctrack: Online multi-target multi- camera tracking with corrective matching cascade

Andreas Specker. Ocmctrack: Online multi-target multi- camera tracking with corrective matching cascade. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7236–7244, 2024. 2

work page 2024

[45] [45]

Toward accurate on- line multi-target multi-camera tracking in real-time

Andreas Specker and J ¨urgen Beyerer. Toward accurate on- line multi-target multi-camera tracking in real-time. In2022 30th European Signal Processing Conference (EUSIPCO), pages 533–537. IEEE, 2022. 2

work page 2022

[46] [46]

Zheng Tang, Shuo Wang, David C. Anastasiu, Ming- Ching Chang, Anuj Sharma, Quan Kong, Norimasa Ko- bori, Munkhjargal Gochoo, Ganzorig Batnasan, Munkh- Erdene Otgonbold, Fady Alnajjar, Jun-Wei Hsieh, Tomasz Kornuta, Xiaolong Li, Yilin Zhao, Han Zhang, Subhashree Radhakrishnan, Arihant Jain, Ratnesh Kumar, Vidya N. Murali, Yuxing Wang, Sameer Satish Pusegao...

work page 2025

[47] [47]

Earlybird: Early-fusion for multi- view tracking in the bird’s eye view

Torben Teepe, Philipp Wolters, Johannes Gilg, Fabian Her- zog, and Gerhard Rigoll. Earlybird: Early-fusion for multi- view tracking in the bird’s eye view. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 102–111, 2024. 1

work page 2024

[48] [48]

Advancing thermal multi-object tracking with attention and metric fu- sion, 2024

Thao-Anh Tran, Vu-Minh Le, Thanh-Tung Phan, Dung Hoang, Duc Phan, Huong Ninh, and Hai Tran. Advancing thermal multi-object tracking with attention and metric fu- sion, 2024. 2

work page 2024

[49] [49]

Yolov8: A novel object detection algorithm with enhanced performance and robust- ness

Rejin Varghese and Sambath M. Yolov8: A novel object detection algorithm with enhanced performance and robust- ness. In2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), pages 1–6, 2024. 2

work page 2024

[50] [50]

Pointpainting: Sequential fusion for 3d object de- tection

Sourabh V ora, Alex H Lang, Bassam Helou, and Oscar Bei- jbom. Pointpainting: Sequential fusion for 3d object de- tection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4604–4612,

work page

[51] [51]

Pointaugmenting: Cross-modal augmentation for 3d object detection

Chunwei Wang, Chao Ma, Ming Zhu, and Xiaokang Yang. Pointaugmenting: Cross-modal augmentation for 3d object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11794– 11803, 2021. 3

work page 2021

[52] [52]

Db- scan: Optimal rates for density based clustering.arXiv: Statistics Theory, 2017

Daren Wang, Xin Yang Lu, and Alessandro Rinaldo. Db- scan: Optimal rates for density based clustering.arXiv: Statistics Theory, 2017. 6

work page 2017

[53] [53]

Anastasiu, Zheng Tang, Ming- Ching Chang, Yue Yao, Liang Zheng, Mohammed Shaiqur Rahman, Meenakshi S

Shuo Wang, David C. Anastasiu, Zheng Tang, Ming- Ching Chang, Yue Yao, Liang Zheng, Mohammed Shaiqur Rahman, Meenakshi S. Arya, Anuj Sharma, Pranamesh Chakraborty, Sanjita Prajapati, Quan Kong, Norimasa Ko- bori, Munkhjargal Gochoo, Munkh-Erdene Otgonbold, Gan- zorig Batnasan, Fady Alnajjar, Ping-Yang Chen, Jun-Wei Hsieh, Xunlei Wu, Sameer Satish Pusegaon...

work page 2024

[54] [54]

Fcos3d: Fully convolutional one-stage monocular 3d object detection

Tai Wang, Xinge Zhu, Jiangmiao Pang, and Dahua Lin. Fcos3d: Fully convolutional one-stage monocular 3d object detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 913–922, 2021. 2

work page 2021

[55] [55]

Pseudo- lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving

Yan Wang, Wei-Lun Chao, Divyansh Garg, Bharath Hari- haran, Mark Campbell, and Kilian Q Weinberger. Pseudo- lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8445–8453, 2019. 2

work page 2019

[56] [56]

Bev- sushi: Multi-target multi-camera 3d detection and tracking in bird’s-eye view.arXiv preprint arXiv:2412.00692, 2024

Yizhou Wang, Tim Meinhardt, Orcun Cetintas, Cheng- Yen Yang, Sameer Satish Pusegaonkar, Benjamin Missaoui, Sujit Biswas, Zheng Tang, and Laura Leal-Taix ´e. Bev- sushi: Multi-target multi-camera 3d detection and tracking in bird’s-eye view.arXiv preprint arXiv:2412.00692, 2024. 1, 2

work page arXiv 2024

[57] [57]

Simple online and realtime tracking with a deep association metric

Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Simple online and realtime tracking with a deep association metric. In2017 IEEE international conference on image processing (ICIP), pages 3645–3649. IEEE, 2017. 2

work page 2017

[58] [58]

ObjectSeeker: Certifiably Robust Object Detection against Patch Hiding Attacks via Patch-agnostic Masking

Chong Xiang, Alexander Valtchanov, Saeed Mahloujifar, and Prateek Mittal. ObjectSeeker: Certifiably Robust Object Detection against Patch Hiding Attacks via Patch-agnostic Masking . In2023 IEEE Symposium on Security and Pri- vacy (SP), pages 1329–1347, Los Alamitos, CA, USA, 2023. IEEE Computer Society. 7

work page 2023

[59] [59]

A robust online multi-camera people tracking system with geometric con- sistency and state-aware re-id correction

Zhenyu Xie, Zelin Ni, Wenjie Yang, Yuang Zhang, Yi- hang Chen, Yang Zhang, and Xiao Ma. A robust online multi-camera people tracking system with geometric con- sistency and state-aware re-id correction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7007–7016, 2024. 1, 2

work page 2024

[60] [60]

An online approach and evalua- tion method for tracking people across cameras in extremely long video sequence

Cheng-Yen Yang, Hsiang-Wei Huang, Pyong-Kun Kim, Zhongyu Jiang, Kwang-Ju Kim, Chung-I Huang, Haiqing Du, and Jenq-Neng Hwang. An online approach and evalua- tion method for tracking people across cameras in extremely long video sequence. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 7037–7045, 2024. 2

work page 2024

[61] [61]

City-scale multi-camera vehicle tracking based on space-time-appearance features

Hui Yao, Zhizhao Duan, Zhen Xie, Jingbo Chen, Xi Wu, Duo Xu, and Yutao Gao. City-scale multi-camera vehicle tracking based on space-time-appearance features. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3310–3318, 2022. 2

work page 2022

[62] [62]

Overlap suppression clustering for offline multi-camera people tracking

Ryuto Yoshida, Junichi Okubo, Junichiro Fujii, Masazumi Amakata, and Takayoshi Yamashita. Overlap suppression clustering for offline multi-camera people tracking. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7153–7162, 2024. 1, 2

work page 2024

[63] [63]

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung-Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object detection.arXiv preprint arXiv:2203.03605, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022

[64] [64]

Monodetr: Depth- guided transformer for monocular 3d object detection

Renrui Zhang, Han Qiu, Tai Wang, Ziyu Guo, Ziteng Cui, Yu Qiao, Hongsheng Li, and Peng Gao. Monodetr: Depth- guided transformer for monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9155–9166, 2023. 2

work page 2023

[65] [65]

Objects are differ- ent: Flexible monocular 3d object detection

Yunpeng Zhang, Jiwen Lu, and Jie Zhou. Objects are differ- ent: Flexible monocular 3d object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 3289–3298, 2021. 2

work page 2021

[66] [66]

Fairmot: On the fairness of detection and re-identification in multiple object tracking.International Journal of Computer Vision, 129(11):3069–3087, 2021

Yifu Zhang, Chunyu Wang, Xinggang Wang, Wenjun Zeng, and Wenyu Liu. Fairmot: On the fairness of detection and re-identification in multiple object tracking.International Journal of Computer Vision, 129(11):3069–3087, 2021. 2

work page 2021

[67] [67]

Bytetrack: Multi-object tracking by associating every detection box, 2022

Yifu Zhang, Peize Sun, Yi Jiang, Dongdong Yu, Fucheng Weng, Zehuan Yuan, Ping Luo, Wenyu Liu, and Xinggang Wang. Bytetrack: Multi-object tracking by associating every detection box, 2022. 2

work page 2022

[68] [68]

Multi-Target, Multi-Camera Tracking by Hierarchical Clustering: Recent Progress on DukeMTMC Project

Zhimeng Zhang, Jianan Wu, Xuan Zhang, and Chi Zhang. Multi-target, multi-camera tracking by hierarchical cluster- ing: Recent progress on dukemtmc project.arXiv preprint arXiv:1712.09531, 2017. 2

work page internal anchor Pith review Pith/arXiv arXiv 2017

[69] [69]

Ob- jects as points, 2019

Xingyi Zhou, Dequan Wang, and Philipp Kr ¨ahenb¨uhl. Ob- jects as points, 2019. 2

work page 2019

[70] [70]

Deformable detr: Deformable transformers for end-to-end object detection, 2020

Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection, 2020. 2

work page 2020

[71] [71]

Detrs with col- laborative hybrid assignments training

Zhuofan Zong, Guanglu Song, and Yu Liu. Detrs with col- laborative hybrid assignments training. InProceedings of the IEEE/CVF international conference on computer vision, pages 6748–6758, 2023. 2, 4

work page 2023