arxiv: 2604.16522 · v1 · submitted 2026-04-16 · 💻 cs.CV

Recognition: unknown

Fast Online 3D Multi-Camera Multi-Object Tracking and Pose Estimation

Linh Van Ma , Tran Thien Dat Nguyen , Moongu Jeon

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:09 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D trackingpose estimationmulti-cameramulti-object trackingBayes filteronline method2D detections

0 comments

The pith

An efficient Bayes-optimal filter enables fast 3D multi-object tracking and pose estimation from multiple monocular cameras using only 2D detections.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a method for performing 3D multi-object tracking and pose estimation online with several monocular cameras at once. The approach takes only 2D bounding box and pose detections as input from standard pre-trained models, skipping any need for 3D training data or complex 3D models. It achieves this by creating a computationally efficient version of a Bayes-optimal multi-object tracking filter. The result is a system that runs much quicker than current leading methods while keeping the same level of accuracy and handling cases where cameras drop in and out of the network.

Core claim

The central discovery is that joint 3D tracking and pose estimation can be accomplished in real time across multiple cameras by an efficient implementation of a Bayes-optimal multi-object tracking filter that operates solely on 2D bounding box and pose detections from publicly available models, delivering speed improvements over state-of-the-art approaches without accuracy loss and maintaining robustness during camera disconnections and reconnections.

What carries the argument

efficient implementation of a Bayes-optimal multi-object tracking filter

If this is right

The algorithm is significantly faster than state-of-the-art methods.
Accuracy is maintained without compromise.
Only publicly available pre-trained 2D detection models are required.
The system remains robust to intermittent camera disconnections and reconnections.
It performs online joint 3D multi-object tracking and pose estimation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This suggests 3D capabilities can be added to existing 2D vision systems with minimal overhead.
The approach may support applications in environments with variable camera availability, such as moving platforms.
Similar efficiency techniques could benefit other Bayesian tracking problems in computer vision.

Load-bearing premise

Reliable 2D bounding box and pose detections from off-the-shelf models are sufficient to drive accurate 3D multi-object tracking and pose estimation across cameras without additional 3D-specific training or calibration.

What would settle it

Testing the reported speed and accuracy on benchmark multi-camera datasets; failure to exceed state-of-the-art speed or match accuracy would falsify the claims.

Figures

Figures reproduced from arXiv: 2604.16522 by Linh Van Ma, Moongu Jeon, Tran Thien Dat Nguyen.

**Figure 1.** Figure 1: Overview of the proposed algorithm with υ cameras. 3.1 Track Prediction In our algorithm, each object (track) is represented by a distinct ID ℓ that is unchanged over time and a state x contains (real-world) 3D position, velocity, shape parameters, and keypoints (joint locations) and their velocity in 3D. Using the notation 1 : n for the list 1, 2, ..., n, the state vector x can be expressed as x=[\rho ^{… view at source ↗

**Figure 3.** Figure 3: Algorithm flowchart. Tracks (3D kinematics, shapes, and poses) [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Bird’s-eye and 3D views of the estimated 3D positions and poses in the CMC (top two rows) and CMU (bottom two rows) datasets. The [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Typical incorrect 3D pose estimates from our algorithm in CMC, [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 7.** Figure 7: Typical incorrect 3D pose estimate from our algorithm in CMU [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 6.** Figure 6: Camera reconfiguration setting and the OSPA [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 8.** Figure 8: Tracking performance with different assignment cost thresholds [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

read the original abstract

This paper proposes a fast and online method for jointly performing 3D multi-object tracking and pose estimation using multiple monocular cameras. Our algorithm requires only 2D bounding box and pose detections, eliminating the need for costly 3D training data or computationally expensive deep learning models. Our solution is an efficient implementation of a Bayes-optimal multi-object tracking filter, enhancing computational efficiency while maintaining accuracy. We demonstrate that our algorithm is significantly faster than state-of-the-art methods without compromising accuracy, using only publicly available pre-trained 2D detection models. We also illustrate the robust performance of our algorithm in scenarios where multiple cameras are intermittently disconnected or reconnected during operation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical, efficient Bayes-optimal filter for online 3D multi-object tracking and pose estimation from 2D detections across cameras, with added handling for intermittent connections, but the abstract supplies no numbers to support the speed and accuracy claims.

read the letter

This paper presents an efficient implementation of a Bayes-optimal multi-object tracking filter for joint 3D tracking and pose estimation using multiple monocular cameras. It relies only on 2D bounding box and pose detections from off-the-shelf models and adds explicit support for cameras that drop in and out during operation. The core contribution is making the filter fast and online while avoiding any 3D training data or heavy learned models. The handling of intermittent camera connections is a useful practical touch for real multi-camera setups. The work does well by keeping the pipeline lightweight and accessible, which could matter for robotics or surveillance applications where compute and data collection are constraints. The approach stays grounded in standard filtering rather than introducing new theory. The main soft spot is the thin evidence. The abstract states the method is significantly faster than state-of-the-art without compromising accuracy, yet it includes no runtime numbers, no tracking error metrics, no datasets, and no experiment details. Without those, the central performance claim cannot be evaluated. The calibration issue also stands out. Recovering metric 3D states from multiple 2D views normally requires camera intrinsics and extrinsics. The abstract does not mention calibration or how the geometry is handled, so the accuracy result appears conditional on an unstated assumption about the camera rig. If the full paper clarifies this or provides an online solution, that needs to be front and center. This is for practitioners who want a lightweight online 3D tracker for multi-camera systems rather than theorists seeking new filters. It deserves a serious referee because the implementation focus and disconnection handling are concrete, but the authors must supply the missing quantitative results and geometric details for the claims to be assessed properly.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes a fast online algorithm for joint 3D multi-object tracking and pose estimation across multiple monocular cameras. It implements an efficient Bayes-optimal multi-object tracking filter that operates exclusively on 2D bounding-box and pose detections from publicly available pre-trained models, eliminating the need for 3D training data or heavy deep-learning inference. The authors claim substantial runtime improvements over prior art while preserving accuracy and demonstrate robustness to intermittent camera disconnections.

Significance. If the performance claims are substantiated with quantitative evidence, the work would offer a practical, calibration-light solution for real-time 3D perception in multi-camera surveillance and robotics settings. The emphasis on using only off-the-shelf 2D detectors and an efficient filter implementation could reduce deployment costs, but the absence of any reported error metrics, runtime numbers, or dataset details in the abstract leaves the significance difficult to evaluate at present.

major comments (3)

[Abstract] Abstract: The central claim that the method is 'significantly faster than state-of-the-art methods without compromising accuracy' is unsupported by any quantitative results, error metrics (e.g., MOTA, MOTP, pose error), runtime benchmarks, or experimental protocol. This absence directly undermines evaluation of the 'no accuracy loss' assertion.
[Method] Method section (presumed §3–4): The transition from 2D detections to metric 3D states via the Bayes filter presupposes multi-view geometry. No explicit treatment of camera intrinsics, extrinsics, or online calibration is described, yet the abstract asserts operation 'without ... calibration.' This geometric precondition is load-bearing for the accuracy claim and must be clarified with either a stated assumption or an auxiliary estimation procedure.
[Experiments] Experimental evaluation (presumed §5): The robustness claim for 'intermittently disconnected or reconnected' cameras requires concrete metrics on tracking continuity and pose drift during camera loss events; without such results the claim remains unverified.

minor comments (2)

[Abstract] The abstract is unusually long and contains redundant phrasing ('fast and online', 'enhancing computational efficiency while maintaining accuracy'); condensing it would improve readability.
[Method] Notation for the Bayes filter state vector and measurement model should be introduced with a clear table or diagram early in the method section to aid readers unfamiliar with the specific filter formulation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. We address each major comment in detail below, providing clarifications from the manuscript and indicating where revisions will be made to improve clarity and completeness.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the method is 'significantly faster than state-of-the-art methods without compromising accuracy' is unsupported by any quantitative results, error metrics (e.g., MOTA, MOTP, pose error), runtime benchmarks, or experimental protocol. This absence directly undermines evaluation of the 'no accuracy loss' assertion.

Authors: The abstract provides a high-level summary of the contributions. Quantitative support for the speed and accuracy claims appears in Section 5, which reports runtime benchmarks against prior methods, MOTA/MOTP scores, and pose estimation errors on public multi-camera datasets using only off-the-shelf 2D detectors. To make the abstract self-contained for readers, we will revise it to include one or two key numerical highlights (e.g., average FPS improvement and accuracy parity) while preserving its brevity. revision: yes
Referee: [Method] Method section (presumed §3–4): The transition from 2D detections to metric 3D states via the Bayes filter presupposes multi-view geometry. No explicit treatment of camera intrinsics, extrinsics, or online calibration is described, yet the abstract asserts operation 'without ... calibration.' This geometric precondition is load-bearing for the accuracy claim and must be clarified with either a stated assumption or an auxiliary estimation procedure.

Authors: The method relies on standard multi-view geometry with known, fixed camera intrinsics and extrinsics; these are treated as given inputs, consistent with the majority of multi-camera tracking literature. The phrase 'without calibration' in the manuscript refers specifically to the absence of any online or dynamic calibration step and to the elimination of 3D-specific training data or heavy 3D inference. We will add an explicit statement in the revised method section clarifying this assumption and noting that no auxiliary online calibration procedure is required or performed. revision: yes
Referee: [Experiments] Experimental evaluation (presumed §5): The robustness claim for 'intermittently disconnected or reconnected' cameras requires concrete metrics on tracking continuity and pose drift during camera loss events; without such results the claim remains unverified.

Authors: Section 5 already includes timing and qualitative tracking continuity results under simulated camera disconnections. To strengthen the claim, we will augment the experimental section with quantitative metrics such as ID-switch rates, track-fragmentation counts, and average pose-error increase during disconnection intervals, computed on the same datasets used for the main evaluation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained implementation of standard filter

full rationale

The paper presents its core contribution as an efficient implementation of a Bayes-optimal multi-object tracking filter that takes 2D bounding box and pose detections from publicly available pre-trained models as input. No equations, predictions, or first-principles results in the abstract or described approach reduce by construction to fitted parameters, self-definitions, or self-citation chains. The accuracy and speed claims are framed as empirical outcomes of applying the standard filter to external detections, without renaming known results or smuggling ansatzes via self-citation. The method remains open to external validation on calibration and geometry assumptions, but these do not create circularity within the derivation itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no explicit list of free parameters, axioms, or invented entities; the method appears to rest on standard Bayesian filtering assumptions (Markov dynamics, independent observations) that are not detailed here.

pith-pipeline@v0.9.0 · 5409 in / 1036 out tokens · 47398 ms · 2026-05-10T12:09:56.058333+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 1 canonical work pages · 1 internal anchor

[1]

Multi- person 3D pose estimation and tracking in sports,

L. Bridgeman, M. Volino, J.-Y. Guillemaut, and A. Hilton, “Multi- person 3D pose estimation and tracking in sports,” inIEEE Conf. Comput. Vis. Pattern Recog. Workshops, 2019, pp. 2487–2496. 1

2019
[2]

Urban traffic surveillance (UTS): A fully probabilistic 3D tracking approach based on 2D detections,

H. Bradler, A. Kretz, and R. Mester, “Urban traffic surveillance (UTS): A fully probabilistic 3D tracking approach based on 2D detections,” inIEEE Intell. Vehicles Symp., 2021, pp. 1198–1205. 1

2021
[3]

Multiple view geometry transformers for 3D human pose estimation,

Z. Liao, J. Zhu, C. Wang, H. Hu, and S. L. Waslander, “Multiple view geometry transformers for 3D human pose estimation,” in IEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 708–717. 1, 2, 10, 11

2024
[4]

YOLOX: Exceeding YOLO Series in 2021

Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “YOLOX: Exceeding YOLO series in 2021,”arXiv preprint arXiv:2107.08430, 2021. 1, 2, 8

work page internal anchor Pith review arXiv 2021
[5]

Realtime multi- person 2D pose estimation using part affinity fields,

Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi- person 2D pose estimation using part affinity fields,” inIEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 7291–7299. 1, 2, 3

2017
[6]

AlphaPose: Whole-body regional multi-person pose estimation and tracking in real-time,

H. Fang, J. Li, H. Tang, C. Xu, H. Zhu, Y. Xiu, Y.-L. Li, and C. Lu, “AlphaPose: Whole-body regional multi-person pose estimation and tracking in real-time,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 6, pp. 7157–7173, 2022. 1, 2, 8

2022
[7]

A Bayesian filter for multi-view 3D multi-object tracking with occlusion handling,

J. Ong, B.-T. Vo, B.-N. Vo, D. Y. Kim, and S. E. Nordholm, “A Bayesian filter for multi-view 3D multi-object tracking with occlusion handling,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 5, pp. 2246–2263, 2022. 1, 2, 3, 6, 7, 8

2022
[8]

Track initialization and re-identification for 3D multi-view multi-object tracking,

L. V . Ma, T. T. D. Nguyen, B.-N. Vo, H. Jang, and M. Jeon, “Track initialization and re-identification for 3D multi-view multi-object tracking,”Inf. Fusion, p. 102496, 2024. 1, 2, 4, 5, 6, 8

2024
[9]

Multi-sensor multi-object track- ing with the generalized labeled multi-Bernoulli filter,

B.-N. Vo, B.-T. Vo, and M. Beard, “Multi-sensor multi-object track- ing with the generalized labeled multi-Bernoulli filter,”IEEE Trans. Signal Process., vol. 67, no. 23, pp. 5952–5967, 2019. 1, 2, 6, 7, 8

2019
[10]

A multiview approach to tracking people in crowded scenes using a planar homography constraint,

S. M. Khan and M. Shah, “A multiview approach to tracking people in crowded scenes using a planar homography constraint,” inEur. Conf. Comput. Vis.Springer, 2006, pp. 133–146. 2

2006
[11]

Homography based multiple camera detection and tracking of people in a dense crowd,

R. Eshel and Y. Moses, “Homography based multiple camera detection and tracking of people in a dense crowd,” inIEEE Conf. Comput. Vis. Pattern Recog., 2008, pp. 1–8. 2

2008
[12]

Deep multi-camera people detec- tion,

T. Chavdarova and F. Fleuret, “Deep multi-camera people detec- tion,” inIEEE Int. Conf. Mach. learning and Appl., 2017, pp. 848–853. 2

2017
[13]

Deep occlusion reasoning for multi-camera multi-target detection,

P . Baqu´e, F. Fleuret, and P . V . Fua, “Deep occlusion reasoning for multi-camera multi-target detection,” inIEEE Int. Conf. Comput. Vis., 2017, pp. 271–279. 2, 15

2017
[14]

LMGP: Lifted multicut meets geometry projections for multi-camera multi-object tracking,

D. M. H. Nguyen, R. Henschel, B. Rosenhahn, D. Sonntag, and P . Swoboda, “LMGP: Lifted multicut meets geometry projections for multi-camera multi-object tracking,” inIEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 8866–8875. 2, 15

2022
[15]

EarlyBird: Early-fusion for multi-view tracking in the bird’s eye view,

T. Teepe, P . Wolters, J. Gilg, F. Herzog, and G. Rigoll, “EarlyBird: Early-fusion for multi-view tracking in the bird’s eye view,” in IEEE/CVF Winter Conf. Appl. Comput. Vis., 2024, pp. 102–111. 2, 15

2024
[16]

Delving into monocular 3D vehicle tracking: a decoupled framework and a dedicated metric,

T. Gao, Z. Jia, W. Lin, and Y. Li, “Delving into monocular 3D vehicle tracking: a decoupled framework and a dedicated metric,” Appl. Intell., vol. 53, no. 1, pp. 746–756, 2023. 2

2023
[17]

Fast online multi- target multi-camera tracking for vehicles,

K. Shim, K. Ko, J. Hwang, H. Jang, and C. Kim, “Fast online multi- target multi-camera tracking for vehicles,”Appl. Intell., vol. 53, no. 23, pp. 28 994–29 004, 2023. 2

2023
[18]

3D pictorial structures for multiple human pose estimation,

V . Belagiannis, S. Amin, M. Andriluka, B. Schiele, N. Navab, and S. Ilic, “3D pictorial structures for multiple human pose estimation,” inIEEE Conf. Comput. Vis. Pattern Recog., 2014, pp. 1669–1676. 2, 7, 9, 10, 11

2014
[19]

VoxelPose: Towards multi-camera 3d human pose estimation in wild environment,

H. Tu, C. Wang, and W. Zeng, “VoxelPose: Towards multi-camera 3d human pose estimation in wild environment,” inEur. Conf. Comput. Vis.Springer, 2020, pp. 197–212. 2

2020
[20]

Faster VoxelPose: Real-time 3D human pose estimation by orthographic projection,

H. Ye, W. Zhu, C. yu Wang, R. Wu, and Y. Wang, “Faster VoxelPose: Real-time 3D human pose estimation by orthographic projection,” inEur. Conf. Comput. Vis.Springer, 2022, pp. 142–159. 2, 10, 11

2022
[21]

TesseTrack: End-to-end learnable multi-person ar- ticulated 3D pose tracking,

N. Reddy, L. Guigues, L. Pischulini, J. Eledath, and S. G. Narasimhan, “TesseTrack: End-to-end learnable multi-person ar- ticulated 3D pose tracking,” inIEEE Conf. Comput. Vis. Pattern Recog., 2021, pp. 15 190–15 200. 2

2021
[22]

Graph-based 3D multi-person pose estimation using multi-view images,

S. Wu, S. Jin, W. Liu, L. Bai, C. Qian, D. Liu, and W. Ouyang, “Graph-based 3D multi-person pose estimation using multi-view images,” inIEEE Conf. Comput. Vis. Pattern Recog., 2021, pp. 11 148– 11 157. 2

2021
[23]

Direct multi-view multi-person 3D pose estimation,

T. Wang, J. Zhang, Y. Cai, S. Yan, and J. Feng, “Direct multi-view multi-person 3D pose estimation,”Adv. Neural Inf. Process. Syst., vol. 34, pp. 13 153–13 164, 2021. 2

2021
[24]

SelfPose3d: Self- supervised multi-person multi-view 3D pose estimation,

V . K. Srivastav, K. Chen, and N. Padoy, “SelfPose3d: Self- supervised multi-person multi-view 3D pose estimation,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 2502–2512. 2, 10, 11

2024
[25]

Distribution-aware single-stage models for multi-person 3D pose estimation,

Z. Wang, X. Nie, X. Qu, Y. Chen, and S. Liu, “Distribution-aware single-stage models for multi-person 3D pose estimation,” inIEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 13 096–13 105. 2, 10, 11 13

2022
[26]

Multi-view multi-person 3D pose estimation with plane sweep stereo,

J. Lin and G. H. Lee, “Multi-view multi-person 3D pose estimation with plane sweep stereo,” inIEEE Conf. Comput. Vis. Pattern Recog., 2021, pp. 11 886–11 895. 2, 10, 11

2021
[27]

Fast and robust multi-person 3D pose estimation from multiple views,

J. Dong, W. B. Jiang, Q.-X. Huang, H. Bao, and X. Zhou, “Fast and robust multi-person 3D pose estimation from multiple views,” in IEEE Conf. Comput. Vis. Pattern Recog., 2019, pp. 7792–7801. 2, 10, 11

2019
[28]

Multi- person 3D pose estimation in crowded scenes based on multi-view geometry,

H. Chen, P . Guo, P . Li, G. H. Lee, and G. S. Chirikjian, “Multi- person 3D pose estimation in crowded scenes based on multi-view geometry,” inEur. Conf. Comput. Vis.Springer, 2020, pp. 541–557. 2

2020
[29]

Distinctive image features from scale-invariant key- points,

D. G. Lowe, “Distinctive image features from scale-invariant key- points,”Int. J. Comput. Vis., vol. 60, pp. 91–110, 2004. 2

2004
[30]

Histograms of oriented gradients for human detection,

N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” inIEEE Conf. Comput. Vis. Pattern Recog., vol. 1, 2005, pp. 886–893. 2

2005
[31]

Rich feature hierarchies for accurate object detection and semantic segmenta- tion,

R. B. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmenta- tion,” inIEEE Conf. Comput. Vis. Pattern Recog., 2014, pp. 580–587. 2

2014
[32]

Fast R-CNN,

R. Girshick, “Fast R-CNN,” inIEEE Int. Conf. Comput. Vis., 2015, pp. 1440–1448. 2

2015
[33]

Faster R-CNN: Towards real-time object detection with region proposal networks,

S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,”Adv. Neural Inf. Process. Syst, vol. 28, 2015. 2

2015
[34]

You only look once: Unified, real-time object detection,

J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” inIEEE Conf. Comput. Vis. Pattern Recog., 2016, pp. 779–788. 2

2016
[35]

Parsing occluded people by flexible compositions,

X. Chen and A. L. Yuille, “Parsing occluded people by flexible compositions,” inIEEE Conf. Comput. Vis. Pattern Recog., 2015, pp. 3945–3954. 2

2015
[36]

OpenPifPaf: Composite fields for semantic keypoint detection and spatio-temporal association,

S. Kreiss, L. Bertoni, and A. Alahi, “OpenPifPaf: Composite fields for semantic keypoint detection and spatio-temporal association,” IEEE Trans. Intell. Transp. Syst., vol. 23, no. 8, pp. 13 498–13 511,
[37]

Mask R-CNN,

K. He, G. Gkioxari, P . Doll ´ar, and R. Girshick, “Mask R-CNN,” in IEEE Int. Conf. Comput. Vis., 2017, pp. 2961–2969. 2

2017
[38]

Cascaded pyramid network for multi-person pose estimation,

Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun, “Cascaded pyramid network for multi-person pose estimation,” inIEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 7103–7112. 2

2018
[39]

Simple baselines for human pose estimation and tracking,

B. Xiao, H. Wu, and Y. Wei, “Simple baselines for human pose estimation and tracking,” inEur. Conf. Comput. Vis., 2018, pp. 466–

2018
[40]

Deep high-resolution representation learning for human pose estimation,

K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation learning for human pose estimation,” inIEEE Conf. Comput. Vis. Pattern Recog., 2019, pp. 5693–5703. 2

2019
[41]

Microsoft COCO: Common objects in context,

T.-Y. Lin, M. Maire, S. J. Belongie, J. Hays, P . Perona, D. Ramanan, P . Doll´ar, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” inEur. Conf. Comput. Vis.Springer, 2014, pp. 740–755. 3

2014
[42]

2D human pose estimation: New benchmark and state of the art analysis,

M. Andriluka, L. Pishchulin, P . Gehler, and B. Schiele, “2D human pose estimation: New benchmark and state of the art analysis,” in IEEE Conf. Comput. Vis. Pattern Recog., 2014, pp. 3686–3693. 3

2014
[43]

A flexible new technique for camera calibration,

Z. Zhang, “A flexible new technique for camera calibration,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 11, pp. 1330–1334,
[44]

Unscented filtering and nonlinear estimation,

S. J. Julier and J. K. Uhlmann, “Unscented filtering and nonlinear estimation,”Proc. IEEE, vol. 92, no. 3, pp. 401–422, 2004. 4

2004
[45]

van der Merwe and E

R. van der Merwe and E. Wan,Sigma-point Kalman filters for probabilistic inference in dynamic state-space models. Oregon Health & Science University, 2004. 4, 5

2004
[46]

The Hungarian method for the assignment prob- lem,

H. W. Kuhn, “The Hungarian method for the assignment prob- lem,”Nav. Res. Logist. Q., vol. 2, no. 1-2, pp. 83–97, 1955. 4

1955
[47]

A shortest augmenting path algo- rithm for dense and sparse linear assignment problems,

R. Jonker and A. Volgenant, “A shortest augmenting path algo- rithm for dense and sparse linear assignment problems,”Comput- ing, vol. 38, no. 4, pp. 325–340, 1987. 4

1987
[48]

ByteTrack: Multi-object tracking by associating every detection box,

Y. Zhang, P . Sun, Y. Jiang, D. Yu, Z. Yuan, P . Luo, W. Liu, and X. Wang, “ByteTrack: Multi-object tracking by associating every detection box,” inEur. Conf. Comput. Vis., 2022, pp. 1–21. 5

2022
[49]

FairMOT: On the fairness of detection and re-identification in multiple object tracking,

Y. Zhang, C. Wang, X. Wang, W. Zeng, and W. Liu, “FairMOT: On the fairness of detection and re-identification in multiple object tracking,”Int. J. Comput. Vis., vol. 129, no. 11, pp. 3069–3087, 2021. 5

2021
[50]

Harltey and A

A. Harltey and A. Zisserman,Multiple view geometry in computer vision. Cambridge university press, 2003. 5

2003
[51]

Wild- track: A multi-camera hd dataset for dense unscripted pedestrian detection,

T. Chavdarova, P . Baqu ´e, S. Bouquet, A. Maksai, C. Jose, T. M. Bagautdinov, L. Lettry, P . V . Fua, L. V . Gool, and F. Fleuret, “Wild- track: A multi-camera hd dataset for dense unscripted pedestrian detection,”IEEE Conf. Comput. Vis. Pattern Recog., pp. 5030–5039,
[52]

Multiview detection with feature perspective transformation,

Y. Hou, L. Zheng, and S. Gould, “Multiview detection with feature perspective transformation,” inEur. Conf. Comput. Vis.Springer, 2020, pp. 1–18. 7

2020
[53]

Panoptic Studio: A massively multi- view system for social motion capture,

H. Joo, H. Liu, L. Tan, L. Gui, B. C. Nabbe, I. Matthews, T. Kanade, S. Nobuhara, and Y. Sheikh, “Panoptic Studio: A massively multi- view system for social motion capture,” inIEEE Int. Conf. Comput. Vis., 2015. 7, 9

2015
[54]

Evaluating multiple object tracking performance: The CLEAR MOT metrics,

K. Bernardin and R. Stiefelhagen, “Evaluating multiple object tracking performance: The CLEAR MOT metrics,”EURASIP J. on Image and Video Process., vol. 2008, pp. 1–10, 2008. 8

2008
[55]

Per- formance measures and a data set for multi-target, multi-camera tracking,

E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi, “Per- formance measures and a data set for multi-target, multi-camera tracking,” inEur. Conf. Comput. Vis.Springer, 2016, pp. 17–35. 8

2016
[56]

A solution for large-scale multi- object tracking,

M. Beard, B.-T. Vo, and B.-N. Vo, “A solution for large-scale multi- object tracking,”IEEE Trans. Signal Process., vol. 68, pp. 2754–2769,
[57]

How trustworthy are the existing performance evaluations for basic vision tasks?

T. T. D. Nguyen, H. Rezatofighi, B.-N. Vo, B.-T. Vo, S. Savarese, and I. Reid, “How trustworthy are the existing performance evaluations for basic vision tasks?”IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 7, pp. 8538–8552, 2023. 8

2023
[58]

Generalized intersection over union: A metric and a loss for bounding box regression,

H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized intersection over union: A metric and a loss for bounding box regression,” inIEEE Conf. Comput. Vis. Pattern Recog., 2019, pp. 658–666. 8

2019
[59]

Multiple human pose estimation with temporally consistent 3D pictorial structures,

V . Belagiannis, X. Wang, B. Schiele, P . V . Fua, S. Ilic, and N. Navab, “Multiple human pose estimation with temporally consistent 3D pictorial structures,” inEur. Conf. Comput. Vis. Workshops. Springer, 2015, pp. 742–754. 10, 11

2015
[60]

3D pictorial structures revisited: Multiple human pose estimation,

V . Belagiannis, S. Amin, M. Andriluka, B. Schiele, N. Navab, and S. Ilic, “3D pictorial structures revisited: Multiple human pose estimation,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 10, pp. 1929–1942, 2015. 10, 11

1929
[61]

Multiple human 3D pose estimation from multiview images,

S. Ershadi-Nasab, E. Noury, S. Kasaei, and E. Sanaei, “Multiple human 3D pose estimation from multiview images,”Multimed. Tools Appl., vol. 77, pp. 15 573–15 601, 2018. 10, 11

2018
[62]

Labeled random finite sets and multi-object conjugate priors,

B.-T. Vo and B.-N. Vo, “Labeled random finite sets and multi-object conjugate priors,”IEEE Trans. Signal Process., vol. 61, no. 13, pp. 3460–3475, 2013. 15

2013
[63]

Design and analysis of modern tracking systems,

S. Blackman and R. Populi, “Design and analysis of modern tracking systems,”Norwood, MA: Artech House, 1999., 1999. 15

1999
[64]

3D random occlu- sion and multi-layer projection for deep multi-camera pedestrian localization,

R. Qiu, M. Xu, Y. Yan, J. S. Smith, and X. Yang, “3D random occlu- sion and multi-layer projection for deep multi-camera pedestrian localization,” inEur. Conf. Comput. Vis., 2022. 15 14 Appendices for “Fast Online 3D Multi-Camera Multi-Object Tracking and Pose Estimation” Linh Van Ma, Tran Thien Dat Nguyen, and Moongu Jeon APPENDIXA PSEUDOCODES FORIMPORTA...

2022
[65]

We also test the KSP-ptracker (K.p.) [51] using 3D detections from DeepOcclusion (DeepOcc.) detector [13], LGMP [14] and EarlyBird [15] 3D trackers

to process 3D detections obtained from 3DROM de- tector [64]. We also test the KSP-ptracker (K.p.) [51] using 3D detections from DeepOcclusion (DeepOcc.) detector [13], LGMP [14] and EarlyBird [15] 3D trackers. We note that 3DROM, DeepOcc. and LGMP models are trained on 90% of the WT dataset and evaluated on the remaining 10%. The EarlyBird tracker was tr...