Multi-view Crowd Tracking Transformer with View-Ground Interactions Under Large Real-World Scenes

Antoni B. Chan; Hui Huang; Jixuan Chen; Kaiyi Zhang; Qi Zhang; Xinquan Yu

arxiv: 2604.19318 · v1 · submitted 2026-04-21 · 💻 cs.CV

Multi-view Crowd Tracking Transformer with View-Ground Interactions Under Large Real-World Scenes

Qi Zhang , Jixuan Chen , Kaiyi Zhang , Xinquan Yu , Antoni B. Chan , Hui Huang This is my paper

Pith reviewed 2026-05-10 03:09 UTC · model grok-4.3

classification 💻 cs.CV

keywords multi-view crowd trackingTransformerview-ground interactionslarge-scale datasetsMVCrowdTrackCityTrackmulti-camera trackingperson re-identification

0 comments

The pith

MVTrackTrans uses Transformer view-ground interactions to track crowds more accurately in large real-world scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MVTrackTrans, a Transformer architecture for multi-view crowd tracking that explicitly models interactions between camera views and the ground plane to maintain consistent person identities across perspectives. Prior CNN-based approaches were mainly tested on small scenes with short sequences, limiting their usefulness where occlusions and scale become severe. The authors release two new datasets, MVCrowdTrack and CityTrack, that contain far larger areas and longer durations, and show their model delivers higher tracking performance than previous methods on these benchmarks. A sympathetic reader would care because reliable ground-plane tracking in expansive environments directly supports applications like urban monitoring and event safety that current small-dataset systems cannot handle.

Core claim

MVTrackTrans adopts a Transformer to perform interactions between multiple camera views and the ground plane, enabling enhanced multi-view tracking performance. The model is evaluated on two newly collected large real-world datasets, MVCrowdTrack and CityTrack, which feature much larger scene sizes and longer time periods than prior benchmarks such as Wildtrack and MultiviewX. On these datasets the proposed model achieves better performance than existing CNN-based methods, demonstrating the advantages of the Transformer and view-ground interaction design for dealing with large scenes.

What carries the argument

The view-ground interaction module inside the Transformer that projects and fuses information across camera views onto a shared ground plane for consistent identity maintenance.

If this is right

Multi-view tracking systems can now scale to scenes with greater spatial extent and longer temporal sequences without losing identity consistency.
Ground-plane projections become a standard mechanism for resolving perspective distortions across cameras in crowded environments.
Future work can adopt the same Transformer interaction pattern to incorporate additional modalities such as depth or motion cues.
Real-world deployment of multi-view tracking becomes feasible for applications requiring coverage of city-scale areas over hours rather than minutes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The new datasets could serve as a testbed for hybrid CNN-Transformer trackers or for evaluating the effect of camera calibration accuracy on ground-plane fusion.
If view-ground interactions prove robust, the same fusion idea might transfer to single-camera tracking by treating the ground plane as an implicit regularizer.
Extending the model to handle moving cameras or dynamic ground surfaces would be a direct next step for outdoor urban scenarios.

Load-bearing premise

The reported performance gains arise specifically from the view-ground interactions and Transformer design rather than from dataset-specific tuning or training differences.

What would settle it

An ablation study on MVCrowdTrack or CityTrack that removes the view-ground interaction module while keeping all other architecture and training choices identical and shows no drop in tracking metrics.

Figures

Figures reproduced from arXiv: 2604.19318 by Antoni B. Chan, Hui Huang, Jixuan Chen, Kaiyi Zhang, Qi Zhang, Xinquan Yu.

**Figure 2.** Figure 2: The overall pipeline of MVTrackTrans consists of Feature Extraction and Multi-view Fusion, Multi-view Tracking Encoding, [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: The view-ground interaction module details. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Multi-camera views and corresponding ground-plane [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: The predicted trajectory visualizations. Our method can accurately track more people for a long time (see red boxes). [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Multi-view crowd tracking estimates each person's tracking trajectories on the ground of the scene. Recent research works mainly rely on CNNs-based multi-view crowd tracking architectures, and most of them are evaluated and compared on relatively small datasets, such as Wildtrack and MultiviewX. Since these two datasets are collected in small scenes and only contain tens of frames in the evaluation stage, it is difficult for the current methods to be applied to real-world applications where scene size and occlusion are more complicated. In this paper, we propose a Transformer-based multi-view crowd tracking model, \textit{MVTrackTrans}, which adopts interactions between camera views and the ground plane for enhanced multi-view tracking performance. Besides, for better evaluation, we collect and label two large real-world multi-view tracking datasets, MVCrowdTrack and CityTrack, which contain a much larger scene size over a longer time period. Compared with existing methods on the two large and new datasets, the proposed MVTrackTrans model achieves better performance, demonstrating the advantages of the model design in dealing with large scenes. We believe the proposed datasets and model will push the frontiers of the task to more practical scenarios, and the datasets and code are available at: https://github.com/zqyq/MVTrackTrans.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MVTrackTrans scales multi-view tracking to larger scenes with new datasets, but needs stronger proof that the architecture drives the gains.

read the letter

The main thing here is that MVTrackTrans brings a Transformer with view-ground interactions to multi-view crowd tracking in big scenes, backed by two new large datasets that actually match real-world sizes. Previous work stayed on small controlled sets with short sequences, so this tries to fix that by scaling up both the data and the model. The datasets, MVCrowdTrack and CityTrack, are the standout part because they have larger scenes and longer durations. Releasing them with code is helpful for the community. The model design focuses on interactions between views and the ground, which could address occlusions better in complex environments. That direction makes sense for practical applications like surveillance. Still, the performance story has gaps. It claims better results than existing methods on these datasets, but the abstract gives no metrics, no ablation on what the interactions add, and no info on whether baselines got the same training treatment. If the improvements trace back to tuning or implementation rather than the new components, the attribution doesn't hold. That matches the stress-test worry, and without stats or multiple runs, it's tough to trust the deltas. This is for computer vision researchers focused on tracking and multi-view problems. The datasets will likely see use regardless. It deserves peer review because the scale-up is important and the work is grounded enough to warrant checking the details. Recommendation: Send it for review, asking referees to verify the experimental fairness.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes MVTrackTrans, a Transformer-based multi-view crowd tracking model that incorporates explicit interactions between camera views and the ground plane. To address the limitations of prior small-scale datasets (e.g., Wildtrack, MultiviewX), the authors introduce two new large real-world datasets, MVCrowdTrack and CityTrack, featuring larger scene sizes and longer sequences. The central claim is that MVTrackTrans outperforms existing CNN-based methods on these datasets due to the view-ground interaction modules and Transformer design, with datasets and code released publicly.

Significance. If the reported gains are shown to derive specifically from the architectural contributions rather than training or implementation differences, the work would meaningfully extend multi-view tracking to more realistic large-scale scenarios. The public release of the two new datasets and the code repository constitutes a concrete, reusable contribution that can support future benchmarking and development in the field.

major comments (2)

[§4.2] §4.2 (Quantitative Comparison): The tables reporting performance on MVCrowdTrack and CityTrack do not document whether baseline re-implementations used identical training schedules, data augmentations, optimizer settings, and evaluation protocols as MVTrackTrans. Because the central claim attributes improvements to the view-ground interactions and Transformer backbone, this omission leaves open the possibility that observed deltas arise from uncontrolled experimental factors rather than the proposed design.
[§3] §3 (Model Architecture) and §4.3 (Ablation Studies): No ablation isolates the contribution of the view-ground interaction modules from the Transformer backbone or from standard multi-view fusion. Without such controlled experiments, it is not possible to confirm that the claimed advantages for large scenes stem from the specific design choices highlighted in the abstract.

minor comments (2)

[Abstract] The abstract states that MVTrackTrans 'achieves better performance' without any numerical values, specific metrics, or named baselines; adding at least the key quantitative deltas would improve the standalone readability of the abstract.
[Figures] Figure captions and axis labels in the qualitative results (e.g., Figure 5) use inconsistent notation for camera indices and ground-plane coordinates; harmonizing these with the notation in §3 would reduce reader effort.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to improve clarity and experimental rigor.

read point-by-point responses

Referee: [§4.2] §4.2 (Quantitative Comparison): The tables reporting performance on MVCrowdTrack and CityTrack do not document whether baseline re-implementations used identical training schedules, data augmentations, optimizer settings, and evaluation protocols as MVTrackTrans. Because the central claim attributes improvements to the view-ground interactions and Transformer backbone, this omission leaves open the possibility that observed deltas arise from uncontrolled experimental factors rather than the proposed design.

Authors: We agree that explicit documentation of training and evaluation protocols is essential for reproducible comparisons. In the revised manuscript we will add a dedicated subsection (or table) detailing the exact training schedules, data augmentations, optimizer settings, and evaluation protocols applied to both MVTrackTrans and all re-implemented baselines. Where we followed the original authors’ recommended settings we will state this explicitly; any deviations will be justified and reported. revision: yes
Referee: [§3] §3 (Model Architecture) and §4.3 (Ablation Studies): No ablation isolates the contribution of the view-ground interaction modules from the Transformer backbone or from standard multi-view fusion. Without such controlled experiments, it is not possible to confirm that the claimed advantages for large scenes stem from the specific design choices highlighted in the abstract.

Authors: We acknowledge that the current ablation studies do not fully isolate the view-ground interaction modules from the Transformer backbone and from generic multi-view fusion. To address this directly we will add new controlled experiments in the revised §4.3 that compare (i) the full MVTrackTrans, (ii) the Transformer backbone without view-ground interaction modules, and (iii) a standard multi-view fusion baseline using the same Transformer encoder. These results will be reported on both MVCrowdTrack and CityTrack to quantify the specific contribution of the view-ground interactions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical model proposal on new datasets with no derivation chain or fitted predictions.

full rationale

The paper introduces MVTrackTrans (Transformer with view-ground interaction modules) and two new large-scale datasets (MVCrowdTrack, CityTrack), then reports superior tracking performance versus prior methods. No equations, parameter-fitting steps, or first-principles derivations appear in the abstract or described content. Claims rest on standard empirical comparison rather than any self-referential reduction, self-citation load-bearing premise, or renaming of known results. Results are independently falsifiable via the released code and data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no model equations, training details, or assumptions are provided, so the ledger cannot be populated with specific free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5538 in / 981 out tokens · 36208 ms · 2026-05-10T03:09:57.404640+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages

[1]

Memot: Multi-object tracking with memory

Jiarui Cai, Mingze Xu, Wei Li, Yuanjun Xiong, Wei Xia, Zhuowen Tu, and Stefano Soatto. Memot: Multi-object tracking with memory. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 8090–8100, 2022. 3

work page 2022
[2]

Refergpt: Towards zero-shot referring multi-object tracking

Tzoulio Chamiti, Leandro Di Bella, Adrian Munteanu, and Nikos Deligiannis. Refergpt: Towards zero-shot referring multi-object tracking. InProceedings of the Computer Vi- sion and Pattern Recognition Conference, pages 3849–3858,

work page
[3]

Wild- track: A multi-camera hd dataset for dense unscripted pedes- trian detection

Tatjana Chavdarova, Pierre Baqu ´e, St ´ephane Bouquet, An- drii Maksai, Cijo Jose, Timur Bagautdinov, Louis Lettry, Pascal Fua, Luc Van Gool, and Franc ¸ois Fleuret. Wild- track: A multi-camera hd dataset for dense unscripted pedes- trian detection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5030– 5039, 2018. 1, 3, 8

work page 2018
[4]

Cross-view referring multi-object tracking

Sijia Chen, En Yu, and Wenbing Tao. Cross-view referring multi-object tracking. InProceedings of the AAAI Confer- ence on Artificial Intelligence, pages 2204–2211, 2025. 3

work page 2025
[5]

Rest: A reconfigurable spatial-temporal graph model for multi-camera multi-object tracking

Cheng-Che Cheng, Min-Xuan Qiu, Chen-Kuo Chiang, and Shang-Hong Lai. Rest: A reconfigurable spatial-temporal graph model for multi-camera multi-object tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10051–10060, 2023. 6, 8

work page 2023
[6]

Cal- heiros, and Teng Joon Lim

Kosta Dakic, Kanchana Thilakarathna, Rodrigo N. Cal- heiros, and Teng Joon Lim. Resource-efficient multiview perception: Integrating semantic masking with masked au- toencoders. In2025 IEEE International Conference on Per- vasive Computing and Communications (PerCom), pages 145–151, 2025. 6, 8

work page 2025
[7]

Usman Maqbool Bhutta, Byron Hernandez, and Henry Medeiros

Amir Etefaghi Daryani, M. Usman Maqbool Bhutta, Byron Hernandez, and Henry Medeiros. Camuvid: Calibration- free multi-view detection. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1220–1229, 2025. 3

work page 2025
[8]

Multi-view tracking using weakly supervised human motion prediction

Martin Engilberge, Weizhe Liu, and Pascal Fua. Multi-view tracking using weakly supervised human motion prediction. InProceedings of the IEEE/CVF Winter Conference on Ap- plications of Computer Vision, pages 1582–1592, 2023. 3, 6, 7, 8

work page 2023
[9]

Multiple object track- ing as id prediction

Ruopeng Gao, Ji Qi, and Limin Wang. Multiple object track- ing as id prediction. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27883–27893,

work page
[10]

Harley, Zhaoyuan Fang, Jie Li, Rares Ambrus, and Katerina Fragkiadaki

Adam W. Harley, Zhaoyuan Fang, Jie Li, Rares Ambrus, and Katerina Fragkiadaki. Simple-bev: What really matters for multi-sensor bev perception? InProceedings of the IEEE In- ternational Conference on Robotics and Automation (ICRA),

work page
[11]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 4, 6

work page 2016
[12]

Multiview detection with shadow transformer (and view-coherent data augmentation)

Yunzhong Hou and Liang Zheng. Multiview detection with shadow transformer (and view-coherent data augmentation). InProceedings of the 29th ACM International Conference on Multimedia, pages 1673–1682, 2021. 3

work page 2021
[13]

Multi- view detection with feature perspective transformation

Yunzhong Hou, Liang Zheng, and Stephen Gould. Multi- view detection with feature perspective transformation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16, pages 1–18. Springer, 2020. 1, 3

work page 2020
[14]

Stdformer: Spatial-temporal mo- tion transformer for multiple object tracking.IEEE Transac- tions on Circuits and Systems for Video Technology, 33(11): 6571–6594, 2023

Mengjie Hu, Xiaotong Zhu, Haotian Wang, Shixiang Cao, Chun Liu, and Qing Song. Stdformer: Spatial-temporal mo- tion transformer for multiple object tracking.IEEE Transac- tions on Circuits and Systems for Video Technology, 33(11): 6571–6594, 2023. 3

work page 2023
[15]

A new approach to linear filter- ing and prediction problems.Journal of Basic Engineering,

Rudolph Emil Kalman. A new approach to linear filter- ing and prediction problems.Journal of Basic Engineering,

work page
[16]

Multi-task learning using uncertainty to weigh losses for scene geom- etry and semantics

Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geom- etry and semantics. InProceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 7482–7491, 2018. 5

work page 2018
[17]

Lamot: Language-guided multi-object tracking

Yunhao Li, Xiaoqiong Liu, Luke Liu, Heng Fan, and Libo Zhang. Lamot: Language-guided multi-object tracking. In 2025 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 6816–6822. IEEE, 2025. 3

work page 2025
[18]

Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2024

Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chong- hao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2024. 1

work page 2024
[19]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. InPro- ceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2980–2988, 2017. 5

work page 2017
[20]

Sparsetrack: Multi-object tracking by performing scene decomposition based on pseudo-depth.IEEE Transac- tions on Circuits and Systems for Video Technology, 2025

Zelin Liu, Xinggang Wang, Cheng Wang, Wenyu Liu, and Xiang Bai. Sparsetrack: Multi-object tracking by performing scene decomposition based on pseudo-depth.IEEE Transac- tions on Circuits and Systems for Video Technology, 2025. 3

work page 2025
[21]

Omnidirectional multi-object tracking

Kai Luo, Hao Shi, Sheng Wu, Fei Teng, Mengfei Duan, Chang Huang, Yuhang Wang, Kaiwei Wang, and Kailun Yang. Omnidirectional multi-object tracking. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21959–21969, 2025. 3

work page 2025
[22]

Trackformer: Multi-object track- ing with transformers

Tim Meinhardt, Alexander Kirillov, Laura Leal-Taixe, and Christoph Feichtenhofer. Trackformer: Multi-object track- ing with transformers. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 8844–8854, 2022. 8

work page 2022
[23]

Countformer: Multi-view crowd counting transformer

Hong Mo, Xiong Zhang, Jianchao Tan, Cheng Yang, Qiong Gu, Bo Hang, and Wenqi Ren. Countformer: Multi-view crowd counting transformer. InComputer Vision – ECCV 2024, pages 20–40, Cham, 2025. Springer Nature Switzer- land. 1

work page 2024
[24]

Mctr: Multi camera tracking transformer

Alexandru Niculescu-Mizil, Deep Patel, and Iain Melvin. Mctr: Multi camera tracking transformer. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), pages 816–826, 2025. 3

work page 2025
[25]

A bayesian filter for multi-view 3d multi- object tracking with occlusion handling.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(5):2246– 2263, 2020

Jonah Ong, Ba-Tuong V o, Ba-Ngu V o, Du Yong Kim, and Sven Nordholm. A bayesian filter for multi-view 3d multi- object tracking with occlusion handling.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(5):2246– 2263, 2020. 8

work page 2020
[26]

3d random occlusion and multi-layer projection for deep multi-camera pedestrian localization

Rui Qiu, Ming Xu, Yuyao Yan, Jeremy S Smith, and Xi Yang. 3d random occlusion and multi-layer projection for deep multi-camera pedestrian localization. InComputer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Is- rael, October 23–27, 2022, Proceedings, Part X, pages 695–

work page 2022
[27]

Focusing on tracks for online multi-object tracking

Kyujin Shim, Kangwook Ko, Yujin Yang, and Changick Kim. Focusing on tracks for online multi-object tracking. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 11687–11696, 2025. 3

work page 2025
[28]

Stacked homography transformations for multi-view pedestrian detection

Liangchen Song, Jialian Wu, Ming Yang, Qian Zhang, Yuan Li, and Junsong Yuan. Stacked homography transformations for multi-view pedestrian detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6049–6057, 2021. 3

work page 2021
[29]

Lifting multi-view detection and tracking to the bird’s eye view

Torben Teepe, Philipp Wolters, Johannes Gilg, Fabian Her- zog, and Gerhard Rigoll. Lifting multi-view detection and tracking to the bird’s eye view. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 667–676, 2024. 1, 2, 3, 6, 7, 8

work page 2024
[30]

Earlybird: Early-fusion for multi- view tracking in the bird’s eye view

Torben Teepe, Philipp Wolters, Johannes Gilg, Fabian Her- zog, and Gerhard Rigoll. Earlybird: Early-fusion for multi- view tracking in the bird’s eye view. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, pages 102–111, 2024. 1, 2, 3, 6, 7, 8

work page 2024
[31]

Depthtrack: Cluster meets bev for multi-camera multi-target 3d tracking

Tai Huu-Phuong Tran, Duong Nguyen-Ngoc Tran, Ngoc Doan-Minh Huynh, Chi Dai Tran, Long Hoang Pham, Quoc Pham-Nam Ho, Huy-Hung Nguyen, Duong Khac Vu, Hyung-Min Jeon, Hyung-Joon Jeon, Son Hong Phan, Trinh Le Ba Khanh, and Jae Wook Jeon. Depthtrack: Cluster meets bev for multi-camera multi-target 3d tracking. InProceed- ings of the IEEE/CVF International Conf...

work page 2025
[32]

Brea, and Manuel Mucientes

Lorenzo Vaquero, Yihong Xu, Xavier Alameda-Pineda, V´ıctor M. Brea, and Manuel Mucientes. Lost and found: Overcoming detector failures in online multi-object track- ing. InEuropean Conf. Comput. Vis. (ECCV), pages 448–

work page
[33]

Bringing generalization to deep multi-view pedestrian detection

Jeet V ora, Swetanjal Dutta, Kanishk Jain, Shyamgopal Karthik, and Vineet Gandhi. Bringing generalization to deep multi-view pedestrian detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 110–119, 2023. 3

work page 2023
[34]

La-motr: End-to-end multi-object tracking by learnable association

Peng Wang, Yongcai Wang, Hualong Cao, Wang Chen, and Deying Li. La-motr: End-to-end multi-object tracking by learnable association. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 12438–12448, 2025. 3

work page 2025
[35]

Mcblt: Multi- camera multi-object 3d tracking in long videos

Yizhou Wang, Tim Meinhardt, Orcun Cetintas, Cheng- Yen Yang, Sameer Pusegaonkar, Benjamin Missaoui, Sujit Biswas, Zheng Tang, and Laura Leal-Taixe. Mcblt: Multi- camera multi-object 3d tracking in long videos. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision (ICCV) Workshops, pages 5245–5254, 2025. 3, 6, 8

work page 2025
[36]

Transcenter: Transformers with dense representations for multiple-object tracking.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 45(6):7820–7835, 2023

Yihong Xu, Yutong Ban, Guillaume Delorme, Chuang Gan, Daniela Rus, and Xavier Alameda-Pineda. Transcenter: Transformers with dense representations for multiple-object tracking.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 45(6):7820–7835, 2023. 3, 8

work page 2023
[37]

Mvtrajecter: Multi-view pedestrian tracking with trajectory motion cost and trajectory appearance cost

Taiga Yamane, Ryo Masumura, Satoshi Suzuki, and Shota Orihashi. Mvtrajecter: Multi-view pedestrian tracking with trajectory motion cost and trajectory appearance cost. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13270–13280, 2025. 3, 6, 8

work page 2025
[38]

An end-to-end tracking framework via multi- view and temporal feature aggregation.Computer Vision and Image Understanding, 249:104203, 2024

Yihan Yang, Ming Xu, Jason F Ralph, Yuchen Ling, and Xi- aonan Pan. An end-to-end tracking framework via multi- view and temporal feature aggregation.Computer Vision and Image Understanding, 249:104203, 2024. 6, 8

work page 2024
[39]

arXiv preprint arXiv:2003.11753 (2020)

Quanzeng You and Hao Jiang. Real-time 3d deep multi- camera tracking.arXiv preprint arXiv:2003.11753, 2020. 8

work page arXiv 2003
[40]

Tgformer: Transformer with track query group for multi-object track- ing.Proceedings of the AAAI Conference on Artificial Intel- ligence, 39(9):9824–9832, 2025

Rui Zeng, Yuanzhou Huang, and Songwei Pei. Tgformer: Transformer with track query group for multi-object track- ing.Proceedings of the AAAI Conference on Artificial Intel- ligence, 39(9):9824–9832, 2025. 3

work page 2025
[41]

Wide-area crowd counting via ground-plane density maps and multi-view fusion cnns

Qi Zhang and Antoni B Chan. Wide-area crowd counting via ground-plane density maps and multi-view fusion cnns. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8297–8306, 2019. 1, 6

work page 2019
[42]

Mahalanobis distance-based multi-view optimal transport for multi-view crowd localization

Qi Zhang, Kaiyi Zhang, Antoni B Chan, and Hui Huang. Mahalanobis distance-based multi-view optimal transport for multi-view crowd localization. InEuropean Conference on Computer Vision, pages 19–36. Springer, 2024. 3

work page 2024
[43]

Juanjuan Zhao, Liutao Zhang, Jiexia Ye, and Chengzhong Xu. Mdlf: A multi-view-based deep learning framework for individual trip destination prediction in public transportation systems.IEEE Transactions on Intelligent Transportation Systems, 23(8):13316–13329, 2021. 1

work page 2021
[44]

Tracking objects as pixel-wise distributions

Zelin Zhao, Ze Wu, Yueqing Zhuang, Boxun Li, and Ji- aya Jia. Tracking objects as pixel-wise distributions. In European Conference on Computer Vision, pages 76–94. Springer, 2022. 3

work page 2022
[45]

Hff-tracker: A hierarchical fine-grained fusion tracker for referring multi- object tracking

Zeyong Zhao, Yanchao Hao, Minghao Zhang, Qingbin Liu, Bo Li, Dianbo Sui, Shizhu He, and Xi Chen. Hff-tracker: A hierarchical fine-grained fusion tracker for referring multi- object tracking. InProceedings of the AAAI Conference on Artificial Intelligence, pages 10528–10536, 2025. 3

work page 2025
[46]

Global tracking transformers

Xingyi Zhou, Tianwei Yin, Vladlen Koltun, and Philipp Kr¨ahenb¨uhl. Global tracking transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8771–8780, 2022. 3

work page 2022
[47]

Deformable detr: Deformable transformers for end-to-end object detection

Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection. InProceedings of the In- ternational Conference on Learning Representations (ICLR),

work page

[1] [1]

Memot: Multi-object tracking with memory

Jiarui Cai, Mingze Xu, Wei Li, Yuanjun Xiong, Wei Xia, Zhuowen Tu, and Stefano Soatto. Memot: Multi-object tracking with memory. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 8090–8100, 2022. 3

work page 2022

[2] [2]

Refergpt: Towards zero-shot referring multi-object tracking

Tzoulio Chamiti, Leandro Di Bella, Adrian Munteanu, and Nikos Deligiannis. Refergpt: Towards zero-shot referring multi-object tracking. InProceedings of the Computer Vi- sion and Pattern Recognition Conference, pages 3849–3858,

work page

[3] [3]

Wild- track: A multi-camera hd dataset for dense unscripted pedes- trian detection

Tatjana Chavdarova, Pierre Baqu ´e, St ´ephane Bouquet, An- drii Maksai, Cijo Jose, Timur Bagautdinov, Louis Lettry, Pascal Fua, Luc Van Gool, and Franc ¸ois Fleuret. Wild- track: A multi-camera hd dataset for dense unscripted pedes- trian detection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5030– 5039, 2018. 1, 3, 8

work page 2018

[4] [4]

Cross-view referring multi-object tracking

Sijia Chen, En Yu, and Wenbing Tao. Cross-view referring multi-object tracking. InProceedings of the AAAI Confer- ence on Artificial Intelligence, pages 2204–2211, 2025. 3

work page 2025

[5] [5]

Rest: A reconfigurable spatial-temporal graph model for multi-camera multi-object tracking

Cheng-Che Cheng, Min-Xuan Qiu, Chen-Kuo Chiang, and Shang-Hong Lai. Rest: A reconfigurable spatial-temporal graph model for multi-camera multi-object tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10051–10060, 2023. 6, 8

work page 2023

[6] [6]

Cal- heiros, and Teng Joon Lim

Kosta Dakic, Kanchana Thilakarathna, Rodrigo N. Cal- heiros, and Teng Joon Lim. Resource-efficient multiview perception: Integrating semantic masking with masked au- toencoders. In2025 IEEE International Conference on Per- vasive Computing and Communications (PerCom), pages 145–151, 2025. 6, 8

work page 2025

[7] [7]

Usman Maqbool Bhutta, Byron Hernandez, and Henry Medeiros

Amir Etefaghi Daryani, M. Usman Maqbool Bhutta, Byron Hernandez, and Henry Medeiros. Camuvid: Calibration- free multi-view detection. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1220–1229, 2025. 3

work page 2025

[8] [8]

Multi-view tracking using weakly supervised human motion prediction

Martin Engilberge, Weizhe Liu, and Pascal Fua. Multi-view tracking using weakly supervised human motion prediction. InProceedings of the IEEE/CVF Winter Conference on Ap- plications of Computer Vision, pages 1582–1592, 2023. 3, 6, 7, 8

work page 2023

[9] [9]

Multiple object track- ing as id prediction

Ruopeng Gao, Ji Qi, and Limin Wang. Multiple object track- ing as id prediction. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27883–27893,

work page

[10] [10]

Harley, Zhaoyuan Fang, Jie Li, Rares Ambrus, and Katerina Fragkiadaki

Adam W. Harley, Zhaoyuan Fang, Jie Li, Rares Ambrus, and Katerina Fragkiadaki. Simple-bev: What really matters for multi-sensor bev perception? InProceedings of the IEEE In- ternational Conference on Robotics and Automation (ICRA),

work page

[11] [11]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 4, 6

work page 2016

[12] [12]

Multiview detection with shadow transformer (and view-coherent data augmentation)

Yunzhong Hou and Liang Zheng. Multiview detection with shadow transformer (and view-coherent data augmentation). InProceedings of the 29th ACM International Conference on Multimedia, pages 1673–1682, 2021. 3

work page 2021

[13] [13]

Multi- view detection with feature perspective transformation

Yunzhong Hou, Liang Zheng, and Stephen Gould. Multi- view detection with feature perspective transformation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16, pages 1–18. Springer, 2020. 1, 3

work page 2020

[14] [14]

Stdformer: Spatial-temporal mo- tion transformer for multiple object tracking.IEEE Transac- tions on Circuits and Systems for Video Technology, 33(11): 6571–6594, 2023

Mengjie Hu, Xiaotong Zhu, Haotian Wang, Shixiang Cao, Chun Liu, and Qing Song. Stdformer: Spatial-temporal mo- tion transformer for multiple object tracking.IEEE Transac- tions on Circuits and Systems for Video Technology, 33(11): 6571–6594, 2023. 3

work page 2023

[15] [15]

A new approach to linear filter- ing and prediction problems.Journal of Basic Engineering,

Rudolph Emil Kalman. A new approach to linear filter- ing and prediction problems.Journal of Basic Engineering,

work page

[16] [16]

Multi-task learning using uncertainty to weigh losses for scene geom- etry and semantics

Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geom- etry and semantics. InProceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 7482–7491, 2018. 5

work page 2018

[17] [17]

Lamot: Language-guided multi-object tracking

Yunhao Li, Xiaoqiong Liu, Luke Liu, Heng Fan, and Libo Zhang. Lamot: Language-guided multi-object tracking. In 2025 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 6816–6822. IEEE, 2025. 3

work page 2025

[18] [18]

Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2024

Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chong- hao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2024. 1

work page 2024

[19] [19]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. InPro- ceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2980–2988, 2017. 5

work page 2017

[20] [20]

Sparsetrack: Multi-object tracking by performing scene decomposition based on pseudo-depth.IEEE Transac- tions on Circuits and Systems for Video Technology, 2025

Zelin Liu, Xinggang Wang, Cheng Wang, Wenyu Liu, and Xiang Bai. Sparsetrack: Multi-object tracking by performing scene decomposition based on pseudo-depth.IEEE Transac- tions on Circuits and Systems for Video Technology, 2025. 3

work page 2025

[21] [21]

Omnidirectional multi-object tracking

Kai Luo, Hao Shi, Sheng Wu, Fei Teng, Mengfei Duan, Chang Huang, Yuhang Wang, Kaiwei Wang, and Kailun Yang. Omnidirectional multi-object tracking. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21959–21969, 2025. 3

work page 2025

[22] [22]

Trackformer: Multi-object track- ing with transformers

Tim Meinhardt, Alexander Kirillov, Laura Leal-Taixe, and Christoph Feichtenhofer. Trackformer: Multi-object track- ing with transformers. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 8844–8854, 2022. 8

work page 2022

[23] [23]

Countformer: Multi-view crowd counting transformer

Hong Mo, Xiong Zhang, Jianchao Tan, Cheng Yang, Qiong Gu, Bo Hang, and Wenqi Ren. Countformer: Multi-view crowd counting transformer. InComputer Vision – ECCV 2024, pages 20–40, Cham, 2025. Springer Nature Switzer- land. 1

work page 2024

[24] [24]

Mctr: Multi camera tracking transformer

Alexandru Niculescu-Mizil, Deep Patel, and Iain Melvin. Mctr: Multi camera tracking transformer. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), pages 816–826, 2025. 3

work page 2025

[25] [25]

A bayesian filter for multi-view 3d multi- object tracking with occlusion handling.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(5):2246– 2263, 2020

Jonah Ong, Ba-Tuong V o, Ba-Ngu V o, Du Yong Kim, and Sven Nordholm. A bayesian filter for multi-view 3d multi- object tracking with occlusion handling.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(5):2246– 2263, 2020. 8

work page 2020

[26] [26]

3d random occlusion and multi-layer projection for deep multi-camera pedestrian localization

Rui Qiu, Ming Xu, Yuyao Yan, Jeremy S Smith, and Xi Yang. 3d random occlusion and multi-layer projection for deep multi-camera pedestrian localization. InComputer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Is- rael, October 23–27, 2022, Proceedings, Part X, pages 695–

work page 2022

[27] [27]

Focusing on tracks for online multi-object tracking

Kyujin Shim, Kangwook Ko, Yujin Yang, and Changick Kim. Focusing on tracks for online multi-object tracking. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 11687–11696, 2025. 3

work page 2025

[28] [28]

Stacked homography transformations for multi-view pedestrian detection

Liangchen Song, Jialian Wu, Ming Yang, Qian Zhang, Yuan Li, and Junsong Yuan. Stacked homography transformations for multi-view pedestrian detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6049–6057, 2021. 3

work page 2021

[29] [29]

Lifting multi-view detection and tracking to the bird’s eye view

Torben Teepe, Philipp Wolters, Johannes Gilg, Fabian Her- zog, and Gerhard Rigoll. Lifting multi-view detection and tracking to the bird’s eye view. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 667–676, 2024. 1, 2, 3, 6, 7, 8

work page 2024

[30] [30]

Earlybird: Early-fusion for multi- view tracking in the bird’s eye view

Torben Teepe, Philipp Wolters, Johannes Gilg, Fabian Her- zog, and Gerhard Rigoll. Earlybird: Early-fusion for multi- view tracking in the bird’s eye view. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, pages 102–111, 2024. 1, 2, 3, 6, 7, 8

work page 2024

[31] [31]

Depthtrack: Cluster meets bev for multi-camera multi-target 3d tracking

Tai Huu-Phuong Tran, Duong Nguyen-Ngoc Tran, Ngoc Doan-Minh Huynh, Chi Dai Tran, Long Hoang Pham, Quoc Pham-Nam Ho, Huy-Hung Nguyen, Duong Khac Vu, Hyung-Min Jeon, Hyung-Joon Jeon, Son Hong Phan, Trinh Le Ba Khanh, and Jae Wook Jeon. Depthtrack: Cluster meets bev for multi-camera multi-target 3d tracking. InProceed- ings of the IEEE/CVF International Conf...

work page 2025

[32] [32]

Brea, and Manuel Mucientes

Lorenzo Vaquero, Yihong Xu, Xavier Alameda-Pineda, V´ıctor M. Brea, and Manuel Mucientes. Lost and found: Overcoming detector failures in online multi-object track- ing. InEuropean Conf. Comput. Vis. (ECCV), pages 448–

work page

[33] [33]

Bringing generalization to deep multi-view pedestrian detection

Jeet V ora, Swetanjal Dutta, Kanishk Jain, Shyamgopal Karthik, and Vineet Gandhi. Bringing generalization to deep multi-view pedestrian detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 110–119, 2023. 3

work page 2023

[34] [34]

La-motr: End-to-end multi-object tracking by learnable association

Peng Wang, Yongcai Wang, Hualong Cao, Wang Chen, and Deying Li. La-motr: End-to-end multi-object tracking by learnable association. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 12438–12448, 2025. 3

work page 2025

[35] [35]

Mcblt: Multi- camera multi-object 3d tracking in long videos

Yizhou Wang, Tim Meinhardt, Orcun Cetintas, Cheng- Yen Yang, Sameer Pusegaonkar, Benjamin Missaoui, Sujit Biswas, Zheng Tang, and Laura Leal-Taixe. Mcblt: Multi- camera multi-object 3d tracking in long videos. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision (ICCV) Workshops, pages 5245–5254, 2025. 3, 6, 8

work page 2025

[36] [36]

Transcenter: Transformers with dense representations for multiple-object tracking.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 45(6):7820–7835, 2023

Yihong Xu, Yutong Ban, Guillaume Delorme, Chuang Gan, Daniela Rus, and Xavier Alameda-Pineda. Transcenter: Transformers with dense representations for multiple-object tracking.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 45(6):7820–7835, 2023. 3, 8

work page 2023

[37] [37]

Mvtrajecter: Multi-view pedestrian tracking with trajectory motion cost and trajectory appearance cost

Taiga Yamane, Ryo Masumura, Satoshi Suzuki, and Shota Orihashi. Mvtrajecter: Multi-view pedestrian tracking with trajectory motion cost and trajectory appearance cost. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13270–13280, 2025. 3, 6, 8

work page 2025

[38] [38]

An end-to-end tracking framework via multi- view and temporal feature aggregation.Computer Vision and Image Understanding, 249:104203, 2024

Yihan Yang, Ming Xu, Jason F Ralph, Yuchen Ling, and Xi- aonan Pan. An end-to-end tracking framework via multi- view and temporal feature aggregation.Computer Vision and Image Understanding, 249:104203, 2024. 6, 8

work page 2024

[39] [39]

arXiv preprint arXiv:2003.11753 (2020)

Quanzeng You and Hao Jiang. Real-time 3d deep multi- camera tracking.arXiv preprint arXiv:2003.11753, 2020. 8

work page arXiv 2003

[40] [40]

Tgformer: Transformer with track query group for multi-object track- ing.Proceedings of the AAAI Conference on Artificial Intel- ligence, 39(9):9824–9832, 2025

Rui Zeng, Yuanzhou Huang, and Songwei Pei. Tgformer: Transformer with track query group for multi-object track- ing.Proceedings of the AAAI Conference on Artificial Intel- ligence, 39(9):9824–9832, 2025. 3

work page 2025

[41] [41]

Wide-area crowd counting via ground-plane density maps and multi-view fusion cnns

Qi Zhang and Antoni B Chan. Wide-area crowd counting via ground-plane density maps and multi-view fusion cnns. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8297–8306, 2019. 1, 6

work page 2019

[42] [42]

Mahalanobis distance-based multi-view optimal transport for multi-view crowd localization

Qi Zhang, Kaiyi Zhang, Antoni B Chan, and Hui Huang. Mahalanobis distance-based multi-view optimal transport for multi-view crowd localization. InEuropean Conference on Computer Vision, pages 19–36. Springer, 2024. 3

work page 2024

[43] [43]

Juanjuan Zhao, Liutao Zhang, Jiexia Ye, and Chengzhong Xu. Mdlf: A multi-view-based deep learning framework for individual trip destination prediction in public transportation systems.IEEE Transactions on Intelligent Transportation Systems, 23(8):13316–13329, 2021. 1

work page 2021

[44] [44]

Tracking objects as pixel-wise distributions

Zelin Zhao, Ze Wu, Yueqing Zhuang, Boxun Li, and Ji- aya Jia. Tracking objects as pixel-wise distributions. In European Conference on Computer Vision, pages 76–94. Springer, 2022. 3

work page 2022

[45] [45]

Hff-tracker: A hierarchical fine-grained fusion tracker for referring multi- object tracking

Zeyong Zhao, Yanchao Hao, Minghao Zhang, Qingbin Liu, Bo Li, Dianbo Sui, Shizhu He, and Xi Chen. Hff-tracker: A hierarchical fine-grained fusion tracker for referring multi- object tracking. InProceedings of the AAAI Conference on Artificial Intelligence, pages 10528–10536, 2025. 3

work page 2025

[46] [46]

Global tracking transformers

Xingyi Zhou, Tianwei Yin, Vladlen Koltun, and Philipp Kr¨ahenb¨uhl. Global tracking transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8771–8780, 2022. 3

work page 2022

[47] [47]

Deformable detr: Deformable transformers for end-to-end object detection

Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection. InProceedings of the In- ternational Conference on Learning Representations (ICLR),

work page