Multi-view Crowd Tracking Transformer with View-Ground Interactions Under Large Real-World Scenes
Pith reviewed 2026-05-10 03:09 UTC · model grok-4.3
The pith
MVTrackTrans uses Transformer view-ground interactions to track crowds more accurately in large real-world scenes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MVTrackTrans adopts a Transformer to perform interactions between multiple camera views and the ground plane, enabling enhanced multi-view tracking performance. The model is evaluated on two newly collected large real-world datasets, MVCrowdTrack and CityTrack, which feature much larger scene sizes and longer time periods than prior benchmarks such as Wildtrack and MultiviewX. On these datasets the proposed model achieves better performance than existing CNN-based methods, demonstrating the advantages of the Transformer and view-ground interaction design for dealing with large scenes.
What carries the argument
The view-ground interaction module inside the Transformer that projects and fuses information across camera views onto a shared ground plane for consistent identity maintenance.
If this is right
- Multi-view tracking systems can now scale to scenes with greater spatial extent and longer temporal sequences without losing identity consistency.
- Ground-plane projections become a standard mechanism for resolving perspective distortions across cameras in crowded environments.
- Future work can adopt the same Transformer interaction pattern to incorporate additional modalities such as depth or motion cues.
- Real-world deployment of multi-view tracking becomes feasible for applications requiring coverage of city-scale areas over hours rather than minutes.
Where Pith is reading between the lines
- The new datasets could serve as a testbed for hybrid CNN-Transformer trackers or for evaluating the effect of camera calibration accuracy on ground-plane fusion.
- If view-ground interactions prove robust, the same fusion idea might transfer to single-camera tracking by treating the ground plane as an implicit regularizer.
- Extending the model to handle moving cameras or dynamic ground surfaces would be a direct next step for outdoor urban scenarios.
Load-bearing premise
The reported performance gains arise specifically from the view-ground interactions and Transformer design rather than from dataset-specific tuning or training differences.
What would settle it
An ablation study on MVCrowdTrack or CityTrack that removes the view-ground interaction module while keeping all other architecture and training choices identical and shows no drop in tracking metrics.
Figures
read the original abstract
Multi-view crowd tracking estimates each person's tracking trajectories on the ground of the scene. Recent research works mainly rely on CNNs-based multi-view crowd tracking architectures, and most of them are evaluated and compared on relatively small datasets, such as Wildtrack and MultiviewX. Since these two datasets are collected in small scenes and only contain tens of frames in the evaluation stage, it is difficult for the current methods to be applied to real-world applications where scene size and occlusion are more complicated. In this paper, we propose a Transformer-based multi-view crowd tracking model, \textit{MVTrackTrans}, which adopts interactions between camera views and the ground plane for enhanced multi-view tracking performance. Besides, for better evaluation, we collect and label two large real-world multi-view tracking datasets, MVCrowdTrack and CityTrack, which contain a much larger scene size over a longer time period. Compared with existing methods on the two large and new datasets, the proposed MVTrackTrans model achieves better performance, demonstrating the advantages of the model design in dealing with large scenes. We believe the proposed datasets and model will push the frontiers of the task to more practical scenarios, and the datasets and code are available at: https://github.com/zqyq/MVTrackTrans.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes MVTrackTrans, a Transformer-based multi-view crowd tracking model that incorporates explicit interactions between camera views and the ground plane. To address the limitations of prior small-scale datasets (e.g., Wildtrack, MultiviewX), the authors introduce two new large real-world datasets, MVCrowdTrack and CityTrack, featuring larger scene sizes and longer sequences. The central claim is that MVTrackTrans outperforms existing CNN-based methods on these datasets due to the view-ground interaction modules and Transformer design, with datasets and code released publicly.
Significance. If the reported gains are shown to derive specifically from the architectural contributions rather than training or implementation differences, the work would meaningfully extend multi-view tracking to more realistic large-scale scenarios. The public release of the two new datasets and the code repository constitutes a concrete, reusable contribution that can support future benchmarking and development in the field.
major comments (2)
- [§4.2] §4.2 (Quantitative Comparison): The tables reporting performance on MVCrowdTrack and CityTrack do not document whether baseline re-implementations used identical training schedules, data augmentations, optimizer settings, and evaluation protocols as MVTrackTrans. Because the central claim attributes improvements to the view-ground interactions and Transformer backbone, this omission leaves open the possibility that observed deltas arise from uncontrolled experimental factors rather than the proposed design.
- [§3] §3 (Model Architecture) and §4.3 (Ablation Studies): No ablation isolates the contribution of the view-ground interaction modules from the Transformer backbone or from standard multi-view fusion. Without such controlled experiments, it is not possible to confirm that the claimed advantages for large scenes stem from the specific design choices highlighted in the abstract.
minor comments (2)
- [Abstract] The abstract states that MVTrackTrans 'achieves better performance' without any numerical values, specific metrics, or named baselines; adding at least the key quantitative deltas would improve the standalone readability of the abstract.
- [Figures] Figure captions and axis labels in the qualitative results (e.g., Figure 5) use inconsistent notation for camera indices and ground-plane coordinates; harmonizing these with the notation in §3 would reduce reader effort.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to improve clarity and experimental rigor.
read point-by-point responses
-
Referee: [§4.2] §4.2 (Quantitative Comparison): The tables reporting performance on MVCrowdTrack and CityTrack do not document whether baseline re-implementations used identical training schedules, data augmentations, optimizer settings, and evaluation protocols as MVTrackTrans. Because the central claim attributes improvements to the view-ground interactions and Transformer backbone, this omission leaves open the possibility that observed deltas arise from uncontrolled experimental factors rather than the proposed design.
Authors: We agree that explicit documentation of training and evaluation protocols is essential for reproducible comparisons. In the revised manuscript we will add a dedicated subsection (or table) detailing the exact training schedules, data augmentations, optimizer settings, and evaluation protocols applied to both MVTrackTrans and all re-implemented baselines. Where we followed the original authors’ recommended settings we will state this explicitly; any deviations will be justified and reported. revision: yes
-
Referee: [§3] §3 (Model Architecture) and §4.3 (Ablation Studies): No ablation isolates the contribution of the view-ground interaction modules from the Transformer backbone or from standard multi-view fusion. Without such controlled experiments, it is not possible to confirm that the claimed advantages for large scenes stem from the specific design choices highlighted in the abstract.
Authors: We acknowledge that the current ablation studies do not fully isolate the view-ground interaction modules from the Transformer backbone and from generic multi-view fusion. To address this directly we will add new controlled experiments in the revised §4.3 that compare (i) the full MVTrackTrans, (ii) the Transformer backbone without view-ground interaction modules, and (iii) a standard multi-view fusion baseline using the same Transformer encoder. These results will be reported on both MVCrowdTrack and CityTrack to quantify the specific contribution of the view-ground interactions. revision: yes
Circularity Check
No circularity: empirical model proposal on new datasets with no derivation chain or fitted predictions.
full rationale
The paper introduces MVTrackTrans (Transformer with view-ground interaction modules) and two new large-scale datasets (MVCrowdTrack, CityTrack), then reports superior tracking performance versus prior methods. No equations, parameter-fitting steps, or first-principles derivations appear in the abstract or described content. Claims rest on standard empirical comparison rather than any self-referential reduction, self-citation load-bearing premise, or renaming of known results. Results are independently falsifiable via the released code and data.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Memot: Multi-object tracking with memory
Jiarui Cai, Mingze Xu, Wei Li, Yuanjun Xiong, Wei Xia, Zhuowen Tu, and Stefano Soatto. Memot: Multi-object tracking with memory. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 8090–8100, 2022. 3
work page 2022
-
[2]
Refergpt: Towards zero-shot referring multi-object tracking
Tzoulio Chamiti, Leandro Di Bella, Adrian Munteanu, and Nikos Deligiannis. Refergpt: Towards zero-shot referring multi-object tracking. InProceedings of the Computer Vi- sion and Pattern Recognition Conference, pages 3849–3858,
-
[3]
Wild- track: A multi-camera hd dataset for dense unscripted pedes- trian detection
Tatjana Chavdarova, Pierre Baqu ´e, St ´ephane Bouquet, An- drii Maksai, Cijo Jose, Timur Bagautdinov, Louis Lettry, Pascal Fua, Luc Van Gool, and Franc ¸ois Fleuret. Wild- track: A multi-camera hd dataset for dense unscripted pedes- trian detection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5030– 5039, 2018. 1, 3, 8
work page 2018
-
[4]
Cross-view referring multi-object tracking
Sijia Chen, En Yu, and Wenbing Tao. Cross-view referring multi-object tracking. InProceedings of the AAAI Confer- ence on Artificial Intelligence, pages 2204–2211, 2025. 3
work page 2025
-
[5]
Rest: A reconfigurable spatial-temporal graph model for multi-camera multi-object tracking
Cheng-Che Cheng, Min-Xuan Qiu, Chen-Kuo Chiang, and Shang-Hong Lai. Rest: A reconfigurable spatial-temporal graph model for multi-camera multi-object tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10051–10060, 2023. 6, 8
work page 2023
-
[6]
Cal- heiros, and Teng Joon Lim
Kosta Dakic, Kanchana Thilakarathna, Rodrigo N. Cal- heiros, and Teng Joon Lim. Resource-efficient multiview perception: Integrating semantic masking with masked au- toencoders. In2025 IEEE International Conference on Per- vasive Computing and Communications (PerCom), pages 145–151, 2025. 6, 8
work page 2025
-
[7]
Usman Maqbool Bhutta, Byron Hernandez, and Henry Medeiros
Amir Etefaghi Daryani, M. Usman Maqbool Bhutta, Byron Hernandez, and Henry Medeiros. Camuvid: Calibration- free multi-view detection. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1220–1229, 2025. 3
work page 2025
-
[8]
Multi-view tracking using weakly supervised human motion prediction
Martin Engilberge, Weizhe Liu, and Pascal Fua. Multi-view tracking using weakly supervised human motion prediction. InProceedings of the IEEE/CVF Winter Conference on Ap- plications of Computer Vision, pages 1582–1592, 2023. 3, 6, 7, 8
work page 2023
-
[9]
Multiple object track- ing as id prediction
Ruopeng Gao, Ji Qi, and Limin Wang. Multiple object track- ing as id prediction. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27883–27893,
-
[10]
Harley, Zhaoyuan Fang, Jie Li, Rares Ambrus, and Katerina Fragkiadaki
Adam W. Harley, Zhaoyuan Fang, Jie Li, Rares Ambrus, and Katerina Fragkiadaki. Simple-bev: What really matters for multi-sensor bev perception? InProceedings of the IEEE In- ternational Conference on Robotics and Automation (ICRA),
-
[11]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 4, 6
work page 2016
-
[12]
Multiview detection with shadow transformer (and view-coherent data augmentation)
Yunzhong Hou and Liang Zheng. Multiview detection with shadow transformer (and view-coherent data augmentation). InProceedings of the 29th ACM International Conference on Multimedia, pages 1673–1682, 2021. 3
work page 2021
-
[13]
Multi- view detection with feature perspective transformation
Yunzhong Hou, Liang Zheng, and Stephen Gould. Multi- view detection with feature perspective transformation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16, pages 1–18. Springer, 2020. 1, 3
work page 2020
-
[14]
Mengjie Hu, Xiaotong Zhu, Haotian Wang, Shixiang Cao, Chun Liu, and Qing Song. Stdformer: Spatial-temporal mo- tion transformer for multiple object tracking.IEEE Transac- tions on Circuits and Systems for Video Technology, 33(11): 6571–6594, 2023. 3
work page 2023
-
[15]
A new approach to linear filter- ing and prediction problems.Journal of Basic Engineering,
Rudolph Emil Kalman. A new approach to linear filter- ing and prediction problems.Journal of Basic Engineering,
-
[16]
Multi-task learning using uncertainty to weigh losses for scene geom- etry and semantics
Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geom- etry and semantics. InProceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 7482–7491, 2018. 5
work page 2018
-
[17]
Lamot: Language-guided multi-object tracking
Yunhao Li, Xiaoqiong Liu, Luke Liu, Heng Fan, and Libo Zhang. Lamot: Language-guided multi-object tracking. In 2025 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 6816–6822. IEEE, 2025. 3
work page 2025
-
[18]
Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chong- hao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2024. 1
work page 2024
-
[19]
Focal loss for dense object detection
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. InPro- ceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2980–2988, 2017. 5
work page 2017
-
[20]
Zelin Liu, Xinggang Wang, Cheng Wang, Wenyu Liu, and Xiang Bai. Sparsetrack: Multi-object tracking by performing scene decomposition based on pseudo-depth.IEEE Transac- tions on Circuits and Systems for Video Technology, 2025. 3
work page 2025
-
[21]
Omnidirectional multi-object tracking
Kai Luo, Hao Shi, Sheng Wu, Fei Teng, Mengfei Duan, Chang Huang, Yuhang Wang, Kaiwei Wang, and Kailun Yang. Omnidirectional multi-object tracking. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21959–21969, 2025. 3
work page 2025
-
[22]
Trackformer: Multi-object track- ing with transformers
Tim Meinhardt, Alexander Kirillov, Laura Leal-Taixe, and Christoph Feichtenhofer. Trackformer: Multi-object track- ing with transformers. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 8844–8854, 2022. 8
work page 2022
-
[23]
Countformer: Multi-view crowd counting transformer
Hong Mo, Xiong Zhang, Jianchao Tan, Cheng Yang, Qiong Gu, Bo Hang, and Wenqi Ren. Countformer: Multi-view crowd counting transformer. InComputer Vision – ECCV 2024, pages 20–40, Cham, 2025. Springer Nature Switzer- land. 1
work page 2024
-
[24]
Mctr: Multi camera tracking transformer
Alexandru Niculescu-Mizil, Deep Patel, and Iain Melvin. Mctr: Multi camera tracking transformer. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), pages 816–826, 2025. 3
work page 2025
-
[25]
Jonah Ong, Ba-Tuong V o, Ba-Ngu V o, Du Yong Kim, and Sven Nordholm. A bayesian filter for multi-view 3d multi- object tracking with occlusion handling.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(5):2246– 2263, 2020. 8
work page 2020
-
[26]
3d random occlusion and multi-layer projection for deep multi-camera pedestrian localization
Rui Qiu, Ming Xu, Yuyao Yan, Jeremy S Smith, and Xi Yang. 3d random occlusion and multi-layer projection for deep multi-camera pedestrian localization. InComputer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Is- rael, October 23–27, 2022, Proceedings, Part X, pages 695–
work page 2022
-
[27]
Focusing on tracks for online multi-object tracking
Kyujin Shim, Kangwook Ko, Yujin Yang, and Changick Kim. Focusing on tracks for online multi-object tracking. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 11687–11696, 2025. 3
work page 2025
-
[28]
Stacked homography transformations for multi-view pedestrian detection
Liangchen Song, Jialian Wu, Ming Yang, Qian Zhang, Yuan Li, and Junsong Yuan. Stacked homography transformations for multi-view pedestrian detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6049–6057, 2021. 3
work page 2021
-
[29]
Lifting multi-view detection and tracking to the bird’s eye view
Torben Teepe, Philipp Wolters, Johannes Gilg, Fabian Her- zog, and Gerhard Rigoll. Lifting multi-view detection and tracking to the bird’s eye view. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 667–676, 2024. 1, 2, 3, 6, 7, 8
work page 2024
-
[30]
Earlybird: Early-fusion for multi- view tracking in the bird’s eye view
Torben Teepe, Philipp Wolters, Johannes Gilg, Fabian Her- zog, and Gerhard Rigoll. Earlybird: Early-fusion for multi- view tracking in the bird’s eye view. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, pages 102–111, 2024. 1, 2, 3, 6, 7, 8
work page 2024
-
[31]
Depthtrack: Cluster meets bev for multi-camera multi-target 3d tracking
Tai Huu-Phuong Tran, Duong Nguyen-Ngoc Tran, Ngoc Doan-Minh Huynh, Chi Dai Tran, Long Hoang Pham, Quoc Pham-Nam Ho, Huy-Hung Nguyen, Duong Khac Vu, Hyung-Min Jeon, Hyung-Joon Jeon, Son Hong Phan, Trinh Le Ba Khanh, and Jae Wook Jeon. Depthtrack: Cluster meets bev for multi-camera multi-target 3d tracking. InProceed- ings of the IEEE/CVF International Conf...
work page 2025
-
[32]
Lorenzo Vaquero, Yihong Xu, Xavier Alameda-Pineda, V´ıctor M. Brea, and Manuel Mucientes. Lost and found: Overcoming detector failures in online multi-object track- ing. InEuropean Conf. Comput. Vis. (ECCV), pages 448–
-
[33]
Bringing generalization to deep multi-view pedestrian detection
Jeet V ora, Swetanjal Dutta, Kanishk Jain, Shyamgopal Karthik, and Vineet Gandhi. Bringing generalization to deep multi-view pedestrian detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 110–119, 2023. 3
work page 2023
-
[34]
La-motr: End-to-end multi-object tracking by learnable association
Peng Wang, Yongcai Wang, Hualong Cao, Wang Chen, and Deying Li. La-motr: End-to-end multi-object tracking by learnable association. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 12438–12448, 2025. 3
work page 2025
-
[35]
Mcblt: Multi- camera multi-object 3d tracking in long videos
Yizhou Wang, Tim Meinhardt, Orcun Cetintas, Cheng- Yen Yang, Sameer Pusegaonkar, Benjamin Missaoui, Sujit Biswas, Zheng Tang, and Laura Leal-Taixe. Mcblt: Multi- camera multi-object 3d tracking in long videos. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision (ICCV) Workshops, pages 5245–5254, 2025. 3, 6, 8
work page 2025
-
[36]
Yihong Xu, Yutong Ban, Guillaume Delorme, Chuang Gan, Daniela Rus, and Xavier Alameda-Pineda. Transcenter: Transformers with dense representations for multiple-object tracking.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 45(6):7820–7835, 2023. 3, 8
work page 2023
-
[37]
Taiga Yamane, Ryo Masumura, Satoshi Suzuki, and Shota Orihashi. Mvtrajecter: Multi-view pedestrian tracking with trajectory motion cost and trajectory appearance cost. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13270–13280, 2025. 3, 6, 8
work page 2025
-
[38]
Yihan Yang, Ming Xu, Jason F Ralph, Yuchen Ling, and Xi- aonan Pan. An end-to-end tracking framework via multi- view and temporal feature aggregation.Computer Vision and Image Understanding, 249:104203, 2024. 6, 8
work page 2024
-
[39]
arXiv preprint arXiv:2003.11753 (2020)
Quanzeng You and Hao Jiang. Real-time 3d deep multi- camera tracking.arXiv preprint arXiv:2003.11753, 2020. 8
-
[40]
Rui Zeng, Yuanzhou Huang, and Songwei Pei. Tgformer: Transformer with track query group for multi-object track- ing.Proceedings of the AAAI Conference on Artificial Intel- ligence, 39(9):9824–9832, 2025. 3
work page 2025
-
[41]
Wide-area crowd counting via ground-plane density maps and multi-view fusion cnns
Qi Zhang and Antoni B Chan. Wide-area crowd counting via ground-plane density maps and multi-view fusion cnns. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8297–8306, 2019. 1, 6
work page 2019
-
[42]
Mahalanobis distance-based multi-view optimal transport for multi-view crowd localization
Qi Zhang, Kaiyi Zhang, Antoni B Chan, and Hui Huang. Mahalanobis distance-based multi-view optimal transport for multi-view crowd localization. InEuropean Conference on Computer Vision, pages 19–36. Springer, 2024. 3
work page 2024
-
[43]
Juanjuan Zhao, Liutao Zhang, Jiexia Ye, and Chengzhong Xu. Mdlf: A multi-view-based deep learning framework for individual trip destination prediction in public transportation systems.IEEE Transactions on Intelligent Transportation Systems, 23(8):13316–13329, 2021. 1
work page 2021
-
[44]
Tracking objects as pixel-wise distributions
Zelin Zhao, Ze Wu, Yueqing Zhuang, Boxun Li, and Ji- aya Jia. Tracking objects as pixel-wise distributions. In European Conference on Computer Vision, pages 76–94. Springer, 2022. 3
work page 2022
-
[45]
Hff-tracker: A hierarchical fine-grained fusion tracker for referring multi- object tracking
Zeyong Zhao, Yanchao Hao, Minghao Zhang, Qingbin Liu, Bo Li, Dianbo Sui, Shizhu He, and Xi Chen. Hff-tracker: A hierarchical fine-grained fusion tracker for referring multi- object tracking. InProceedings of the AAAI Conference on Artificial Intelligence, pages 10528–10536, 2025. 3
work page 2025
-
[46]
Xingyi Zhou, Tianwei Yin, Vladlen Koltun, and Philipp Kr¨ahenb¨uhl. Global tracking transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8771–8780, 2022. 3
work page 2022
-
[47]
Deformable detr: Deformable transformers for end-to-end object detection
Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection. InProceedings of the In- ternational Conference on Learning Representations (ICLR),
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.