Edge Assisted Multi-Camera Vehicle Tracking Framework for Real-Time and Scalable Deployment

Adrian Evans; Florian Stanek; Markus Zarbock; Nic Zhang; Sam Lockyer; Shucheng Zhang; Wenbin Li; Yinhai Wang; Yuqiang Lin

arxiv: 2511.13904 · v2 · pith:MGHD6I4Onew · submitted 2025-11-17 · 💻 cs.CV

Edge Assisted Multi-Camera Vehicle Tracking Framework for Real-Time and Scalable Deployment

Yuqiang Lin , Sam Lockyer , Shucheng Zhang , Florian Stanek , Markus Zarbock , Adrian Evans , Wenbin Li , Yinhai Wang

show 1 more author

Nic Zhang

This is my paper

Pith reviewed 2026-05-25 07:42 UTC · model grok-4.3

classification 💻 cs.CV

keywords multi-camera vehicle trackingedge computingreal-time systemsscalable deploymentintelligent transportation systemscross-camera associationdistributed tracking

0 comments

The pith

A distributed edge-server framework tracks vehicles across multiple cameras in real time by sending only lightweight metadata for cross-camera association.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents EASE-MCVT as a framework that splits multi-camera vehicle tracking between edge devices and a central server to meet real-time and scalability needs. Each edge processes its camera feed locally through detection, single-camera tracking, geo-mapping, and feature extraction, then forwards only vehicle locations and appearance features. The server performs cross-camera association using a re-match module for fragmented tracklets and a self-supervised camera link model to learn spatio-temporal constraints. This design contrasts with prior accuracy-focused methods by prioritizing throughput and large-scale operation. Experiments on the RoundaboutHD and CityFlow datasets confirm real-time performance alongside competitive tracking accuracy.

Core claim

EASE-MCVT is the first MCVT framework explicitly designed to address both real-time performance and scalability in a distributed edge-server setting. On the edge side, each camera stream is processed through object detection, single-camera tracking, geo-mapping and feature extraction, while only lightweight metadata, including vehicle locations and appearance features, is sent to the central server for cross-camera association. Algorithmic optimizations include a dynamic workload scheme for tracklet-level feature extraction, a server-side re-match module to reconnect fragmented tracklets, and a self-supervised camera link model that learns spatio-temporal constraints. System components for (

What carries the argument

The edge-server split where local edge processing extracts and forwards only vehicle locations and appearance features, with server-side re-match and self-supervised camera link model handling cross-camera association.

If this is right

Enables city-wide real-time traffic management by supporting scalable operation across large camera networks.
Reduces data transmission load through metadata-only exchange between edges and server.
Improves association stability via the self-supervised camera link model that incorporates spatio-temporal constraints.
Standardizes deployment and data exchange for production-scale intelligent transportation systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The metadata-only approach could lower network bandwidth demands enough to support denser camera deployments than video-streaming methods allow.
Keeping raw video on the edge devices may reduce privacy exposure compared to centralized full-video processing.
The same edge-server split with learned camera links could apply to multi-camera tracking of other road users such as pedestrians or cyclists.

Load-bearing premise

Lightweight metadata of vehicle locations and appearance features suffices for the server modules to reconnect tracklets and learn effective constraints without critical loss of visual information.

What would settle it

Deploy the system on a network of over 100 cameras and check whether cross-camera tracking accuracy drops below the levels reported on RoundaboutHD and CityFlow or whether end-to-end latency exceeds real-time thresholds.

Figures

Figures reproduced from arXiv: 2511.13904 by Adrian Evans, Florian Stanek, Markus Zarbock, Nic Zhang, Sam Lockyer, Shucheng Zhang, Wenbin Li, Yinhai Wang, Yuqiang Lin.

**Figure 1.** Figure 1: The workflow of SAE-MCVT framework. It contains two main blocks. (I) N edge node in the road network. Each edge node received [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 3.** Figure 3: Comparison between ground-truth and estimated transition-time distributions for the self-supervised camera link model. [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

read the original abstract

Cameras are a core sensing modality in modern intelligent transportation systems (ITS), providing rich visual information on road-user activities. Multi-Camera Vehicle Tracking (MCVT) uses this data to reconstruct vehicle trajectories across camera networks, supporting applications such as traffic flow prediction and optimisation. However, most existing MCVT studies emphasise tracking accuracy while paying limited attention to real-time performance and scalability, both essential for real-world and city-scale deployment. To address this gap, we propose Edge-Assisted, Scalable and Efficient MCVT (EASE-MCVT), a distributed edge--server framework designed for real-time throughput and scalable operation. On the edge side, each camera stream is processed through object detection, single-camera tracking, geo-mapping and feature extraction, while only lightweight metadata, including vehicle locations and appearance features, is sent to the central server for cross-camera association. To improve both tracking accuracy and system efficiency, EASE-MCVT is optimised from algorithmic and system perspectives. Algorithmically, it introduces a dynamic workload scheme for tracklet-level feature extraction, a server-side re-match module to reconnect fragmented tracklets, and a self-supervised camera link model that learns spatio-temporal constraints to accelerate and stabilise cross-camera association. Systemically, it integrates production-oriented data engineering components to standardise deployment and data exchange for large-scale operation. To the best of our knowledge, EASE-MCVT is the first MCVT framework explicitly designed to address both real-time performance and scalability in a distributed edge--server setting. Experiments on the RoundaboutHD and CityFlow datasets demonstrate real-time throughput with competitive tracking accuracy, paving the way for city-wide real-time traffic management.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EASE-MCVT offers a practical edge-server architecture for real-time multi-camera tracking with targeted optimizations, but the evidence for competitive accuracy rests on unshown metrics.

read the letter

The paper's main point is a distributed edge-server system that runs detection, single-camera tracking, geo-mapping and feature extraction locally, then ships only locations and appearance features to a central server for cross-camera work. It adds a dynamic workload scheme for feature extraction, a server-side re-match module for fragmented tracklets, and a self-supervised camera link model for spatio-temporal constraints, plus some production data engineering pieces. This combination is positioned as the first to target both real-time throughput and scalability explicitly in this setting, and the architecture description is straightforward and deployment-oriented. The use of public datasets RoundaboutHD and CityFlow is a reasonable choice for testing. The work does a decent job integrating existing components into a system that could actually run at city scale without sending full video streams. The soft spots are in the results. The abstract supplies no frame rates, no MOTA or IDF1 numbers, no baseline comparisons, and no ablations on the metadata-only approach or the individual modules. Without those, it is difficult to judge whether the re-match and camera-link pieces actually recover enough accuracy when richer visual information is dropped at the edge. The stress-test concern about information loss is reasonable and should be checked against the full experiments. The citation pattern is standard and draws from prior MCVT literature without obvious gaps. This paper is for engineers and practitioners building intelligent transportation systems who need a deployable blueprint more than a new algorithm. It has enough concrete system detail to merit peer review, though the experimental section will need close scrutiny for the missing numbers and controls.

Referee Report

2 major / 0 minor

Summary. The paper proposes EASE-MCVT, a distributed edge-server framework for multi-camera vehicle tracking (MCVT) in intelligent transportation systems. On the edge, each camera performs detection, single-camera tracking, geo-mapping and feature extraction, sending only lightweight metadata (vehicle locations and appearance features) to a central server for cross-camera association. Algorithmic optimizations include a dynamic workload scheme for tracklet-level feature extraction, a server-side re-match module for fragmented tracklets, and a self-supervised camera link model for spatio-temporal constraints. System components standardize large-scale deployment. The authors claim it is the first MCVT framework explicitly designed for both real-time performance and scalability in an edge-server setting, with experiments on RoundaboutHD and CityFlow demonstrating real-time throughput and competitive tracking accuracy.

Significance. If the performance claims hold with supporting quantitative evidence, this work would address a practical gap in MCVT by enabling city-scale, real-time deployment through edge distribution and metadata-only transmission, which could reduce bandwidth demands while supporting applications like traffic optimization. The integration of algorithmic modules (re-match, camera-link) with production data engineering components strengthens its potential for scalable ITS.

major comments (2)

[Abstract] Abstract: The central claim that 'Experiments on the RoundaboutHD and CityFlow datasets demonstrate real-time throughput with competitive tracking accuracy' supplies no quantitative metrics (e.g., FPS, MOTA, IDF1), baselines, ablation results, or error analysis. This absence makes it impossible to verify the headline assertions of real-time performance and competitive accuracy that underpin the framework's contribution.
[Abstract] Abstract: The framework's core premise—that transmitting only lightweight metadata (locations and appearance features) suffices for competitive cross-camera accuracy—rests on the untested assumption that the server-side re-match module and self-supervised camera link model can fully recover information lost from edge-side feature extraction. No ablation isolating the effect of metadata-only transmission versus full visual cues is referenced, which directly bears on whether the accuracy claims generalize beyond the reported datasets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and will revise the manuscript to strengthen the presentation of results.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'Experiments on the RoundaboutHD and CityFlow datasets demonstrate real-time throughput with competitive tracking accuracy' supplies no quantitative metrics (e.g., FPS, MOTA, IDF1), baselines, ablation results, or error analysis. This absence makes it impossible to verify the headline assertions of real-time performance and competitive accuracy that underpin the framework's contribution.

Authors: We agree that the abstract would be strengthened by including specific quantitative metrics. In the revised version we will update the abstract to report key results such as achieved FPS, MOTA and IDF1 scores on both datasets together with explicit baseline comparisons drawn from the experimental section. revision: yes
Referee: [Abstract] Abstract: The framework's core premise—that transmitting only lightweight metadata (locations and appearance features) suffices for competitive cross-camera accuracy—rests on the untested assumption that the server-side re-match module and self-supervised camera link model can fully recover information lost from edge-side feature extraction. No ablation isolating the effect of metadata-only transmission versus full visual cues is referenced, which directly bears on whether the accuracy claims generalize beyond the reported datasets.

Authors: The experiments on RoundaboutHD and CityFlow already demonstrate that the metadata-only pipeline, augmented by the re-match module and self-supervised camera link model, yields competitive cross-camera accuracy. We acknowledge that an explicit ablation isolating metadata-only transmission from full visual cues is not currently referenced. We will add this ablation study in the revision to quantify the contribution of the server-side components and to further support generalizability. revision: yes

Circularity Check

0 steps flagged

No circularity: framework assembles standard CV modules with new optimizations, validated on independent public datasets

full rationale

The paper describes an engineering integration of object detection, single-camera tracking, feature extraction, and cross-camera association, augmented by three new modules (dynamic workload, re-match, self-supervised camera link). All performance claims are tied to experiments on the external RoundaboutHD and CityFlow datasets rather than any fitted parameter or self-referential definition. No equations, uniqueness theorems, or self-citations are invoked to force the central result. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, new entities, or ad-hoc axioms are stated. The work implicitly relies on standard domain assumptions in computer vision and distributed systems.

axioms (1)

domain assumption Cameras are calibrated and time-synchronized sufficiently for geo-mapping and cross-camera association.
Required for any multi-camera tracking system that performs geo-mapping and spatio-temporal linking.

pith-pipeline@v0.9.0 · 5864 in / 1359 out tokens · 31470 ms · 2026-05-25T07:42:23.101586+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 3 internal anchors

[1]

arXiv preprint arXiv:2206.14651

Bot-sort: Robust associations multi-pedestrian tracking. arXiv preprint arXiv:2206.14651 . 12 Amosa, T.I., Sebastian, P., Izhar, L.I., Ibrahim, O., Ayinla, L.S., Bahashwan, A.A., Bala, A., Samaila, Y .A.,

work page arXiv
[2]

Qwen Technical Report

Qwen technical report. arXiv preprint arXiv:2309.16609 . Balamuralidhar, N., Tilon, S., Nex, F.,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Simple online and realtime tracking, in: 2016 IEEE international conference on image processing (ICIP), Ieee. pp. 3464–3468. Broström, M.,

work page 2016
[4]

Commit version used: v13.0.17 (July 2025)

BoxMOT: Pluggable sota multi-object tracking modules for segmentation, object detection, and pose estimation.https://github.com/mikel-brostrom/boxmot. Commit version used: v13.0.17 (July 2025). Cao, J., Pang, J., Weng, X., Khirodkar, R., Kitani, K.,

work page 2025
[5]

arXiv preprint arXiv:2003.09003

Mot20: A benchmark for multi object tracking in crowded scenes. arXiv preprint arXiv:2003.09003 . Gaikwad, B., Karmakar, A.,

work page arXiv 2003
[6]

Journal of real-time image processing 18, 1993–2007

Smart surveillance system for real-time multi-person multi-camera tracking at the edge. Journal of real-time image processing 18, 1993–2007. Gerum, R., Richter, S., Winterl, A., Fabry, B., Zitterbart, D.,

work page 1993
[7]

arXiv preprint arXiv:1712.07438

Cameratransform: a scientific python package for perspective camera corrections. arXiv preprint arXiv:1712.07438 . Girshick, R.,

work page arXiv
[8]

YOLOv11: An Overview of the Key Architectural Enhancements

Yolov11: An overview of the key architectural enhancements. arXiv preprint arXiv:2410.17725 . Lashkov, I., Yuan, R., Zhang, G.,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

3265–3273

Multi-camera vehicle tracking system for ai city challenge 2022, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3265–3273. 13 Lin, T.Y ., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.,

work page 2022
[10]

Microsoft coco: Common objects in context, in: Computer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13, Springer. pp. 740–755. Lin, Y ., Lockyer, S., Evans, A., Zarbock, M., Zhang, N., 2025a. Ablation study for multicamera vehicle tracking using cityflow dataset, in: Seventeenth International ...

work page arXiv 2014
[11]

YOLOv12: Attention-Centric Real-Time Object Detectors

Yolov12: Attention-centric real-time object detectors. arXiv preprint arXiv:2502.12524 . Varghese, R., Sambath, M.,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Yolov8: A novel object detection algorithm with enhanced performance and robustness, in: 2024 International conference on advances in data engineering and intelligent computing systems (ADICS), IEEE. pp. 1–6. Wojke, N., Bewley, A., Paulus, D.,

work page 2024
[13]

Simple online and realtime tracking with a deep association metric, in: 2017 IEEE international conference on image processing (ICIP), IEEE. pp. 3645–3649. Wu, M., Qian, Y ., Wang, C., Yang, M.,

work page 2017
[14]

2019 IEEE Visual Communications and Image Processing (VCIP) , 1–4

Real-time multi-target multi-camera tracking with spatial-temporal information. 2019 IEEE Visual Communications and Image Processing (VCIP) , 1–4. Zhang, X., Yu, H., Qin, Y ., Zhou, X., Chan, S.,

work page 2019
[15]

A real-time framework of multi-camera vehicle tracking system, in: 2024 IEEE International Conference on Real-time Computing and Robotics (RCAR), IEEE. pp. 149–154. 15

work page 2024

[1] [1]

arXiv preprint arXiv:2206.14651

Bot-sort: Robust associations multi-pedestrian tracking. arXiv preprint arXiv:2206.14651 . 12 Amosa, T.I., Sebastian, P., Izhar, L.I., Ibrahim, O., Ayinla, L.S., Bahashwan, A.A., Bala, A., Samaila, Y .A.,

work page arXiv

[2] [2]

Qwen Technical Report

Qwen technical report. arXiv preprint arXiv:2309.16609 . Balamuralidhar, N., Tilon, S., Nex, F.,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Simple online and realtime tracking, in: 2016 IEEE international conference on image processing (ICIP), Ieee. pp. 3464–3468. Broström, M.,

work page 2016

[4] [4]

Commit version used: v13.0.17 (July 2025)

BoxMOT: Pluggable sota multi-object tracking modules for segmentation, object detection, and pose estimation.https://github.com/mikel-brostrom/boxmot. Commit version used: v13.0.17 (July 2025). Cao, J., Pang, J., Weng, X., Khirodkar, R., Kitani, K.,

work page 2025

[5] [5]

arXiv preprint arXiv:2003.09003

Mot20: A benchmark for multi object tracking in crowded scenes. arXiv preprint arXiv:2003.09003 . Gaikwad, B., Karmakar, A.,

work page arXiv 2003

[6] [6]

Journal of real-time image processing 18, 1993–2007

Smart surveillance system for real-time multi-person multi-camera tracking at the edge. Journal of real-time image processing 18, 1993–2007. Gerum, R., Richter, S., Winterl, A., Fabry, B., Zitterbart, D.,

work page 1993

[7] [7]

arXiv preprint arXiv:1712.07438

Cameratransform: a scientific python package for perspective camera corrections. arXiv preprint arXiv:1712.07438 . Girshick, R.,

work page arXiv

[8] [8]

YOLOv11: An Overview of the Key Architectural Enhancements

Yolov11: An overview of the key architectural enhancements. arXiv preprint arXiv:2410.17725 . Lashkov, I., Yuan, R., Zhang, G.,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

3265–3273

Multi-camera vehicle tracking system for ai city challenge 2022, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3265–3273. 13 Lin, T.Y ., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.,

work page 2022

[10] [10]

Microsoft coco: Common objects in context, in: Computer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13, Springer. pp. 740–755. Lin, Y ., Lockyer, S., Evans, A., Zarbock, M., Zhang, N., 2025a. Ablation study for multicamera vehicle tracking using cityflow dataset, in: Seventeenth International ...

work page arXiv 2014

[11] [11]

YOLOv12: Attention-Centric Real-Time Object Detectors

Yolov12: Attention-centric real-time object detectors. arXiv preprint arXiv:2502.12524 . Varghese, R., Sambath, M.,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Yolov8: A novel object detection algorithm with enhanced performance and robustness, in: 2024 International conference on advances in data engineering and intelligent computing systems (ADICS), IEEE. pp. 1–6. Wojke, N., Bewley, A., Paulus, D.,

work page 2024

[13] [13]

Simple online and realtime tracking with a deep association metric, in: 2017 IEEE international conference on image processing (ICIP), IEEE. pp. 3645–3649. Wu, M., Qian, Y ., Wang, C., Yang, M.,

work page 2017

[14] [14]

2019 IEEE Visual Communications and Image Processing (VCIP) , 1–4

Real-time multi-target multi-camera tracking with spatial-temporal information. 2019 IEEE Visual Communications and Image Processing (VCIP) , 1–4. Zhang, X., Yu, H., Qin, Y ., Zhou, X., Chan, S.,

work page 2019

[15] [15]

A real-time framework of multi-camera vehicle tracking system, in: 2024 IEEE International Conference on Real-time Computing and Robotics (RCAR), IEEE. pp. 149–154. 15

work page 2024