MR2-ByteTrack: CNN and Transformer-based Video Object Detection for AI-augmented Embedded Vision Sensor Nodes

Daniele Palossi; Francesco Conti; Luca Benini; Luca Bompani; Manuele Rusci

arxiv: 2605.15423 · v1 · pith:4FRWERRYnew · submitted 2026-05-14 · 💻 cs.CV · cs.AI· eess.IV

MR2-ByteTrack: CNN and Transformer-based Video Object Detection for AI-augmented Embedded Vision Sensor Nodes

Luca Bompani , Manuele Rusci , Luca Benini , Daniele Palossi , Francesco Conti This is my paper

Pith reviewed 2026-05-19 15:20 UTC · model grok-4.3

classification 💻 cs.CV cs.AIeess.IV

keywords video object detectionembedded visionmicrocontrollermulti-resolution inferencetransformerCNNByteTrackenergy efficiency

0 comments

The pith

MR2-ByteTrack enables video object detection with up to 55% energy savings on microcontroller-based vision sensors by alternating resolutions and rescoring detections.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a video object detection method designed for ultra-low-power microcontrollers that cannot handle standard approaches requiring lots of memory or buffering. It alternates between processing full-resolution and low-resolution frames to cut computation, uses ByteTrack to associate detections across frames, and applies a Rescore step that combines confidence scores from multiple frames using probability rules to fix mistakes made on low-res inputs. This keeps detection accuracy close to full-resolution baselines while slashing the number of operations and energy used. A sympathetic reader would care because it makes on-device AI possible for smart cameras where sending data to the cloud is not feasible due to privacy, power, or connectivity limits.

Core claim

MR2-ByteTrack reduces multiply-accumulate operations by up to 53% for CNN detectors and 32% for Transformer detectors on the ImageNetVID dataset while maintaining mAP scores of 49.0 and 48.7 respectively. When run on the GAP9 MCU it achieves up to 55% energy savings over full-resolution processing and supports real-time Transformer-based video object detection for the first time on such hardware.

What carries the argument

The Multi-Resolution Rescored ByteTrack (MR2-ByteTrack) pipeline that switches between full- and low-resolution inference passes and corrects low-resolution errors via ByteTrack association combined with the Rescore algorithm's probability union aggregation of per-frame confidences.

If this is right

Reduces computational cost measured in multiply-accumulate operations by as much as 53% for CNN models and 32% for Transformer models.
Achieves up to 55% energy savings on the GAP9 ultra-low-power RISC-V MCU compared to full-resolution processing.
Enables real-time Transformer-based video object detection on MCU-class embedded vision nodes for the first time.
Preserves detection accuracy with mAP values up to 49.0 for CNN and 48.7 for Transformer on ImageNetVID.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar multi-resolution strategies could be tested on other tracking or detection architectures beyond the ones evaluated here.
The approach might reduce bandwidth needs in distributed vision systems by keeping more processing local.
Extending the Rescore logic to longer sequences or different confidence aggregation rules could further improve robustness on very low-power hardware.

Load-bearing premise

The Rescore algorithm reliably fixes misclassifications introduced by low-resolution frames using probability union rules without lowering overall detection performance.

What would settle it

Measuring the mAP on ImageNetVID when running the full pipeline but disabling the Rescore step and seeing if accuracy drops below the reported levels or below a full-resolution baseline.

Figures

Figures reproduced from arXiv: 2605.15423 by Daniele Palossi, Francesco Conti, Luca Benini, Luca Bompani, Manuele Rusci.

**Figure 2.** Figure 2: FIGURE 2 [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: FIGURE 3 [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: FIGURE 4 [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: FIGURE 5 [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

read the original abstract

Modern smart vision sensors need on-device intelligence to process video streams, as cloud computing is often impractical due to bandwidth, latency, and privacy constraints. However, these sensory systems typically rely on ultra-low-power microcontrollers (MCUs) with limited memory and compute, making conventional video object detection methods, which require feature storage or multi-frame buffering, unfeasible. To address this challenge, we introduce Multi-Resolution Rescored ByteTrack (MR2-ByteTrack), a Video Object Detection (VOD) method tailored for MCU-based embedded vision nodes. MR2-ByteTrack reduces computational cost by alternating between full- and low-resolution inference, while linking detections across frames via ByteTrack and correcting misclassifications through the Rescore algorithm, which applies probability union rules to aggregate detection confidence scores across frames. We apply our approach to both a CNN-based detector and a Transformer-based model, demonstrating its generality across architectures with fundamentally different spatial processing. Experiments on ImageNetVID demonstrate that MR2-ByteTrack maintains accuracy, achieving mAP scores of up to 49.0 for the CNN-based models and 48.7 for the Transformer, while reducing multiply-accumulate operations by as much as 53\% for the CNNs and 32\% for the Transformer. When deployed on GAP9, an ultra-low-power RISC-V multicore MCU, our method yields up to 55\% energy savings compared to processing only full-resolution images, enabling the first real-time Transformer-based VOD on an MCU-class embedded vision node. Code available at https://github.com/Bomps4/Multi_Resolution_Rescored_ByteTrack/tree/IEEE_Access

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MR2-ByteTrack shows workable energy cuts on real MCU hardware for both CNN and transformer video detection by mixing resolutions and rescoring tracks, but the accuracy claim rests on how well the Rescore step actually fixes low-res mistakes.

read the letter

The main takeaway is that this method delivers up to 55% energy savings on GAP9 hardware while holding mAP at 49.0 for the CNN detector and 48.7 for the transformer on ImageNetVID. They achieve the savings by alternating full- and low-resolution inference, linking detections with ByteTrack, and applying a Rescore step that uses probability union rules to combine scores across frames. The approach works for two very different model families, which is useful because CNNs and transformers handle spatial information differently. The hardware measurements and the public code are concrete positives that let others check the numbers directly. The paper applies existing tracking and multi-resolution ideas to the tight memory and power limits of MCUs, and the reported MAC reductions of 53% and 32% line up with the energy results. The soft spot sits in the Rescore component. The energy gains come from increasing the share of low-resolution frames, and the maintained accuracy is credited to Rescore correcting misclassifications. If low-resolution detections produce high-confidence errors on small or fast-moving objects, the union rule may not catch them all and could introduce new ones. The abstract does not give quantitative bounds on how much low-resolution error the method can absorb before mAP drops below the full-resolution baseline, so that link needs clear ablation results in the full text. This work is aimed at engineers and researchers building on-device video pipelines for power-constrained sensors and IoT nodes. Anyone looking for measured trade-offs between compute and accuracy on actual embedded hardware will find the deployment numbers and code release helpful. I would send it to peer review. The combination of software method, accuracy numbers, and real hardware energy data is solid enough to merit referee time, even if the robustness of the rescoring step needs tighter evidence.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MR2-ByteTrack, a video object detection method for MCU-based embedded vision nodes. It alternates full- and low-resolution inference on CNN and Transformer detectors, links detections with ByteTrack, and applies a Rescore algorithm using probability union rules to aggregate per-frame confidence scores and correct misclassifications. On ImageNetVID it reports maintained mAP of 49.0 (CNN) and 48.7 (Transformer) with MAC reductions of 53% and 32%, respectively; on GAP9 hardware it claims up to 55% energy savings versus full-resolution processing, enabling the first real-time Transformer VOD on an MCU-class node. Code is released.

Significance. If the accuracy-maintenance claim holds, the work is significant for practical on-device video intelligence under severe memory and power constraints. It demonstrates cross-architecture generality (CNN and Transformer), reports concrete hardware energy measurements on GAP9, and provides reproducible code. These elements directly address the gap between high-accuracy VOD models and ultra-low-power embedded deployment.

major comments (2)

[Method description of Rescore algorithm] The central claim that mAP is preserved while increasing the fraction of low-resolution frames (thereby achieving the reported 53%/32% MAC and 55% energy reductions) rests on the Rescore step. The manuscript states that probability-union aggregation corrects low-resolution misclassifications, yet provides no ablation, error-tolerance bound, or quantitative analysis of how many high-confidence false positives or missed small/fast objects the union rule can absorb before mAP falls below the full-resolution baseline. This is load-bearing for the energy-savings result.
[Experiments on ImageNetVID and GAP9] The experimental section reports mAP values and MAC counts but does not specify the exact alternating schedule (e.g., fraction of low-resolution frames per sequence), the precise definition of the probability-union rule, or comparisons against other multi-resolution or frame-skipping baselines. Without these details the optimality and robustness of the 55% energy figure cannot be fully assessed.

minor comments (2)

[Abstract and Method] The abstract and method sections use “probability union rules” without a short inline formula or pseudocode; adding one would improve clarity for readers unfamiliar with the exact aggregation.
[Results tables/figures] Table or figure captions should explicitly state the resolution schedule and the number of low-resolution frames used to obtain the reported MAC and energy numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential significance of MR2-ByteTrack for energy-constrained embedded vision. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of the Rescore algorithm and experimental details.

read point-by-point responses

Referee: [Method description of Rescore algorithm] The central claim that mAP is preserved while increasing the fraction of low-resolution frames (thereby achieving the reported 53%/32% MAC and 55% energy reductions) rests on the Rescore step. The manuscript states that probability-union aggregation corrects low-resolution misclassifications, yet provides no ablation, error-tolerance bound, or quantitative analysis of how many high-confidence false positives or missed small/fast objects the union rule can absorb before mAP falls below the full-resolution baseline. This is load-bearing for the energy-savings result.

Authors: We agree that additional analysis is required to fully support the central claim. In the revised manuscript we will add an ablation study that varies the fraction of low-resolution frames and reports mAP both with and without the Rescore step. We will also include a quantitative error-tolerance analysis, showing concrete examples of how the union rule recovers high-confidence false positives and missed small or fast objects across linked tracks. The probability-union rule will be defined precisely (maximum probability across linked detections or 1 - product(1 - p_i)). These additions will directly address the load-bearing nature of the result. revision: yes
Referee: [Experiments on ImageNetVID and GAP9] The experimental section reports mAP values and MAC counts but does not specify the exact alternating schedule (e.g., fraction of low-resolution frames per sequence), the precise definition of the probability-union rule, or comparisons against other multi-resolution or frame-skipping baselines. Without these details the optimality and robustness of the 55% energy figure cannot be fully assessed.

Authors: We acknowledge that the current experimental description lacks sufficient detail. In the revision we will explicitly state the alternating schedule (e.g., full-resolution every third frame with the resulting fraction of low-resolution frames per sequence), provide the exact mathematical formulation of the probability-union rule, and add direct comparisons against simple frame-skipping and other multi-resolution baselines. These changes will allow readers to assess the optimality and robustness of the reported 55% energy savings on GAP9. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation of resolution-alternating VOD with tracking and rescoring

full rationale

The paper introduces MR2-ByteTrack as an algorithmic combination of alternating full/low-resolution inference, ByteTrack linking, and a Rescore step that aggregates scores via probability-union rules. All performance claims (mAP 49.0/48.7, 53%/32% MAC reduction, 55% energy savings on GAP9) are presented as direct experimental outcomes on ImageNetVID, compared against full-resolution baselines. No equations, first-principles derivations, or fitted parameters are shown that reduce to the method's own inputs by construction. No self-citation chains or uniqueness theorems are invoked to justify the core approach. The method is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper builds on established computer vision techniques like ByteTrack and adapts them for resource-constrained environments without introducing new fundamental entities or many free parameters beyond standard model hyperparameters.

axioms (2)

domain assumption Detections can be reliably linked across frames using ByteTrack
Central to maintaining temporal consistency in video.
domain assumption Probability union rules can aggregate confidence scores to correct errors
Basis of the Rescore algorithm.

pith-pipeline@v0.9.0 · 5857 in / 1399 out tokens · 73685 ms · 2026-05-19T15:20:15.327680+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MR2-ByteTrack reduces computational cost by alternating between full- and low-resolution inference, while linking detections across frames via ByteTrack and correcting misclassifications through the Rescore algorithm, which applies probability union rules to aggregate detection confidence scores across frames.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 7 internal anchors

[1]

S. C. Mukhopadhyay, S. K. S. Tyagi, N. K. Suryadevara, V . Piuri, F. Scotti, and S. Zeadally, ‘‘Artificial intelligence-based sensors for next generation iot applications: A review,’’IEEE Sensors Journal, vol. 21, no. 22, pp. 24 920–24 932, 2021

work page 2021
[2]

W. Su, L. Li, F. Liu, M. He, and X. Liang, ‘‘Ai on the edge: a comprehensive review,’’Artif. Intell. Rev., vol. 55, no. 8, p. 6125–6183, Dec. 2022. [Online]. Available: https://doi.org/10.1007/s10462-022-10141-4

work page doi:10.1007/s10462-022-10141-4 2022
[3]

W. Y u, F. Liang, X. He, W. G. Hatcher, C. Lu, J. Lin, and X. Y ang, ‘‘A survey on the edge computing for the internet of things,’’IEEE Access, vol. 6, pp. 6900–6919, 2018

work page 2018
[4]

K. S. Patle, R. Saini, A. Kumar, and V . S. Palaparthy, ‘‘Field evaluation of smart sensor system for plant disease prediction using lstm network,’’ IEEE Sensors Journal, vol. 22, no. 4, pp. 3715–3725, 2022

work page 2022
[5]

Sabato, S

A. Sabato, S. Dabetwar, N. N. Kulkarni, and G. Fortino, ‘‘Noncontact sensing techniques for ai-aided structural health monitoring: A systematic review,’’IEEE Sensors Journal, vol. 23, no. 5, pp. 4672–4684, 2023

work page 2023
[6]

Sameer, P

S. Sameer, P . Madan, S. Kannan, V . J. Upadhye, H. Patil, and S. Rajkumar, ‘‘AI-based Object Detection for Assisting the Visually Impaired People,’’ in2024 5th International Conference on Mobile Computing and Sustain- able Informatics (ICMCSI). IEEE, 2024, pp. 512–518

work page 2024
[7]

Lamberti, L

L. Lamberti, L. Bompani, V . J. Kartsch, M. Rusci, D. Palossi, and L. Benini, ‘‘Bio-inspired autonomous exploration policies with cnn-based object de- tection on nano-drones,’’ in2023 Design, Automation & Testin Europe Conference & Exhibition (DATE). IEEE, 2023, pp. 1–6

work page 2023
[8]

AlNuaimi, E

E. AlNuaimi, E. Cereda, R. Psiakis, S. Sugumar, A. Giusti, and D. Palossi, ‘‘A Deep Learning-Based Face Mask Detector for Autonomous Nano- Drones (Student Abstract),’’ inProceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 11, 2022, pp. 12 903–12 904

work page 2022
[9]

Rossi, F

D. Rossi, F. Conti, M. Eggiman, A. D. Mauro, G. Tagliavini, S. Mach, M. Guermandi, A. Pullini, I. Loi, J. Chen, E. Flamand, and L. Benini, ‘‘V ega: A Ten-Core SoC for IoT Endnodes With DNN Acceleration and Cognitive Wake-Up From MRAM-Based State-Retentive Sleep Mode,’’ IEEE Journal of Solid-State Circuits, vol. 57, no. 1, pp. 127–139, 2022

work page 2022
[10]

Lamberti, M

L. Lamberti, M. Rusci, M. Fariselli, F. Paci, and L. Benini, ‘‘Low-power license plate detection and recognition on a risc-v multi-core mcu-based vision system,’’ in2021 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2021, pp. 1–5

work page 2021
[11]

Bompani, M

L. Bompani, M. Rusci, D. Palossi, F. Conti, and L. Benini, ‘‘ Multi- resolution Rescored ByteTrack for Video Object Detection on Ultra-low- power Embedded Systems ,’’ in2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Los Alamitos, CA, USA: IEEE Computer Society, Jun. 2024, pp. 2182–2190. VOLUME 14, 2026 13

work page 2024
[12]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, ‘‘An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,’’ inInternational Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=YicbFdNTTy

work page 2021
[13]

Y . Wang, Y . Deng, Y . Zheng, P . Chattopadhyay, and L. Wang, ‘‘Vision transformers for image classification: A comparative survey,’’ Technologies, vol. 13, no. 1, 2025. [Online]. Available: https://www.mdpi. com/2227-7080/13/1/32

work page 2025
[14]

A. Khan, Z. Rauf, A. Sohail, A. R. Khan, H. Asif, A. Asif, and U. Farooq, ‘‘A survey of the vision transformers and their cnn-transformer based variants,’’Artificial Intelligence Review, vol. 56, no. 3, pp. 2917–2970, Dec

work page
[15]

Available: https://doi.org/10.1007/s10462-023-10595-0

[Online]. Available: https://doi.org/10.1007/s10462-023-10595-0

work page doi:10.1007/s10462-023-10595-0
[16]

H. Cai, J. Li, M. Hu, C. Gan, and S. Han, ‘‘EfficientViT: Lightweight Multi-Scale Attention for High-Resolution Dense Prediction,’’ in2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 17 256–17 267

work page 2023
[17]

Russakovsky, J

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, ‘‘Ima- geNet Large Scale Visual Recognition Challenge,’’International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015

work page 2015
[18]

[Online]

RangiLyu, ‘‘Nanodet-plus Superfast and high accuracy lightweight anchor-free object detection model,’’ 2021. [Online]. Available: https: //github.com/RangiLyu/nanodet

work page 2021
[19]

Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, ‘‘Y olox: Exceeding yolo series in 2021,’’arXivpreprintarXiv:2107.08430, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[20]

B. Liu, M. Cai, and J. Li, ‘‘Video Object Detection Based on 3D Con- volution,’’ in2022 IEEE International Conference on Unmanned Systems (ICUS), 2022, pp. 177–183

work page 2022
[21]

X. Zhu, Y . Wang, J. Dai, L. Y uan, and Y . Wei, ‘‘Flow-Guided Feature Aggregation for Video Object Detection,’’ in2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 408–417

work page 2017
[22]

H. Wu, Y . Chen, N. Wang, and Z.-X. Zhang, ‘‘Sequence Level Semantics Aggregation for Video Object Detection,’’ in2019 IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), 2019, pp. 9216–9224

work page 2019
[23]

Y . Chen, Y . Cao, H. Hu, and L. Wang, ‘‘Memory Enhanced Global-Local Aggregation for Video Object Detection,’’ in2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 06 2020, pp. 10 334– 10 343

work page 2020
[24]

Q. Zhou, X. Li, L. He, Y . Y ang, G. Cheng, Y . Tong, L. Ma, and D. Tao, ‘‘TransVOD: End-to-End Video Object Detection With Spatial-Temporal Transformers,’’IEEE Transactions on Pattern Analysis and Machine Intel- ligence, vol. 45, no. 6, pp. 7853–7869, 2023

work page 2023
[25]

Y . Shi, N. Wang, and X. Guo, ‘‘YOLOV: Making Still Image Object Detectors Great at Video Object Detection,’’Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 2, pp. 2254–2262, Jun. 2023

work page 2023
[26]

Belhassen, H

H. Belhassen, H. Zhang, V . Fresse, and E.-B. Bourennane, ‘‘Im- proving Video Object Detection by Seq-BboxMatching.’’ inVISI- GRAPP(5:VISAPP), 2019, pp. 226–233

work page 2019
[27]

M. Li, L. Li, R. Bai, J. Ren, B. Meng, and Y . Y ang, ‘‘A Motion-based Seq-bbox Matching Method for Video Object Detection,’’ in2021 IEEE Symposium on Computers and Communications (ISCC), 2021, pp. 1–7

work page 2021
[28]

X. Liu, F. K. Nejadasl, J. C. van Gemert, O. Booij, and S. L. Pintea, ‘‘ Objects do not disappear: Video object detection by single-frame object location anticipation ,’’ in2023 IEEE/CVF International Conference on Computer Vision (ICCV). Los Alamitos, CA, USA: IEEE Computer Society, Oct. 2023, pp. 6927–6938

work page 2023
[29]

V erelst and T

T. V erelst and T. Tuytelaars, ‘‘BlockCopy: High-Resolution Video Process- ing with Block-Sparse Feature Propagation and Online Policies,’’ in2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 5138–5147

work page 2021
[30]

Q. Zhou, S. Guo, J. Pan, J. Liang, J. Guo, Z. Xu, and J. Zhou, ‘‘Pass: Patch automatic skip scheme for efficient on-device video perception,’’IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 5, pp. 3938–3954, 2024

work page 2024
[31]

M. Liu, M. Zhu, M. White, Y . Li, and D. Kalenichenko, ‘‘Looking fast and slow: Memory-guided mobile video object detection,’’arXiv preprint arXiv:1903.10172, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1903
[32]

Boyle, J

L. Boyle, J. Moosmann, N. Baumann, S. Heo, and M. Magno, ‘‘DSORT- MCU: Detecting Small Objects in Real Time on Microcontroller Units,’’ IEEE Sensors Journal, vol. 24, no. 24, pp. 40 231–40 239, 2024

work page 2024
[33]

W. Han, P . Khorrami, T. L. Paine, P . Ramachandran, M. Babaeizadeh, H. Shi, J. Li, S. Y an, and T. S. Huang, ‘‘Seq-NMS for Video Object Detection.’’CoRR, vol. abs/1602.08465, 2016. [Online]. Available: http: //dblp.uni-trier.de/db/journals/corr/corr1602.html#HanKPRBSL YH16

work page internal anchor Pith review Pith/arXiv arXiv 2016
[34]

S. Ren, K. He, R. Girshick, and J. Sun, ‘‘Faster r-cnn: towards real-time object detection with region proposal networks,’’ inProceedings of the 29th International Conference on Neural Information Processing Systems - V olume 1, ser. NIPS’15. Cambridge, MA, USA: MIT Press, 2015, p. 91–99

work page 2015
[35]

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y . Fu, and A. C. Berg, ‘‘SSD: Single Shot MultiBox Detector,’’ inComputer Vision – ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds. Cham: Springer International Publishing, 2016, pp. 21–37

work page 2016
[36]

Sandler, A

M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, ‘‘Mobilenetv2: Inverted residuals and linear bottlenecks,’’2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4510–4520, 2018

work page 2018
[37]

Y aseen, ‘‘What is YOLOv8: An In-Depth Exploration of the Internal Features of the Next-Generation Object Detector,’’ 08 2024

M. Y aseen, ‘‘What is YOLOv8: An In-Depth Exploration of the Internal Features of the Next-Generation Object Detector,’’ 08 2024

work page 2024
[38]

A. Wang, H. Chen, L. Liu, K. CHEN, Z. Lin, J. Han, and G. Ding, ‘‘YOLOv10: Real-Time End-to-End Object Detection,’’ inThe Thirty- eighth Annual Conference on Neural Information Processing Systems,

work page
[39]

Available: https://openreview.net/forum?id=tz83Nyb71l

[Online]. Available: https://openreview.net/forum?id=tz83Nyb71l

work page
[40]

YOLOv11: An Overview of the Key Architectural Enhancements

R. Khanam and M. Hussain, ‘‘YOLOv11: An Overview of the Key Architectural Enhancements,’’ 2024. [Online]. Available: https://arxiv.org/ abs/2410.17725

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Carion, F

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, ‘‘End-to-end object detection with transformers,’’ inCom- puter Vision – ECCV 2020, A. V edaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds. Cham: Springer International Publishing, 2020, pp. 213– 229

work page 2020
[42]

Mehta and M

S. Mehta and M. Rastegari, ‘‘MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer,’’ inInternational Conference on Learning Representations, 2022. [Online]. Available: https://openreview. net/forum?id=vh-0sUt8HlG

work page 2022
[43]

S. N. Wadekar and A. Chaurasia, ‘‘MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features,’’ 2022. [Online]. Available: https://arxiv.org/abs/2209.15159

work page arXiv 2022
[44]

Mehta and M

S. Mehta and M. Rastegari, ‘‘Separable Self-attention for Mobile Vision Transformers,’’ 2022. [Online]. Available: https://arxiv.org/abs/ 2206.02680

work page arXiv 2022
[45]

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, ‘‘Learning Spatiotemporal Features with 3D Convolutional Networks,’’ in2015 IEEE International Conference on Computer Vision (ICCV). Los Alamitos, CA, USA: IEEE Computer Society, dec 2015, pp. 4489–4497

work page 2015
[46]

Y . Lyu, M. Y . Y ang, G. V osselman, and G.-S. Xia, ‘‘Video object detection with a convolutional regression tracker,’’ISPRS Journal of Photogramme- try and Remote Sensing, vol. 176, pp. 139–150, 2021

work page 2021
[47]

Integrated Object Detection and Tracking with Tracklet-Conditioned Detection

Z. Zhang, D. Cheng, X. Z. S. Lin, and J. Dai, ‘‘Integrated Object De- tection and Tracking with Tracklet-Conditioned Detection,’’ArXiv, vol. abs/1811.11167, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[48]

X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, ‘‘Deformable DETR: Deformable Transformers for End-to-End Object Detection,’’ArXiv, vol. abs/2010.04159, 2020. [Online]. Available: https://api.semanticscholar. org/CorpusID:222208633

work page internal anchor Pith review Pith/arXiv arXiv 2010
[49]

Bewley, Z

A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, ‘‘Simple online and realtime tracking,’’ in2016 IEEE International Conference on Image Processing (ICIP), 2016, pp. 3464–3468

work page 2016
[50]

Zhang, P

Y . Zhang, P . Sun, Y . Jiang, D. Y u, Z. Y uan, P . Luo, W. Liu, and X. Wang, ‘‘ByteTrack: Multi-Object Tracking by Associating Every Detection Box,’’ inEuropean Conference on Computer Vision, 2021

work page 2021
[51]

B. A. Motetti, L. Crupi, M. O. M. E. Elshaigi, M. Risso, D. J. Pagliari, D. Palossi, and A. Burrello, ‘‘Adaptive Deep Learning for Efficient Visual Pose Estimation aboard Ultra-low-power Nano-drones,’’ArXiv, vol. abs/2401.15236, 2024. [Online]. Available: https://api.semanticscholar. org/CorpusID:267312457

work page arXiv 2024
[52]

Moosmann, H

J. Moosmann, H. Müller, N. Zimmerman, G. Rutishauser, L. Benini, and M. Magno, ‘‘Flexible and Fully Quantized Lightweight TinyissimoYOLO for Ultra-Low-Power Edge Systems,’’IEEE Access, vol. 12, pp. 75 093– 75 107, 2024

work page 2024
[53]

Moosmann, P

J. Moosmann, P . Bonazzi, Y . Li, S. Bian, P . Mayer, L. Benini, and M. Magno, ‘‘Ultra-efficient on-device object detection on ai-integrated smart glasses with tinyissimoyolo,’’ inComputer Vision – ECCV 2024 14 VOLUME 14, 2026 Workshops, A. Del Bue, C. Canton, J. Pont-Tuset, and T. Tommasi, Eds. Cham: Springer Nature Switzerland, 2025, pp. 262–280

work page 2024
[54]

H. H. Y . Shalby, M. Pavan, and M. Roveri, ‘‘StreamTinyNet: video stream- ing analysis with spatial-temporal TinyML,’’ in2024 International Joint Conference on Neural Networks (IJCNN), 2024, pp. 1–8

work page 2024
[55]

El Zeinaty, W

C. El Zeinaty, W. Hamidouche, G. Herrou, and D. Menard, ‘‘Designing object detection models for tinyml: Foundations, comparative analysis, challenges, and emerging solutions,’’ACM Comput. Surv., vol. 58, no. 2, Sep. 2025. [Online]. Available: https://doi.org/10.1145/3744339

work page doi:10.1145/3744339 2025
[56]

Burrello, M

A. Burrello, M. Scherer, M. Zanghieri, F. Conti, and L. Benini, ‘‘A Mi- crocontroller is All Y ou Need: Enabling Transformer Execution on Low- Power IoT Endnodes,’’ in2021 IEEE International Conference on Omni- Layer Intelligent Systems (COINS), 2021, pp. 1–6

work page 2021
[57]

V . J.-B. Jung, A. Burrello, M. Scherer, F. Conti, and L. Benini, ‘‘Optimiz- ing the Deployment of Tiny Transformers on Low-Power MCUs,’’IEEE Transactions on Computers, vol. 74, no. 2, pp. 526–541, 2025

work page 2025
[58]

Dequino, L

A. Dequino, L. Bompani, L. Benini, and F. Conti, ‘‘Optimizing BFloat16 Deployment of Tiny Transformers on Ultra-Low Power Extreme Edge SoCs,’’Journal of Low Power Electronics and Applications, vol. 15, no. 1,

work page
[59]

Available: https://www.mdpi.com/2079-9268/15/1/8

[Online]. Available: https://www.mdpi.com/2079-9268/15/1/8

work page 2079
[60]

X. Lu, C. Bai, A. Zhu, Y . Zhu, and K. Wang, ‘‘Mcformer: A transformer- based detector for molecular communication with accelerated particle- based solution,’’IEEE Communications Letters, vol. 27, no. 10, pp. 2837– 2841, 2023

work page 2023
[61]

V aswani, N

A. V aswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ inProceedings of the 31st International Conference on Neural Information Processing Systems, ser. NIPS’17. Red Hook, NY , USA: Curran Associates Inc., 2017, p. 6000–6010

work page 2017
[62]

T.-Y . Lin, P . Goyal, R. Girshick, K. He, and P . Dollár, ‘‘Focal Loss for Dense Object Detection,’’IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 2, pp. 318–327, 2020

work page 2020
[63]

Scarciglia, A

L. Scarciglia, A. Paolillo, and D. Palossi, ‘‘A map-free deep learning- based framework for gate-to-gate monocular visual navigation aboard miniaturized aerial vehicles,’’ 2025. [Online]. Available: https://arxiv.org/ abs/2503.05251

work page arXiv 2025
[64]

Bompani, L

L. Bompani, L. Crupi, D. Palossi, O. Baldoni, D. Brunelli, F. Conti, M. Rusci, and L. Benini, ‘‘Accelerating image-based pest detection on a heterogeneous multicore microcontroller,’’IEEE Transactions on Agri- F ood Electronics, vol. 2, no. 2, pp. 170–180, 2024

work page 2024
[65]

Crupi, L

L. Crupi, L. Butera, A. Ferrante, A. Giusti, and D. Palossi, ‘‘An efficient ground-aerial transportation system for pest control enabled by ai-based autonomous nano-uavs,’’ACM J. Auton. Transport. Syst., vol. 2, no. 4, Jun. 2025. [Online]. Available: https://doi.org/10.1145/3719210

work page doi:10.1145/3719210 2025
[66]

YOLOv4: Optimal Speed and Accuracy of Object Detection

A. Bochkovskiy, C.-Y . Wang, and H.-Y . M. Liao, ‘‘Y olov4: Optimal speed and accuracy of object detection,’’ArXiv, vol. abs/2004.10934, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:216080778

work page internal anchor Pith review Pith/arXiv arXiv 2004
[67]

P . Zhu, L. Wen, D. Du, X. Bian, H. Fan, Q. Hu, and H. Ling, ‘‘Detection and tracking meet drones challenge,’’IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 7380–7399, 2021

work page 2021
[68]

Z. Tang, M. Naphade, M.-Y . Liu, X. Y ang, S. Birchfield, S. Wang, R. Ku- mar, D. Anastasiu, and J.-N. Hwang, ‘‘CityFlow: A city-scale benchmark for multi-target multi-camera vehicle tracking and re-identification,’’ in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

work page 2019
[69]

L. Wen, D. Du, Z. Cai, Z. Lei, M.-C. Chang, H. Qi, J. Lim, M.-H. Y ang, and S. Lyu, ‘‘UA-DETRAC: A new benchmark and protocol for multi-object detection and tracking,’’Computer Vision and Image Understanding, vol. 193, p. 102907, 2020

work page 2020
[70]

Nanocopter AI Challenge

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P . Perona, D. Ramanan, P . Dol- lár, and C. L. Zitnick, ‘‘Microsoft coco: Common objects in context,’’ inComputer Vision – ECCV 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds. Cham: Springer International Publishing, 2014, pp. 740–755. LUCA BOMPANIPh.D. graduate in Electronic Engineering at the U...

work page 2014

[1] [1]

S. C. Mukhopadhyay, S. K. S. Tyagi, N. K. Suryadevara, V . Piuri, F. Scotti, and S. Zeadally, ‘‘Artificial intelligence-based sensors for next generation iot applications: A review,’’IEEE Sensors Journal, vol. 21, no. 22, pp. 24 920–24 932, 2021

work page 2021

[2] [2]

W. Su, L. Li, F. Liu, M. He, and X. Liang, ‘‘Ai on the edge: a comprehensive review,’’Artif. Intell. Rev., vol. 55, no. 8, p. 6125–6183, Dec. 2022. [Online]. Available: https://doi.org/10.1007/s10462-022-10141-4

work page doi:10.1007/s10462-022-10141-4 2022

[3] [3]

W. Y u, F. Liang, X. He, W. G. Hatcher, C. Lu, J. Lin, and X. Y ang, ‘‘A survey on the edge computing for the internet of things,’’IEEE Access, vol. 6, pp. 6900–6919, 2018

work page 2018

[4] [4]

K. S. Patle, R. Saini, A. Kumar, and V . S. Palaparthy, ‘‘Field evaluation of smart sensor system for plant disease prediction using lstm network,’’ IEEE Sensors Journal, vol. 22, no. 4, pp. 3715–3725, 2022

work page 2022

[5] [5]

Sabato, S

A. Sabato, S. Dabetwar, N. N. Kulkarni, and G. Fortino, ‘‘Noncontact sensing techniques for ai-aided structural health monitoring: A systematic review,’’IEEE Sensors Journal, vol. 23, no. 5, pp. 4672–4684, 2023

work page 2023

[6] [6]

Sameer, P

S. Sameer, P . Madan, S. Kannan, V . J. Upadhye, H. Patil, and S. Rajkumar, ‘‘AI-based Object Detection for Assisting the Visually Impaired People,’’ in2024 5th International Conference on Mobile Computing and Sustain- able Informatics (ICMCSI). IEEE, 2024, pp. 512–518

work page 2024

[7] [7]

Lamberti, L

L. Lamberti, L. Bompani, V . J. Kartsch, M. Rusci, D. Palossi, and L. Benini, ‘‘Bio-inspired autonomous exploration policies with cnn-based object de- tection on nano-drones,’’ in2023 Design, Automation & Testin Europe Conference & Exhibition (DATE). IEEE, 2023, pp. 1–6

work page 2023

[8] [8]

AlNuaimi, E

E. AlNuaimi, E. Cereda, R. Psiakis, S. Sugumar, A. Giusti, and D. Palossi, ‘‘A Deep Learning-Based Face Mask Detector for Autonomous Nano- Drones (Student Abstract),’’ inProceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 11, 2022, pp. 12 903–12 904

work page 2022

[9] [9]

Rossi, F

D. Rossi, F. Conti, M. Eggiman, A. D. Mauro, G. Tagliavini, S. Mach, M. Guermandi, A. Pullini, I. Loi, J. Chen, E. Flamand, and L. Benini, ‘‘V ega: A Ten-Core SoC for IoT Endnodes With DNN Acceleration and Cognitive Wake-Up From MRAM-Based State-Retentive Sleep Mode,’’ IEEE Journal of Solid-State Circuits, vol. 57, no. 1, pp. 127–139, 2022

work page 2022

[10] [10]

Lamberti, M

L. Lamberti, M. Rusci, M. Fariselli, F. Paci, and L. Benini, ‘‘Low-power license plate detection and recognition on a risc-v multi-core mcu-based vision system,’’ in2021 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2021, pp. 1–5

work page 2021

[11] [11]

Bompani, M

L. Bompani, M. Rusci, D. Palossi, F. Conti, and L. Benini, ‘‘ Multi- resolution Rescored ByteTrack for Video Object Detection on Ultra-low- power Embedded Systems ,’’ in2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Los Alamitos, CA, USA: IEEE Computer Society, Jun. 2024, pp. 2182–2190. VOLUME 14, 2026 13

work page 2024

[12] [12]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, ‘‘An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,’’ inInternational Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=YicbFdNTTy

work page 2021

[13] [13]

Y . Wang, Y . Deng, Y . Zheng, P . Chattopadhyay, and L. Wang, ‘‘Vision transformers for image classification: A comparative survey,’’ Technologies, vol. 13, no. 1, 2025. [Online]. Available: https://www.mdpi. com/2227-7080/13/1/32

work page 2025

[14] [14]

A. Khan, Z. Rauf, A. Sohail, A. R. Khan, H. Asif, A. Asif, and U. Farooq, ‘‘A survey of the vision transformers and their cnn-transformer based variants,’’Artificial Intelligence Review, vol. 56, no. 3, pp. 2917–2970, Dec

work page

[15] [15]

Available: https://doi.org/10.1007/s10462-023-10595-0

[Online]. Available: https://doi.org/10.1007/s10462-023-10595-0

work page doi:10.1007/s10462-023-10595-0

[16] [16]

H. Cai, J. Li, M. Hu, C. Gan, and S. Han, ‘‘EfficientViT: Lightweight Multi-Scale Attention for High-Resolution Dense Prediction,’’ in2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 17 256–17 267

work page 2023

[17] [17]

Russakovsky, J

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, ‘‘Ima- geNet Large Scale Visual Recognition Challenge,’’International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015

work page 2015

[18] [18]

[Online]

RangiLyu, ‘‘Nanodet-plus Superfast and high accuracy lightweight anchor-free object detection model,’’ 2021. [Online]. Available: https: //github.com/RangiLyu/nanodet

work page 2021

[19] [19]

Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, ‘‘Y olox: Exceeding yolo series in 2021,’’arXivpreprintarXiv:2107.08430, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[20] [20]

B. Liu, M. Cai, and J. Li, ‘‘Video Object Detection Based on 3D Con- volution,’’ in2022 IEEE International Conference on Unmanned Systems (ICUS), 2022, pp. 177–183

work page 2022

[21] [21]

X. Zhu, Y . Wang, J. Dai, L. Y uan, and Y . Wei, ‘‘Flow-Guided Feature Aggregation for Video Object Detection,’’ in2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 408–417

work page 2017

[22] [22]

H. Wu, Y . Chen, N. Wang, and Z.-X. Zhang, ‘‘Sequence Level Semantics Aggregation for Video Object Detection,’’ in2019 IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), 2019, pp. 9216–9224

work page 2019

[23] [23]

Y . Chen, Y . Cao, H. Hu, and L. Wang, ‘‘Memory Enhanced Global-Local Aggregation for Video Object Detection,’’ in2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 06 2020, pp. 10 334– 10 343

work page 2020

[24] [24]

Q. Zhou, X. Li, L. He, Y . Y ang, G. Cheng, Y . Tong, L. Ma, and D. Tao, ‘‘TransVOD: End-to-End Video Object Detection With Spatial-Temporal Transformers,’’IEEE Transactions on Pattern Analysis and Machine Intel- ligence, vol. 45, no. 6, pp. 7853–7869, 2023

work page 2023

[25] [25]

Y . Shi, N. Wang, and X. Guo, ‘‘YOLOV: Making Still Image Object Detectors Great at Video Object Detection,’’Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 2, pp. 2254–2262, Jun. 2023

work page 2023

[26] [26]

Belhassen, H

H. Belhassen, H. Zhang, V . Fresse, and E.-B. Bourennane, ‘‘Im- proving Video Object Detection by Seq-BboxMatching.’’ inVISI- GRAPP(5:VISAPP), 2019, pp. 226–233

work page 2019

[27] [27]

M. Li, L. Li, R. Bai, J. Ren, B. Meng, and Y . Y ang, ‘‘A Motion-based Seq-bbox Matching Method for Video Object Detection,’’ in2021 IEEE Symposium on Computers and Communications (ISCC), 2021, pp. 1–7

work page 2021

[28] [28]

X. Liu, F. K. Nejadasl, J. C. van Gemert, O. Booij, and S. L. Pintea, ‘‘ Objects do not disappear: Video object detection by single-frame object location anticipation ,’’ in2023 IEEE/CVF International Conference on Computer Vision (ICCV). Los Alamitos, CA, USA: IEEE Computer Society, Oct. 2023, pp. 6927–6938

work page 2023

[29] [29]

V erelst and T

T. V erelst and T. Tuytelaars, ‘‘BlockCopy: High-Resolution Video Process- ing with Block-Sparse Feature Propagation and Online Policies,’’ in2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 5138–5147

work page 2021

[30] [30]

Q. Zhou, S. Guo, J. Pan, J. Liang, J. Guo, Z. Xu, and J. Zhou, ‘‘Pass: Patch automatic skip scheme for efficient on-device video perception,’’IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 5, pp. 3938–3954, 2024

work page 2024

[31] [31]

M. Liu, M. Zhu, M. White, Y . Li, and D. Kalenichenko, ‘‘Looking fast and slow: Memory-guided mobile video object detection,’’arXiv preprint arXiv:1903.10172, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1903

[32] [32]

Boyle, J

L. Boyle, J. Moosmann, N. Baumann, S. Heo, and M. Magno, ‘‘DSORT- MCU: Detecting Small Objects in Real Time on Microcontroller Units,’’ IEEE Sensors Journal, vol. 24, no. 24, pp. 40 231–40 239, 2024

work page 2024

[33] [33]

W. Han, P . Khorrami, T. L. Paine, P . Ramachandran, M. Babaeizadeh, H. Shi, J. Li, S. Y an, and T. S. Huang, ‘‘Seq-NMS for Video Object Detection.’’CoRR, vol. abs/1602.08465, 2016. [Online]. Available: http: //dblp.uni-trier.de/db/journals/corr/corr1602.html#HanKPRBSL YH16

work page internal anchor Pith review Pith/arXiv arXiv 2016

[34] [34]

S. Ren, K. He, R. Girshick, and J. Sun, ‘‘Faster r-cnn: towards real-time object detection with region proposal networks,’’ inProceedings of the 29th International Conference on Neural Information Processing Systems - V olume 1, ser. NIPS’15. Cambridge, MA, USA: MIT Press, 2015, p. 91–99

work page 2015

[35] [35]

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y . Fu, and A. C. Berg, ‘‘SSD: Single Shot MultiBox Detector,’’ inComputer Vision – ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds. Cham: Springer International Publishing, 2016, pp. 21–37

work page 2016

[36] [36]

Sandler, A

M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, ‘‘Mobilenetv2: Inverted residuals and linear bottlenecks,’’2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4510–4520, 2018

work page 2018

[37] [37]

Y aseen, ‘‘What is YOLOv8: An In-Depth Exploration of the Internal Features of the Next-Generation Object Detector,’’ 08 2024

M. Y aseen, ‘‘What is YOLOv8: An In-Depth Exploration of the Internal Features of the Next-Generation Object Detector,’’ 08 2024

work page 2024

[38] [38]

A. Wang, H. Chen, L. Liu, K. CHEN, Z. Lin, J. Han, and G. Ding, ‘‘YOLOv10: Real-Time End-to-End Object Detection,’’ inThe Thirty- eighth Annual Conference on Neural Information Processing Systems,

work page

[39] [39]

Available: https://openreview.net/forum?id=tz83Nyb71l

[Online]. Available: https://openreview.net/forum?id=tz83Nyb71l

work page

[40] [40]

YOLOv11: An Overview of the Key Architectural Enhancements

R. Khanam and M. Hussain, ‘‘YOLOv11: An Overview of the Key Architectural Enhancements,’’ 2024. [Online]. Available: https://arxiv.org/ abs/2410.17725

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

Carion, F

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, ‘‘End-to-end object detection with transformers,’’ inCom- puter Vision – ECCV 2020, A. V edaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds. Cham: Springer International Publishing, 2020, pp. 213– 229

work page 2020

[42] [42]

Mehta and M

S. Mehta and M. Rastegari, ‘‘MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer,’’ inInternational Conference on Learning Representations, 2022. [Online]. Available: https://openreview. net/forum?id=vh-0sUt8HlG

work page 2022

[43] [43]

S. N. Wadekar and A. Chaurasia, ‘‘MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features,’’ 2022. [Online]. Available: https://arxiv.org/abs/2209.15159

work page arXiv 2022

[44] [44]

Mehta and M

S. Mehta and M. Rastegari, ‘‘Separable Self-attention for Mobile Vision Transformers,’’ 2022. [Online]. Available: https://arxiv.org/abs/ 2206.02680

work page arXiv 2022

[45] [45]

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, ‘‘Learning Spatiotemporal Features with 3D Convolutional Networks,’’ in2015 IEEE International Conference on Computer Vision (ICCV). Los Alamitos, CA, USA: IEEE Computer Society, dec 2015, pp. 4489–4497

work page 2015

[46] [46]

Y . Lyu, M. Y . Y ang, G. V osselman, and G.-S. Xia, ‘‘Video object detection with a convolutional regression tracker,’’ISPRS Journal of Photogramme- try and Remote Sensing, vol. 176, pp. 139–150, 2021

work page 2021

[47] [47]

Integrated Object Detection and Tracking with Tracklet-Conditioned Detection

Z. Zhang, D. Cheng, X. Z. S. Lin, and J. Dai, ‘‘Integrated Object De- tection and Tracking with Tracklet-Conditioned Detection,’’ArXiv, vol. abs/1811.11167, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[48] [48]

X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, ‘‘Deformable DETR: Deformable Transformers for End-to-End Object Detection,’’ArXiv, vol. abs/2010.04159, 2020. [Online]. Available: https://api.semanticscholar. org/CorpusID:222208633

work page internal anchor Pith review Pith/arXiv arXiv 2010

[49] [49]

Bewley, Z

A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, ‘‘Simple online and realtime tracking,’’ in2016 IEEE International Conference on Image Processing (ICIP), 2016, pp. 3464–3468

work page 2016

[50] [50]

Zhang, P

Y . Zhang, P . Sun, Y . Jiang, D. Y u, Z. Y uan, P . Luo, W. Liu, and X. Wang, ‘‘ByteTrack: Multi-Object Tracking by Associating Every Detection Box,’’ inEuropean Conference on Computer Vision, 2021

work page 2021

[51] [51]

B. A. Motetti, L. Crupi, M. O. M. E. Elshaigi, M. Risso, D. J. Pagliari, D. Palossi, and A. Burrello, ‘‘Adaptive Deep Learning for Efficient Visual Pose Estimation aboard Ultra-low-power Nano-drones,’’ArXiv, vol. abs/2401.15236, 2024. [Online]. Available: https://api.semanticscholar. org/CorpusID:267312457

work page arXiv 2024

[52] [52]

Moosmann, H

J. Moosmann, H. Müller, N. Zimmerman, G. Rutishauser, L. Benini, and M. Magno, ‘‘Flexible and Fully Quantized Lightweight TinyissimoYOLO for Ultra-Low-Power Edge Systems,’’IEEE Access, vol. 12, pp. 75 093– 75 107, 2024

work page 2024

[53] [53]

Moosmann, P

J. Moosmann, P . Bonazzi, Y . Li, S. Bian, P . Mayer, L. Benini, and M. Magno, ‘‘Ultra-efficient on-device object detection on ai-integrated smart glasses with tinyissimoyolo,’’ inComputer Vision – ECCV 2024 14 VOLUME 14, 2026 Workshops, A. Del Bue, C. Canton, J. Pont-Tuset, and T. Tommasi, Eds. Cham: Springer Nature Switzerland, 2025, pp. 262–280

work page 2024

[54] [54]

H. H. Y . Shalby, M. Pavan, and M. Roveri, ‘‘StreamTinyNet: video stream- ing analysis with spatial-temporal TinyML,’’ in2024 International Joint Conference on Neural Networks (IJCNN), 2024, pp. 1–8

work page 2024

[55] [55]

El Zeinaty, W

C. El Zeinaty, W. Hamidouche, G. Herrou, and D. Menard, ‘‘Designing object detection models for tinyml: Foundations, comparative analysis, challenges, and emerging solutions,’’ACM Comput. Surv., vol. 58, no. 2, Sep. 2025. [Online]. Available: https://doi.org/10.1145/3744339

work page doi:10.1145/3744339 2025

[56] [56]

Burrello, M

A. Burrello, M. Scherer, M. Zanghieri, F. Conti, and L. Benini, ‘‘A Mi- crocontroller is All Y ou Need: Enabling Transformer Execution on Low- Power IoT Endnodes,’’ in2021 IEEE International Conference on Omni- Layer Intelligent Systems (COINS), 2021, pp. 1–6

work page 2021

[57] [57]

V . J.-B. Jung, A. Burrello, M. Scherer, F. Conti, and L. Benini, ‘‘Optimiz- ing the Deployment of Tiny Transformers on Low-Power MCUs,’’IEEE Transactions on Computers, vol. 74, no. 2, pp. 526–541, 2025

work page 2025

[58] [58]

Dequino, L

A. Dequino, L. Bompani, L. Benini, and F. Conti, ‘‘Optimizing BFloat16 Deployment of Tiny Transformers on Ultra-Low Power Extreme Edge SoCs,’’Journal of Low Power Electronics and Applications, vol. 15, no. 1,

work page

[59] [59]

Available: https://www.mdpi.com/2079-9268/15/1/8

[Online]. Available: https://www.mdpi.com/2079-9268/15/1/8

work page 2079

[60] [60]

X. Lu, C. Bai, A. Zhu, Y . Zhu, and K. Wang, ‘‘Mcformer: A transformer- based detector for molecular communication with accelerated particle- based solution,’’IEEE Communications Letters, vol. 27, no. 10, pp. 2837– 2841, 2023

work page 2023

[61] [61]

V aswani, N

A. V aswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ inProceedings of the 31st International Conference on Neural Information Processing Systems, ser. NIPS’17. Red Hook, NY , USA: Curran Associates Inc., 2017, p. 6000–6010

work page 2017

[62] [62]

T.-Y . Lin, P . Goyal, R. Girshick, K. He, and P . Dollár, ‘‘Focal Loss for Dense Object Detection,’’IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 2, pp. 318–327, 2020

work page 2020

[63] [63]

Scarciglia, A

L. Scarciglia, A. Paolillo, and D. Palossi, ‘‘A map-free deep learning- based framework for gate-to-gate monocular visual navigation aboard miniaturized aerial vehicles,’’ 2025. [Online]. Available: https://arxiv.org/ abs/2503.05251

work page arXiv 2025

[64] [64]

Bompani, L

L. Bompani, L. Crupi, D. Palossi, O. Baldoni, D. Brunelli, F. Conti, M. Rusci, and L. Benini, ‘‘Accelerating image-based pest detection on a heterogeneous multicore microcontroller,’’IEEE Transactions on Agri- F ood Electronics, vol. 2, no. 2, pp. 170–180, 2024

work page 2024

[65] [65]

Crupi, L

L. Crupi, L. Butera, A. Ferrante, A. Giusti, and D. Palossi, ‘‘An efficient ground-aerial transportation system for pest control enabled by ai-based autonomous nano-uavs,’’ACM J. Auton. Transport. Syst., vol. 2, no. 4, Jun. 2025. [Online]. Available: https://doi.org/10.1145/3719210

work page doi:10.1145/3719210 2025

[66] [66]

YOLOv4: Optimal Speed and Accuracy of Object Detection

A. Bochkovskiy, C.-Y . Wang, and H.-Y . M. Liao, ‘‘Y olov4: Optimal speed and accuracy of object detection,’’ArXiv, vol. abs/2004.10934, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:216080778

work page internal anchor Pith review Pith/arXiv arXiv 2004

[67] [67]

P . Zhu, L. Wen, D. Du, X. Bian, H. Fan, Q. Hu, and H. Ling, ‘‘Detection and tracking meet drones challenge,’’IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 7380–7399, 2021

work page 2021

[68] [68]

Z. Tang, M. Naphade, M.-Y . Liu, X. Y ang, S. Birchfield, S. Wang, R. Ku- mar, D. Anastasiu, and J.-N. Hwang, ‘‘CityFlow: A city-scale benchmark for multi-target multi-camera vehicle tracking and re-identification,’’ in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

work page 2019

[69] [69]

L. Wen, D. Du, Z. Cai, Z. Lei, M.-C. Chang, H. Qi, J. Lim, M.-H. Y ang, and S. Lyu, ‘‘UA-DETRAC: A new benchmark and protocol for multi-object detection and tracking,’’Computer Vision and Image Understanding, vol. 193, p. 102907, 2020

work page 2020

[70] [70]

Nanocopter AI Challenge

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P . Perona, D. Ramanan, P . Dol- lár, and C. L. Zitnick, ‘‘Microsoft coco: Common objects in context,’’ inComputer Vision – ECCV 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds. Cham: Springer International Publishing, 2014, pp. 740–755. LUCA BOMPANIPh.D. graduate in Electronic Engineering at the U...

work page 2014