pith. sign in

arxiv: 2605.15423 · v1 · pith:4FRWERRYnew · submitted 2026-05-14 · 💻 cs.CV · cs.AI· eess.IV

MR2-ByteTrack: CNN and Transformer-based Video Object Detection for AI-augmented Embedded Vision Sensor Nodes

Pith reviewed 2026-05-19 15:20 UTC · model grok-4.3

classification 💻 cs.CV cs.AIeess.IV
keywords video object detectionembedded visionmicrocontrollermulti-resolution inferencetransformerCNNByteTrackenergy efficiency
0
0 comments X

The pith

MR2-ByteTrack enables video object detection with up to 55% energy savings on microcontroller-based vision sensors by alternating resolutions and rescoring detections.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a video object detection method designed for ultra-low-power microcontrollers that cannot handle standard approaches requiring lots of memory or buffering. It alternates between processing full-resolution and low-resolution frames to cut computation, uses ByteTrack to associate detections across frames, and applies a Rescore step that combines confidence scores from multiple frames using probability rules to fix mistakes made on low-res inputs. This keeps detection accuracy close to full-resolution baselines while slashing the number of operations and energy used. A sympathetic reader would care because it makes on-device AI possible for smart cameras where sending data to the cloud is not feasible due to privacy, power, or connectivity limits.

Core claim

MR2-ByteTrack reduces multiply-accumulate operations by up to 53% for CNN detectors and 32% for Transformer detectors on the ImageNetVID dataset while maintaining mAP scores of 49.0 and 48.7 respectively. When run on the GAP9 MCU it achieves up to 55% energy savings over full-resolution processing and supports real-time Transformer-based video object detection for the first time on such hardware.

What carries the argument

The Multi-Resolution Rescored ByteTrack (MR2-ByteTrack) pipeline that switches between full- and low-resolution inference passes and corrects low-resolution errors via ByteTrack association combined with the Rescore algorithm's probability union aggregation of per-frame confidences.

If this is right

  • Reduces computational cost measured in multiply-accumulate operations by as much as 53% for CNN models and 32% for Transformer models.
  • Achieves up to 55% energy savings on the GAP9 ultra-low-power RISC-V MCU compared to full-resolution processing.
  • Enables real-time Transformer-based video object detection on MCU-class embedded vision nodes for the first time.
  • Preserves detection accuracy with mAP values up to 49.0 for CNN and 48.7 for Transformer on ImageNetVID.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar multi-resolution strategies could be tested on other tracking or detection architectures beyond the ones evaluated here.
  • The approach might reduce bandwidth needs in distributed vision systems by keeping more processing local.
  • Extending the Rescore logic to longer sequences or different confidence aggregation rules could further improve robustness on very low-power hardware.

Load-bearing premise

The Rescore algorithm reliably fixes misclassifications introduced by low-resolution frames using probability union rules without lowering overall detection performance.

What would settle it

Measuring the mAP on ImageNetVID when running the full pipeline but disabling the Rescore step and seeing if accuracy drops below the reported levels or below a full-resolution baseline.

Figures

Figures reproduced from arXiv: 2605.15423 by Daniele Palossi, Francesco Conti, Luca Benini, Luca Bompani, Manuele Rusci.

Figure 1
Figure 1. Figure 1: FIGURE 1 [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: FIGURE 2 [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: FIGURE 3 [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: FIGURE 4 [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: FIGURE 5 [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
read the original abstract

Modern smart vision sensors need on-device intelligence to process video streams, as cloud computing is often impractical due to bandwidth, latency, and privacy constraints. However, these sensory systems typically rely on ultra-low-power microcontrollers (MCUs) with limited memory and compute, making conventional video object detection methods, which require feature storage or multi-frame buffering, unfeasible. To address this challenge, we introduce Multi-Resolution Rescored ByteTrack (MR2-ByteTrack), a Video Object Detection (VOD) method tailored for MCU-based embedded vision nodes. MR2-ByteTrack reduces computational cost by alternating between full- and low-resolution inference, while linking detections across frames via ByteTrack and correcting misclassifications through the Rescore algorithm, which applies probability union rules to aggregate detection confidence scores across frames. We apply our approach to both a CNN-based detector and a Transformer-based model, demonstrating its generality across architectures with fundamentally different spatial processing. Experiments on ImageNetVID demonstrate that MR2-ByteTrack maintains accuracy, achieving mAP scores of up to 49.0 for the CNN-based models and 48.7 for the Transformer, while reducing multiply-accumulate operations by as much as 53\% for the CNNs and 32\% for the Transformer. When deployed on GAP9, an ultra-low-power RISC-V multicore MCU, our method yields up to 55\% energy savings compared to processing only full-resolution images, enabling the first real-time Transformer-based VOD on an MCU-class embedded vision node. Code available at https://github.com/Bomps4/Multi_Resolution_Rescored_ByteTrack/tree/IEEE_Access

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MR2-ByteTrack, a video object detection method for MCU-based embedded vision nodes. It alternates full- and low-resolution inference on CNN and Transformer detectors, links detections with ByteTrack, and applies a Rescore algorithm using probability union rules to aggregate per-frame confidence scores and correct misclassifications. On ImageNetVID it reports maintained mAP of 49.0 (CNN) and 48.7 (Transformer) with MAC reductions of 53% and 32%, respectively; on GAP9 hardware it claims up to 55% energy savings versus full-resolution processing, enabling the first real-time Transformer VOD on an MCU-class node. Code is released.

Significance. If the accuracy-maintenance claim holds, the work is significant for practical on-device video intelligence under severe memory and power constraints. It demonstrates cross-architecture generality (CNN and Transformer), reports concrete hardware energy measurements on GAP9, and provides reproducible code. These elements directly address the gap between high-accuracy VOD models and ultra-low-power embedded deployment.

major comments (2)
  1. [Method description of Rescore algorithm] The central claim that mAP is preserved while increasing the fraction of low-resolution frames (thereby achieving the reported 53%/32% MAC and 55% energy reductions) rests on the Rescore step. The manuscript states that probability-union aggregation corrects low-resolution misclassifications, yet provides no ablation, error-tolerance bound, or quantitative analysis of how many high-confidence false positives or missed small/fast objects the union rule can absorb before mAP falls below the full-resolution baseline. This is load-bearing for the energy-savings result.
  2. [Experiments on ImageNetVID and GAP9] The experimental section reports mAP values and MAC counts but does not specify the exact alternating schedule (e.g., fraction of low-resolution frames per sequence), the precise definition of the probability-union rule, or comparisons against other multi-resolution or frame-skipping baselines. Without these details the optimality and robustness of the 55% energy figure cannot be fully assessed.
minor comments (2)
  1. [Abstract and Method] The abstract and method sections use “probability union rules” without a short inline formula or pseudocode; adding one would improve clarity for readers unfamiliar with the exact aggregation.
  2. [Results tables/figures] Table or figure captions should explicitly state the resolution schedule and the number of low-resolution frames used to obtain the reported MAC and energy numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential significance of MR2-ByteTrack for energy-constrained embedded vision. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of the Rescore algorithm and experimental details.

read point-by-point responses
  1. Referee: [Method description of Rescore algorithm] The central claim that mAP is preserved while increasing the fraction of low-resolution frames (thereby achieving the reported 53%/32% MAC and 55% energy reductions) rests on the Rescore step. The manuscript states that probability-union aggregation corrects low-resolution misclassifications, yet provides no ablation, error-tolerance bound, or quantitative analysis of how many high-confidence false positives or missed small/fast objects the union rule can absorb before mAP falls below the full-resolution baseline. This is load-bearing for the energy-savings result.

    Authors: We agree that additional analysis is required to fully support the central claim. In the revised manuscript we will add an ablation study that varies the fraction of low-resolution frames and reports mAP both with and without the Rescore step. We will also include a quantitative error-tolerance analysis, showing concrete examples of how the union rule recovers high-confidence false positives and missed small or fast objects across linked tracks. The probability-union rule will be defined precisely (maximum probability across linked detections or 1 - product(1 - p_i)). These additions will directly address the load-bearing nature of the result. revision: yes

  2. Referee: [Experiments on ImageNetVID and GAP9] The experimental section reports mAP values and MAC counts but does not specify the exact alternating schedule (e.g., fraction of low-resolution frames per sequence), the precise definition of the probability-union rule, or comparisons against other multi-resolution or frame-skipping baselines. Without these details the optimality and robustness of the 55% energy figure cannot be fully assessed.

    Authors: We acknowledge that the current experimental description lacks sufficient detail. In the revision we will explicitly state the alternating schedule (e.g., full-resolution every third frame with the resulting fraction of low-resolution frames per sequence), provide the exact mathematical formulation of the probability-union rule, and add direct comparisons against simple frame-skipping and other multi-resolution baselines. These changes will allow readers to assess the optimality and robustness of the reported 55% energy savings on GAP9. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation of resolution-alternating VOD with tracking and rescoring

full rationale

The paper introduces MR2-ByteTrack as an algorithmic combination of alternating full/low-resolution inference, ByteTrack linking, and a Rescore step that aggregates scores via probability-union rules. All performance claims (mAP 49.0/48.7, 53%/32% MAC reduction, 55% energy savings on GAP9) are presented as direct experimental outcomes on ImageNetVID, compared against full-resolution baselines. No equations, first-principles derivations, or fitted parameters are shown that reduce to the method's own inputs by construction. No self-citation chains or uniqueness theorems are invoked to justify the core approach. The method is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper builds on established computer vision techniques like ByteTrack and adapts them for resource-constrained environments without introducing new fundamental entities or many free parameters beyond standard model hyperparameters.

axioms (2)
  • domain assumption Detections can be reliably linked across frames using ByteTrack
    Central to maintaining temporal consistency in video.
  • domain assumption Probability union rules can aggregate confidence scores to correct errors
    Basis of the Rescore algorithm.

pith-pipeline@v0.9.0 · 5857 in / 1399 out tokens · 73685 ms · 2026-05-19T15:20:15.327680+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    MR2-ByteTrack reduces computational cost by alternating between full- and low-resolution inference, while linking detections across frames via ByteTrack and correcting misclassifications through the Rescore algorithm, which applies probability union rules to aggregate detection confidence scores across frames.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 7 internal anchors

  1. [1]

    S. C. Mukhopadhyay, S. K. S. Tyagi, N. K. Suryadevara, V . Piuri, F. Scotti, and S. Zeadally, ‘‘Artificial intelligence-based sensors for next generation iot applications: A review,’’IEEE Sensors Journal, vol. 21, no. 22, pp. 24 920–24 932, 2021

  2. [2]

    W. Su, L. Li, F. Liu, M. He, and X. Liang, ‘‘Ai on the edge: a comprehensive review,’’Artif. Intell. Rev., vol. 55, no. 8, p. 6125–6183, Dec. 2022. [Online]. Available: https://doi.org/10.1007/s10462-022-10141-4

  3. [3]

    W. Y u, F. Liang, X. He, W. G. Hatcher, C. Lu, J. Lin, and X. Y ang, ‘‘A survey on the edge computing for the internet of things,’’IEEE Access, vol. 6, pp. 6900–6919, 2018

  4. [4]

    K. S. Patle, R. Saini, A. Kumar, and V . S. Palaparthy, ‘‘Field evaluation of smart sensor system for plant disease prediction using lstm network,’’ IEEE Sensors Journal, vol. 22, no. 4, pp. 3715–3725, 2022

  5. [5]

    Sabato, S

    A. Sabato, S. Dabetwar, N. N. Kulkarni, and G. Fortino, ‘‘Noncontact sensing techniques for ai-aided structural health monitoring: A systematic review,’’IEEE Sensors Journal, vol. 23, no. 5, pp. 4672–4684, 2023

  6. [6]

    Sameer, P

    S. Sameer, P . Madan, S. Kannan, V . J. Upadhye, H. Patil, and S. Rajkumar, ‘‘AI-based Object Detection for Assisting the Visually Impaired People,’’ in2024 5th International Conference on Mobile Computing and Sustain- able Informatics (ICMCSI). IEEE, 2024, pp. 512–518

  7. [7]

    Lamberti, L

    L. Lamberti, L. Bompani, V . J. Kartsch, M. Rusci, D. Palossi, and L. Benini, ‘‘Bio-inspired autonomous exploration policies with cnn-based object de- tection on nano-drones,’’ in2023 Design, Automation & Testin Europe Conference & Exhibition (DATE). IEEE, 2023, pp. 1–6

  8. [8]

    AlNuaimi, E

    E. AlNuaimi, E. Cereda, R. Psiakis, S. Sugumar, A. Giusti, and D. Palossi, ‘‘A Deep Learning-Based Face Mask Detector for Autonomous Nano- Drones (Student Abstract),’’ inProceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 11, 2022, pp. 12 903–12 904

  9. [9]

    Rossi, F

    D. Rossi, F. Conti, M. Eggiman, A. D. Mauro, G. Tagliavini, S. Mach, M. Guermandi, A. Pullini, I. Loi, J. Chen, E. Flamand, and L. Benini, ‘‘V ega: A Ten-Core SoC for IoT Endnodes With DNN Acceleration and Cognitive Wake-Up From MRAM-Based State-Retentive Sleep Mode,’’ IEEE Journal of Solid-State Circuits, vol. 57, no. 1, pp. 127–139, 2022

  10. [10]

    Lamberti, M

    L. Lamberti, M. Rusci, M. Fariselli, F. Paci, and L. Benini, ‘‘Low-power license plate detection and recognition on a risc-v multi-core mcu-based vision system,’’ in2021 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2021, pp. 1–5

  11. [11]

    Bompani, M

    L. Bompani, M. Rusci, D. Palossi, F. Conti, and L. Benini, ‘‘ Multi- resolution Rescored ByteTrack for Video Object Detection on Ultra-low- power Embedded Systems ,’’ in2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Los Alamitos, CA, USA: IEEE Computer Society, Jun. 2024, pp. 2182–2190. VOLUME 14, 2026 13

  12. [12]

    Dosovitskiy, L

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, ‘‘An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,’’ inInternational Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=YicbFdNTTy

  13. [13]

    Y . Wang, Y . Deng, Y . Zheng, P . Chattopadhyay, and L. Wang, ‘‘Vision transformers for image classification: A comparative survey,’’ Technologies, vol. 13, no. 1, 2025. [Online]. Available: https://www.mdpi. com/2227-7080/13/1/32

  14. [14]

    A. Khan, Z. Rauf, A. Sohail, A. R. Khan, H. Asif, A. Asif, and U. Farooq, ‘‘A survey of the vision transformers and their cnn-transformer based variants,’’Artificial Intelligence Review, vol. 56, no. 3, pp. 2917–2970, Dec

  15. [15]

    Available: https://doi.org/10.1007/s10462-023-10595-0

    [Online]. Available: https://doi.org/10.1007/s10462-023-10595-0

  16. [16]

    H. Cai, J. Li, M. Hu, C. Gan, and S. Han, ‘‘EfficientViT: Lightweight Multi-Scale Attention for High-Resolution Dense Prediction,’’ in2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 17 256–17 267

  17. [17]

    Russakovsky, J

    O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, ‘‘Ima- geNet Large Scale Visual Recognition Challenge,’’International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015

  18. [18]

    [Online]

    RangiLyu, ‘‘Nanodet-plus Superfast and high accuracy lightweight anchor-free object detection model,’’ 2021. [Online]. Available: https: //github.com/RangiLyu/nanodet

  19. [19]

    Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, ‘‘Y olox: Exceeding yolo series in 2021,’’arXivpreprintarXiv:2107.08430, 2021

  20. [20]

    B. Liu, M. Cai, and J. Li, ‘‘Video Object Detection Based on 3D Con- volution,’’ in2022 IEEE International Conference on Unmanned Systems (ICUS), 2022, pp. 177–183

  21. [21]

    X. Zhu, Y . Wang, J. Dai, L. Y uan, and Y . Wei, ‘‘Flow-Guided Feature Aggregation for Video Object Detection,’’ in2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 408–417

  22. [22]

    H. Wu, Y . Chen, N. Wang, and Z.-X. Zhang, ‘‘Sequence Level Semantics Aggregation for Video Object Detection,’’ in2019 IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), 2019, pp. 9216–9224

  23. [23]

    Y . Chen, Y . Cao, H. Hu, and L. Wang, ‘‘Memory Enhanced Global-Local Aggregation for Video Object Detection,’’ in2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 06 2020, pp. 10 334– 10 343

  24. [24]

    Q. Zhou, X. Li, L. He, Y . Y ang, G. Cheng, Y . Tong, L. Ma, and D. Tao, ‘‘TransVOD: End-to-End Video Object Detection With Spatial-Temporal Transformers,’’IEEE Transactions on Pattern Analysis and Machine Intel- ligence, vol. 45, no. 6, pp. 7853–7869, 2023

  25. [25]

    Y . Shi, N. Wang, and X. Guo, ‘‘YOLOV: Making Still Image Object Detectors Great at Video Object Detection,’’Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 2, pp. 2254–2262, Jun. 2023

  26. [26]

    Belhassen, H

    H. Belhassen, H. Zhang, V . Fresse, and E.-B. Bourennane, ‘‘Im- proving Video Object Detection by Seq-BboxMatching.’’ inVISI- GRAPP(5:VISAPP), 2019, pp. 226–233

  27. [27]

    M. Li, L. Li, R. Bai, J. Ren, B. Meng, and Y . Y ang, ‘‘A Motion-based Seq-bbox Matching Method for Video Object Detection,’’ in2021 IEEE Symposium on Computers and Communications (ISCC), 2021, pp. 1–7

  28. [28]

    X. Liu, F. K. Nejadasl, J. C. van Gemert, O. Booij, and S. L. Pintea, ‘‘ Objects do not disappear: Video object detection by single-frame object location anticipation ,’’ in2023 IEEE/CVF International Conference on Computer Vision (ICCV). Los Alamitos, CA, USA: IEEE Computer Society, Oct. 2023, pp. 6927–6938

  29. [29]

    V erelst and T

    T. V erelst and T. Tuytelaars, ‘‘BlockCopy: High-Resolution Video Process- ing with Block-Sparse Feature Propagation and Online Policies,’’ in2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 5138–5147

  30. [30]

    Q. Zhou, S. Guo, J. Pan, J. Liang, J. Guo, Z. Xu, and J. Zhou, ‘‘Pass: Patch automatic skip scheme for efficient on-device video perception,’’IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 5, pp. 3938–3954, 2024

  31. [31]

    M. Liu, M. Zhu, M. White, Y . Li, and D. Kalenichenko, ‘‘Looking fast and slow: Memory-guided mobile video object detection,’’arXiv preprint arXiv:1903.10172, 2019

  32. [32]

    Boyle, J

    L. Boyle, J. Moosmann, N. Baumann, S. Heo, and M. Magno, ‘‘DSORT- MCU: Detecting Small Objects in Real Time on Microcontroller Units,’’ IEEE Sensors Journal, vol. 24, no. 24, pp. 40 231–40 239, 2024

  33. [33]

    W. Han, P . Khorrami, T. L. Paine, P . Ramachandran, M. Babaeizadeh, H. Shi, J. Li, S. Y an, and T. S. Huang, ‘‘Seq-NMS for Video Object Detection.’’CoRR, vol. abs/1602.08465, 2016. [Online]. Available: http: //dblp.uni-trier.de/db/journals/corr/corr1602.html#HanKPRBSL YH16

  34. [34]

    S. Ren, K. He, R. Girshick, and J. Sun, ‘‘Faster r-cnn: towards real-time object detection with region proposal networks,’’ inProceedings of the 29th International Conference on Neural Information Processing Systems - V olume 1, ser. NIPS’15. Cambridge, MA, USA: MIT Press, 2015, p. 91–99

  35. [35]

    W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y . Fu, and A. C. Berg, ‘‘SSD: Single Shot MultiBox Detector,’’ inComputer Vision – ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds. Cham: Springer International Publishing, 2016, pp. 21–37

  36. [36]

    Sandler, A

    M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, ‘‘Mobilenetv2: Inverted residuals and linear bottlenecks,’’2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4510–4520, 2018

  37. [37]

    Y aseen, ‘‘What is YOLOv8: An In-Depth Exploration of the Internal Features of the Next-Generation Object Detector,’’ 08 2024

    M. Y aseen, ‘‘What is YOLOv8: An In-Depth Exploration of the Internal Features of the Next-Generation Object Detector,’’ 08 2024

  38. [38]

    A. Wang, H. Chen, L. Liu, K. CHEN, Z. Lin, J. Han, and G. Ding, ‘‘YOLOv10: Real-Time End-to-End Object Detection,’’ inThe Thirty- eighth Annual Conference on Neural Information Processing Systems,

  39. [39]

    Available: https://openreview.net/forum?id=tz83Nyb71l

    [Online]. Available: https://openreview.net/forum?id=tz83Nyb71l

  40. [40]

    YOLOv11: An Overview of the Key Architectural Enhancements

    R. Khanam and M. Hussain, ‘‘YOLOv11: An Overview of the Key Architectural Enhancements,’’ 2024. [Online]. Available: https://arxiv.org/ abs/2410.17725

  41. [41]

    Carion, F

    N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, ‘‘End-to-end object detection with transformers,’’ inCom- puter Vision – ECCV 2020, A. V edaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds. Cham: Springer International Publishing, 2020, pp. 213– 229

  42. [42]

    Mehta and M

    S. Mehta and M. Rastegari, ‘‘MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer,’’ inInternational Conference on Learning Representations, 2022. [Online]. Available: https://openreview. net/forum?id=vh-0sUt8HlG

  43. [43]

    S. N. Wadekar and A. Chaurasia, ‘‘MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features,’’ 2022. [Online]. Available: https://arxiv.org/abs/2209.15159

  44. [44]

    Mehta and M

    S. Mehta and M. Rastegari, ‘‘Separable Self-attention for Mobile Vision Transformers,’’ 2022. [Online]. Available: https://arxiv.org/abs/ 2206.02680

  45. [45]

    D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, ‘‘Learning Spatiotemporal Features with 3D Convolutional Networks,’’ in2015 IEEE International Conference on Computer Vision (ICCV). Los Alamitos, CA, USA: IEEE Computer Society, dec 2015, pp. 4489–4497

  46. [46]

    Y . Lyu, M. Y . Y ang, G. V osselman, and G.-S. Xia, ‘‘Video object detection with a convolutional regression tracker,’’ISPRS Journal of Photogramme- try and Remote Sensing, vol. 176, pp. 139–150, 2021

  47. [47]

    Integrated Object Detection and Tracking with Tracklet-Conditioned Detection

    Z. Zhang, D. Cheng, X. Z. S. Lin, and J. Dai, ‘‘Integrated Object De- tection and Tracking with Tracklet-Conditioned Detection,’’ArXiv, vol. abs/1811.11167, 2018

  48. [48]

    X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, ‘‘Deformable DETR: Deformable Transformers for End-to-End Object Detection,’’ArXiv, vol. abs/2010.04159, 2020. [Online]. Available: https://api.semanticscholar. org/CorpusID:222208633

  49. [49]

    Bewley, Z

    A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, ‘‘Simple online and realtime tracking,’’ in2016 IEEE International Conference on Image Processing (ICIP), 2016, pp. 3464–3468

  50. [50]

    Zhang, P

    Y . Zhang, P . Sun, Y . Jiang, D. Y u, Z. Y uan, P . Luo, W. Liu, and X. Wang, ‘‘ByteTrack: Multi-Object Tracking by Associating Every Detection Box,’’ inEuropean Conference on Computer Vision, 2021

  51. [51]

    B. A. Motetti, L. Crupi, M. O. M. E. Elshaigi, M. Risso, D. J. Pagliari, D. Palossi, and A. Burrello, ‘‘Adaptive Deep Learning for Efficient Visual Pose Estimation aboard Ultra-low-power Nano-drones,’’ArXiv, vol. abs/2401.15236, 2024. [Online]. Available: https://api.semanticscholar. org/CorpusID:267312457

  52. [52]

    Moosmann, H

    J. Moosmann, H. Müller, N. Zimmerman, G. Rutishauser, L. Benini, and M. Magno, ‘‘Flexible and Fully Quantized Lightweight TinyissimoYOLO for Ultra-Low-Power Edge Systems,’’IEEE Access, vol. 12, pp. 75 093– 75 107, 2024

  53. [53]

    Moosmann, P

    J. Moosmann, P . Bonazzi, Y . Li, S. Bian, P . Mayer, L. Benini, and M. Magno, ‘‘Ultra-efficient on-device object detection on ai-integrated smart glasses with tinyissimoyolo,’’ inComputer Vision – ECCV 2024 14 VOLUME 14, 2026 Workshops, A. Del Bue, C. Canton, J. Pont-Tuset, and T. Tommasi, Eds. Cham: Springer Nature Switzerland, 2025, pp. 262–280

  54. [54]

    H. H. Y . Shalby, M. Pavan, and M. Roveri, ‘‘StreamTinyNet: video stream- ing analysis with spatial-temporal TinyML,’’ in2024 International Joint Conference on Neural Networks (IJCNN), 2024, pp. 1–8

  55. [55]

    El Zeinaty, W

    C. El Zeinaty, W. Hamidouche, G. Herrou, and D. Menard, ‘‘Designing object detection models for tinyml: Foundations, comparative analysis, challenges, and emerging solutions,’’ACM Comput. Surv., vol. 58, no. 2, Sep. 2025. [Online]. Available: https://doi.org/10.1145/3744339

  56. [56]

    Burrello, M

    A. Burrello, M. Scherer, M. Zanghieri, F. Conti, and L. Benini, ‘‘A Mi- crocontroller is All Y ou Need: Enabling Transformer Execution on Low- Power IoT Endnodes,’’ in2021 IEEE International Conference on Omni- Layer Intelligent Systems (COINS), 2021, pp. 1–6

  57. [57]

    V . J.-B. Jung, A. Burrello, M. Scherer, F. Conti, and L. Benini, ‘‘Optimiz- ing the Deployment of Tiny Transformers on Low-Power MCUs,’’IEEE Transactions on Computers, vol. 74, no. 2, pp. 526–541, 2025

  58. [58]

    Dequino, L

    A. Dequino, L. Bompani, L. Benini, and F. Conti, ‘‘Optimizing BFloat16 Deployment of Tiny Transformers on Ultra-Low Power Extreme Edge SoCs,’’Journal of Low Power Electronics and Applications, vol. 15, no. 1,

  59. [59]

    Available: https://www.mdpi.com/2079-9268/15/1/8

    [Online]. Available: https://www.mdpi.com/2079-9268/15/1/8

  60. [60]

    X. Lu, C. Bai, A. Zhu, Y . Zhu, and K. Wang, ‘‘Mcformer: A transformer- based detector for molecular communication with accelerated particle- based solution,’’IEEE Communications Letters, vol. 27, no. 10, pp. 2837– 2841, 2023

  61. [61]

    V aswani, N

    A. V aswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ inProceedings of the 31st International Conference on Neural Information Processing Systems, ser. NIPS’17. Red Hook, NY , USA: Curran Associates Inc., 2017, p. 6000–6010

  62. [62]

    T.-Y . Lin, P . Goyal, R. Girshick, K. He, and P . Dollár, ‘‘Focal Loss for Dense Object Detection,’’IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 2, pp. 318–327, 2020

  63. [63]

    Scarciglia, A

    L. Scarciglia, A. Paolillo, and D. Palossi, ‘‘A map-free deep learning- based framework for gate-to-gate monocular visual navigation aboard miniaturized aerial vehicles,’’ 2025. [Online]. Available: https://arxiv.org/ abs/2503.05251

  64. [64]

    Bompani, L

    L. Bompani, L. Crupi, D. Palossi, O. Baldoni, D. Brunelli, F. Conti, M. Rusci, and L. Benini, ‘‘Accelerating image-based pest detection on a heterogeneous multicore microcontroller,’’IEEE Transactions on Agri- F ood Electronics, vol. 2, no. 2, pp. 170–180, 2024

  65. [65]

    Crupi, L

    L. Crupi, L. Butera, A. Ferrante, A. Giusti, and D. Palossi, ‘‘An efficient ground-aerial transportation system for pest control enabled by ai-based autonomous nano-uavs,’’ACM J. Auton. Transport. Syst., vol. 2, no. 4, Jun. 2025. [Online]. Available: https://doi.org/10.1145/3719210

  66. [66]

    YOLOv4: Optimal Speed and Accuracy of Object Detection

    A. Bochkovskiy, C.-Y . Wang, and H.-Y . M. Liao, ‘‘Y olov4: Optimal speed and accuracy of object detection,’’ArXiv, vol. abs/2004.10934, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:216080778

  67. [67]

    P . Zhu, L. Wen, D. Du, X. Bian, H. Fan, Q. Hu, and H. Ling, ‘‘Detection and tracking meet drones challenge,’’IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 7380–7399, 2021

  68. [68]

    Z. Tang, M. Naphade, M.-Y . Liu, X. Y ang, S. Birchfield, S. Wang, R. Ku- mar, D. Anastasiu, and J.-N. Hwang, ‘‘CityFlow: A city-scale benchmark for multi-target multi-camera vehicle tracking and re-identification,’’ in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

  69. [69]

    L. Wen, D. Du, Z. Cai, Z. Lei, M.-C. Chang, H. Qi, J. Lim, M.-H. Y ang, and S. Lyu, ‘‘UA-DETRAC: A new benchmark and protocol for multi-object detection and tracking,’’Computer Vision and Image Understanding, vol. 193, p. 102907, 2020

  70. [70]

    Nanocopter AI Challenge

    T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P . Perona, D. Ramanan, P . Dol- lár, and C. L. Zitnick, ‘‘Microsoft coco: Common objects in context,’’ inComputer Vision – ECCV 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds. Cham: Springer International Publishing, 2014, pp. 740–755. LUCA BOMPANIPh.D. graduate in Electronic Engineering at the U...