pith. machine review for the scientific record. sign in

arxiv: 2604.14789 · v1 · submitted 2026-04-16 · 💻 cs.AI

Recognition: unknown

A Comparative Study of CNN Optimization Methods for Edge AI: Exploring the Role of Early Exits

Authors on Pith no claims yet

Pith reviewed 2026-05-10 10:37 UTC · model grok-4.3

classification 💻 cs.AI
keywords CNN optimizationedge AIearly exitspruningquantizationinference latencyONNXmodel compression
0
0 comments X

The pith

Static compression and early-exit mechanisms offer different trade-offs on edge devices, with their combination reducing latency and memory while preserving most accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares static compression techniques like pruning and quantization with dynamic early-exit mechanisms for CNNs on edge devices. It tests them individually and in combination using real hardware and ONNX pipelines under identical conditions. A reader would care because edge deployment requires trading off accuracy for speed and resources, and understanding how these methods work together can unlock better performance. The central finding is that static methods cut memory consistently, early exits save computation adaptively, and combining them reduces both latency and memory with little accuracy loss.

Core claim

Static and dynamic techniques offer fundamentally different trade-offs for edge deployment. While pruning and quantization deliver consistent memory footprint reduction, early-exit mechanisms enable input-adaptive computation savings that static methods cannot match. Their combination proves highly effective, simultaneously reducing inference latency and memory usage with minimal accuracy loss, expanding what is achievable at the edge.

What carries the argument

A unified comparison of static compression (pruning and quantization) and dynamic early-exit mechanisms, evaluated through ONNX-based inference pipelines on physical edge devices.

If this is right

  • Pruning and quantization deliver consistent reductions in model memory footprint.
  • Early-exit mechanisms provide input-dependent savings in computation that static methods cannot achieve.
  • The combination of both approaches reduces inference latency and memory usage simultaneously with only minimal accuracy loss.
  • This hybrid strategy expands the range of feasible CNN deployments on resource-constrained edge hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Edge AI system designers may achieve tighter resource budgets by prioritizing hybrid static-dynamic optimizations from the start.
  • Future implementations could tie quantization levels to specific exit points to gain additional efficiency.
  • The approach suggests potential reductions in energy use for battery-powered or IoT devices beyond the tested cases.

Load-bearing premise

The specific CNN models, datasets, and edge devices tested represent broader edge AI workloads, and ONNX inference pipelines capture all relevant runtime overheads without hidden platform-specific effects.

What would settle it

Applying the combined optimizations to a different CNN architecture or edge device and observing no simultaneous reduction in latency and memory beyond what either technique achieves alone.

read the original abstract

Deploying deep neural networks on edge devices requires balancing accuracy, latency, and resource constraints under realistic execution conditions. To fit models within these constraints, two broad strategies have emerged: static compression techniques such as pruning and quantization, which permanently reduce model size, and dynamic approaches such as early-exit mechanisms, which adapt computational cost at runtime. While both families are widely studied in isolation, they are rarely compared under identical conditions on physical hardware. This paper presents a unified deployment-oriented comparison of static compression and dynamic early-exit mechanisms, evaluated on real edge devices using ONNX based inference pipelines. Our results show that static and dynamic techniques offer fundamentally different trade-offs for edge deployment. While pruning and quantization deliver consistent memory footprint reduction, early-exit mechanisms enable input-adaptive computation savings that static methods cannot match. Their combination proves highly effective, simultaneously reducing inference latency and memory usage with minimal accuracy loss, expanding what is achievable at the edge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript performs a comparative empirical study of static compression techniques (pruning and quantization) versus dynamic early-exit mechanisms for CNN optimization on edge devices. It evaluates both families and their combination on physical hardware via ONNX inference pipelines, reporting that static methods provide consistent memory reduction while early exits enable input-adaptive latency savings, and that the hybrid approach simultaneously reduces latency and memory footprint with only minimal accuracy loss.

Significance. If the reported trade-offs hold under broader conditions, the work supplies practical guidance on hybrid static-dynamic optimization for edge deployment, a topic of direct engineering relevance. The use of real hardware and ONNX pipelines is a positive methodological choice that moves beyond simulation-only evaluations.

major comments (2)
  1. [Abstract] Abstract: the central claim that the combination 'proves highly effective' by simultaneously reducing inference latency and memory usage with minimal accuracy loss is load-bearing on the representativeness of the chosen CNN architectures, datasets, and physical edge devices. No justification or diversity analysis for these choices is supplied, leaving open whether the observed trade-offs generalize beyond the specific experimental setup.
  2. [Results] Results section: the abstract states quantitative outcomes but the provided text supplies no model architectures, dataset names, device specifications, baseline comparisons, effect sizes, or statistical tests. Without these, the magnitude and reliability of the claimed improvements cannot be assessed.
minor comments (2)
  1. [Methods] Methods: specify the exact ONNX runtime version, execution providers, and any platform-specific scheduling or memory-hierarchy interactions that could affect early-exit branching overhead.
  2. [Abstract] Notation: define 'minimal accuracy loss' quantitatively (e.g., absolute or relative drop threshold) rather than qualitatively.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We have reviewed the major comments carefully and provide point-by-point responses below. Where revisions are warranted to improve clarity and address concerns about experimental context and detail, we have outlined specific changes that will be incorporated in the revised version.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the combination 'proves highly effective' by simultaneously reducing inference latency and memory usage with minimal accuracy loss is load-bearing on the representativeness of the chosen CNN architectures, datasets, and physical edge devices. No justification or diversity analysis for these choices is supplied, leaving open whether the observed trade-offs generalize beyond the specific experimental setup.

    Authors: We agree that additional context on the selection of architectures, datasets, and devices would strengthen the abstract's central claim. In the revised manuscript, we will expand the abstract to briefly justify these choices as representative of common edge AI scenarios (e.g., lightweight CNNs for resource-constrained hardware, standard image classification benchmarks, and physical ONNX-compatible edge platforms). We will also add a short discussion of diversity considerations and generalization limits in the introduction and conclusions sections. revision: yes

  2. Referee: [Results] Results section: the abstract states quantitative outcomes but the provided text supplies no model architectures, dataset names, device specifications, baseline comparisons, effect sizes, or statistical tests. Without these, the magnitude and reliability of the claimed improvements cannot be assessed.

    Authors: We acknowledge that the results presentation would benefit from greater explicitness to allow readers to fully assess the quantitative claims. Although the methodology and results sections describe the experimental setup, we will revise the results section to include a consolidated summary table listing model architectures, dataset names, device specifications, baselines, effect sizes, and statistical tests where applicable. We will also cross-reference these details more clearly from the abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical comparison study with no derivations or self-referential predictions

full rationale

This is an empirical comparison paper evaluating static compression (pruning/quantization) versus dynamic early-exit mechanisms on physical edge devices via ONNX pipelines. The abstract and described content contain no equations, no fitted parameters, no predictions derived from inputs, and no load-bearing self-citations or uniqueness theorems. Central claims rest on experimental trade-off observations rather than any derivation chain that reduces to its own inputs by construction. The study is self-contained against external benchmarks, with generalizability concerns belonging to correctness rather than circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, free parameters, axioms, or invented entities are present; the work is a hardware-oriented empirical comparison.

pith-pipeline@v0.9.0 · 5475 in / 1048 out tokens · 48111 ms · 2026-05-10T10:37:46.826802+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 17 canonical work pages

  1. [1]

    AI-powered IoT: A survey on integrating artificial intelligence with IoT for enhanced secu- rity, efficiency, and smart applications,

    V. M. U, V. Babu Kumaravelu, V. K. C, R. A, S. Chinnadurai, R. Venkatesan, H. Hai, and P. Sel- vaprabhu, “AI-powered IoT: A survey on integrating artificial intelligence with IoT for enhanced secu- rity, efficiency, and smart applications,”IEEE Access, vol. 13, pp. 50 296–50 339, 2025

  2. [2]

    Integration of deep learning into the iot: A survey of techniques and challenges for real-world applications,

    A. Elhanashi, P. Dini, S. Saponara, and Q. Zheng, “Integration of deep learning into the iot: A survey of techniques and challenges for real-world applications,”Electronics, 2023

  3. [3]

    Deep learning on computational- resource-limited platforms: A survey,

    C. Chen, P. Zhang, H. Zhang, J. Dai, Y. Yi, H. Zhang, and Y. Zhang, “Deep learning on computational- resource-limited platforms: A survey,”Mob. Inf. Syst., vol. 2020, pp. 8 454 327:1–8 454 327:19, 2020

  4. [4]

    Chapter eight - energy-efficient deep learning inference on edge devices,

    F. Daghero, D. J. Pagliari, and M. Poncino, “Chapter eight - energy-efficient deep learning inference on edge devices,” inHardware Accelerator Systems for Artificial Intelligence and Machine Learning, ser. Advances in Computers, S. Kim and G. C. Deka, Eds. Elsevier, 2021, vol. 122, pp. 247–301. [Online]. Available: https://www.sciencedirect.com/science/ar...

  5. [5]

    EdgeAI: A vision for deep learning in IoT era,

    K. Bhardwaj, N. Suda, and R. Marculescu, “EdgeAI: A vision for deep learning in IoT era,”CoRR, vol. abs/1910.10356, 2019. [Online]. Available: http://arxiv.org/abs/1910.10356 20

  6. [6]

    Advancements in accelerating deep neural network inference on aiot devices: A survey,

    L. Cheng, Y. Gu, Q. Liu, L. Yang, C. Liu, and Y. Wang, “Advancements in accelerating deep neural network inference on aiot devices: A survey,”IEEE Transactions on Sustainable Computing, vol. 9, no. 6, pp. 830–847, 2024

  7. [7]

    The emergence of edge computing,

    M. Satyanarayanan, “The emergence of edge computing,”Computer, vol. 50, no. 1, pp. 30–39, 2017

  8. [8]

    Empowering edge intelligence: A comprehensive survey on on-device ai models,

    X. Wang, Z. Tang, J. Guo, T. Meng, C. Wang, T. Wang, and W. Jia, “Empowering edge intelligence: A comprehensive survey on on-device ai models,”ACM Comput. Surv., vol. 57, no. 9, Apr. 2025. [Online]. Available: https://doi.org/10.1145/3724420

  9. [9]

    Edge intelligence: Paving the last mile of artificial intelligence with edge computing,

    Z. Zhou, X. Chen, E. Li, L. Zeng, K. Luo, and J. Zhang, “Edge intelligence: Paving the last mile of artificial intelligence with edge computing,”Proceedings of the IEEE, vol. 107, no. 8, pp. 1738–1762, 2019

  10. [10]

    Edge intelligence un- leashed: a survey on deploying large language models in resource-constrained environments,

    S. Semerikov, T. Vakaliuk, O. Kanevska, O. Ostroushko, and A. O. Kolhatin, “Edge intelligence un- leashed: a survey on deploying large language models in resource-constrained environments,”J. Edge Comput., vol. 4, pp. 179–233, 2025

  11. [11]

    The internet of things, fog and cloud continuum: Integration and challenges,

    L. Bittencourt, R. Immich, R. Sakellariou, N. Fonseca, E. Madeira, M. Curado, L. Villas, L. DaSilva, C. Lee, and O. Rana, “The internet of things, fog and cloud continuum: Integration and challenges,”Internet of Things, vol. 3-4, pp. 134–155, 2018. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S2542660518300635

  12. [12]

    Embedded artificial intelligence: A comprehensive literature review,

    X. Huang, H. Wang, S. Qin, and S.-K. Tang, “Embedded artificial intelligence: A comprehensive literature review,”Electronics, vol. 14, no. 17, 2025. [Online]. Available: https://www.mdpi.com/2079-9292/14/17/3468

  13. [13]

    Usdc: Unified static and dynamic compression for visual transformer,

    H. Yuan, C. Liao, J. Tan, P. Yao, J. Jia, B. Chen, C. Song, and D. Zhang, “Usdc: Unified static and dynamic compression for visual transformer,”arXiv preprint arXiv:2310.11117, 2023

  14. [14]

    Dynamic neural networks: A survey,

    Y. Han, G. Huang, S. Song, L. Yang, H. Wang, and Y. Wang, “Dynamic neural networks: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 7436–7456, 2022

  15. [15]

    Optimization Approaches for Distributed AI Models on Edge Devices,

    N. Fernandez, A. Amurrio, and S. Van Vaerenbergh, “Optimization Approaches for Distributed AI Models on Edge Devices,” inNovel Deep Learning Methodologies in Industrial and Applied Mathematics, S. Xamb´ o-Descamps, Ed. Springer Nature, 2025, proceedings of the ICIAM MS 02515 organized within the ICIAM-2023 Conference, in press

  16. [16]

    Shallow-deep networks: Understanding and mitigating network overthinking,

    Y. Kaya and T. Dumitras, “How to stop off-the-shelf deep neural networks from overthinking,”CoRR, vol. abs/1810.07052, 2018. [Online]. Available: http://arxiv.org/abs/1810.07052

  17. [17]

    Pruning techniques for artificial intelligence networks: a deeper look at their engineering design and bias: the first review of its kind,

    L. Mohanty, A. Kumar, V. Mehta, M. Agarwal, and J. S. Suri, “Pruning techniques for artificial intelligence networks: a deeper look at their engineering design and bias: the first review of its kind,” Multimedia Tools and Applications, vol. 84, no. 11, pp. 9591–9665, 2025

  18. [18]

    Learning efficient convolutional networks through network slimming,

    Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning efficient convolutional networks through network slimming,”2017 IEEE International Conference on Computer Vision (ICCV), pp. 2755–2763, 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:5993328

  19. [19]

    A comprehensive review of model compression techniques in machine learning,

    P. V. Dantas, W. Sabino da Silva, L. C. Cordeiro, and C. B. Carvalho, “A comprehensive review of model compression techniques in machine learning,”Applied Intelligence, vol. 54, no. 22, pp. 11 804–11 844, Nov. 2024

  20. [20]

    Pruning vs quantization: Which is better?

    A. Kuzmin, M. Nagel, M. Van Baalen, A. Behboodi, and T. Blankevoort, “Pruning vs quantization: Which is better?” inAdvances in Neural Information Processing Systems, 2 2024. 21

  21. [21]

    Sparsert: Accelerating unstructured sparsity on gpus for deep learning inference,

    Z. Wang, “Sparsert: Accelerating unstructured sparsity on gpus for deep learning inference,” inPro- ceedings of the ACM international conference on parallel architectures and compilation techniques, 2020, pp. 31–42

  22. [22]

    Balanced sparsity for efficient dnn inference on gpu,

    Z. Yao, S. Cao, W. Xiao, C. Zhang, and L. Nie, “Balanced sparsity for efficient dnn inference on gpu,” inProceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 5676–5683

  23. [23]

    Pruning filters with l1-norm and capped l1-norm for cnn compression,

    A. Kumar, A. M. Shaikh, Y. Li, H. Bilal, and B. Yin, “Pruning filters with l1-norm and capped l1-norm for cnn compression,”Applied Intelligence, vol. 51, no. 2, pp. 1152–1160, 2021

  24. [24]

    Lightprune: Latency-aware structured pruning for effi- cient deep inference on embedded devices,

    A. Belhadi, Y. Djenouri, and A. N. Belbachir, “Lightprune: Latency-aware structured pruning for effi- cient deep inference on embedded devices,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 1688–1697

  25. [25]

    Efficient LLMs for Edge Devices: Pruning, Quantization, and Distillation Techniques,

    R. Agrawal, H. Kumar, and S. R. Lnu, “Efficient LLMs for Edge Devices: Pruning, Quantization, and Distillation Techniques,” in2025 International Conference on Machine Learning and Autonomous Systems (ICMLAS), Mar. 2025, pp. 1413–1418

  26. [26]

    Quantunev2: Compiler-based local metric-driven mixed precision quantization for practical embedded ai applications,

    J. Kim, J. Lee, Y. Kwon, and D. Kim, “Quantunev2: Compiler-based local metric-driven mixed precision quantization for practical embedded ai applications,”Future Generation Computer Systems, vol. 166, p. 107718, 2025. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S016 7739X25000135

  27. [27]

    Quantizing deep convolutional networks for efficient inference: A whitepaper

    R. Krishnamoorthi, “Quantizing deep convolutional networks for efficient inference: A whitepaper,” arXiv preprint arXiv:1806.08342, 2018

  28. [28]

    Branchynet: Fast inference via early exiting from deep neural networks,

    S. Teerapittayanon, B. McDanel, and H.-T. Kung, “Branchynet: Fast inference via early exiting from deep neural networks,” in2016 23rd international conference on pattern recognition (ICPR). IEEE, 2016, pp. 2464–2469

  29. [29]

    A comparative analysis of compression and transfer learning techniques in deepfake detection models,

    A. Karathanasis, J. Violos, and I. Kompatsiaris, “A comparative analysis of compression and transfer learning techniques in deepfake detection models,”Mathematics, vol. 13, no. 5, 2025. [Online]. Available: https://www.mdpi.com/2227-7390/13/5/887

  30. [30]

    Optimized convolutional neural network at the IoT edge for image detection using pruning and quantization,

    S. Naveen and M. R. Kounte, “Optimized convolutional neural network at the IoT edge for image detection using pruning and quantization,”Multim. Tools Appl., vol. 84, pp. 5435–5455, 2024. [Online]. Available: https://api.semanticscholar.org/CorpusID:275070866

  31. [31]

    Edge ai: Evaluation of model compression techniques for convolutional neural networks,

    S. Francy and R. Singh, “Edge ai: Evaluation of model compression techniques for convolutional neural networks,” 2024. [Online]. Available: https://arxiv.org/abs/2409.02134

  32. [32]

    Model compression for deep neural networks: A survey,

    Z. Li, H. Li, and L. Meng, “Model compression for deep neural networks: A survey,”Computers, vol. 12, no. 3, 2023. [Online]. Available: https://www.mdpi.com/2073-431X/12/3/60

  33. [33]

    A comparative study of pre- processing and model compression techniques in deep learning for forest sound classification,

    T. Paranayapa, P. Ranasinghe, D. Ranmal, D. Meedeniya, and C. Perera, “A comparative study of pre- processing and model compression techniques in deep learning for forest sound classification,”Sensors, vol. 24, no. 4, p. 1149, Feb. 2024

  34. [34]

    Iot-edge splitting with pruned early-exit cnns for adaptive inference,

    G. Korol and A. C. S. Beck, “Iot-edge splitting with pruned early-exit cnns for adaptive inference,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 33, no. 9, pp. 2382–2394, 2025

  35. [35]

    Pruning and early-exit co- optimization for cnn acceleration on fpgas,

    G. Korol, M. G. Jordan, M. B. Rutzig, J. Castrillon, and A. C. S. Beck, “Pruning and early-exit co- optimization for cnn acceleration on fpgas,” in2023 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2023, pp. 1–6. 22

  36. [36]

    Mcqueen: Mixed precision quantization of early exit networks

    U. Saxena and K. Roy, “Mcqueen: Mixed precision quantization of early exit networks.” inBMVC, 2023, pp. 511–513

  37. [37]

    Recent trends in edge AI: Efficient design, training and deployment of machine learning models,

    M. Deutel, M. Mallah, J. Wissing, and S. Scheele, “Recent trends in edge AI: Efficient design, training and deployment of machine learning models,” inCharting the Intelligence Frontiers–Edge AI Systems Nexus. River Publishers, 2026, pp. 181–220

  38. [38]

    Polythrottle: Energy-efficient neural network inference on edge devices,

    M. Yan, H. Wang, and S. Venkataraman, “Polythrottle: Energy-efficient neural network inference on edge devices,” 2023

  39. [39]

    SUQ-3: A three stage coarse-to-fine compression framework for sustain- able edge AI in smart farming,

    T. Vaiyapuri and H. Aldosari, “SUQ-3: A three stage coarse-to-fine compression framework for sustain- able edge AI in smart farming,”Sustainability, vol. 17, no. 12, p. 5230, Jun. 2025

  40. [40]

    Efficient hardware implementation of cellular neural networks with incremental quantization and early exit,

    X. Xu, Q. Lu, T. Wang, Y. Hu, C. Zhuo, J. Liu, and Y. Shi, “Efficient hardware implementation of cellular neural networks with incremental quantization and early exit,”J. Emerg. Technol. Comput. Syst., vol. 14, no. 4, Dec. 2018

  41. [41]

    Skipnet: Learning dynamic routing in convolutional networks,

    X. Wang, F. Yu, Z.-Y. Dou, T. Darrell, and J. E. Gonzalez, “Skipnet: Learning dynamic routing in convolutional networks,” inProceedings of the European Conference on Computer Vision (ECCV), September 2018

  42. [42]

    Dynamicvit: efficient vision transformers with dynamic token sparsification,

    Y. Rao, W. Zhao, B. Liu, J. Lu, J. Zhou, and C.-J. Hsieh, “Dynamicvit: efficient vision transformers with dynamic token sparsification,” inProceedings of the 35th International Conference on Neural Information Processing Systems, ser. NIPS ’21. Red Hook, NY, USA: Curran Associates Inc., 2021

  43. [43]

    Performance aware convolutional neural network channel pruning for embedded gpus,

    V. Radu, K. Kaszyk, Y. Wen, J. Turner, J. Cano, E. J. Crowley, B. Franke, A. Storkey, and M. O’Boyle, “Performance aware convolutional neural network channel pruning for embedded gpus,” in2019 IEEE International Symposium on Workload Characterization (IISWC), 2019, pp. 24–34

  44. [44]

    Latency-aware automatic cnn channel pruning with gpu runtime analysis,

    J. Liu, J. Sun, Z. Xu, and G. Sun, “Latency-aware automatic cnn channel pruning with gpu runtime analysis,”BenchCouncil Transactions on Benchmarks, Standards and Evaluations, vol. 1, no. 1, p. 100009, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S2772485921000 090

  45. [45]

    Streamlining speech enhancement dnns: an automated pruning method based on dependency graph with advanced regularized loss strategies,

    Z. Zhao, J. Zhang, Y. Liu, J. Liu, K. Niu, and Z. He, “Streamlining speech enhancement dnns: an automated pruning method based on dependency graph with advanced regularized loss strategies,” in Proc. Interspeech 2024, 2024, pp. 662–666

  46. [46]

    Probability-based channel pruning for depthwise separable convolutional networks,

    H. L. Zhao, K. J. Shi, X. G. Jinet al., “Probability-based channel pruning for depthwise separable convolutional networks,”Journal of Computer Science and Technology, vol. 37, no. 3, pp. 584–600,

  47. [47]

    Available: https://doi.org/10.1007/s11390-022-2131-8

    [Online]. Available: https://doi.org/10.1007/s11390-022-2131-8

  48. [48]

    Dynamic shuffle: An efficient channel mixture method,

    K. Gong, Z. Yin, Y. Li, K. Guo, and X. Xu, “Dynamic shuffle: An efficient channel mixture method,” ArXiv, vol. abs/2310.02776, 2023

  49. [49]

    Available: https://arxiv.org/abs/2106.08295

    M. Nagel, M. Fournarakis, R. A. Amjad, Y. Bondarenko, M. Van Baalen, and T. Blankevoort, “A white paper on neural network quantization,”arXiv preprint arXiv:2106.08295, 2021

  50. [50]

    Xilinx/Brevitas: v0.7.1,

    A. Pappalardoet al., “Xilinx/Brevitas: v0.7.1,” 2021, [Online]. Available: https://github.com/Xilinx/ brevitas

  51. [51]

    How to train your multi-exit model? analyzing the impact of training strategies,

    P. Kubaty, B. W´ ojcik, B. T. Krzepkowski, M. Michaluk, T. Trzcinski, J. Pomponi, and K. Adamczewski, “How to train your multi-exit model? analyzing the impact of training strategies,” inForty-second International Conference on Machine Learning, 2025. [Online]. Available: https://openreview.net/forum?id=vhTPfOdwyQ 23

  52. [52]

    Bakhtiarnia, Q

    A. Bakhtiarnia, Q. Zhang, and A. Iosifidis, “Multi-exit vision transformer for dynamic inference,”arXiv preprint arXiv:2106.15183, 2021

  53. [53]

    T-recx: Tiny-resource efficient convolutional neural networks with early-exit,

    N. P. Ghanathe and S. Wilton, “T-recx: Tiny-resource efficient convolutional neural networks with early-exit,” inProceedings of the 20th ACM International Conference on Computing Frontiers, ser. CF ’23. New York, NY, USA: Association for Computing Machinery, 2023, p. 123–133. [Online]. Available: https://doi.org/10.1145/3587135.3592204

  54. [54]

    torch.cond,

    PyTorch Team, “torch.cond,”PyTorch Documentation, n.d., [Online]. Available: https://docs.pytorch.org/docs/stable/generated/torch.cond.html

  55. [55]

    Interoperability in deep learning: A user survey and failure analysis of onnx model converters,

    P. Jajal, W. Jiang, A. Tewari, E. Kocinare, J. Woo, A. Sarraf, Y.-H. Lu, G. K. Thiruvathukal, and J. C. Davis, “Interoperability in deep learning: A user survey and failure analysis of onnx model converters,” inProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA 2024. New York, NY, USA: Association for ...

  56. [56]

    DEEP-CWS: Distilling efficient pre-trained models with early exit and pruning for scalable chinese word segmentation,

    X. Shiting, “DEEP-CWS: Distilling efficient pre-trained models with early exit and pruning for scalable chinese word segmentation,”Information Sciences, vol. 719, p. 122470, 2025. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0020025525006024

  57. [57]

    Memory architecture and cuda programming on jetson orin: Differences from x86 GPUs - help docs for errors/issues on nvidia jetson dev boards,

    Piveral, “Memory architecture and cuda programming on jetson orin: Differences from x86 GPUs - help docs for errors/issues on nvidia jetson dev boards,” https://nvidia-jetson.piveral.com/jetson-ori n-nano/memory-architecture-and-cuda-programming-on-jetson-orin-differences-from-x86-gpus, 2024, accedido el April 17, 2026

  58. [58]

    psutil.cpu percent — psutil 7.2.2 documentation,

    “psutil.cpu percent — psutil 7.2.2 documentation,” Online, psutil development team, 2026, [Online]. Available: https://psutil.readthedocs.io/en/latest/#psutil.cpu percent. [Online]. Available: https://psutil.readthedocs.io/en/latest/#psutil.cpu percent

  59. [59]

    jtop.core.gpu.gpu — jetson-stats 4.5.4 api reference,

    “jtop.core.gpu.gpu — jetson-stats 4.5.4 api reference,” Online, jetson-stats development team, 2025, [Online]. Available: https://rnext.it/jetson stats/reference/gpu.html#jtop.core.gpu.GPU. [Online]. Available: https://rnext.it/jetson stats/reference/gpu.html#jtop.core.gpu.GPU

  60. [60]

    jtop.jtop.memory — jetson-stats 4.5.4 api reference,

    “jtop.jtop.memory — jetson-stats 4.5.4 api reference,” Online, jetson-stats development team, 2025, [Online]. Available: https://rnext.it/jetson stats/reference/jtop.html#jtop.jtop.memory. [Online]. Available: https://rnext.it/jetson stats/reference/jtop.html#jtop.jtop.memory

  61. [61]

    LSQ+: improving low-bit quantization through learnable offsets and better initialization,

    Y. Bhalgat, J. Lee, M. Nagel, T. Blankevoort, and N. Kwak, “LSQ+: improving low-bit quantization through learnable offsets and better initialization,”CoRR, vol. abs/2004.09576, 2020. [Online]. Available: https://arxiv.org/abs/2004.09576

  62. [63]

    Available: http://arxiv.org/abs/1802.10280

    [Online]. Available: http://arxiv.org/abs/1802.10280

  63. [64]

    Qserve: W4a8kv4 quantization and system co-design for efficient llm serving

    Y. Lin, H. Tang, S. Yang, Z. Zhang, G. Xiao, C. Gan, and S. Han, “Qserve: W4a8kv4 quantization and system co-design for efficient llm serving,”ArXiv, vol. abs/2405.04532, 2024

  64. [65]

    Quick: Quantization-aware interleaving and conflict-free kernel for efficient llm inference,

    T. Kim, J. Lee, D. Ahn, S. Kim, J. Choi, M. Kim, and H. Kim, “Quick: Quantization-aware interleaving and conflict-free kernel for efficient llm inference,”ArXiv, vol. abs/2402.10076, 2024

  65. [66]

    Efficient execution of quantized deep learning models: A compiler approach,

    A. Jain, S. Bhattacharya, M. Masuda, V. Sharma, and Y. Wang, “Efficient execution of quantized deep learning models: A compiler approach,” 2020. [Online]. Available: https://arxiv.org/abs/2006.10226 24 A Appendix: Detailed Pruning Results Table 8 summarizes the results of multiple structured pruning configurations applied to different CNN architectures, i...