arxiv: 2604.14789 · v1 · submitted 2026-04-16 · 💻 cs.AI

Recognition: unknown

A Comparative Study of CNN Optimization Methods for Edge AI: Exploring the Role of Early Exits

Nekane Fernandez , Ivan Valdes , Steven Van Vaerenbergh , Idoia de la Iglesia , Julen Arratibel

Authors on Pith no claims yet

Pith reviewed 2026-05-10 10:37 UTC · model grok-4.3

classification 💻 cs.AI

keywords CNN optimizationedge AIearly exitspruningquantizationinference latencyONNXmodel compression

0 comments

The pith

Static compression and early-exit mechanisms offer different trade-offs on edge devices, with their combination reducing latency and memory while preserving most accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares static compression techniques like pruning and quantization with dynamic early-exit mechanisms for CNNs on edge devices. It tests them individually and in combination using real hardware and ONNX pipelines under identical conditions. A reader would care because edge deployment requires trading off accuracy for speed and resources, and understanding how these methods work together can unlock better performance. The central finding is that static methods cut memory consistently, early exits save computation adaptively, and combining them reduces both latency and memory with little accuracy loss.

Core claim

Static and dynamic techniques offer fundamentally different trade-offs for edge deployment. While pruning and quantization deliver consistent memory footprint reduction, early-exit mechanisms enable input-adaptive computation savings that static methods cannot match. Their combination proves highly effective, simultaneously reducing inference latency and memory usage with minimal accuracy loss, expanding what is achievable at the edge.

What carries the argument

A unified comparison of static compression (pruning and quantization) and dynamic early-exit mechanisms, evaluated through ONNX-based inference pipelines on physical edge devices.

If this is right

Pruning and quantization deliver consistent reductions in model memory footprint.
Early-exit mechanisms provide input-dependent savings in computation that static methods cannot achieve.
The combination of both approaches reduces inference latency and memory usage simultaneously with only minimal accuracy loss.
This hybrid strategy expands the range of feasible CNN deployments on resource-constrained edge hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Edge AI system designers may achieve tighter resource budgets by prioritizing hybrid static-dynamic optimizations from the start.
Future implementations could tie quantization levels to specific exit points to gain additional efficiency.
The approach suggests potential reductions in energy use for battery-powered or IoT devices beyond the tested cases.

Load-bearing premise

The specific CNN models, datasets, and edge devices tested represent broader edge AI workloads, and ONNX inference pipelines capture all relevant runtime overheads without hidden platform-specific effects.

What would settle it

Applying the combined optimizations to a different CNN architecture or edge device and observing no simultaneous reduction in latency and memory beyond what either technique achieves alone.

read the original abstract

Deploying deep neural networks on edge devices requires balancing accuracy, latency, and resource constraints under realistic execution conditions. To fit models within these constraints, two broad strategies have emerged: static compression techniques such as pruning and quantization, which permanently reduce model size, and dynamic approaches such as early-exit mechanisms, which adapt computational cost at runtime. While both families are widely studied in isolation, they are rarely compared under identical conditions on physical hardware. This paper presents a unified deployment-oriented comparison of static compression and dynamic early-exit mechanisms, evaluated on real edge devices using ONNX based inference pipelines. Our results show that static and dynamic techniques offer fundamentally different trade-offs for edge deployment. While pruning and quantization deliver consistent memory footprint reduction, early-exit mechanisms enable input-adaptive computation savings that static methods cannot match. Their combination proves highly effective, simultaneously reducing inference latency and memory usage with minimal accuracy loss, expanding what is achievable at the edge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript performs a comparative empirical study of static compression techniques (pruning and quantization) versus dynamic early-exit mechanisms for CNN optimization on edge devices. It evaluates both families and their combination on physical hardware via ONNX inference pipelines, reporting that static methods provide consistent memory reduction while early exits enable input-adaptive latency savings, and that the hybrid approach simultaneously reduces latency and memory footprint with only minimal accuracy loss.

Significance. If the reported trade-offs hold under broader conditions, the work supplies practical guidance on hybrid static-dynamic optimization for edge deployment, a topic of direct engineering relevance. The use of real hardware and ONNX pipelines is a positive methodological choice that moves beyond simulation-only evaluations.

major comments (2)

[Abstract] Abstract: the central claim that the combination 'proves highly effective' by simultaneously reducing inference latency and memory usage with minimal accuracy loss is load-bearing on the representativeness of the chosen CNN architectures, datasets, and physical edge devices. No justification or diversity analysis for these choices is supplied, leaving open whether the observed trade-offs generalize beyond the specific experimental setup.
[Results] Results section: the abstract states quantitative outcomes but the provided text supplies no model architectures, dataset names, device specifications, baseline comparisons, effect sizes, or statistical tests. Without these, the magnitude and reliability of the claimed improvements cannot be assessed.

minor comments (2)

[Methods] Methods: specify the exact ONNX runtime version, execution providers, and any platform-specific scheduling or memory-hierarchy interactions that could affect early-exit branching overhead.
[Abstract] Notation: define 'minimal accuracy loss' quantitatively (e.g., absolute or relative drop threshold) rather than qualitatively.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We have reviewed the major comments carefully and provide point-by-point responses below. Where revisions are warranted to improve clarity and address concerns about experimental context and detail, we have outlined specific changes that will be incorporated in the revised version.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the combination 'proves highly effective' by simultaneously reducing inference latency and memory usage with minimal accuracy loss is load-bearing on the representativeness of the chosen CNN architectures, datasets, and physical edge devices. No justification or diversity analysis for these choices is supplied, leaving open whether the observed trade-offs generalize beyond the specific experimental setup.

Authors: We agree that additional context on the selection of architectures, datasets, and devices would strengthen the abstract's central claim. In the revised manuscript, we will expand the abstract to briefly justify these choices as representative of common edge AI scenarios (e.g., lightweight CNNs for resource-constrained hardware, standard image classification benchmarks, and physical ONNX-compatible edge platforms). We will also add a short discussion of diversity considerations and generalization limits in the introduction and conclusions sections. revision: yes
Referee: [Results] Results section: the abstract states quantitative outcomes but the provided text supplies no model architectures, dataset names, device specifications, baseline comparisons, effect sizes, or statistical tests. Without these, the magnitude and reliability of the claimed improvements cannot be assessed.

Authors: We acknowledge that the results presentation would benefit from greater explicitness to allow readers to fully assess the quantitative claims. Although the methodology and results sections describe the experimental setup, we will revise the results section to include a consolidated summary table listing model architectures, dataset names, device specifications, baselines, effect sizes, and statistical tests where applicable. We will also cross-reference these details more clearly from the abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical comparison study with no derivations or self-referential predictions

full rationale

This is an empirical comparison paper evaluating static compression (pruning/quantization) versus dynamic early-exit mechanisms on physical edge devices via ONNX pipelines. The abstract and described content contain no equations, no fitted parameters, no predictions derived from inputs, and no load-bearing self-citations or uniqueness theorems. Central claims rest on experimental trade-off observations rather than any derivation chain that reduces to its own inputs by construction. The study is self-contained against external benchmarks, with generalizability concerns belonging to correctness rather than circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, free parameters, axioms, or invented entities are present; the work is a hardware-oriented empirical comparison.

pith-pipeline@v0.9.0 · 5475 in / 1048 out tokens · 48111 ms · 2026-05-10T10:37:46.826802+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 17 canonical work pages

[1]

AI-powered IoT: A survey on integrating artificial intelligence with IoT for enhanced secu- rity, efficiency, and smart applications,

V. M. U, V. Babu Kumaravelu, V. K. C, R. A, S. Chinnadurai, R. Venkatesan, H. Hai, and P. Sel- vaprabhu, “AI-powered IoT: A survey on integrating artificial intelligence with IoT for enhanced secu- rity, efficiency, and smart applications,”IEEE Access, vol. 13, pp. 50 296–50 339, 2025

2025
[2]

Integration of deep learning into the iot: A survey of techniques and challenges for real-world applications,

A. Elhanashi, P. Dini, S. Saponara, and Q. Zheng, “Integration of deep learning into the iot: A survey of techniques and challenges for real-world applications,”Electronics, 2023

2023
[3]

Deep learning on computational- resource-limited platforms: A survey,

C. Chen, P. Zhang, H. Zhang, J. Dai, Y. Yi, H. Zhang, and Y. Zhang, “Deep learning on computational- resource-limited platforms: A survey,”Mob. Inf. Syst., vol. 2020, pp. 8 454 327:1–8 454 327:19, 2020

2020
[4]

Chapter eight - energy-efficient deep learning inference on edge devices,

F. Daghero, D. J. Pagliari, and M. Poncino, “Chapter eight - energy-efficient deep learning inference on edge devices,” inHardware Accelerator Systems for Artificial Intelligence and Machine Learning, ser. Advances in Computers, S. Kim and G. C. Deka, Eds. Elsevier, 2021, vol. 122, pp. 247–301. [Online]. Available: https://www.sciencedirect.com/science/ar...

2021
[5]

EdgeAI: A vision for deep learning in IoT era,

K. Bhardwaj, N. Suda, and R. Marculescu, “EdgeAI: A vision for deep learning in IoT era,”CoRR, vol. abs/1910.10356, 2019. [Online]. Available: http://arxiv.org/abs/1910.10356 20

work page arXiv 1910
[6]

Advancements in accelerating deep neural network inference on aiot devices: A survey,

L. Cheng, Y. Gu, Q. Liu, L. Yang, C. Liu, and Y. Wang, “Advancements in accelerating deep neural network inference on aiot devices: A survey,”IEEE Transactions on Sustainable Computing, vol. 9, no. 6, pp. 830–847, 2024

2024
[7]

The emergence of edge computing,

M. Satyanarayanan, “The emergence of edge computing,”Computer, vol. 50, no. 1, pp. 30–39, 2017

2017
[8]

Empowering edge intelligence: A comprehensive survey on on-device ai models,

X. Wang, Z. Tang, J. Guo, T. Meng, C. Wang, T. Wang, and W. Jia, “Empowering edge intelligence: A comprehensive survey on on-device ai models,”ACM Comput. Surv., vol. 57, no. 9, Apr. 2025. [Online]. Available: https://doi.org/10.1145/3724420

work page doi:10.1145/3724420 2025
[9]

Edge intelligence: Paving the last mile of artificial intelligence with edge computing,

Z. Zhou, X. Chen, E. Li, L. Zeng, K. Luo, and J. Zhang, “Edge intelligence: Paving the last mile of artificial intelligence with edge computing,”Proceedings of the IEEE, vol. 107, no. 8, pp. 1738–1762, 2019

2019
[10]

Edge intelligence un- leashed: a survey on deploying large language models in resource-constrained environments,

S. Semerikov, T. Vakaliuk, O. Kanevska, O. Ostroushko, and A. O. Kolhatin, “Edge intelligence un- leashed: a survey on deploying large language models in resource-constrained environments,”J. Edge Comput., vol. 4, pp. 179–233, 2025

2025
[11]

The internet of things, fog and cloud continuum: Integration and challenges,

L. Bittencourt, R. Immich, R. Sakellariou, N. Fonseca, E. Madeira, M. Curado, L. Villas, L. DaSilva, C. Lee, and O. Rana, “The internet of things, fog and cloud continuum: Integration and challenges,”Internet of Things, vol. 3-4, pp. 134–155, 2018. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S2542660518300635

2018
[12]

Embedded artificial intelligence: A comprehensive literature review,

X. Huang, H. Wang, S. Qin, and S.-K. Tang, “Embedded artificial intelligence: A comprehensive literature review,”Electronics, vol. 14, no. 17, 2025. [Online]. Available: https://www.mdpi.com/2079-9292/14/17/3468

2025
[13]

Usdc: Unified static and dynamic compression for visual transformer,

H. Yuan, C. Liao, J. Tan, P. Yao, J. Jia, B. Chen, C. Song, and D. Zhang, “Usdc: Unified static and dynamic compression for visual transformer,”arXiv preprint arXiv:2310.11117, 2023

work page arXiv 2023
[14]

Dynamic neural networks: A survey,

Y. Han, G. Huang, S. Song, L. Yang, H. Wang, and Y. Wang, “Dynamic neural networks: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 7436–7456, 2022

2022
[15]

Optimization Approaches for Distributed AI Models on Edge Devices,

N. Fernandez, A. Amurrio, and S. Van Vaerenbergh, “Optimization Approaches for Distributed AI Models on Edge Devices,” inNovel Deep Learning Methodologies in Industrial and Applied Mathematics, S. Xamb´ o-Descamps, Ed. Springer Nature, 2025, proceedings of the ICIAM MS 02515 organized within the ICIAM-2023 Conference, in press

2025
[16]

Shallow-deep networks: Understanding and mitigating network overthinking,

Y. Kaya and T. Dumitras, “How to stop off-the-shelf deep neural networks from overthinking,”CoRR, vol. abs/1810.07052, 2018. [Online]. Available: http://arxiv.org/abs/1810.07052

work page arXiv 2018
[17]

Pruning techniques for artificial intelligence networks: a deeper look at their engineering design and bias: the first review of its kind,

L. Mohanty, A. Kumar, V. Mehta, M. Agarwal, and J. S. Suri, “Pruning techniques for artificial intelligence networks: a deeper look at their engineering design and bias: the first review of its kind,” Multimedia Tools and Applications, vol. 84, no. 11, pp. 9591–9665, 2025

2025
[18]

Learning efficient convolutional networks through network slimming,

Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning efficient convolutional networks through network slimming,”2017 IEEE International Conference on Computer Vision (ICCV), pp. 2755–2763, 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:5993328

2017
[19]

A comprehensive review of model compression techniques in machine learning,

P. V. Dantas, W. Sabino da Silva, L. C. Cordeiro, and C. B. Carvalho, “A comprehensive review of model compression techniques in machine learning,”Applied Intelligence, vol. 54, no. 22, pp. 11 804–11 844, Nov. 2024

2024
[20]

Pruning vs quantization: Which is better?

A. Kuzmin, M. Nagel, M. Van Baalen, A. Behboodi, and T. Blankevoort, “Pruning vs quantization: Which is better?” inAdvances in Neural Information Processing Systems, 2 2024. 21

2024
[21]

Sparsert: Accelerating unstructured sparsity on gpus for deep learning inference,

Z. Wang, “Sparsert: Accelerating unstructured sparsity on gpus for deep learning inference,” inPro- ceedings of the ACM international conference on parallel architectures and compilation techniques, 2020, pp. 31–42

2020
[22]

Balanced sparsity for efficient dnn inference on gpu,

Z. Yao, S. Cao, W. Xiao, C. Zhang, and L. Nie, “Balanced sparsity for efficient dnn inference on gpu,” inProceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 5676–5683

2019
[23]

Pruning filters with l1-norm and capped l1-norm for cnn compression,

A. Kumar, A. M. Shaikh, Y. Li, H. Bilal, and B. Yin, “Pruning filters with l1-norm and capped l1-norm for cnn compression,”Applied Intelligence, vol. 51, no. 2, pp. 1152–1160, 2021

2021
[24]

Lightprune: Latency-aware structured pruning for effi- cient deep inference on embedded devices,

A. Belhadi, Y. Djenouri, and A. N. Belbachir, “Lightprune: Latency-aware structured pruning for effi- cient deep inference on embedded devices,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 1688–1697

2025
[25]

Efficient LLMs for Edge Devices: Pruning, Quantization, and Distillation Techniques,

R. Agrawal, H. Kumar, and S. R. Lnu, “Efficient LLMs for Edge Devices: Pruning, Quantization, and Distillation Techniques,” in2025 International Conference on Machine Learning and Autonomous Systems (ICMLAS), Mar. 2025, pp. 1413–1418

2025
[26]

Quantunev2: Compiler-based local metric-driven mixed precision quantization for practical embedded ai applications,

J. Kim, J. Lee, Y. Kwon, and D. Kim, “Quantunev2: Compiler-based local metric-driven mixed precision quantization for practical embedded ai applications,”Future Generation Computer Systems, vol. 166, p. 107718, 2025. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S016 7739X25000135

2025
[27]

Quantizing deep convolutional networks for efficient inference: A whitepaper

R. Krishnamoorthi, “Quantizing deep convolutional networks for efficient inference: A whitepaper,” arXiv preprint arXiv:1806.08342, 2018

work page arXiv 2018
[28]

Branchynet: Fast inference via early exiting from deep neural networks,

S. Teerapittayanon, B. McDanel, and H.-T. Kung, “Branchynet: Fast inference via early exiting from deep neural networks,” in2016 23rd international conference on pattern recognition (ICPR). IEEE, 2016, pp. 2464–2469

2016
[29]

A comparative analysis of compression and transfer learning techniques in deepfake detection models,

A. Karathanasis, J. Violos, and I. Kompatsiaris, “A comparative analysis of compression and transfer learning techniques in deepfake detection models,”Mathematics, vol. 13, no. 5, 2025. [Online]. Available: https://www.mdpi.com/2227-7390/13/5/887

2025
[30]

Optimized convolutional neural network at the IoT edge for image detection using pruning and quantization,

S. Naveen and M. R. Kounte, “Optimized convolutional neural network at the IoT edge for image detection using pruning and quantization,”Multim. Tools Appl., vol. 84, pp. 5435–5455, 2024. [Online]. Available: https://api.semanticscholar.org/CorpusID:275070866

2024
[31]

Edge ai: Evaluation of model compression techniques for convolutional neural networks,

S. Francy and R. Singh, “Edge ai: Evaluation of model compression techniques for convolutional neural networks,” 2024. [Online]. Available: https://arxiv.org/abs/2409.02134

work page arXiv 2024
[32]

Model compression for deep neural networks: A survey,

Z. Li, H. Li, and L. Meng, “Model compression for deep neural networks: A survey,”Computers, vol. 12, no. 3, 2023. [Online]. Available: https://www.mdpi.com/2073-431X/12/3/60

2023
[33]

A comparative study of pre- processing and model compression techniques in deep learning for forest sound classification,

T. Paranayapa, P. Ranasinghe, D. Ranmal, D. Meedeniya, and C. Perera, “A comparative study of pre- processing and model compression techniques in deep learning for forest sound classification,”Sensors, vol. 24, no. 4, p. 1149, Feb. 2024

2024
[34]

Iot-edge splitting with pruned early-exit cnns for adaptive inference,

G. Korol and A. C. S. Beck, “Iot-edge splitting with pruned early-exit cnns for adaptive inference,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 33, no. 9, pp. 2382–2394, 2025

2025
[35]

Pruning and early-exit co- optimization for cnn acceleration on fpgas,

G. Korol, M. G. Jordan, M. B. Rutzig, J. Castrillon, and A. C. S. Beck, “Pruning and early-exit co- optimization for cnn acceleration on fpgas,” in2023 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2023, pp. 1–6. 22

2023
[36]

Mcqueen: Mixed precision quantization of early exit networks

U. Saxena and K. Roy, “Mcqueen: Mixed precision quantization of early exit networks.” inBMVC, 2023, pp. 511–513

2023
[37]

Recent trends in edge AI: Efficient design, training and deployment of machine learning models,

M. Deutel, M. Mallah, J. Wissing, and S. Scheele, “Recent trends in edge AI: Efficient design, training and deployment of machine learning models,” inCharting the Intelligence Frontiers–Edge AI Systems Nexus. River Publishers, 2026, pp. 181–220

2026
[38]

Polythrottle: Energy-efficient neural network inference on edge devices,

M. Yan, H. Wang, and S. Venkataraman, “Polythrottle: Energy-efficient neural network inference on edge devices,” 2023

2023
[39]

SUQ-3: A three stage coarse-to-fine compression framework for sustain- able edge AI in smart farming,

T. Vaiyapuri and H. Aldosari, “SUQ-3: A three stage coarse-to-fine compression framework for sustain- able edge AI in smart farming,”Sustainability, vol. 17, no. 12, p. 5230, Jun. 2025

2025
[40]

Efficient hardware implementation of cellular neural networks with incremental quantization and early exit,

X. Xu, Q. Lu, T. Wang, Y. Hu, C. Zhuo, J. Liu, and Y. Shi, “Efficient hardware implementation of cellular neural networks with incremental quantization and early exit,”J. Emerg. Technol. Comput. Syst., vol. 14, no. 4, Dec. 2018

2018
[41]

Skipnet: Learning dynamic routing in convolutional networks,

X. Wang, F. Yu, Z.-Y. Dou, T. Darrell, and J. E. Gonzalez, “Skipnet: Learning dynamic routing in convolutional networks,” inProceedings of the European Conference on Computer Vision (ECCV), September 2018

2018
[42]

Dynamicvit: efficient vision transformers with dynamic token sparsification,

Y. Rao, W. Zhao, B. Liu, J. Lu, J. Zhou, and C.-J. Hsieh, “Dynamicvit: efficient vision transformers with dynamic token sparsification,” inProceedings of the 35th International Conference on Neural Information Processing Systems, ser. NIPS ’21. Red Hook, NY, USA: Curran Associates Inc., 2021

2021
[43]

Performance aware convolutional neural network channel pruning for embedded gpus,

V. Radu, K. Kaszyk, Y. Wen, J. Turner, J. Cano, E. J. Crowley, B. Franke, A. Storkey, and M. O’Boyle, “Performance aware convolutional neural network channel pruning for embedded gpus,” in2019 IEEE International Symposium on Workload Characterization (IISWC), 2019, pp. 24–34

2019
[44]

Latency-aware automatic cnn channel pruning with gpu runtime analysis,

J. Liu, J. Sun, Z. Xu, and G. Sun, “Latency-aware automatic cnn channel pruning with gpu runtime analysis,”BenchCouncil Transactions on Benchmarks, Standards and Evaluations, vol. 1, no. 1, p. 100009, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S2772485921000 090

2021
[45]

Streamlining speech enhancement dnns: an automated pruning method based on dependency graph with advanced regularized loss strategies,

Z. Zhao, J. Zhang, Y. Liu, J. Liu, K. Niu, and Z. He, “Streamlining speech enhancement dnns: an automated pruning method based on dependency graph with advanced regularized loss strategies,” in Proc. Interspeech 2024, 2024, pp. 662–666

2024
[46]

Probability-based channel pruning for depthwise separable convolutional networks,

H. L. Zhao, K. J. Shi, X. G. Jinet al., “Probability-based channel pruning for depthwise separable convolutional networks,”Journal of Computer Science and Technology, vol. 37, no. 3, pp. 584–600,
[47]

Available: https://doi.org/10.1007/s11390-022-2131-8

[Online]. Available: https://doi.org/10.1007/s11390-022-2131-8

work page doi:10.1007/s11390-022-2131-8
[48]

Dynamic shuffle: An efficient channel mixture method,

K. Gong, Z. Yin, Y. Li, K. Guo, and X. Xu, “Dynamic shuffle: An efficient channel mixture method,” ArXiv, vol. abs/2310.02776, 2023

work page arXiv 2023
[49]

Available: https://arxiv.org/abs/2106.08295

M. Nagel, M. Fournarakis, R. A. Amjad, Y. Bondarenko, M. Van Baalen, and T. Blankevoort, “A white paper on neural network quantization,”arXiv preprint arXiv:2106.08295, 2021

work page arXiv 2021
[50]

Xilinx/Brevitas: v0.7.1,

A. Pappalardoet al., “Xilinx/Brevitas: v0.7.1,” 2021, [Online]. Available: https://github.com/Xilinx/ brevitas

2021
[51]

How to train your multi-exit model? analyzing the impact of training strategies,

P. Kubaty, B. W´ ojcik, B. T. Krzepkowski, M. Michaluk, T. Trzcinski, J. Pomponi, and K. Adamczewski, “How to train your multi-exit model? analyzing the impact of training strategies,” inForty-second International Conference on Machine Learning, 2025. [Online]. Available: https://openreview.net/forum?id=vhTPfOdwyQ 23

2025
[52]

Bakhtiarnia, Q

A. Bakhtiarnia, Q. Zhang, and A. Iosifidis, “Multi-exit vision transformer for dynamic inference,”arXiv preprint arXiv:2106.15183, 2021

work page arXiv 2021
[53]

T-recx: Tiny-resource efficient convolutional neural networks with early-exit,

N. P. Ghanathe and S. Wilton, “T-recx: Tiny-resource efficient convolutional neural networks with early-exit,” inProceedings of the 20th ACM International Conference on Computing Frontiers, ser. CF ’23. New York, NY, USA: Association for Computing Machinery, 2023, p. 123–133. [Online]. Available: https://doi.org/10.1145/3587135.3592204

work page doi:10.1145/3587135.3592204 2023
[54]

torch.cond,

PyTorch Team, “torch.cond,”PyTorch Documentation, n.d., [Online]. Available: https://docs.pytorch.org/docs/stable/generated/torch.cond.html
[55]

Interoperability in deep learning: A user survey and failure analysis of onnx model converters,

P. Jajal, W. Jiang, A. Tewari, E. Kocinare, J. Woo, A. Sarraf, Y.-H. Lu, G. K. Thiruvathukal, and J. C. Davis, “Interoperability in deep learning: A user survey and failure analysis of onnx model converters,” inProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA 2024. New York, NY, USA: Association for ...

work page doi:10.1145/3650212.3680374 2024
[56]

DEEP-CWS: Distilling efficient pre-trained models with early exit and pruning for scalable chinese word segmentation,

X. Shiting, “DEEP-CWS: Distilling efficient pre-trained models with early exit and pruning for scalable chinese word segmentation,”Information Sciences, vol. 719, p. 122470, 2025. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0020025525006024

2025
[57]

Memory architecture and cuda programming on jetson orin: Differences from x86 GPUs - help docs for errors/issues on nvidia jetson dev boards,

Piveral, “Memory architecture and cuda programming on jetson orin: Differences from x86 GPUs - help docs for errors/issues on nvidia jetson dev boards,” https://nvidia-jetson.piveral.com/jetson-ori n-nano/memory-architecture-and-cuda-programming-on-jetson-orin-differences-from-x86-gpus, 2024, accedido el April 17, 2026

2024
[58]

psutil.cpu percent — psutil 7.2.2 documentation,

“psutil.cpu percent — psutil 7.2.2 documentation,” Online, psutil development team, 2026, [Online]. Available: https://psutil.readthedocs.io/en/latest/#psutil.cpu percent. [Online]. Available: https://psutil.readthedocs.io/en/latest/#psutil.cpu percent

2026
[59]

jtop.core.gpu.gpu — jetson-stats 4.5.4 api reference,

“jtop.core.gpu.gpu — jetson-stats 4.5.4 api reference,” Online, jetson-stats development team, 2025, [Online]. Available: https://rnext.it/jetson stats/reference/gpu.html#jtop.core.gpu.GPU. [Online]. Available: https://rnext.it/jetson stats/reference/gpu.html#jtop.core.gpu.GPU

2025
[60]

jtop.jtop.memory — jetson-stats 4.5.4 api reference,

“jtop.jtop.memory — jetson-stats 4.5.4 api reference,” Online, jetson-stats development team, 2025, [Online]. Available: https://rnext.it/jetson stats/reference/jtop.html#jtop.jtop.memory. [Online]. Available: https://rnext.it/jetson stats/reference/jtop.html#jtop.jtop.memory

2025
[61]

LSQ+: improving low-bit quantization through learnable offsets and better initialization,

Y. Bhalgat, J. Lee, M. Nagel, T. Blankevoort, and N. Kwak, “LSQ+: improving low-bit quantization through learnable offsets and better initialization,”CoRR, vol. abs/2004.09576, 2020. [Online]. Available: https://arxiv.org/abs/2004.09576

work page arXiv 2004
[63]

Available: http://arxiv.org/abs/1802.10280

[Online]. Available: http://arxiv.org/abs/1802.10280

work page arXiv
[64]

Qserve: W4a8kv4 quantization and system co-design for efficient llm serving

Y. Lin, H. Tang, S. Yang, Z. Zhang, G. Xiao, C. Gan, and S. Han, “Qserve: W4a8kv4 quantization and system co-design for efficient llm serving,”ArXiv, vol. abs/2405.04532, 2024

work page arXiv 2024
[65]

Quick: Quantization-aware interleaving and conflict-free kernel for efficient llm inference,

T. Kim, J. Lee, D. Ahn, S. Kim, J. Choi, M. Kim, and H. Kim, “Quick: Quantization-aware interleaving and conflict-free kernel for efficient llm inference,”ArXiv, vol. abs/2402.10076, 2024

work page arXiv 2024
[66]

Efficient execution of quantized deep learning models: A compiler approach,

A. Jain, S. Bhattacharya, M. Masuda, V. Sharma, and Y. Wang, “Efficient execution of quantized deep learning models: A compiler approach,” 2020. [Online]. Available: https://arxiv.org/abs/2006.10226 24 A Appendix: Detailed Pruning Results Table 8 summarizes the results of multiple structured pruning configurations applied to different CNN architectures, i...

work page arXiv 2020