Recognition: unknown
A Comparative Study of CNN Optimization Methods for Edge AI: Exploring the Role of Early Exits
Pith reviewed 2026-05-10 10:37 UTC · model grok-4.3
The pith
Static compression and early-exit mechanisms offer different trade-offs on edge devices, with their combination reducing latency and memory while preserving most accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Static and dynamic techniques offer fundamentally different trade-offs for edge deployment. While pruning and quantization deliver consistent memory footprint reduction, early-exit mechanisms enable input-adaptive computation savings that static methods cannot match. Their combination proves highly effective, simultaneously reducing inference latency and memory usage with minimal accuracy loss, expanding what is achievable at the edge.
What carries the argument
A unified comparison of static compression (pruning and quantization) and dynamic early-exit mechanisms, evaluated through ONNX-based inference pipelines on physical edge devices.
If this is right
- Pruning and quantization deliver consistent reductions in model memory footprint.
- Early-exit mechanisms provide input-dependent savings in computation that static methods cannot achieve.
- The combination of both approaches reduces inference latency and memory usage simultaneously with only minimal accuracy loss.
- This hybrid strategy expands the range of feasible CNN deployments on resource-constrained edge hardware.
Where Pith is reading between the lines
- Edge AI system designers may achieve tighter resource budgets by prioritizing hybrid static-dynamic optimizations from the start.
- Future implementations could tie quantization levels to specific exit points to gain additional efficiency.
- The approach suggests potential reductions in energy use for battery-powered or IoT devices beyond the tested cases.
Load-bearing premise
The specific CNN models, datasets, and edge devices tested represent broader edge AI workloads, and ONNX inference pipelines capture all relevant runtime overheads without hidden platform-specific effects.
What would settle it
Applying the combined optimizations to a different CNN architecture or edge device and observing no simultaneous reduction in latency and memory beyond what either technique achieves alone.
read the original abstract
Deploying deep neural networks on edge devices requires balancing accuracy, latency, and resource constraints under realistic execution conditions. To fit models within these constraints, two broad strategies have emerged: static compression techniques such as pruning and quantization, which permanently reduce model size, and dynamic approaches such as early-exit mechanisms, which adapt computational cost at runtime. While both families are widely studied in isolation, they are rarely compared under identical conditions on physical hardware. This paper presents a unified deployment-oriented comparison of static compression and dynamic early-exit mechanisms, evaluated on real edge devices using ONNX based inference pipelines. Our results show that static and dynamic techniques offer fundamentally different trade-offs for edge deployment. While pruning and quantization deliver consistent memory footprint reduction, early-exit mechanisms enable input-adaptive computation savings that static methods cannot match. Their combination proves highly effective, simultaneously reducing inference latency and memory usage with minimal accuracy loss, expanding what is achievable at the edge.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript performs a comparative empirical study of static compression techniques (pruning and quantization) versus dynamic early-exit mechanisms for CNN optimization on edge devices. It evaluates both families and their combination on physical hardware via ONNX inference pipelines, reporting that static methods provide consistent memory reduction while early exits enable input-adaptive latency savings, and that the hybrid approach simultaneously reduces latency and memory footprint with only minimal accuracy loss.
Significance. If the reported trade-offs hold under broader conditions, the work supplies practical guidance on hybrid static-dynamic optimization for edge deployment, a topic of direct engineering relevance. The use of real hardware and ONNX pipelines is a positive methodological choice that moves beyond simulation-only evaluations.
major comments (2)
- [Abstract] Abstract: the central claim that the combination 'proves highly effective' by simultaneously reducing inference latency and memory usage with minimal accuracy loss is load-bearing on the representativeness of the chosen CNN architectures, datasets, and physical edge devices. No justification or diversity analysis for these choices is supplied, leaving open whether the observed trade-offs generalize beyond the specific experimental setup.
- [Results] Results section: the abstract states quantitative outcomes but the provided text supplies no model architectures, dataset names, device specifications, baseline comparisons, effect sizes, or statistical tests. Without these, the magnitude and reliability of the claimed improvements cannot be assessed.
minor comments (2)
- [Methods] Methods: specify the exact ONNX runtime version, execution providers, and any platform-specific scheduling or memory-hierarchy interactions that could affect early-exit branching overhead.
- [Abstract] Notation: define 'minimal accuracy loss' quantitatively (e.g., absolute or relative drop threshold) rather than qualitatively.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We have reviewed the major comments carefully and provide point-by-point responses below. Where revisions are warranted to improve clarity and address concerns about experimental context and detail, we have outlined specific changes that will be incorporated in the revised version.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that the combination 'proves highly effective' by simultaneously reducing inference latency and memory usage with minimal accuracy loss is load-bearing on the representativeness of the chosen CNN architectures, datasets, and physical edge devices. No justification or diversity analysis for these choices is supplied, leaving open whether the observed trade-offs generalize beyond the specific experimental setup.
Authors: We agree that additional context on the selection of architectures, datasets, and devices would strengthen the abstract's central claim. In the revised manuscript, we will expand the abstract to briefly justify these choices as representative of common edge AI scenarios (e.g., lightweight CNNs for resource-constrained hardware, standard image classification benchmarks, and physical ONNX-compatible edge platforms). We will also add a short discussion of diversity considerations and generalization limits in the introduction and conclusions sections. revision: yes
-
Referee: [Results] Results section: the abstract states quantitative outcomes but the provided text supplies no model architectures, dataset names, device specifications, baseline comparisons, effect sizes, or statistical tests. Without these, the magnitude and reliability of the claimed improvements cannot be assessed.
Authors: We acknowledge that the results presentation would benefit from greater explicitness to allow readers to fully assess the quantitative claims. Although the methodology and results sections describe the experimental setup, we will revise the results section to include a consolidated summary table listing model architectures, dataset names, device specifications, baselines, effect sizes, and statistical tests where applicable. We will also cross-reference these details more clearly from the abstract. revision: yes
Circularity Check
No significant circularity; empirical comparison study with no derivations or self-referential predictions
full rationale
This is an empirical comparison paper evaluating static compression (pruning/quantization) versus dynamic early-exit mechanisms on physical edge devices via ONNX pipelines. The abstract and described content contain no equations, no fitted parameters, no predictions derived from inputs, and no load-bearing self-citations or uniqueness theorems. Central claims rest on experimental trade-off observations rather than any derivation chain that reduces to its own inputs by construction. The study is self-contained against external benchmarks, with generalizability concerns belonging to correctness rather than circularity.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
AI-powered IoT: A survey on integrating artificial intelligence with IoT for enhanced secu- rity, efficiency, and smart applications,
V. M. U, V. Babu Kumaravelu, V. K. C, R. A, S. Chinnadurai, R. Venkatesan, H. Hai, and P. Sel- vaprabhu, “AI-powered IoT: A survey on integrating artificial intelligence with IoT for enhanced secu- rity, efficiency, and smart applications,”IEEE Access, vol. 13, pp. 50 296–50 339, 2025
2025
-
[2]
Integration of deep learning into the iot: A survey of techniques and challenges for real-world applications,
A. Elhanashi, P. Dini, S. Saponara, and Q. Zheng, “Integration of deep learning into the iot: A survey of techniques and challenges for real-world applications,”Electronics, 2023
2023
-
[3]
Deep learning on computational- resource-limited platforms: A survey,
C. Chen, P. Zhang, H. Zhang, J. Dai, Y. Yi, H. Zhang, and Y. Zhang, “Deep learning on computational- resource-limited platforms: A survey,”Mob. Inf. Syst., vol. 2020, pp. 8 454 327:1–8 454 327:19, 2020
2020
-
[4]
Chapter eight - energy-efficient deep learning inference on edge devices,
F. Daghero, D. J. Pagliari, and M. Poncino, “Chapter eight - energy-efficient deep learning inference on edge devices,” inHardware Accelerator Systems for Artificial Intelligence and Machine Learning, ser. Advances in Computers, S. Kim and G. C. Deka, Eds. Elsevier, 2021, vol. 122, pp. 247–301. [Online]. Available: https://www.sciencedirect.com/science/ar...
2021
-
[5]
EdgeAI: A vision for deep learning in IoT era,
K. Bhardwaj, N. Suda, and R. Marculescu, “EdgeAI: A vision for deep learning in IoT era,”CoRR, vol. abs/1910.10356, 2019. [Online]. Available: http://arxiv.org/abs/1910.10356 20
-
[6]
Advancements in accelerating deep neural network inference on aiot devices: A survey,
L. Cheng, Y. Gu, Q. Liu, L. Yang, C. Liu, and Y. Wang, “Advancements in accelerating deep neural network inference on aiot devices: A survey,”IEEE Transactions on Sustainable Computing, vol. 9, no. 6, pp. 830–847, 2024
2024
-
[7]
The emergence of edge computing,
M. Satyanarayanan, “The emergence of edge computing,”Computer, vol. 50, no. 1, pp. 30–39, 2017
2017
-
[8]
Empowering edge intelligence: A comprehensive survey on on-device ai models,
X. Wang, Z. Tang, J. Guo, T. Meng, C. Wang, T. Wang, and W. Jia, “Empowering edge intelligence: A comprehensive survey on on-device ai models,”ACM Comput. Surv., vol. 57, no. 9, Apr. 2025. [Online]. Available: https://doi.org/10.1145/3724420
-
[9]
Edge intelligence: Paving the last mile of artificial intelligence with edge computing,
Z. Zhou, X. Chen, E. Li, L. Zeng, K. Luo, and J. Zhang, “Edge intelligence: Paving the last mile of artificial intelligence with edge computing,”Proceedings of the IEEE, vol. 107, no. 8, pp. 1738–1762, 2019
2019
-
[10]
Edge intelligence un- leashed: a survey on deploying large language models in resource-constrained environments,
S. Semerikov, T. Vakaliuk, O. Kanevska, O. Ostroushko, and A. O. Kolhatin, “Edge intelligence un- leashed: a survey on deploying large language models in resource-constrained environments,”J. Edge Comput., vol. 4, pp. 179–233, 2025
2025
-
[11]
The internet of things, fog and cloud continuum: Integration and challenges,
L. Bittencourt, R. Immich, R. Sakellariou, N. Fonseca, E. Madeira, M. Curado, L. Villas, L. DaSilva, C. Lee, and O. Rana, “The internet of things, fog and cloud continuum: Integration and challenges,”Internet of Things, vol. 3-4, pp. 134–155, 2018. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S2542660518300635
2018
-
[12]
Embedded artificial intelligence: A comprehensive literature review,
X. Huang, H. Wang, S. Qin, and S.-K. Tang, “Embedded artificial intelligence: A comprehensive literature review,”Electronics, vol. 14, no. 17, 2025. [Online]. Available: https://www.mdpi.com/2079-9292/14/17/3468
2025
-
[13]
Usdc: Unified static and dynamic compression for visual transformer,
H. Yuan, C. Liao, J. Tan, P. Yao, J. Jia, B. Chen, C. Song, and D. Zhang, “Usdc: Unified static and dynamic compression for visual transformer,”arXiv preprint arXiv:2310.11117, 2023
-
[14]
Dynamic neural networks: A survey,
Y. Han, G. Huang, S. Song, L. Yang, H. Wang, and Y. Wang, “Dynamic neural networks: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 7436–7456, 2022
2022
-
[15]
Optimization Approaches for Distributed AI Models on Edge Devices,
N. Fernandez, A. Amurrio, and S. Van Vaerenbergh, “Optimization Approaches for Distributed AI Models on Edge Devices,” inNovel Deep Learning Methodologies in Industrial and Applied Mathematics, S. Xamb´ o-Descamps, Ed. Springer Nature, 2025, proceedings of the ICIAM MS 02515 organized within the ICIAM-2023 Conference, in press
2025
-
[16]
Shallow-deep networks: Understanding and mitigating network overthinking,
Y. Kaya and T. Dumitras, “How to stop off-the-shelf deep neural networks from overthinking,”CoRR, vol. abs/1810.07052, 2018. [Online]. Available: http://arxiv.org/abs/1810.07052
-
[17]
Pruning techniques for artificial intelligence networks: a deeper look at their engineering design and bias: the first review of its kind,
L. Mohanty, A. Kumar, V. Mehta, M. Agarwal, and J. S. Suri, “Pruning techniques for artificial intelligence networks: a deeper look at their engineering design and bias: the first review of its kind,” Multimedia Tools and Applications, vol. 84, no. 11, pp. 9591–9665, 2025
2025
-
[18]
Learning efficient convolutional networks through network slimming,
Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning efficient convolutional networks through network slimming,”2017 IEEE International Conference on Computer Vision (ICCV), pp. 2755–2763, 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:5993328
2017
-
[19]
A comprehensive review of model compression techniques in machine learning,
P. V. Dantas, W. Sabino da Silva, L. C. Cordeiro, and C. B. Carvalho, “A comprehensive review of model compression techniques in machine learning,”Applied Intelligence, vol. 54, no. 22, pp. 11 804–11 844, Nov. 2024
2024
-
[20]
Pruning vs quantization: Which is better?
A. Kuzmin, M. Nagel, M. Van Baalen, A. Behboodi, and T. Blankevoort, “Pruning vs quantization: Which is better?” inAdvances in Neural Information Processing Systems, 2 2024. 21
2024
-
[21]
Sparsert: Accelerating unstructured sparsity on gpus for deep learning inference,
Z. Wang, “Sparsert: Accelerating unstructured sparsity on gpus for deep learning inference,” inPro- ceedings of the ACM international conference on parallel architectures and compilation techniques, 2020, pp. 31–42
2020
-
[22]
Balanced sparsity for efficient dnn inference on gpu,
Z. Yao, S. Cao, W. Xiao, C. Zhang, and L. Nie, “Balanced sparsity for efficient dnn inference on gpu,” inProceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 5676–5683
2019
-
[23]
Pruning filters with l1-norm and capped l1-norm for cnn compression,
A. Kumar, A. M. Shaikh, Y. Li, H. Bilal, and B. Yin, “Pruning filters with l1-norm and capped l1-norm for cnn compression,”Applied Intelligence, vol. 51, no. 2, pp. 1152–1160, 2021
2021
-
[24]
Lightprune: Latency-aware structured pruning for effi- cient deep inference on embedded devices,
A. Belhadi, Y. Djenouri, and A. N. Belbachir, “Lightprune: Latency-aware structured pruning for effi- cient deep inference on embedded devices,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 1688–1697
2025
-
[25]
Efficient LLMs for Edge Devices: Pruning, Quantization, and Distillation Techniques,
R. Agrawal, H. Kumar, and S. R. Lnu, “Efficient LLMs for Edge Devices: Pruning, Quantization, and Distillation Techniques,” in2025 International Conference on Machine Learning and Autonomous Systems (ICMLAS), Mar. 2025, pp. 1413–1418
2025
-
[26]
Quantunev2: Compiler-based local metric-driven mixed precision quantization for practical embedded ai applications,
J. Kim, J. Lee, Y. Kwon, and D. Kim, “Quantunev2: Compiler-based local metric-driven mixed precision quantization for practical embedded ai applications,”Future Generation Computer Systems, vol. 166, p. 107718, 2025. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S016 7739X25000135
2025
-
[27]
Quantizing deep convolutional networks for efficient inference: A whitepaper
R. Krishnamoorthi, “Quantizing deep convolutional networks for efficient inference: A whitepaper,” arXiv preprint arXiv:1806.08342, 2018
-
[28]
Branchynet: Fast inference via early exiting from deep neural networks,
S. Teerapittayanon, B. McDanel, and H.-T. Kung, “Branchynet: Fast inference via early exiting from deep neural networks,” in2016 23rd international conference on pattern recognition (ICPR). IEEE, 2016, pp. 2464–2469
2016
-
[29]
A comparative analysis of compression and transfer learning techniques in deepfake detection models,
A. Karathanasis, J. Violos, and I. Kompatsiaris, “A comparative analysis of compression and transfer learning techniques in deepfake detection models,”Mathematics, vol. 13, no. 5, 2025. [Online]. Available: https://www.mdpi.com/2227-7390/13/5/887
2025
-
[30]
Optimized convolutional neural network at the IoT edge for image detection using pruning and quantization,
S. Naveen and M. R. Kounte, “Optimized convolutional neural network at the IoT edge for image detection using pruning and quantization,”Multim. Tools Appl., vol. 84, pp. 5435–5455, 2024. [Online]. Available: https://api.semanticscholar.org/CorpusID:275070866
2024
-
[31]
Edge ai: Evaluation of model compression techniques for convolutional neural networks,
S. Francy and R. Singh, “Edge ai: Evaluation of model compression techniques for convolutional neural networks,” 2024. [Online]. Available: https://arxiv.org/abs/2409.02134
-
[32]
Model compression for deep neural networks: A survey,
Z. Li, H. Li, and L. Meng, “Model compression for deep neural networks: A survey,”Computers, vol. 12, no. 3, 2023. [Online]. Available: https://www.mdpi.com/2073-431X/12/3/60
2023
-
[33]
A comparative study of pre- processing and model compression techniques in deep learning for forest sound classification,
T. Paranayapa, P. Ranasinghe, D. Ranmal, D. Meedeniya, and C. Perera, “A comparative study of pre- processing and model compression techniques in deep learning for forest sound classification,”Sensors, vol. 24, no. 4, p. 1149, Feb. 2024
2024
-
[34]
Iot-edge splitting with pruned early-exit cnns for adaptive inference,
G. Korol and A. C. S. Beck, “Iot-edge splitting with pruned early-exit cnns for adaptive inference,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 33, no. 9, pp. 2382–2394, 2025
2025
-
[35]
Pruning and early-exit co- optimization for cnn acceleration on fpgas,
G. Korol, M. G. Jordan, M. B. Rutzig, J. Castrillon, and A. C. S. Beck, “Pruning and early-exit co- optimization for cnn acceleration on fpgas,” in2023 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2023, pp. 1–6. 22
2023
-
[36]
Mcqueen: Mixed precision quantization of early exit networks
U. Saxena and K. Roy, “Mcqueen: Mixed precision quantization of early exit networks.” inBMVC, 2023, pp. 511–513
2023
-
[37]
Recent trends in edge AI: Efficient design, training and deployment of machine learning models,
M. Deutel, M. Mallah, J. Wissing, and S. Scheele, “Recent trends in edge AI: Efficient design, training and deployment of machine learning models,” inCharting the Intelligence Frontiers–Edge AI Systems Nexus. River Publishers, 2026, pp. 181–220
2026
-
[38]
Polythrottle: Energy-efficient neural network inference on edge devices,
M. Yan, H. Wang, and S. Venkataraman, “Polythrottle: Energy-efficient neural network inference on edge devices,” 2023
2023
-
[39]
SUQ-3: A three stage coarse-to-fine compression framework for sustain- able edge AI in smart farming,
T. Vaiyapuri and H. Aldosari, “SUQ-3: A three stage coarse-to-fine compression framework for sustain- able edge AI in smart farming,”Sustainability, vol. 17, no. 12, p. 5230, Jun. 2025
2025
-
[40]
Efficient hardware implementation of cellular neural networks with incremental quantization and early exit,
X. Xu, Q. Lu, T. Wang, Y. Hu, C. Zhuo, J. Liu, and Y. Shi, “Efficient hardware implementation of cellular neural networks with incremental quantization and early exit,”J. Emerg. Technol. Comput. Syst., vol. 14, no. 4, Dec. 2018
2018
-
[41]
Skipnet: Learning dynamic routing in convolutional networks,
X. Wang, F. Yu, Z.-Y. Dou, T. Darrell, and J. E. Gonzalez, “Skipnet: Learning dynamic routing in convolutional networks,” inProceedings of the European Conference on Computer Vision (ECCV), September 2018
2018
-
[42]
Dynamicvit: efficient vision transformers with dynamic token sparsification,
Y. Rao, W. Zhao, B. Liu, J. Lu, J. Zhou, and C.-J. Hsieh, “Dynamicvit: efficient vision transformers with dynamic token sparsification,” inProceedings of the 35th International Conference on Neural Information Processing Systems, ser. NIPS ’21. Red Hook, NY, USA: Curran Associates Inc., 2021
2021
-
[43]
Performance aware convolutional neural network channel pruning for embedded gpus,
V. Radu, K. Kaszyk, Y. Wen, J. Turner, J. Cano, E. J. Crowley, B. Franke, A. Storkey, and M. O’Boyle, “Performance aware convolutional neural network channel pruning for embedded gpus,” in2019 IEEE International Symposium on Workload Characterization (IISWC), 2019, pp. 24–34
2019
-
[44]
Latency-aware automatic cnn channel pruning with gpu runtime analysis,
J. Liu, J. Sun, Z. Xu, and G. Sun, “Latency-aware automatic cnn channel pruning with gpu runtime analysis,”BenchCouncil Transactions on Benchmarks, Standards and Evaluations, vol. 1, no. 1, p. 100009, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S2772485921000 090
2021
-
[45]
Streamlining speech enhancement dnns: an automated pruning method based on dependency graph with advanced regularized loss strategies,
Z. Zhao, J. Zhang, Y. Liu, J. Liu, K. Niu, and Z. He, “Streamlining speech enhancement dnns: an automated pruning method based on dependency graph with advanced regularized loss strategies,” in Proc. Interspeech 2024, 2024, pp. 662–666
2024
-
[46]
Probability-based channel pruning for depthwise separable convolutional networks,
H. L. Zhao, K. J. Shi, X. G. Jinet al., “Probability-based channel pruning for depthwise separable convolutional networks,”Journal of Computer Science and Technology, vol. 37, no. 3, pp. 584–600,
-
[47]
Available: https://doi.org/10.1007/s11390-022-2131-8
[Online]. Available: https://doi.org/10.1007/s11390-022-2131-8
-
[48]
Dynamic shuffle: An efficient channel mixture method,
K. Gong, Z. Yin, Y. Li, K. Guo, and X. Xu, “Dynamic shuffle: An efficient channel mixture method,” ArXiv, vol. abs/2310.02776, 2023
-
[49]
Available: https://arxiv.org/abs/2106.08295
M. Nagel, M. Fournarakis, R. A. Amjad, Y. Bondarenko, M. Van Baalen, and T. Blankevoort, “A white paper on neural network quantization,”arXiv preprint arXiv:2106.08295, 2021
-
[50]
Xilinx/Brevitas: v0.7.1,
A. Pappalardoet al., “Xilinx/Brevitas: v0.7.1,” 2021, [Online]. Available: https://github.com/Xilinx/ brevitas
2021
-
[51]
How to train your multi-exit model? analyzing the impact of training strategies,
P. Kubaty, B. W´ ojcik, B. T. Krzepkowski, M. Michaluk, T. Trzcinski, J. Pomponi, and K. Adamczewski, “How to train your multi-exit model? analyzing the impact of training strategies,” inForty-second International Conference on Machine Learning, 2025. [Online]. Available: https://openreview.net/forum?id=vhTPfOdwyQ 23
2025
-
[52]
A. Bakhtiarnia, Q. Zhang, and A. Iosifidis, “Multi-exit vision transformer for dynamic inference,”arXiv preprint arXiv:2106.15183, 2021
-
[53]
T-recx: Tiny-resource efficient convolutional neural networks with early-exit,
N. P. Ghanathe and S. Wilton, “T-recx: Tiny-resource efficient convolutional neural networks with early-exit,” inProceedings of the 20th ACM International Conference on Computing Frontiers, ser. CF ’23. New York, NY, USA: Association for Computing Machinery, 2023, p. 123–133. [Online]. Available: https://doi.org/10.1145/3587135.3592204
-
[54]
torch.cond,
PyTorch Team, “torch.cond,”PyTorch Documentation, n.d., [Online]. Available: https://docs.pytorch.org/docs/stable/generated/torch.cond.html
-
[55]
Interoperability in deep learning: A user survey and failure analysis of onnx model converters,
P. Jajal, W. Jiang, A. Tewari, E. Kocinare, J. Woo, A. Sarraf, Y.-H. Lu, G. K. Thiruvathukal, and J. C. Davis, “Interoperability in deep learning: A user survey and failure analysis of onnx model converters,” inProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA 2024. New York, NY, USA: Association for ...
-
[56]
DEEP-CWS: Distilling efficient pre-trained models with early exit and pruning for scalable chinese word segmentation,
X. Shiting, “DEEP-CWS: Distilling efficient pre-trained models with early exit and pruning for scalable chinese word segmentation,”Information Sciences, vol. 719, p. 122470, 2025. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0020025525006024
2025
-
[57]
Memory architecture and cuda programming on jetson orin: Differences from x86 GPUs - help docs for errors/issues on nvidia jetson dev boards,
Piveral, “Memory architecture and cuda programming on jetson orin: Differences from x86 GPUs - help docs for errors/issues on nvidia jetson dev boards,” https://nvidia-jetson.piveral.com/jetson-ori n-nano/memory-architecture-and-cuda-programming-on-jetson-orin-differences-from-x86-gpus, 2024, accedido el April 17, 2026
2024
-
[58]
psutil.cpu percent — psutil 7.2.2 documentation,
“psutil.cpu percent — psutil 7.2.2 documentation,” Online, psutil development team, 2026, [Online]. Available: https://psutil.readthedocs.io/en/latest/#psutil.cpu percent. [Online]. Available: https://psutil.readthedocs.io/en/latest/#psutil.cpu percent
2026
-
[59]
jtop.core.gpu.gpu — jetson-stats 4.5.4 api reference,
“jtop.core.gpu.gpu — jetson-stats 4.5.4 api reference,” Online, jetson-stats development team, 2025, [Online]. Available: https://rnext.it/jetson stats/reference/gpu.html#jtop.core.gpu.GPU. [Online]. Available: https://rnext.it/jetson stats/reference/gpu.html#jtop.core.gpu.GPU
2025
-
[60]
jtop.jtop.memory — jetson-stats 4.5.4 api reference,
“jtop.jtop.memory — jetson-stats 4.5.4 api reference,” Online, jetson-stats development team, 2025, [Online]. Available: https://rnext.it/jetson stats/reference/jtop.html#jtop.jtop.memory. [Online]. Available: https://rnext.it/jetson stats/reference/jtop.html#jtop.jtop.memory
2025
-
[61]
LSQ+: improving low-bit quantization through learnable offsets and better initialization,
Y. Bhalgat, J. Lee, M. Nagel, T. Blankevoort, and N. Kwak, “LSQ+: improving low-bit quantization through learnable offsets and better initialization,”CoRR, vol. abs/2004.09576, 2020. [Online]. Available: https://arxiv.org/abs/2004.09576
-
[63]
Available: http://arxiv.org/abs/1802.10280
[Online]. Available: http://arxiv.org/abs/1802.10280
-
[64]
Qserve: W4a8kv4 quantization and system co-design for efficient llm serving
Y. Lin, H. Tang, S. Yang, Z. Zhang, G. Xiao, C. Gan, and S. Han, “Qserve: W4a8kv4 quantization and system co-design for efficient llm serving,”ArXiv, vol. abs/2405.04532, 2024
-
[65]
Quick: Quantization-aware interleaving and conflict-free kernel for efficient llm inference,
T. Kim, J. Lee, D. Ahn, S. Kim, J. Choi, M. Kim, and H. Kim, “Quick: Quantization-aware interleaving and conflict-free kernel for efficient llm inference,”ArXiv, vol. abs/2402.10076, 2024
-
[66]
Efficient execution of quantized deep learning models: A compiler approach,
A. Jain, S. Bhattacharya, M. Masuda, V. Sharma, and Y. Wang, “Efficient execution of quantized deep learning models: A compiler approach,” 2020. [Online]. Available: https://arxiv.org/abs/2006.10226 24 A Appendix: Detailed Pruning Results Table 8 summarizes the results of multiple structured pruning configurations applied to different CNN architectures, i...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.