Cloud to Edge: Benchmarking LLM Inference On Hardware-Accelerated Single-Board Computers
Pith reviewed 2026-05-08 09:19 UTC · model grok-4.3
The pith
Hardware accelerators on single-board computers improve LLM inference by balancing token speed against power use and device size.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that a multi-dimensional benchmarking methodology, applied to four IoT-suitable edge platform configurations, demonstrates the benefits of hardware accelerators such as NPUs and GPUs for LLM inference on single-board computers. It quantifies trade-offs among power efficiency, physical device size, and token throughput, thereby providing practical guidance for deploying generative AI in privacy-sensitive and connectivity-limited environments such as unmanned vehicles and portable, ruggedised operations.
What carries the argument
A multi-dimensional benchmarking methodology that jointly evaluates inference performance, power efficiency, physical size, and token throughput across edge platforms with hardware accelerators.
If this is right
- Hardware accelerators such as NPUs and GPUs increase token throughput relative to CPU-only inference on the tested platforms.
- Joint measurement of power, size, and speed allows concrete selection of hardware for given deployment constraints.
- The resulting data supports local LLM use in unmanned vehicles and portable rugged operations where cloud access is restricted.
- Local inference reduces data transmission, lowering both latency and privacy exposure compared with cloud-centric approaches.
Where Pith is reading between the lines
- The same benchmarking approach could be reused on newer accelerator chips or different model sizes to keep recommendations current.
- Adding metrics such as sustained operation under battery limits or thermal constraints would strengthen applicability to field deployments.
- The quantified trade-offs suggest a path toward standardized test suites for edge LLM hardware in industrial and defence contexts.
Load-bearing premise
The four selected IoT-suitable edge platform configurations and the chosen evaluation tasks adequately represent real-world operational technology and defence use cases.
What would settle it
Repeating the benchmarks on additional single-board computers or with tasks drawn directly from unmanned vehicle operations and finding no consistent throughput gains from accelerators or different trade-off patterns would undermine the offered guidance.
Figures
read the original abstract
Large language models (LLMs) are becoming increasingly capable at small parameter scales. At the same time, conventional cloud-centric deployment introduces challenges around data privacy, latency, and cost that are acute in operational technology and defence environments. Advances in model distillation, quantisation, and affordable edge accelerators now make local LLM inference on single-board computers feasible, but the high dimensionality of the configuration space makes identifying optimal deployments difficult without structured evaluation. Existing LLM-specific edge benchmarking efforts rely on CPU-only inference, poor coverage of genuine single-board computers, and generic evaluation tasks that lack multi-dimensional assessment of hardware effectiveness. This paper proposes a multi-dimensional benchmarking methodology that jointly evaluates inference performance and hardware efficiency across four IoT-suitable edge platform configurations testing single-board computers with the latest available hardware accelerators. Our results reveal the benefits of using hardware accelerators such as NPUs and GPUs, along with multi-dimensional evaluations quantifying the trade-offs between power efficiency, physical device size and token throughput; offering practical guidance for deploying generative AI in privacy-sensitive and connectivity-limited environments such as unmanned vehicles and portable, ruggedised operations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a multi-dimensional benchmarking methodology for LLM inference on four IoT-suitable single-board computer configurations equipped with hardware accelerators (NPUs and GPUs). It evaluates trade-offs among token throughput, power efficiency, and physical device size, claiming that the results demonstrate benefits of accelerators and provide practical guidance for deploying generative AI in privacy-sensitive, connectivity-limited environments such as unmanned vehicles and ruggedised operations.
Significance. If the measurements are reproducible and the platforms/tasks representative, the work would address a genuine gap in edge LLM benchmarking by extending beyond CPU-only studies and supplying joint performance-efficiency metrics. The emphasis on multi-dimensional evaluation is a positive contribution that could inform hardware selection for constrained deployments.
major comments (1)
- Abstract: The headline claim that the benchmarking 'offer[s] practical guidance for deploying generative AI in privacy-sensitive and connectivity-limited environments such as unmanned vehicles and portable, ruggedised operations' is load-bearing on the assumption that the four chosen edge platforms and evaluation tasks capture the relevant constraints (power budgets, thermal envelopes, form-factor limits, and realistic workloads) of those target domains. The manuscript supplies no explicit selection criteria for the platforms, no mapping of their specifications to OT/defence requirements, and no domain-specific task suite, leaving the generalisation unsupported.
minor comments (1)
- Abstract: The summary of results mentions benefits and trade-offs but does not list concrete metrics, error bars, or exclusion criteria, reducing immediate verifiability of the empirical claims.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address the single major comment below, acknowledging where clarification and revision are warranted to strengthen the manuscript.
read point-by-point responses
-
Referee: Abstract: The headline claim that the benchmarking 'offer[s] practical guidance for deploying generative AI in privacy-sensitive and connectivity-limited environments such as unmanned vehicles and portable, ruggedised operations' is load-bearing on the assumption that the four chosen edge platforms and evaluation tasks capture the relevant constraints (power budgets, thermal envelopes, form-factor limits, and realistic workloads) of those target domains. The manuscript supplies no explicit selection criteria for the platforms, no mapping of their specifications to OT/defence requirements, and no domain-specific task suite, leaving the generalisation unsupported.
Authors: We agree that the abstract claim would benefit from stronger grounding. The four platforms were chosen as commercially available single-board computers equipped with NPUs or GPUs that are representative of current edge hardware used in IoT and constrained deployments; the tasks consist of standard LLM inference workloads measuring token throughput under varying batch sizes and quantisation levels. However, the manuscript does not include explicit selection criteria or a direct mapping of specifications (e.g., power envelopes or thermal limits) to OT/defence use cases. We will revise the abstract to qualify the scope of the practical guidance and add a short subsection in the methodology explaining the platform selection rationale together with references to typical power budgets and form-factor constraints encountered in unmanned and ruggedised settings. This will make the generalisation explicit and supported by the text. revision: yes
Circularity Check
Empirical benchmarking study with direct measurements; no derivations or self-referential predictions
full rationale
The paper is a hardware benchmarking study that reports direct measurements of token throughput, power efficiency, and device size on four single-board computer configurations. No equations, fitted parameters, predictions, or first-principles derivations are present in the abstract or described methodology. Claims rest on observed experimental data rather than any reduction to inputs by construction. Existing benchmarking efforts are cited only for context, not as load-bearing self-citations. The representativeness concern raised by the skeptic is a question of external validity, not circularity in the derivation chain.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017
work page 2017
-
[2]
Inter- preting and improving large language models in arithmetic calculation
Wei Zhang, Chaoqun Wan, Yonggang Zhang, Yiu-ming Cheung, Xinmei Tian, Xu Shen, and Jieping Ye. Inter- preting and improving large language models in arithmetic calculation. InProceedings of the 41st International Conference on Machine Learning, pages 59932–59950, 2024
work page 2024
-
[3]
Carlos Gómez-Rodríguez and Paul Williams. A confederacy of models: A comprehensive evaluation of llms on creative writing.arXiv preprint arXiv:2310.08433, 2023
-
[4]
Dana Brin, Vera Sorin, Eli Konen, Girish Nadkarni, Benjamin S Glicksberg, and Eyal Klang. How large language models perform on the united states medical licensing examination: a systematic review.MedRxiv, pages 2023–09, 2023
work page 2023
-
[5]
Daniel Martin Katz, Michael James Bommarito, Shang Gao, and Pablo Arredondo. Gpt-4 passes the bar exam. Philosophical Transactions of the Royal Society A, 382(2270):20230254, 2024
work page 2024
-
[6]
Vasiliki Liagkou, George Fragiadakis, Evangelia Filiopoulou, Mara Nikolaidou, and Christos Michalakelis. Taming the llmaas market: A decision-making framework utilizing diverse enterprise-critical selection factors. Available at SSRN 5406285, 2025
work page 2025
-
[7]
Sok: The privacy paradox of large language models: Advancements, privacy risks, and mitigation
Yashothara Shanmugarasa, Ming Ding, Chamikara Mahawaga Arachchige, and Thierry Rakotoarivelo. Sok: The privacy paradox of large language models: Advancements, privacy risks, and mitigation. InProceedings of the 20th ACM Asia Conference on Computer and Communications Security, pages 425–441, 2025
work page 2025
-
[8]
Yue Zheng, Yuhao Chen, Bin Qian, Xiufang Shi, Yuanchao Shu, and Jiming Chen. A review on edge large language models: Design, execution, and applications.ACM Computing Surveys, 57(8):1–35, 2025
work page 2025
-
[9]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review arXiv 2024
-
[10]
A Yang Qwen, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengpeng Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint, 2024
work page 2024
-
[11]
Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 8-bit optimizers via block-wise quantization. arXiv preprint arXiv:2110.02861, 2021
-
[12]
Stojkovic, J., Zhang, C., Goiri, ´I., Torrellas, J., and Choukse, E
Tianyao Shi and Yi Ding. Systematic characterization of llm quantization: A performance, energy, and quality perspective.arXiv preprint arXiv:2508.16712, 2025. 8 APREPRINT- APRIL29, 2026
-
[13]
Deepedgebench: Benchmarking deep neural networks on edge devices, 2021
Stephan Patrick Baller, Anshul Jindal, Mohak Chadha, and Michael Gerndt. Deepedgebench: Benchmarking deep neural networks on edge devices, 2021
work page 2021
-
[14]
Mustafa Abdulkadhim and Sandor R Repas. Introducing leaf: Llm edge assessment framework for generative ai on the edge.Machine Learning and Knowledge Extraction, 8(2):48, 2026
work page 2026
-
[15]
Zeinab Nezami, Maryam Hafeez, Karim Djemame, Syed Ali Raza Zaidi, and Jie Xu. Descriptor: Benchmark dataset for generative ai on edge devices (bedged).IEEE Data Descriptions, 2025
work page 2025
-
[16]
Llms at the edge: Performance and efficiency evaluation with ollama on diverse hardware
Donghao Huang and Zhaoxia Wang. Llms at the edge: Performance and efficiency evaluation with ollama on diverse hardware. In2025 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2025
work page 2025
-
[17]
Maximilian Abstreiter, Sasu Tarkoma, and Roberto Morabito. Sometimes painful but certainly promising: Feasibility and trade-offs of language model inference at the edge.arXiv preprint arXiv:2503.09114, 2025
-
[18]
Pranay Tummalapalli, Sahil Arayakandy, Ritam Pal, and Kautuk Kundan. Llm inference at the edge: Mobile, npu, and gpu performance efficiency trade-offs under sustained load.arXiv preprint arXiv:2603.23640, 2026
-
[19]
Edge ai: A taxonomy, systematic review and future directions.Cluster Computing, 28(1):18, 2025
Sukhpal Singh Gill, Muhammed Golec, Jianmin Hu, Minxian Xu, Junhui Du, Huaming Wu, Guneet Kaur Walia, Subramaniam Subramanian Murugesan, Babar Ali, Mohit Kumar, et al. Edge ai: A taxonomy, systematic review and future directions.Cluster Computing, 28(1):18, 2025
work page 2025
-
[20]
Sub-4-bit llm quantization: Enterprise guide to model compression & accuracy tradeoffs, 2026
picovoice. Sub-4-bit llm quantization: Enterprise guide to model compression & accuracy tradeoffs, 2026
work page 2026
-
[21]
QRazor: Reliable and effortless 4-bit LLM quantization by significant data razoring, 2025
Dongyoung Lee, Seungkyu Choi, and Ik Joon Chang. QRazor: Reliable and effortless 4-bit LLM quantization by significant data razoring, 2025
work page 2025
-
[22]
Generating plc code with universal large language models
Kilian Tran, Jingxi Zhang, Jérôme Pfeiffer, Andreas Wortmann, and Bianca Wiesmayr. Generating plc code with universal large language models. In2024 IEEE 29th International Conference on Emerging Technologies and Factory Automation (ETF A), pages 1–8. IEEE, 2024
work page 2024
-
[23]
Automating maritime risk data collection and identification leveraging large language models
Donghao Huang, Xiuju Fu, Xiaofeng Yin, Haibo Pen, and Zhaoxia Wang. Automating maritime risk data collection and identification leveraging large language models. In2024 IEEE International Conference on Data Mining Workshops (ICDMW), pages 433–439. IEEE, 2024
work page 2024
-
[24]
Tavish Pattanayak and Dimitri Mavris. Battery technology for sustainable aviation: a review of current trends and future prospects.Applied Energy, 397:126356, 2025
work page 2025
-
[25]
Satellite iot in practice: A first measurement study on network availability, performance, and costs
Wenchang Chai, Jinhong Liu, Ziyue Zhang, Xianjin Xia, Yuanqing Zheng, Ningning Hou, Qiang Yang, Weiwei Chen, and Tao Gu. Satellite iot in practice: A first measurement study on network availability, performance, and costs. InProceedings of the 2025 ACM Internet Measurement Conference, IMC ’25, page 891–899, New York, NY , USA, 2025. Association for Comput...
work page 2025
-
[26]
Nicolae Cris, an. Design and implementation of a full-duplex ground station for the qo-100 satellite system based on sdr and raspberry pi.Acta Technica Napocensis, 64(2):9–14, 2024. 9
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.