LLM Inference at the Edge: Mobile, NPU, and GPU Performance Efficiency Trade-offs Under Sustained Load

Kautuk Kundan; Pranay Tummalapalli; Ritam Pal; Sahil Arayakandy

arxiv: 2603.23640 · v2 · pith:E2CX6AJVnew · submitted 2026-03-24 · 💻 cs.DC · cs.LG

LLM Inference at the Edge: Mobile, NPU, and GPU Performance Efficiency Trade-offs Under Sustained Load

Pranay Tummalapalli , Sahil Arayakandy , Ritam Pal , Kautuk Kundan This is my paper

classification 💻 cs.DC cs.LG

keywords hardwarehailo-10hinferencepowerthermalthroughputiphoneiterations

0 comments

read the original abstract

Deploying large language models on-device for always-on personal agents demands sustained inference from hardware tightly constrained in power, thermal envelope, and memory. We benchmark Qwen 2.5 1.5B (4-bit quantised) across four platforms: a Raspberry Pi 5 with Hailo-10H NPU, a Samsung Galaxy S24 Ultra, an iPhone 16 Pro, and a laptop NVIDIA RTX 4050 GPU. Using a fixed 258-token prompt over 20 warm-condition iterations per device, we measure throughput, latency, power, and thermal behaviour. For mobile platforms, thermal management supersedes peak compute as the primary constraint: the iPhone 16 Pro loses nearly half its throughput within two iterations, and the S24 Ultra suffers a hard OS-enforced GPU frequency floor that terminates inference entirely. On dedicated hardware, distinct constraints dominate: the RTX 4050 is bounded by its battery power ceiling, while the Hailo-10H is limited by on-module memory bandwidth. The RTX 4050 sustains 131.7 tok/s at 34.1 W; the Hailo-10H sustains 6.9 tok/s at under 2 W with near-zero variance, matching the RTX 4050 in energy proportionality at 19x lower throughput. Results should be interpreted as platform-level deployment characterisations for a single model and prompt type, reflecting hardware and software combined, rather than general claims about hardware capability alone.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Cloud to Edge: Benchmarking LLM Inference On Hardware-Accelerated Single-Board Computers
cs.AR 2026-04 unverdicted novelty 6.0

Benchmarking on four edge platform configurations shows hardware accelerators improve LLM inference efficiency and reveals trade-offs in power use, device size, and token throughput for constrained deployments.