NPUsper: Eliminating Redundant Computation for Real-Time Whisper on Mobile NPUs
Pith reviewed 2026-07-02 05:55 UTC · model grok-4.3
The pith
NPUsper reduces per-word latency up to 4.84 times and power use up to 88.64 percent for real-time Whisper on mobile NPUs by skipping redundant padded computation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
NPUsper achieves up to 4.84x lower per-word latency, up to 33.2x lower time-to-first-token, and up to 88.64% lower average power consumption compared with baselines, while maintaining comparable transcription accuracy, by detecting hallucinated tokens online from temporal patterns in decoder cross-attention and executing autoregressive decoding as K-step chunk graphs.
What carries the argument
Online detection of hallucinated tokens from temporal patterns in decoder cross-attention combined with controlled unrolling of autoregressive decoding into K-step chunk graphs
If this is right
- Per-word latency for live transcription drops by up to 4.84 times.
- Time-to-first-token for the first output falls by up to 33.2 times.
- Average power draw during inference decreases by up to 88.64 percent.
- Transcription accuracy remains comparable to conventional padded streaming systems.
- Mobile NPU execution avoids both padding overhead and repeated KV-cache work.
Where Pith is reading between the lines
- The same cross-attention pattern check could be tested on other autoregressive audio models running on edge NPUs.
- Chunk-graph unrolling may reduce dispatch costs when applied to non-speech sequence tasks on the same hardware.
- Lower sustained power could support longer always-on listening sessions on battery-powered devices.
Load-bearing premise
Detection of hallucinated tokens from temporal patterns in decoder cross-attention is reliable enough to permit short audio inputs with minimal carryover without meaningful accuracy loss.
What would settle it
A side-by-side test on streaming audio where the hallucination detector permits short inputs but the resulting word error rate rises noticeably above the padded baseline.
Figures
read the original abstract
We present NPUsper, a live transcription system that makes Whisper efficient on mobile NPUs by eliminating redundant computation. To avoid the heavy padding used by prior streaming systems, NPUsper detects hallucinated tokens online from temporal patterns in decoder cross-attention, allowing each inference round to process short audio inputs with minimal carryover. For efficient mobile-NPU execution, we propose controlled unrolling, which executes autoregressive decoding as K-step chunk graphs, removing unnecessary KV-cache computation and reducing graph-dispatch overhead. NPUsper achieves up to 4.84x lower per-word latency, up to 33.2x lower time-to-first-token (TTFT), and up to 88.64% lower average power consumption compared with baselines, while maintaining comparable transcription accuracy. The code is available at https://github.com/npusper/NPUsper.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents NPUsper, a live transcription system for making Whisper efficient on mobile NPUs. It eliminates redundant computation by detecting hallucinated tokens online from temporal patterns in decoder cross-attention, allowing short audio inputs with minimal carryover instead of heavy padding. It further proposes controlled unrolling to execute autoregressive decoding as K-step chunk graphs, removing unnecessary KV-cache computation and reducing graph-dispatch overhead. The system reports up to 4.84x lower per-word latency, up to 33.2x lower TTFT, and up to 88.64% lower average power consumption versus baselines while maintaining comparable transcription accuracy. Reproducible code is released at https://github.com/npusper/NPUsper.
Significance. If the performance claims and detection reliability hold under rigorous testing, the work would be significant for real-time ASR deployment on mobile NPUs, as it directly targets padding and KV-cache overheads that dominate streaming inference. The combination of attention-based hallucination detection with NPU-specific graph optimizations is a practical contribution. Credit is due for releasing code, which supports reproducibility.
major comments (2)
- [Method (hallucination detection) and Experiments] The central latency and TTFT claims (abstract) rest on the assumption that hallucinated-token detection from temporal patterns in decoder cross-attention is reliable enough to truncate inputs with only minimal carry-over and no meaningful accuracy loss. The manuscript supplies no precision/recall figures, confusion matrices, or failure-mode analysis on noisy, accented, or long-context speech; without these the 4.84x and 33.2x speedups cannot be considered load-bearing.
- [Experiments / Results] Table or figure reporting the numeric gains (abstract and results section) states specific maxima but provides no description of the exact baselines, datasets, Whisper model sizes, hardware platforms, error bars, or statistical tests used. This absence prevents assessment of whether the controlled-unrolling and detection components actually deliver the claimed improvements.
minor comments (1)
- [Abstract] The abstract uses unqualified 'up to' phrasing for the three performance numbers without stating the conditions or model variants under which the maxima occur.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and indicate where revisions will be made.
read point-by-point responses
-
Referee: [Method (hallucination detection) and Experiments] The central latency and TTFT claims (abstract) rest on the assumption that hallucinated-token detection from temporal patterns in decoder cross-attention is reliable enough to truncate inputs with only minimal carry-over and no meaningful accuracy loss. The manuscript supplies no precision/recall figures, confusion matrices, or failure-mode analysis on noisy, accented, or long-context speech; without these the 4.84x and 33.2x speedups cannot be considered load-bearing.
Authors: We agree that explicit reliability metrics for the online hallucination detector would strengthen the manuscript. The current results show end-to-end accuracy is preserved, but dedicated precision/recall, confusion matrices, and failure-mode analysis on noisy, accented, and long-context inputs were not included. We will add these evaluations in the revised version to directly support the latency and TTFT claims. revision: yes
-
Referee: [Experiments / Results] Table or figure reporting the numeric gains (abstract and results section) states specific maxima but provides no description of the exact baselines, datasets, Whisper model sizes, hardware platforms, error bars, or statistical tests used. This absence prevents assessment of whether the controlled-unrolling and detection components actually deliver the claimed improvements.
Authors: We will expand the Experiments section with complete descriptions of the baselines, datasets, Whisper model sizes, and hardware platforms. Error bars and statistical significance tests will also be added to the reported numeric gains so that the contributions of hallucination detection and controlled unrolling can be rigorously assessed. revision: yes
Circularity Check
No circularity: empirical systems paper with no derivations or self-referential claims
full rationale
The paper presents an empirical optimization for Whisper on mobile NPUs, describing a detection heuristic for hallucinated tokens and a controlled unrolling technique. No equations, fitted parameters renamed as predictions, or derivation chains appear. Performance numbers are reported from measurements against baselines. The hallucination detection method is introduced as a practical heuristic without any claim that it is derived from first principles or reduces to its own inputs. No self-citation load-bearing steps are present. This is a standard non-circular empirical systems contribution.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Haoyu Wang, Guoqiang Hu, Guodong Lin, Wei-Qiang Zhang, and Jian Li. Simul-whisper: Attention- guided streaming whisper with truncation detection.arXiv preprint arXiv:2406.10052,
-
[2]
InProceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025), pages 389–398,
2025
-
[3]
Danni Liu, Gerasimos Spanakis, and Jan Niehues. Low-latency sequence-to-sequence speech recognition and translation by partial hypothesis selection.arXiv preprint arXiv:2005.11185,
-
[4]
Sara Papi, Marco Turchi, and Matteo Negri. Alignatt: Using attention-based audio-translation alignments as a guide for simultaneous speech translation.arXiv preprint arXiv:2305.11408,
-
[5]
Efficient execution of deep neural networks on mobile devices with npu
Tianxiang Tan and Guohong Cao. Efficient execution of deep neural networks on mobile devices with npu. InProceedings of the 20th International Conference on Information Processing in Sensor Networks (Co-Located with CPS-IoT Week 2021), pages 283–298,
2021
-
[6]
Accelerating recurrent neural network training using sequence bucketing and multi-gpu data parallelization
Viacheslav Khomenko, Oleg Shyshkov, Olga Radyvonenko, and Kostiantyn Bokhan. Accelerating recurrent neural network training using sequence bucketing and multi-gpu data parallelization. In 2016 IEEE First International Conference on Data Stream Mining & Processing (DSMP), pages 100–103. IEEE,
2016
-
[7]
Why Language Models Hallucinate
10 Adam Tauman Kalai, Ofir Nachum, Santosh S Vempala, and Edwin Zhang. Why language models hallucinate.arXiv preprint arXiv:2509.04664,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Investigation of whisper asr hallucinations induced by non-speech audio
Mateusz Bara´nski, Jan Jasi´nski, Julitta Bartolewska, Stanisław Kacprzak, Marcin Witkowski, and Konrad Kowalczyk. Investigation of whisper asr hallucinations induced by non-speech audio. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE,
2025
-
[9]
Careless whisper: Speech-to-text hallucination harms
Allison Koenecke, Anna Seo Gyeong Choi, Katelyn X Mei, Hilke Schellmann, and Mona Sloane. Careless whisper: Speech-to-text hallucination harms. InProceedings of the 2024 ACM conference on fairness, accountability, and transparency, pages 1672–1681,
2024
-
[10]
Cif: Continuous integrate-and-fire for end-to-end speech recognition
Linhao Dong and Bo Xu. Cif: Continuous integrate-and-fire for end-to-end speech recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6079–6083. IEEE,
2020
-
[11]
Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss
Qian Zhang, Han Lu, Hasim Sak, Anshuman Tripathi, Erik McDermott, Stephen Koo, and Shankar Kumar. Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss. InICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7829–7833. IEEE,
2020
-
[12]
Manjunath Kudlur, Evan King, James Wang, and Pete Warden. Moonshine v2: Ergodic streaming encoder asr for latency-critical speech applications.arXiv preprint arXiv:2602.12241,
-
[13]
WhisperRT -- Turning Whisper into a Causal Streaming Model
Monsoon Solutions, Inc. High V oltage Power Monitor. https://www.msoon.com/ high-voltage-power-monitor. Qualcomm Technologies, Inc. Qualcomm Profiler. https://www.qualcomm.com/developer/ software/qualcomm-profiler. Tomer Krichli, Bhiksha Raj, and Joseph Keshet. Carelesswhisper: Turning whisper into a causal streaming model.arXiv preprint arXiv:2508.12301,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Monica Sekoyan, Nithin Rao Koluguri, Nune Tadevosyan, Piotr Zelasko, Travis Bartley, Nikolay Karpov, Jagadeesh Balam, and Boris Ginsburg. Canary-1b-v2 & parakeet-tdt-0.6 b-v3: Efficient and high-performance models for multilingual asr and ast.arXiv preprint arXiv:2509.14128,
-
[15]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova
Loïc Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, et al. Seamlessm4t: Massively multilingual & multimodal machine translation.arXiv preprint arXiv:2308.11596,
-
[16]
The raw attention-difference signal is defined as δi(f) = ˜ai(f)−˜ap(i)(f),(6) where f denotes the audio frame index
12 A Hallucination Detection Analysis A.1 Filtered Attention-Difference Signal Filtered attention.For each generated token i, we compare its final-layer cross-attention with that of the most recent preceding content tokenp(i). The raw attention-difference signal is defined as δi(f) = ˜ai(f)−˜ap(i)(f),(6) where f denotes the audio frame index. To reduce sp...
2015
-
[17]
Our method
1.2 51.6 67.4 A.4 Model Selection and Generalizability Model selection.Our end-to-end system evaluation focuses on the Whisper base model. This choice reflects the memory and compute constraints of mobile devices, where the goal is to build a complete real-time on-device transcription system rather than to maximize transcription accuracy with a larger mod...
2025
-
[18]
Across decoding lengths.We also evaluate whether this trend remains consistent as the number of generated tokens changes
The X Plus results show the same overall trend as S25: the fixed full-length graph has the highest decoding overhead, the separate-graph approach reduces redundant KV-cache computation but still incurs graph-dispatch overhead, and our controlled-unrolling design achieves the lowest overall decoding overhead. Across decoding lengths.We also evaluate whethe...
2015
-
[19]
At the first failure point, the masked token is absent from the unmasked top-3 candidates; in the following steps, it is demoted below an incorrect unmasked top-1 token. D.5 Self-Attention Cache Handling Static cache interface.Controlled unrolling executes multiple autoregressive decoding steps inside one NPU graph, but the first chunk would normally star...
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.