pith. sign in

arxiv: 2607.01108 · v1 · pith:OREIKYB4new · submitted 2026-07-01 · 💻 cs.SD

NPUsper: Eliminating Redundant Computation for Real-Time Whisper on Mobile NPUs

Pith reviewed 2026-07-02 05:55 UTC · model grok-4.3

classification 💻 cs.SD
keywords real-time speech transcriptionWhisper modelmobile NPUsstreaming inferencehallucination detectionKV cache optimizationpower efficiencyautoregressive decoding
0
0 comments X

The pith

NPUsper reduces per-word latency up to 4.84 times and power use up to 88.64 percent for real-time Whisper on mobile NPUs by skipping redundant padded computation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents NPUsper as a live transcription system that makes the Whisper model efficient on mobile NPUs. It detects hallucinated tokens online from temporal patterns in decoder cross-attention so each inference round can use short audio inputs with minimal carryover instead of heavy padding. Controlled unrolling executes autoregressive decoding as K-step chunk graphs to remove unnecessary KV-cache computation and reduce graph-dispatch overhead. These steps produce the reported gains in latency, time-to-first-token, and power while transcription accuracy stays comparable to baselines. The approach targets practical real-time use on resource-constrained mobile hardware.

Core claim

NPUsper achieves up to 4.84x lower per-word latency, up to 33.2x lower time-to-first-token, and up to 88.64% lower average power consumption compared with baselines, while maintaining comparable transcription accuracy, by detecting hallucinated tokens online from temporal patterns in decoder cross-attention and executing autoregressive decoding as K-step chunk graphs.

What carries the argument

Online detection of hallucinated tokens from temporal patterns in decoder cross-attention combined with controlled unrolling of autoregressive decoding into K-step chunk graphs

If this is right

  • Per-word latency for live transcription drops by up to 4.84 times.
  • Time-to-first-token for the first output falls by up to 33.2 times.
  • Average power draw during inference decreases by up to 88.64 percent.
  • Transcription accuracy remains comparable to conventional padded streaming systems.
  • Mobile NPU execution avoids both padding overhead and repeated KV-cache work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same cross-attention pattern check could be tested on other autoregressive audio models running on edge NPUs.
  • Chunk-graph unrolling may reduce dispatch costs when applied to non-speech sequence tasks on the same hardware.
  • Lower sustained power could support longer always-on listening sessions on battery-powered devices.

Load-bearing premise

Detection of hallucinated tokens from temporal patterns in decoder cross-attention is reliable enough to permit short audio inputs with minimal carryover without meaningful accuracy loss.

What would settle it

A side-by-side test on streaming audio where the hallucination detector permits short inputs but the resulting word error rate rises noticeably above the padded baseline.

Figures

Figures reproduced from arXiv: 2607.01108 by Chengpo Yan, Hojeong Lee, Seyeon Kim, Sihyeon Lee, Suman Banerjee, Sungwon Woo.

Figure 1
Figure 1. Figure 1: Redundant computation in existing systems. Existing systems construct each Whisper [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Padding reduces hallucination and im￾proves accuracy, but increases processing over￾head. 0 100 200 300 Time (s) 0 5 10 15 Power (W) NPU ends NPU GPU (a) Power and throughput Qualcomm Separate 20 40 60 80 100 Time (ms) 51.1 41.2 86.8 76.9 Execution Overhead (b) Decoding latency breakdown on NPU [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: NPUsper overview. preserving transcription accuracy. To realize this opportunity, we analyze cross-attention patterns to detect hallucinations and avoid unnecessary computation in NPUsper, detailed in § 4.3. 3.2 Dynamic Speech Processing on Static-graph NPUs NPU execution mismatch. NPUs offer high inference throughput and energy efficiency, making them attractive for on-device live transcription systems. A… view at source ↗
Figure 5
Figure 5. Figure 5: Layer-wise cross-attention dynamics in Whisper over time for selected tokens from the [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of decoding strategies on NPUs. Fixed full-length graphs reuse the same graph [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Per-word latency and WER comparison. NPUsper achieves the lowest per-word latency on both devices with comparable transcription accuracy. Note that a 1 pp WER gap corresponds to only one additional word error per 100 reference words. 0 10 20 30 Whisper input (s) Ours-N Ours SW SS WF WS 3.7 3.7 30 30 13.8 30 (a) Average Whisper input length 0 500 1000 1500 2000 Time per iteration (ms) Ours-N Ours SW SS WF W… view at source ↗
Figure 8
Figure 8. Figure 8: Latency breakdown. NPUsper delivers the best latency performance by avoiding redundant computation through hallucination detection during decoding (§ 4.3). Zhang et al., 2020]. Time-to-first-token (TTFT, ↓) is the delay from the start of inference to the first text output visible to the user [Kudlur et al., 2026]. 5.2 Latency and Inference Efficiency Low-latency transcription. As shown in [PITH_FULL_IMAGE… view at source ↗
Figure 9
Figure 9. Figure 9: Decoding overhead analysis on S25. Energy efficiency [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Additional analysis of hallucination detection. The stacked bar plot summarizes the [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Power consumption. NPUsper achieves the lowest power consumption by reducing redundant computation on both GPU and NPU. The × markers indicate when each system terminates. B.3 Controlled Unrolling on NPUs Cross-device consistency. In addition to the S25 results in [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Decoding overhead on X Plus. QCOM Sep. Ours 20 40 60 80 100 120 Time (ms) 51.1 41.2 6.7 86.8 76.9 42.4 Execution Overhead (a) 16 tokens QCOM Sep. Ours 20 40 60 80 100 120 Time (ms) 44.3 36.2 18.7 94.5 86.4 68.9 Execution Overhead (b) 23 tokens QCOM Sep. Ours 20 40 60 80 100 120 Time (ms) 44.0 25.5 6.8 121.6 103.1 84.4 Execution Overhead (c) 30 tokens [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 14
Figure 14. Figure 14: Effect of removing the cross-attention source-axis mask on LJ001-0021, decoder steps [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗
read the original abstract

We present NPUsper, a live transcription system that makes Whisper efficient on mobile NPUs by eliminating redundant computation. To avoid the heavy padding used by prior streaming systems, NPUsper detects hallucinated tokens online from temporal patterns in decoder cross-attention, allowing each inference round to process short audio inputs with minimal carryover. For efficient mobile-NPU execution, we propose controlled unrolling, which executes autoregressive decoding as K-step chunk graphs, removing unnecessary KV-cache computation and reducing graph-dispatch overhead. NPUsper achieves up to 4.84x lower per-word latency, up to 33.2x lower time-to-first-token (TTFT), and up to 88.64% lower average power consumption compared with baselines, while maintaining comparable transcription accuracy. The code is available at https://github.com/npusper/NPUsper.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents NPUsper, a live transcription system for making Whisper efficient on mobile NPUs. It eliminates redundant computation by detecting hallucinated tokens online from temporal patterns in decoder cross-attention, allowing short audio inputs with minimal carryover instead of heavy padding. It further proposes controlled unrolling to execute autoregressive decoding as K-step chunk graphs, removing unnecessary KV-cache computation and reducing graph-dispatch overhead. The system reports up to 4.84x lower per-word latency, up to 33.2x lower TTFT, and up to 88.64% lower average power consumption versus baselines while maintaining comparable transcription accuracy. Reproducible code is released at https://github.com/npusper/NPUsper.

Significance. If the performance claims and detection reliability hold under rigorous testing, the work would be significant for real-time ASR deployment on mobile NPUs, as it directly targets padding and KV-cache overheads that dominate streaming inference. The combination of attention-based hallucination detection with NPU-specific graph optimizations is a practical contribution. Credit is due for releasing code, which supports reproducibility.

major comments (2)
  1. [Method (hallucination detection) and Experiments] The central latency and TTFT claims (abstract) rest on the assumption that hallucinated-token detection from temporal patterns in decoder cross-attention is reliable enough to truncate inputs with only minimal carry-over and no meaningful accuracy loss. The manuscript supplies no precision/recall figures, confusion matrices, or failure-mode analysis on noisy, accented, or long-context speech; without these the 4.84x and 33.2x speedups cannot be considered load-bearing.
  2. [Experiments / Results] Table or figure reporting the numeric gains (abstract and results section) states specific maxima but provides no description of the exact baselines, datasets, Whisper model sizes, hardware platforms, error bars, or statistical tests used. This absence prevents assessment of whether the controlled-unrolling and detection components actually deliver the claimed improvements.
minor comments (1)
  1. [Abstract] The abstract uses unqualified 'up to' phrasing for the three performance numbers without stating the conditions or model variants under which the maxima occur.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate where revisions will be made.

read point-by-point responses
  1. Referee: [Method (hallucination detection) and Experiments] The central latency and TTFT claims (abstract) rest on the assumption that hallucinated-token detection from temporal patterns in decoder cross-attention is reliable enough to truncate inputs with only minimal carry-over and no meaningful accuracy loss. The manuscript supplies no precision/recall figures, confusion matrices, or failure-mode analysis on noisy, accented, or long-context speech; without these the 4.84x and 33.2x speedups cannot be considered load-bearing.

    Authors: We agree that explicit reliability metrics for the online hallucination detector would strengthen the manuscript. The current results show end-to-end accuracy is preserved, but dedicated precision/recall, confusion matrices, and failure-mode analysis on noisy, accented, and long-context inputs were not included. We will add these evaluations in the revised version to directly support the latency and TTFT claims. revision: yes

  2. Referee: [Experiments / Results] Table or figure reporting the numeric gains (abstract and results section) states specific maxima but provides no description of the exact baselines, datasets, Whisper model sizes, hardware platforms, error bars, or statistical tests used. This absence prevents assessment of whether the controlled-unrolling and detection components actually deliver the claimed improvements.

    Authors: We will expand the Experiments section with complete descriptions of the baselines, datasets, Whisper model sizes, and hardware platforms. Error bars and statistical significance tests will also be added to the reported numeric gains so that the contributions of hallucination detection and controlled unrolling can be rigorously assessed. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical systems paper with no derivations or self-referential claims

full rationale

The paper presents an empirical optimization for Whisper on mobile NPUs, describing a detection heuristic for hallucinated tokens and a controlled unrolling technique. No equations, fitted parameters renamed as predictions, or derivation chains appear. Performance numbers are reported from measurements against baselines. The hallucination detection method is introduced as a practical heuristic without any claim that it is derived from first principles or reduces to its own inputs. No self-citation load-bearing steps are present. This is a standard non-circular empirical systems contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied systems paper; the abstract contains no mathematical derivations, free parameters, axioms, or postulated entities.

pith-pipeline@v0.9.1-grok · 5695 in / 1193 out tokens · 36866 ms · 2026-07-02T05:55:49.523651+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 8 canonical work pages · 2 internal anchors

  1. [1]

    Simul-whisper: Attention- guided streaming whisper with truncation detection.arXiv preprint arXiv:2406.10052,

    Haoyu Wang, Guoqiang Hu, Guodong Lin, Wei-Qiang Zhang, and Jian Li. Simul-whisper: Attention- guided streaming whisper with truncation detection.arXiv preprint arXiv:2406.10052,

  2. [2]

    InProceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025), pages 389–398,

  3. [3]

    Low-latency sequence-to-sequence speech recognition and translation by partial hypothesis selection.arXiv preprint arXiv:2005.11185,

    Danni Liu, Gerasimos Spanakis, and Jan Niehues. Low-latency sequence-to-sequence speech recognition and translation by partial hypothesis selection.arXiv preprint arXiv:2005.11185,

  4. [4]

    Alignatt: Using attention-based audio-translation alignments as a guide for simultaneous speech translation.arXiv preprint arXiv:2305.11408,

    Sara Papi, Marco Turchi, and Matteo Negri. Alignatt: Using attention-based audio-translation alignments as a guide for simultaneous speech translation.arXiv preprint arXiv:2305.11408,

  5. [5]

    Efficient execution of deep neural networks on mobile devices with npu

    Tianxiang Tan and Guohong Cao. Efficient execution of deep neural networks on mobile devices with npu. InProceedings of the 20th International Conference on Information Processing in Sensor Networks (Co-Located with CPS-IoT Week 2021), pages 283–298,

  6. [6]

    Accelerating recurrent neural network training using sequence bucketing and multi-gpu data parallelization

    Viacheslav Khomenko, Oleg Shyshkov, Olga Radyvonenko, and Kostiantyn Bokhan. Accelerating recurrent neural network training using sequence bucketing and multi-gpu data parallelization. In 2016 IEEE First International Conference on Data Stream Mining & Processing (DSMP), pages 100–103. IEEE,

  7. [7]

    Why Language Models Hallucinate

    10 Adam Tauman Kalai, Ofir Nachum, Santosh S Vempala, and Edwin Zhang. Why language models hallucinate.arXiv preprint arXiv:2509.04664,

  8. [8]

    Investigation of whisper asr hallucinations induced by non-speech audio

    Mateusz Bara´nski, Jan Jasi´nski, Julitta Bartolewska, Stanisław Kacprzak, Marcin Witkowski, and Konrad Kowalczyk. Investigation of whisper asr hallucinations induced by non-speech audio. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE,

  9. [9]

    Careless whisper: Speech-to-text hallucination harms

    Allison Koenecke, Anna Seo Gyeong Choi, Katelyn X Mei, Hilke Schellmann, and Mona Sloane. Careless whisper: Speech-to-text hallucination harms. InProceedings of the 2024 ACM conference on fairness, accountability, and transparency, pages 1672–1681,

  10. [10]

    Cif: Continuous integrate-and-fire for end-to-end speech recognition

    Linhao Dong and Bo Xu. Cif: Continuous integrate-and-fire for end-to-end speech recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6079–6083. IEEE,

  11. [11]

    Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss

    Qian Zhang, Han Lu, Hasim Sak, Anshuman Tripathi, Erik McDermott, Stephen Koo, and Shankar Kumar. Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss. InICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7829–7833. IEEE,

  12. [12]

    Moonshine v2: Ergodic streaming encoder asr for latency-critical speech applications.arXiv preprint arXiv:2602.12241,

    Manjunath Kudlur, Evan King, James Wang, and Pete Warden. Moonshine v2: Ergodic streaming encoder asr for latency-critical speech applications.arXiv preprint arXiv:2602.12241,

  13. [13]

    WhisperRT -- Turning Whisper into a Causal Streaming Model

    Monsoon Solutions, Inc. High V oltage Power Monitor. https://www.msoon.com/ high-voltage-power-monitor. Qualcomm Technologies, Inc. Qualcomm Profiler. https://www.qualcomm.com/developer/ software/qualcomm-profiler. Tomer Krichli, Bhiksha Raj, and Joseph Keshet. Carelesswhisper: Turning whisper into a causal streaming model.arXiv preprint arXiv:2508.12301,

  14. [14]

    Canary-1b-v2 & parakeet-tdt-0.6 b-v3: Efficient and high-performance models for multilingual asr and ast.arXiv preprint arXiv:2509.14128,

    Monica Sekoyan, Nithin Rao Koluguri, Nune Tadevosyan, Piotr Zelasko, Travis Bartley, Nikolay Karpov, Jagadeesh Balam, and Boris Ginsburg. Canary-1b-v2 & parakeet-tdt-0.6 b-v3: Efficient and high-performance models for multilingual asr and ast.arXiv preprint arXiv:2509.14128,

  15. [15]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova

    Loïc Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, et al. Seamlessm4t: Massively multilingual & multimodal machine translation.arXiv preprint arXiv:2308.11596,

  16. [16]

    The raw attention-difference signal is defined as δi(f) = ˜ai(f)−˜ap(i)(f),(6) where f denotes the audio frame index

    12 A Hallucination Detection Analysis A.1 Filtered Attention-Difference Signal Filtered attention.For each generated token i, we compare its final-layer cross-attention with that of the most recent preceding content tokenp(i). The raw attention-difference signal is defined as δi(f) = ˜ai(f)−˜ap(i)(f),(6) where f denotes the audio frame index. To reduce sp...

  17. [17]

    Our method

    1.2 51.6 67.4 A.4 Model Selection and Generalizability Model selection.Our end-to-end system evaluation focuses on the Whisper base model. This choice reflects the memory and compute constraints of mobile devices, where the goal is to build a complete real-time on-device transcription system rather than to maximize transcription accuracy with a larger mod...

  18. [18]

    Across decoding lengths.We also evaluate whether this trend remains consistent as the number of generated tokens changes

    The X Plus results show the same overall trend as S25: the fixed full-length graph has the highest decoding overhead, the separate-graph approach reduces redundant KV-cache computation but still incurs graph-dispatch overhead, and our controlled-unrolling design achieves the lowest overall decoding overhead. Across decoding lengths.We also evaluate whethe...

  19. [19]

    At the first failure point, the masked token is absent from the unmasked top-3 candidates; in the following steps, it is demoted below an incorrect unmasked top-1 token. D.5 Self-Attention Cache Handling Static cache interface.Controlled unrolling executes multiple autoregressive decoding steps inside one NPU graph, but the first chunk would normally star...