pith. sign in

arxiv: 2604.08412 · v1 · submitted 2026-04-09 · 💻 cs.SD · cs.AI· eess.AS

Selective Attention System (SAS): Device-Addressed Speech Detection for Real-Time On-Device Voice AI

Pith reviewed 2026-05-10 17:06 UTC · model grok-4.3

classification 💻 cs.SD cs.AIeess.AS
keywords device-addressed speech detectionsequential routingon-device voice AIinteraction historymulti-speaker environmentsedge deploymentselective attention
0
0 comments X

The pith

Modeling device-addressed speech detection as sequential routing over interaction history substantially improves accuracy over local classification in multi-speaker settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that device-addressed speech detection in multi-speaker environments with ambiguous timing is better treated as a sequential decision process drawing on recent interaction history than as an isolated classification of each utterance. It introduces the Selective Attention System to implement this approach on edge hardware. This framing matters for real-time voice AI because it allows systems to decide whether to process audio before full transcription, respecting tight latency and memory budgets. Ablation tests show that dropping the history component causes the largest performance drop, from 0.95 to 0.57 F1 when video is available.

Core claim

The central discovery is that short-horizon causal interaction history carries substantial decision-relevant information for determining whether speech is device-addressed. By formalizing the task as Sequential Device-Addressed Routing and building the Selective Attention System around it, the system reaches F1 scores of 0.86 with audio alone and 0.95 with audio-video fusion on a 60-hour held-out test set, while running fully on-device with under 150 milliseconds latency and a 20 megabyte footprint.

What carries the argument

Sequential Device-Addressed Routing (SDAR): a formulation that routes decisions by attending to short causal interaction history rather than classifying each utterance in isolation.

If this is right

  • The system achieves real-time on-device performance with latency below 150 milliseconds and memory footprint below 20 megabytes.
  • Audio-plus-video fusion raises both precision and recall compared with audio alone.
  • The interaction-history stage produces the largest ablation effect among tested components.
  • The approach supports pre-ASR decisions without requiring cloud resources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Voice interfaces could maintain a brief rolling context of recent turns to resolve who is being addressed without full transcription.
  • Similar sequential attention over short history might apply to other edge tasks such as intent or gesture disambiguation.
  • Fully on-device execution reduces the need to transmit raw audio, which may ease some privacy constraints in always-listening devices.

Load-bearing premise

The 60-hour held-out multi-speaker English test set and the internal evaluation protocol are representative of real-world temporally ambiguous utterances.

What would settle it

Evaluating the Selective Attention System on an independent public dataset containing multi-speaker conversations with temporally ambiguous address references and measuring whether the F1 drop from removing the interaction-history stage remains near 0.38 points.

Figures

Figures reproduced from arXiv: 2604.08412 by Bonny Banerjee, Daniyal Anjum, David Joohun Kim, Omar Abbasi.

Figure 1
Figure 1. Figure 1: Precision, F1, and recall as a function of num￾ber of speakers present at τ = 0.70. Shaded region shows the precision-recall gap, which widens with speaker count: cross-talk degrades recall faster than precision. 8.4 Precision-recall curve [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Precision-recall curve across operating thresh￾olds τ ∈ [0.56, 0.85]. Filled square (■): τ = 0.70 (standard). Filled diamond (♦): τ = 0.82 (high-media). Shaded area: AP ≈ 0.88. 8.5 Noise-floor and speaker-count interaction [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: F1 heatmap: noise floor × speakers present at τ = 0.70. Each cell reports macro-averaged F1 across held-out sessions under that condition. 0.5 0.6 0.7 0.8 0.9 1 Full model Full model (SAS) Stage 1 removed (no beamforming) Stage 2 removed (no classifier) Stage 3 removed (no temp. context) 0.95 0.81 −0.14 0.74 −0.21 0.57 −0.38 F1 score (τ = 0.70) [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ablation study: F1 when one stage is removed at τ = 0.70. Rows ordered by ascending impact. Error bars show the reported range for Stage 2 (0.74 ± 0.02) and Stage 3 (0.57 ± 0.03). Stage 3 (temporal context) is the dominant contributor; its removal reduces F1 by −0.38 (∆F1). The reported results characterise performance un￾der the evaluated distribution of speaker counts, noise conditions, and interaction p… view at source ↗
read the original abstract

We study device-addressed speech detection under pre-ASR edge deployment constraints, where systems must decide whether to forward audio before transcription under strict latency and compute limits. We show that, in multi-speaker environments with temporally ambiguous utterances, this task is more effectively modelled as a sequential routing problem over interaction history than as an utterance-local classification task. We formalize this as Sequential Device-Addressed Routing (SDAR) and present the Selective Attention System (SAS), an on-device implementation that instantiates this formulation. On a held-out 60-hour multi-speaker English test set, the primary audio-only configuration achieves F1=0.86 (precision=0.89, recall=0.83); with an optional camera, audio+video fusion raises F1 to 0.95 (precision=0.97, recall=0.93). Removing causal interaction history (Stage~3) reduced F1 from 0.95 to 0.57+/-0.03 in the audio+video configuration under our evaluation protocol. Among the tested components, this was the largest observed ablation effect, indicating that short-horizon interaction history carries substantial decision-relevant information in the evaluated setting. SAS runs fully on-device on ARM Cortex-A class hardware (<150 ms latency, <20 MB footprint). All results are from internal evaluation on a proprietary dataset evaluated primarily in English; a 5-hour evaluation subset may be shared for independent verification (Section 8.8).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes modeling device-addressed speech detection as a sequential routing problem (SDAR) over interaction history rather than local utterance classification, and presents the Selective Attention System (SAS) as an on-device implementation. It reports F1=0.86 (audio-only) and F1=0.95 (audio+video) on a 60-hour proprietary multi-speaker English test set, with the largest ablation effect being a drop to 0.57+/-0.03 when removing the causal interaction history stage (Stage 3). The system meets on-device constraints (<150 ms latency, <20 MB) on ARM hardware, with a 5-hour subset offered for verification in Section 8.8.

Significance. If the ablation result generalizes, the work provides evidence that short-horizon interaction history carries substantial decision-relevant information for resolving temporal ambiguity in multi-speaker voice AI, supporting more accurate on-device routing under latency and compute limits. The concrete F1, precision, recall, and error-barred ablation numbers, plus the on-device footprint, are strengths; the offer of a shareable subset aids reproducibility.

major comments (2)
  1. [Results section and Section 8.8] Results section and Section 8.8: The central claim that SDAR is more effective than utterance-local classification rests on the ablation where removing Stage 3 (causal interaction history) drops F1 from 0.95 to 0.57+/-0.03 in the audio+video case. However, this is evaluated only on a proprietary 60-hour dataset using an internal protocol for labeling temporally ambiguous utterances and constructing speaker-turn history; full details of segmentation rules and filtering are not public, with only a 5-hour subset offered. This limits assessment of whether the effect reflects general sequential structure or dataset-specific artifacts.
  2. [Evaluation] Evaluation and baselines: The paper provides ablations across internal stages but does not report comparisons to external utterance-local classifiers, standard VAD pipelines, or other published device-addressed detection methods. Without these, the magnitude of improvement attributable to the sequential formulation versus conventional approaches remains unclear.
minor comments (2)
  1. [Section 8.8] Ensure Section 8.8 explicitly lists what components of the 5-hour subset (e.g., raw audio, labels, history features) will be shared and any restrictions on use.
  2. [Introduction/Methods] Clarify notation for the stages (e.g., Stage 3) and any formal definition of SDAR early in the paper to aid readers in following the sequential routing formulation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments on our manuscript. We address each of the major comments point by point below, proposing revisions to enhance clarity and reproducibility where feasible.

read point-by-point responses
  1. Referee: Results section and Section 8.8: The central claim that SDAR is more effective than utterance-local classification rests on the ablation where removing Stage 3 (causal interaction history) drops F1 from 0.95 to 0.57+/-0.03 in the audio+video case. However, this is evaluated only on a proprietary 60-hour dataset using an internal protocol for labeling temporally ambiguous utterances and constructing speaker-turn history; full details of segmentation rules and filtering are not public, with only a 5-hour subset offered. This limits assessment of whether the effect reflects general sequential structure or dataset-specific artifacts.

    Authors: We recognize the concern regarding the proprietary dataset and the limited public details on the labeling protocol. Due to privacy considerations, we cannot release the full 60-hour dataset. However, we will revise Section 8.8 to provide more comprehensive descriptions of the segmentation rules, filtering criteria, and how temporally ambiguous utterances are labeled, while maintaining confidentiality. Additionally, the offered 5-hour subset will allow external verification of the reported metrics. We believe this addresses the core issue of assessing generalizability. revision: partial

  2. Referee: Evaluation and baselines: The paper provides ablations across internal stages but does not report comparisons to external utterance-local classifiers, standard VAD pipelines, or other published device-addressed detection methods. Without these, the magnitude of improvement attributable to the sequential formulation versus conventional approaches remains unclear.

    Authors: We agree that including comparisons to established baselines would better contextualize our results. In the revised manuscript, we will add evaluations against a standard utterance-local audio classifier and a VAD pipeline, using the shareable 5-hour subset to ensure reproducibility. This will help quantify the benefits of the sequential SDAR approach over conventional methods. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ablations on proprietary data with no reducing equations or self-referential derivations

full rationale

The paper's central claims rest on empirical F1 scores and ablations (e.g., audio+video F1 dropping from 0.95 to 0.57 when removing Stage 3 causal history) evaluated on a held-out 60-hour internal multi-speaker set. No equations, first-principles derivations, or modeling steps are presented that reduce any reported quantity to a fitted parameter or prior self-result by construction. The formalization of SDAR and the SAS implementation are introduced as modeling choices justified by the ablation outcomes rather than derived from self-citations or ansatzes that loop back to the inputs. Results are self-contained experimental findings; the proprietary nature of the data affects verifiability but does not create circularity in any derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The modeling choice of sequential routing over history is presented as an empirical finding rather than a derived axiom.

pith-pipeline@v0.9.0 · 5586 in / 1179 out tokens · 66767 ms · 2026-05-10T17:06:44.049610+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

  1. [1]

    Some experiments on the recog- nition of speech, with one and with two ears,

    E. C. Cherry, “Some experiments on the recog- nition of speech, with one and with two ears,” J. Acoust. Soc. Am., vol. 25, no. 5, pp. 975–979, 1953

  2. [2]

    Ego4D: Around the world in 3,000 hours of egocentric video,

    K. Grauman et al., “Ego4D: Around the world in 3,000 hours of egocentric video,” inProc. IEEE/CVF CVPR, pp. 18995–19012, 2022

  3. [3]

    QuAVF: Quality-aware audio-visual fusion for Ego4D talking to me chal- lenge,

    H.-C. Lin, C.-Y. Wang, M.-H. Chen, S.-W. Fu, and Y.-C. F. Wang, “QuAVF: Quality-aware audio-visual fusion for Ego4D talking to me chal- lenge,” arXiv:2306.17404, CVPR 2023 Ego4D Workshop, 2023

  4. [4]

    Long-term social interaction context: The key to egocentric addressee detection,

    D. Kong, F. Khan, X. Zhang, P. Singhal, and Y. N. Wu, “Long-term social interaction context: The key to egocentric addressee detection,” in Proc. IEEE ICASSP, 2024

  5. [5]

    Ex2Eg-MAE: A framework for adaptation of exocentric video masked au- toencoders for egocentric social role understand- ing,

    M. Tran, Y. Kim, C.-C. Su, C.-H. Kuo, M. Sun, and M. Soleymani, “Ex2Eg-MAE: A framework for adaptation of exocentric video masked au- toencoders for egocentric social role understand- ing,” inProc. ECCV, LNCS vol. 15138, Springer, 2024

  6. [6]

    PCIE_Interaction so- lution for Ego4D social interaction challenge,

    K. Lertniphonphan et al., “PCIE_Interaction so- lution for Ego4D social interaction challenge,” 15 arXiv:2505.24404, CVPR2025Ego4DWorkshop, 2025

  7. [7]

    Silero VAD: pre-trained enterprise- gradevoiceactivitydetector,

    Silero Team, “Silero VAD: pre-trained enterprise- gradevoiceactivitydetector,” GitHubrepository, 2021

  8. [8]

    Turn-taking in human commu- nication: Origins and implications for language processing,

    S. C. Levinson, “Turn-taking in human commu- nication: Origins and implications for language processing,”Trends Cogn. Sci., vol. 20, no. 1, pp. 6–14, 2016

  9. [9]

    Modeling global and focal hyper- articulation during human–computer error res- olution,

    S. Oviatt, G. Levow, M. MacEachern, and R. Moreton, “Modeling global and focal hyper- articulation during human–computer error res- olution,”J. Acoust. Soc. Am., vol. 104, no. 5, pp. 3080–3098, 1998

  10. [10]

    Prosodic differences in human- and Alexa- directed speech, but similar local intelligibility adjustments,

    M. Cohn, K.-H. Liang, M. Serič, and G. Zel- lou, “Prosodic differences in human- and Alexa- directed speech, but similar local intelligibility adjustments,”Front. Commun., vol. 6, 675704, 2021

  11. [11]

    Device- directed utterance detection,

    S. H. Mallidi, R. Maas, K. Goehner, A. Ras- trow, S. Matsoukas, and B. Hoffmeister, “Device- directed utterance detection,” inProc. Inter- speech, pp. 1225–1228, 2018

  12. [12]

    A multimodal approach to device-directed speech detection with large language models,

    D. Wagner, A. Churchill, S. Sigtia, P. Georgiou, M. Mirsamadi, A. Mishra, and E. Marchi, “A multimodal approach to device-directed speech detection with large language models,” inProc. IEEE ICASSP, pp. 10451–10455, 2024

  13. [13]

    Mul- timodal large language models with fusion low rank adaptation for device directed speech detec- tion,

    S. Palaskar, O. Rudovic, S. Dharur, F. Pesce, G. Krishna, A. Sivaraman, J. Berkowitz, A. H. Abdelaziz, S. Adya, and A. Tewfik, “Mul- timodal large language models with fusion low rank adaptation for device directed speech detec- tion,” inProc. Interspeech, pp. 4778–4782, 2024

  14. [14]

    Device-directed speech detection for follow- up conversations using large language models,

    O. Rudovic, P. Dighe, Y. Su, V. Garg, S. Dharur, X. Niu, A. H. Abdelaziz, S. Adya, and A. Tew- fik, “Device-directed speech detection for follow- up conversations using large language models,” inNeurIPS 2024 Workshop on Adaptive Founda- tion Models, 2024

  15. [15]

    SELMA: A speech-enabled language model for virtual assistant interactions,

    D. Wagner, A. Churchill, S. Sigtia, and E. Marchi, “SELMA: A speech-enabled language model for virtual assistant interactions,” inProc. IEEE ICASSP, 2025

  16. [16]

    Adaptive knowledge distillation for device- directed speech detection,

    H. G. Chi, F. Pesce, W. Chang, O. Rudovic, A. Argueta, S. Braun, V. Garg, and A. H. Abde- laziz, “Adaptive knowledge distillation for device- directed speech detection,” inProc. Interspeech, 2025

  17. [17]

    Modality dropout for multimodal device di- rected speech detection using verbal and non- verbal features,

    G. Krishna, S. Dharur, O. Rudovic, P. Dighe, S. Adya, A. H. Abdelaziz, and A. H. Tew- fik, “Modality dropout for multimodal device di- rected speech detection using verbal and non- verbal features,” arXiv:2310.15261, 2023

  18. [18]

    Multi- modal data and resource efficient device-directed speech detection with large foundation models,

    D. Wagner, A. Churchill, S. Sigtia, P. Georgiou, M. Mirsamadi, A. Mishra, and E. Marchi, “Multi- modal data and resource efficient device-directed speech detection with large foundation models,” inThird Workshop on Efficient NLP and Speech Processing (ENLSP-III) at NeurIPS, 2023

  19. [19]

    Learning when to listen: Detect- ing system-addressed speech in human-human- computer dialog,

    E. Shriberg, A. Stolcke, D. Hakkani-Tür, and L. Heck, “Learning when to listen: Detect- ing system-addressed speech in human-human- computer dialog,” inProc. Interspeech, pp. 334– 337, 2012

  20. [20]

    Addressee detection for dialog systems using temporal and spectral dimensions of speaking style,

    E. Shriberg, A. Stolcke, and D. Hakkani-Tür, “Addressee detection for dialog systems using temporal and spectral dimensions of speaking style,” inProc. Interspeech, 2013

  21. [21]

    A study of multimodal addressee detection in human- human-computer interaction,

    T. J. Tsai, A. Stolcke, and M. Slaney, “A study of multimodal addressee detection in human- human-computer interaction,”IEEE Trans. Mul- timedia, vol. 17, no. 9, pp. 1550–1561, 2015

  22. [22]

    Acoustic-based auto- matic addressee detection for technical systems: A review,

    I. Siegert and O. Niebuhr, “Acoustic-based auto- matic addressee detection for technical systems: A review,”Front. Comput. Sci., vol. 4, 831784, 2022

  23. [23]

    A study for improving device- directed speech detection toward frictionless human-machine interaction,

    C.-W. Huang, R. Maas, S. H. Mallidi, and B. Hoffmeister, “A study for improving device- directed speech detection toward frictionless human-machine interaction,” inProc. Inter- speech, pp. 3342–3346, 2019

  24. [24]

    Streamingon-device detection of device directed speech from voice and touch-based invocation,

    O. Rudovic, A. Bindal, V. Garg, P. Simha, P.Dighe, andS.Kajarekar, “Streamingon-device detection of device directed speech from voice and touch-based invocation,” inProc. IEEE ICASSP, pp. 491–495, 2022

  25. [25]

    Less is more: A unified architecture for device-directed speech de- tection with multiple invocation types,

    O. Rudovic, W. Chang, V. Garg, P. Dighe, P. Simha, J. Berkowitz, A. H. Abdelaziz, S. Ka- jarekar, E. Marchi, and S. Adya, “Less is more: A unified architecture for device-directed speech de- tection with multiple invocation types,” inProc. IEEE ICASSP, 2023

  26. [26]

    Implicit acoustic echo cancellation for keyword spotting and device-directed speech detection,

    S. Cornell, T. Balestri, and T. Sénéchal, “Implicit acoustic echo cancellation for keyword spotting and device-directed speech detection,” inProc. IEEE SLT, 2022

  27. [27]

    Custom wake words with voice ID,

    Picovoice, “Custom wake words with voice ID,”https://picovoice.ai/blog/ custom-wake-words-with-voice-id/, ac- cessed April 2026.16