Selective Attention System (SAS): Device-Addressed Speech Detection for Real-Time On-Device Voice AI
Pith reviewed 2026-05-10 17:06 UTC · model grok-4.3
The pith
Modeling device-addressed speech detection as sequential routing over interaction history substantially improves accuracy over local classification in multi-speaker settings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that short-horizon causal interaction history carries substantial decision-relevant information for determining whether speech is device-addressed. By formalizing the task as Sequential Device-Addressed Routing and building the Selective Attention System around it, the system reaches F1 scores of 0.86 with audio alone and 0.95 with audio-video fusion on a 60-hour held-out test set, while running fully on-device with under 150 milliseconds latency and a 20 megabyte footprint.
What carries the argument
Sequential Device-Addressed Routing (SDAR): a formulation that routes decisions by attending to short causal interaction history rather than classifying each utterance in isolation.
If this is right
- The system achieves real-time on-device performance with latency below 150 milliseconds and memory footprint below 20 megabytes.
- Audio-plus-video fusion raises both precision and recall compared with audio alone.
- The interaction-history stage produces the largest ablation effect among tested components.
- The approach supports pre-ASR decisions without requiring cloud resources.
Where Pith is reading between the lines
- Voice interfaces could maintain a brief rolling context of recent turns to resolve who is being addressed without full transcription.
- Similar sequential attention over short history might apply to other edge tasks such as intent or gesture disambiguation.
- Fully on-device execution reduces the need to transmit raw audio, which may ease some privacy constraints in always-listening devices.
Load-bearing premise
The 60-hour held-out multi-speaker English test set and the internal evaluation protocol are representative of real-world temporally ambiguous utterances.
What would settle it
Evaluating the Selective Attention System on an independent public dataset containing multi-speaker conversations with temporally ambiguous address references and measuring whether the F1 drop from removing the interaction-history stage remains near 0.38 points.
Figures
read the original abstract
We study device-addressed speech detection under pre-ASR edge deployment constraints, where systems must decide whether to forward audio before transcription under strict latency and compute limits. We show that, in multi-speaker environments with temporally ambiguous utterances, this task is more effectively modelled as a sequential routing problem over interaction history than as an utterance-local classification task. We formalize this as Sequential Device-Addressed Routing (SDAR) and present the Selective Attention System (SAS), an on-device implementation that instantiates this formulation. On a held-out 60-hour multi-speaker English test set, the primary audio-only configuration achieves F1=0.86 (precision=0.89, recall=0.83); with an optional camera, audio+video fusion raises F1 to 0.95 (precision=0.97, recall=0.93). Removing causal interaction history (Stage~3) reduced F1 from 0.95 to 0.57+/-0.03 in the audio+video configuration under our evaluation protocol. Among the tested components, this was the largest observed ablation effect, indicating that short-horizon interaction history carries substantial decision-relevant information in the evaluated setting. SAS runs fully on-device on ARM Cortex-A class hardware (<150 ms latency, <20 MB footprint). All results are from internal evaluation on a proprietary dataset evaluated primarily in English; a 5-hour evaluation subset may be shared for independent verification (Section 8.8).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes modeling device-addressed speech detection as a sequential routing problem (SDAR) over interaction history rather than local utterance classification, and presents the Selective Attention System (SAS) as an on-device implementation. It reports F1=0.86 (audio-only) and F1=0.95 (audio+video) on a 60-hour proprietary multi-speaker English test set, with the largest ablation effect being a drop to 0.57+/-0.03 when removing the causal interaction history stage (Stage 3). The system meets on-device constraints (<150 ms latency, <20 MB) on ARM hardware, with a 5-hour subset offered for verification in Section 8.8.
Significance. If the ablation result generalizes, the work provides evidence that short-horizon interaction history carries substantial decision-relevant information for resolving temporal ambiguity in multi-speaker voice AI, supporting more accurate on-device routing under latency and compute limits. The concrete F1, precision, recall, and error-barred ablation numbers, plus the on-device footprint, are strengths; the offer of a shareable subset aids reproducibility.
major comments (2)
- [Results section and Section 8.8] Results section and Section 8.8: The central claim that SDAR is more effective than utterance-local classification rests on the ablation where removing Stage 3 (causal interaction history) drops F1 from 0.95 to 0.57+/-0.03 in the audio+video case. However, this is evaluated only on a proprietary 60-hour dataset using an internal protocol for labeling temporally ambiguous utterances and constructing speaker-turn history; full details of segmentation rules and filtering are not public, with only a 5-hour subset offered. This limits assessment of whether the effect reflects general sequential structure or dataset-specific artifacts.
- [Evaluation] Evaluation and baselines: The paper provides ablations across internal stages but does not report comparisons to external utterance-local classifiers, standard VAD pipelines, or other published device-addressed detection methods. Without these, the magnitude of improvement attributable to the sequential formulation versus conventional approaches remains unclear.
minor comments (2)
- [Section 8.8] Ensure Section 8.8 explicitly lists what components of the 5-hour subset (e.g., raw audio, labels, history features) will be shared and any restrictions on use.
- [Introduction/Methods] Clarify notation for the stages (e.g., Stage 3) and any formal definition of SDAR early in the paper to aid readers in following the sequential routing formulation.
Simulated Author's Rebuttal
We thank the referee for their insightful comments on our manuscript. We address each of the major comments point by point below, proposing revisions to enhance clarity and reproducibility where feasible.
read point-by-point responses
-
Referee: Results section and Section 8.8: The central claim that SDAR is more effective than utterance-local classification rests on the ablation where removing Stage 3 (causal interaction history) drops F1 from 0.95 to 0.57+/-0.03 in the audio+video case. However, this is evaluated only on a proprietary 60-hour dataset using an internal protocol for labeling temporally ambiguous utterances and constructing speaker-turn history; full details of segmentation rules and filtering are not public, with only a 5-hour subset offered. This limits assessment of whether the effect reflects general sequential structure or dataset-specific artifacts.
Authors: We recognize the concern regarding the proprietary dataset and the limited public details on the labeling protocol. Due to privacy considerations, we cannot release the full 60-hour dataset. However, we will revise Section 8.8 to provide more comprehensive descriptions of the segmentation rules, filtering criteria, and how temporally ambiguous utterances are labeled, while maintaining confidentiality. Additionally, the offered 5-hour subset will allow external verification of the reported metrics. We believe this addresses the core issue of assessing generalizability. revision: partial
-
Referee: Evaluation and baselines: The paper provides ablations across internal stages but does not report comparisons to external utterance-local classifiers, standard VAD pipelines, or other published device-addressed detection methods. Without these, the magnitude of improvement attributable to the sequential formulation versus conventional approaches remains unclear.
Authors: We agree that including comparisons to established baselines would better contextualize our results. In the revised manuscript, we will add evaluations against a standard utterance-local audio classifier and a VAD pipeline, using the shareable 5-hour subset to ensure reproducibility. This will help quantify the benefits of the sequential SDAR approach over conventional methods. revision: yes
Circularity Check
No circularity: empirical ablations on proprietary data with no reducing equations or self-referential derivations
full rationale
The paper's central claims rest on empirical F1 scores and ablations (e.g., audio+video F1 dropping from 0.95 to 0.57 when removing Stage 3 causal history) evaluated on a held-out 60-hour internal multi-speaker set. No equations, first-principles derivations, or modeling steps are presented that reduce any reported quantity to a fitted parameter or prior self-result by construction. The formalization of SDAR and the SAS implementation are introduced as modeling choices justified by the ablation outcomes rather than derived from self-citations or ansatzes that loop back to the inputs. Results are self-contained experimental findings; the proprietary nature of the data affects verifiability but does not create circularity in any derivation chain.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/DimensionForcing.leanreality_from_one_distinction (8-tick period) echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Performance peaks at approximately 8 seconds; beyond 12 seconds, irrelevant history degrades Stage 3 accuracy... context window is fixed at 8 seconds of interaction history
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Some experiments on the recog- nition of speech, with one and with two ears,
E. C. Cherry, “Some experiments on the recog- nition of speech, with one and with two ears,” J. Acoust. Soc. Am., vol. 25, no. 5, pp. 975–979, 1953
work page 1953
-
[2]
Ego4D: Around the world in 3,000 hours of egocentric video,
K. Grauman et al., “Ego4D: Around the world in 3,000 hours of egocentric video,” inProc. IEEE/CVF CVPR, pp. 18995–19012, 2022
work page 2022
-
[3]
QuAVF: Quality-aware audio-visual fusion for Ego4D talking to me chal- lenge,
H.-C. Lin, C.-Y. Wang, M.-H. Chen, S.-W. Fu, and Y.-C. F. Wang, “QuAVF: Quality-aware audio-visual fusion for Ego4D talking to me chal- lenge,” arXiv:2306.17404, CVPR 2023 Ego4D Workshop, 2023
-
[4]
Long-term social interaction context: The key to egocentric addressee detection,
D. Kong, F. Khan, X. Zhang, P. Singhal, and Y. N. Wu, “Long-term social interaction context: The key to egocentric addressee detection,” in Proc. IEEE ICASSP, 2024
work page 2024
-
[5]
M. Tran, Y. Kim, C.-C. Su, C.-H. Kuo, M. Sun, and M. Soleymani, “Ex2Eg-MAE: A framework for adaptation of exocentric video masked au- toencoders for egocentric social role understand- ing,” inProc. ECCV, LNCS vol. 15138, Springer, 2024
work page 2024
-
[6]
PCIE_Interaction so- lution for Ego4D social interaction challenge,
K. Lertniphonphan et al., “PCIE_Interaction so- lution for Ego4D social interaction challenge,” 15 arXiv:2505.24404, CVPR2025Ego4DWorkshop, 2025
-
[7]
Silero VAD: pre-trained enterprise- gradevoiceactivitydetector,
Silero Team, “Silero VAD: pre-trained enterprise- gradevoiceactivitydetector,” GitHubrepository, 2021
work page 2021
-
[8]
Turn-taking in human commu- nication: Origins and implications for language processing,
S. C. Levinson, “Turn-taking in human commu- nication: Origins and implications for language processing,”Trends Cogn. Sci., vol. 20, no. 1, pp. 6–14, 2016
work page 2016
-
[9]
Modeling global and focal hyper- articulation during human–computer error res- olution,
S. Oviatt, G. Levow, M. MacEachern, and R. Moreton, “Modeling global and focal hyper- articulation during human–computer error res- olution,”J. Acoust. Soc. Am., vol. 104, no. 5, pp. 3080–3098, 1998
work page 1998
-
[10]
M. Cohn, K.-H. Liang, M. Serič, and G. Zel- lou, “Prosodic differences in human- and Alexa- directed speech, but similar local intelligibility adjustments,”Front. Commun., vol. 6, 675704, 2021
work page 2021
-
[11]
Device- directed utterance detection,
S. H. Mallidi, R. Maas, K. Goehner, A. Ras- trow, S. Matsoukas, and B. Hoffmeister, “Device- directed utterance detection,” inProc. Inter- speech, pp. 1225–1228, 2018
work page 2018
-
[12]
A multimodal approach to device-directed speech detection with large language models,
D. Wagner, A. Churchill, S. Sigtia, P. Georgiou, M. Mirsamadi, A. Mishra, and E. Marchi, “A multimodal approach to device-directed speech detection with large language models,” inProc. IEEE ICASSP, pp. 10451–10455, 2024
work page 2024
-
[13]
S. Palaskar, O. Rudovic, S. Dharur, F. Pesce, G. Krishna, A. Sivaraman, J. Berkowitz, A. H. Abdelaziz, S. Adya, and A. Tewfik, “Mul- timodal large language models with fusion low rank adaptation for device directed speech detec- tion,” inProc. Interspeech, pp. 4778–4782, 2024
work page 2024
-
[14]
Device-directed speech detection for follow- up conversations using large language models,
O. Rudovic, P. Dighe, Y. Su, V. Garg, S. Dharur, X. Niu, A. H. Abdelaziz, S. Adya, and A. Tew- fik, “Device-directed speech detection for follow- up conversations using large language models,” inNeurIPS 2024 Workshop on Adaptive Founda- tion Models, 2024
work page 2024
-
[15]
SELMA: A speech-enabled language model for virtual assistant interactions,
D. Wagner, A. Churchill, S. Sigtia, and E. Marchi, “SELMA: A speech-enabled language model for virtual assistant interactions,” inProc. IEEE ICASSP, 2025
work page 2025
-
[16]
Adaptive knowledge distillation for device- directed speech detection,
H. G. Chi, F. Pesce, W. Chang, O. Rudovic, A. Argueta, S. Braun, V. Garg, and A. H. Abde- laziz, “Adaptive knowledge distillation for device- directed speech detection,” inProc. Interspeech, 2025
work page 2025
-
[17]
G. Krishna, S. Dharur, O. Rudovic, P. Dighe, S. Adya, A. H. Abdelaziz, and A. H. Tew- fik, “Modality dropout for multimodal device di- rected speech detection using verbal and non- verbal features,” arXiv:2310.15261, 2023
-
[18]
D. Wagner, A. Churchill, S. Sigtia, P. Georgiou, M. Mirsamadi, A. Mishra, and E. Marchi, “Multi- modal data and resource efficient device-directed speech detection with large foundation models,” inThird Workshop on Efficient NLP and Speech Processing (ENLSP-III) at NeurIPS, 2023
work page 2023
-
[19]
Learning when to listen: Detect- ing system-addressed speech in human-human- computer dialog,
E. Shriberg, A. Stolcke, D. Hakkani-Tür, and L. Heck, “Learning when to listen: Detect- ing system-addressed speech in human-human- computer dialog,” inProc. Interspeech, pp. 334– 337, 2012
work page 2012
-
[20]
Addressee detection for dialog systems using temporal and spectral dimensions of speaking style,
E. Shriberg, A. Stolcke, and D. Hakkani-Tür, “Addressee detection for dialog systems using temporal and spectral dimensions of speaking style,” inProc. Interspeech, 2013
work page 2013
-
[21]
A study of multimodal addressee detection in human- human-computer interaction,
T. J. Tsai, A. Stolcke, and M. Slaney, “A study of multimodal addressee detection in human- human-computer interaction,”IEEE Trans. Mul- timedia, vol. 17, no. 9, pp. 1550–1561, 2015
work page 2015
-
[22]
Acoustic-based auto- matic addressee detection for technical systems: A review,
I. Siegert and O. Niebuhr, “Acoustic-based auto- matic addressee detection for technical systems: A review,”Front. Comput. Sci., vol. 4, 831784, 2022
work page 2022
-
[23]
C.-W. Huang, R. Maas, S. H. Mallidi, and B. Hoffmeister, “A study for improving device- directed speech detection toward frictionless human-machine interaction,” inProc. Inter- speech, pp. 3342–3346, 2019
work page 2019
-
[24]
Streamingon-device detection of device directed speech from voice and touch-based invocation,
O. Rudovic, A. Bindal, V. Garg, P. Simha, P.Dighe, andS.Kajarekar, “Streamingon-device detection of device directed speech from voice and touch-based invocation,” inProc. IEEE ICASSP, pp. 491–495, 2022
work page 2022
-
[25]
O. Rudovic, W. Chang, V. Garg, P. Dighe, P. Simha, J. Berkowitz, A. H. Abdelaziz, S. Ka- jarekar, E. Marchi, and S. Adya, “Less is more: A unified architecture for device-directed speech de- tection with multiple invocation types,” inProc. IEEE ICASSP, 2023
work page 2023
-
[26]
Implicit acoustic echo cancellation for keyword spotting and device-directed speech detection,
S. Cornell, T. Balestri, and T. Sénéchal, “Implicit acoustic echo cancellation for keyword spotting and device-directed speech detection,” inProc. IEEE SLT, 2022
work page 2022
-
[27]
Custom wake words with voice ID,
Picovoice, “Custom wake words with voice ID,”https://picovoice.ai/blog/ custom-wake-words-with-voice-id/, ac- cessed April 2026.16
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.