Two-pass Endpoint Detection for Speech Recognition

Anirudh Raju; Aparna Khare; Ariya Rastrow; Colin Vaz; Di He; Ilya Sklyar; Long Chen; Roland Maas; Sam Alptekin; Venkatesh Ravichandran

arxiv: 2401.08916 · v1 · pith:IML3FMAHnew · submitted 2024-01-17 · 📡 eess.AS · cs.SD

Two-pass Endpoint Detection for Speech Recognition

Anirudh Raju , Aparna Khare , Di He , Ilya Sklyar , Long Chen , Sam Alptekin , Viet Anh Trinh , Zhe Zhang

show 4 more authors

Colin Vaz Venkatesh Ravichandran Roland Maas Ariya Rastrow

This is my paper

classification 📡 eess.AS cs.SD

keywords endpointspeechdetectionearlyendpointerlatencymethodmodel

0 comments

read the original abstract

Endpoint (EP) detection is a key component of far-field speech recognition systems that assist the user through voice commands. The endpoint detector has to trade-off between accuracy and latency, since waiting longer reduces the cases of users being cut-off early. We propose a novel two-pass solution for endpointing, where the utterance endpoint detected from a first pass endpointer is verified by a 2nd-pass model termed EP Arbitrator. Our method improves the trade-off between early cut-offs and latency over a baseline endpointer, as tested on datasets including voice-assistant transactional queries, conversational speech, and the public SLURP corpus. We demonstrate that our method shows improvements regardless of the first-pass EP model used.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Next-Turn: Duration-Aware Streaming Endpoint Detection via Time-to-Next-Speech-Onset Prediction
cs.SD 2026-06 unverdicted novelty 6.0

Next-Turn introduces time-to-next-speech-onset prediction for duration-aware streaming endpoint detection, reporting a 25.9% improvement in accuracy within 320 ms.