Real-time and Continuous Turn-taking Prediction Using Voice Activity Projection

Bing'er Jiang; Erik Ekstedt; Gabriel Skantze; Koji Inoue; Tatsuya Kawahara

arxiv: 2401.04868 · v1 · pith:K24N7WMUnew · submitted 2024-01-10 · 💻 cs.CL · cs.HC· cs.SD· eess.AS

Real-time and Continuous Turn-taking Prediction Using Voice Activity Projection

Koji Inoue , Bing'er Jiang , Erik Ekstedt , Tatsuya Kawahara , Gabriel Skantze This is my paper

classification 💻 cs.CL cs.HCcs.SDeess.AS

keywords real-timesystemvoiceactivityaudiocontinuousmodelprediction

0 comments

read the original abstract

A demonstration of a real-time and continuous turn-taking prediction system is presented. The system is based on a voice activity projection (VAP) model, which directly maps dialogue stereo audio to future voice activities. The VAP model includes contrastive predictive coding (CPC) and self-attention transformers, followed by a cross-attention transformer. We examine the effect of the input context audio length and demonstrate that the proposed system can operate in real-time with CPU settings, with minimal performance degradation.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TurnNat: Automatic Evaluation of Turn-Taking Naturalness in Dyadic Spoken Dialogue
cs.CL 2026-07 unverdicted novelty 6.0

TurnNat introduces a likelihood-based automatic evaluation method for turn-taking naturalness in dyadic spoken dialogues using a causal prediction model and a human-validated perturbation benchmark.
Toward Signing Activity Projection in Sign Language Interaction
cs.CL 2026-06 unverdicted novelty 6.0

Initial adaptation of Voice Activity Projection to dyadic sign language interaction on the Public DGS Corpus shows SHIFT/HOLD prediction is feasible with hand cues while SHIFT prediction remains difficult.
Endpoint Anticipation for Low-Latency Spoken Dialogue
eess.AS 2026-06 unverdicted novelty 5.0

A speech-based model forecasts conversation turn endpoints up to 2.56 seconds ahead to enable lower-latency spoken dialogue via speculative LLM and TTS execution.