Voice Activity Projection: Self-supervised Learning of Turn-taking Events

Erik Ekstedt; Gabriel Skantze

arxiv: 2205.09812 · v1 · pith:FAU6UTFNnew · submitted 2022-05-19 · 📡 eess.AS · cs.SD

Voice Activity Projection: Self-supervised Learning of Turn-taking Events

Erik Ekstedt , Gabriel Skantze This is my paper

classification 📡 eess.AS cs.SD

keywords activityvoicemodelingpriorprojectionturn-takingeventsneed

0 comments

read the original abstract

The modeling of turn-taking in dialog can be viewed as the modeling of the dynamics of voice activity of the interlocutors. We extend prior work and define the predictive task of Voice Activity Projection, a general, self-supervised objective, as a way to train turn-taking models without the need of labeled data. We highlight a theoretical weakness with prior approaches, arguing for the need of modeling the dependency of voice activity events in the projection window. We propose four zero-shot tasks, related to the prediction of upcoming turn-shifts and backchannels, and show that the proposed model outperforms prior work.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Next-Turn: Duration-Aware Streaming Endpoint Detection via Time-to-Next-Speech-Onset Prediction
cs.SD 2026-06 unverdicted novelty 6.0

Next-Turn introduces time-to-next-speech-onset prediction for duration-aware streaming endpoint detection, reporting a 25.9% improvement in accuracy within 320 ms.