ProsAudit, a prosodic benchmark for self-supervised speech models

Andrea Santos Revilla; Arthur Thomas; Bogdan Ludusan; Emmanuel Dupoux; Guillaume Wisniewski; Gwendal Virlet; Hadrien Titeux; Marvin Lavechin; Maureen de Seyssel

arxiv: 2302.12057 · v3 · pith:LPD2GMKNnew · submitted 2023-02-23 · 💻 cs.CL · cs.SD· eess.AS

ProsAudit, a prosodic benchmark for self-supervised speech models

Maureen de Seyssel , Marvin Lavechin , Hadrien Titeux , Arthur Thomas , Gwendal Virlet , Andrea Santos Revilla , Guillaume Wisniewski , Bogdan Ludusan

show 1 more author

Emmanuel Dupoux

This is my paper

classification 💻 cs.CL cs.SDeess.AS

keywords modelstaskbenchmarklexicalprosodiccorrectlyevaluatedevaluation

0 comments

read the original abstract

We present ProsAudit, a benchmark in English to assess structural prosodic knowledge in self-supervised learning (SSL) speech models. It consists of two subtasks, their corresponding metrics, and an evaluation dataset. In the protosyntax task, the model must correctly identify strong versus weak prosodic boundaries. In the lexical task, the model needs to correctly distinguish between pauses inserted between words and within words. We also provide human evaluation scores on this benchmark. We evaluated a series of SSL models and found that they were all able to perform above chance on both tasks, even when evaluated on an unseen language. However, non-native models performed significantly worse than native ones on the lexical task, highlighting the importance of lexical knowledge in this task. We also found a clear effect of size with models trained on more data performing better in the two subtasks.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Knowing What to Stress: A Discourse-Conditioned Text-to-Speech Benchmark
cs.CL 2026-04 unverdicted novelty 7.0

CAST benchmark shows language models infer correct word stress from discourse context but TTS systems frequently fail to produce it in speech.