CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.
V ox-profile: A speech foundation model benchmark for characterizing diverse speaker and speech traits
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
Spoken language models exhibit style amnesia and fail to maintain instructed paralinguistic styles across multi-turn conversations, with explicit recall offering partial mitigation.
A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.
Smiles during intense negative affect in Holocaust survivor testimonies improve emotional valence trajectories across audio, eye gaze, and text modalities while reducing eye dynamics.
Cross-lifespan evaluation shows adult-trained speech foundation models degrade on child and older-adult data, with joint multi-age training and targeted adaptation improving robustness especially using Whisper encoder.
citing papers explorer
-
CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation
CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.
-
Style Amnesia: Investigating Speaking Style Degradation and Mitigation in Multi-Turn Spoken Language Models
Spoken language models exhibit style amnesia and fail to maintain instructed paralinguistic styles across multi-turn conversations, with explicit recall offering partial mitigation.
-
Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model
A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.
-
Smiling Regulates Emotion During Traumatic Recollection
Smiles during intense negative affect in Holocaust survivor testimonies improve emotional valence trajectories across audio, eye gaze, and text modalities while reducing eye dynamics.
-
Exploring Speech Foundation Models for Speaker Diarization Across Lifespan
Cross-lifespan evaluation shows adult-trained speech foundation models degrade on child and older-adult data, with joint multi-age training and targeted adaptation improving robustness especially using Whisper encoder.