VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
years
2026 2verdicts
UNVERDICTED 2roles
background 1polarities
background 1representative citing papers
A self-supervised prosody encoder with speaker disentanglement strategies outperforms raw prosody and HuBERT baselines on pitch reconstruction and prosodic event detection while achieving strong speaker separation.
citing papers explorer
-
Privacy-preserving Prosody Representation Learning
A self-supervised prosody encoder with speaker disentanglement strategies outperforms raw prosody and HuBERT baselines on pitch reconstruction and prosodic event detection while achieving strong speaker separation.