Audio LLMs fail to use paralinguistic audio information and default to transcript content; a new adversarial benchmark plus PCLM and DPO training raise accuracy on VoxParadox from 17.4% to 65.2%.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
EchoChain benchmark shows no evaluated real-time voice model exceeds 50% success on state updates after mid-speech interruptions, with a 40.2% failure reduction in non-interrupted controls.
Layer-wise probing of wav2vec2-base and Whisper-small shows both models distinguish reduced vs. canonical consonant clusters in AAE with high accuracy and retain cues to underlying stops, encoding CCR as gradient variation.
citing papers explorer
-
Do Audio LLMs Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox
Audio LLMs fail to use paralinguistic audio information and default to transcript content; a new adversarial benchmark plus PCLM and DPO training raise accuracy on VoxParadox from 17.4% to 65.2%.
-
EchoChain: A Full-Duplex Benchmark for State-Update Reasoning Under Interruptions
EchoChain benchmark shows no evaluated real-time voice model exceeds 50% success on state updates after mid-speech interruptions, with a 40.2% failure reduction in non-interrupted controls.
-
Layer-wise Probing of wav2vec 2.0 and Whisper for Consonant Cluster Reduction in African American English
Layer-wise probing of wav2vec2-base and Whisper-small shows both models distinguish reduced vs. canonical consonant clusters in AAE with high accuracy and retain cues to underlying stops, encoding CCR as gradient variation.