An end-to-end SLU architecture with frozen SSL acoustic encoder, LSTM classification head, and cross-modal distillation achieves 93% accuracy on simple commands and 82% on spontaneous speech at 7 ms latency on the new VoiceStick corpus, outperforming cascade baselines.
Spoken dialogsum: An emotion-rich conversational dataset for spoken dialogue summa- rization
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
dataset 1
citation-polarity summary
years
2026 2verdicts
UNVERDICTED 2roles
dataset 1polarities
use dataset 1representative citing papers
A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.
citing papers explorer
-
End-to-End Voice Intent Recognition for Spontaneous Human-Drone Interaction with Naive Users
An end-to-end SLU architecture with frozen SSL acoustic encoder, LSTM classification head, and cross-modal distillation achieves 93% accuracy on simple commands and 82% on spontaneous speech at 7 ms latency on the new VoiceStick corpus, outperforming cascade baselines.
-
A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook
A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.