An end-to-end SLU architecture with frozen SSL acoustic encoder, LSTM classification head, and cross-modal distillation achieves 93% accuracy on simple commands and 82% on spontaneous speech at 7 ms latency on the new VoiceStick corpus, outperforming cascade baselines.
LeBenchmark 2.0: a Standardized, Replicable and Enhanced Frame- work for Self-supervised Representations of French Speech,
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
Empirical study finds diminishing accuracy returns against steep energy growth for deeper and wider ResNet speaker verification models on VoxCeleb2.
citing papers explorer
-
End-to-End Voice Intent Recognition for Spontaneous Human-Drone Interaction with Naive Users
An end-to-end SLU architecture with frozen SSL acoustic encoder, LSTM classification head, and cross-modal distillation achieves 93% accuracy on simple commands and 82% on spontaneous speech at 7 ms latency on the new VoiceStick corpus, outperforming cascade baselines.
-
Assessing the Energy and Carbon Emissions of Neural Speaker Verification Model in Training and Inference
Empirical study finds diminishing accuracy returns against steep energy growth for deeper and wider ResNet speaker verification models on VoxCeleb2.