Cascaded systems remain the most reliable for speech translation overall, but recent SpeechLLMs match or outperform them in many conditions while standalone speech models lag.
ESPnet: End-to-end speech processing toolkit
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 5roles
background 2polarities
background 2representative citing papers
SyncDPO improves temporal synchronization in video-audio joint generation using DPO with efficient on-the-fly negative sample construction and curriculum learning.
MLAAD provides a large-scale multi-language synthetic audio dataset for training and evaluating audio anti-spoofing models, showing better training performance than InTheWild and FakeOrReal and alternating superiority with ASVspoof 2019 across eight test sets.
Gated fusion of fastText and BERT embeddings into an end-to-end ASR model captures multi-sentence conversational context and lowers word error rate on the Switchboard corpus.
WhisperPipe delivers 89 ms median latency and 48% lower peak GPU memory than standard Whisper while keeping word error rate within 2% of the offline model.
citing papers explorer
-
Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs
Cascaded systems remain the most reliable for speech translation overall, but recent SpeechLLMs match or outperform them in many conditions while standalone speech models lag.
-
SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning
SyncDPO improves temporal synchronization in video-audio joint generation using DPO with efficient on-the-fly negative sample construction and curriculum learning.
-
MLAAD: The Multi-Language Audio Anti-Spoofing Dataset
MLAAD provides a large-scale multi-language synthetic audio dataset for training and evaluating audio anti-spoofing models, showing better training performance than InTheWild and FakeOrReal and alternating superiority with ASVspoof 2019 across eight test sets.
-
Gated Embeddings in End-to-End Speech Recognition for Conversational-Context Fusion
Gated fusion of fastText and BERT embeddings into an end-to-end ASR model captures multi-sentence conversational context and lowers word error rate on the Switchboard corpus.
-
WhisperPipe: A Resource-Efficient Streaming Architecture for Real-Time Automatic Speech Recognition
WhisperPipe delivers 89 ms median latency and 48% lower peak GPU memory than standard Whisper while keeping word error rate within 2% of the offline model.