ESPnet: End-to-end speech processing toolkit

Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin , Jahn Heymann, Matthew Wiesner, Nanxin Chen, Adithya Renduchintala, Tsubasa Ochiai · 2018 · DOI 10.21437/interspeech.2018-1456

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

open at publisher browse 5 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

cs.CL · 2025-12-18 · unverdicted · novelty 7.0

Cascaded systems remain the most reliable for speech translation overall, but recent SpeechLLMs match or outperform them in many conditions while standalone speech models lag.

SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

SyncDPO improves temporal synchronization in video-audio joint generation using DPO with efficient on-the-fly negative sample construction and curriculum learning.

MLAAD: The Multi-Language Audio Anti-Spoofing Dataset

cs.SD · 2024-01-17 · unverdicted · novelty 6.0

MLAAD provides a large-scale multi-language synthetic audio dataset for training and evaluating audio anti-spoofing models, showing better training performance than InTheWild and FakeOrReal and alternating superiority with ASVspoof 2019 across eight test sets.

Gated Embeddings in End-to-End Speech Recognition for Conversational-Context Fusion

cs.CL · 2019-06-27 · unverdicted · novelty 6.0

Gated fusion of fastText and BERT embeddings into an end-to-end ASR model captures multi-sentence conversational context and lowers word error rate on the Switchboard corpus.

WhisperPipe: A Resource-Efficient Streaming Architecture for Real-Time Automatic Speech Recognition

cs.CL · 2026-04-28 · unverdicted · novelty 5.0

WhisperPipe delivers 89 ms median latency and 48% lower peak GPU memory than standard Whisper while keeping word error rate within 2% of the offline model.

citing papers explorer

Showing 5 of 5 citing papers.

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs cs.CL · 2025-12-18 · unverdicted · none · ref 102
Cascaded systems remain the most reliable for speech translation overall, but recent SpeechLLMs match or outperform them in many conditions while standalone speech models lag.
SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning cs.CV · 2026-05-12 · unverdicted · none · ref 46
SyncDPO improves temporal synchronization in video-audio joint generation using DPO with efficient on-the-fly negative sample construction and curriculum learning.
MLAAD: The Multi-Language Audio Anti-Spoofing Dataset cs.SD · 2024-01-17 · unverdicted · none · ref 35
MLAAD provides a large-scale multi-language synthetic audio dataset for training and evaluating audio anti-spoofing models, showing better training performance than InTheWild and FakeOrReal and alternating superiority with ASVspoof 2019 across eight test sets.
Gated Embeddings in End-to-End Speech Recognition for Conversational-Context Fusion cs.CL · 2019-06-27 · unverdicted · none · ref 46
Gated fusion of fastText and BERT embeddings into an end-to-end ASR model captures multi-sentence conversational context and lowers word error rate on the Switchboard corpus.
WhisperPipe: A Resource-Efficient Streaming Architecture for Real-Time Automatic Speech Recognition cs.CL · 2026-04-28 · unverdicted · none · ref 41
WhisperPipe delivers 89 ms median latency and 48% lower peak GPU memory than standard Whisper while keeping word error rate within 2% of the offline model.

ESPnet: End-to-end speech processing toolkit

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer