pith. sign in

arxiv: 1906.01199 · v1 · pith:AN7HAAIKnew · submitted 2019-06-04 · 💻 cs.CL · cs.SD· eess.AS

Exploring Phoneme-Level Speech Representations for End-to-End Speech Translation

classification 💻 cs.CL cs.SDeess.AS
keywords speechtranslationrepresentationscreateend-to-endfeaturesframe-levelframes
0
0 comments X
read the original abstract

Previous work on end-to-end translation from speech has primarily used frame-level features as speech representations, which creates longer, sparser sequences than text. We show that a naive method to create compressed phoneme-like speech representations is far more effective and efficient for translation than traditional frame-level speech features. Specifically, we generate phoneme labels for speech frames and average consecutive frames with the same label to create shorter, higher-level source sequences for translation. We see improvements of up to 5 BLEU on both our high and low resource language pairs, with a reduction in training time of 60%. Our improvements hold across multiple data sizes and two language pairs.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.