Fitting New Speakers Based on a Short Untranscribed Sample

Adam Polyak; Eliya Nachmani; Lior Wolf; Yaniv Taigman

Fitting New Speakers Based on a Short Untranscribed Sample

Not yet reviewed by Pith; the record is open.

Re-run · record.json Download PDF Read on arXiv ↗

This paper has not been read by Pith yet. Machine review is queued; the pith claim, tier, and objections will appear here once it completes.

SPECIMEN: schema-true, not a live event

T0 review · schema-true

One-sentence machine reading of the paper's core claim.

pith:XXXXXXXX · record.json · timestamp

arxiv 1802.06984 v1 pith:X5FILVWM submitted 2018-02-20 cs.LG cs.SDeess.AS

Fitting New Speakers Based on a Short Untranscribed Sample

Eliya Nachmani , Adam Polyak , Yaniv Taigman , Lior Wolf This is my paper

classification cs.LG cs.SDeess.AS

keywords sampleshortspeakeraudiofittingnetworkspeakersspeech

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

0 comments

read the original abstract

Learning-based Text To Speech systems have the potential to generalize from one speaker to the next and thus require a relatively short sample of any new voice. However, this promise is currently largely unrealized. We present a method that is designed to capture a new speaker from a short untranscribed audio sample. This is done by employing an additional network that given an audio sample, places the speaker in the embedding space. This network is trained as part of the speech synthesis system using various consistency losses. Our results demonstrate a greatly improved performance on both the dataset speakers, and, more importantly, when fitting new voices, even from very short samples.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Hierarchical Sequence to Sequence Voice Conversion with Limited Data
eess.AS 2019-07 unverdicted novelty 4.0

Hierarchical seq2seq model for parallel voice conversion pretrained as autoencoder on single-speaker data then adapted to limited multispeaker data, using mel spectrograms converted via wavenet vocoder.