Any-speaker adaptive text-to-speech synthesis with diffusion models

Minki Kang, Dongchan Min, Sung Ju Hwang · 2022 · arXiv 2211.09383

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

cs.CL · 2023-01-05 · unverdicted · novelty 7.0 · 2 refs

VALL-E is a neural codec language model trained on 60K hours of speech that performs zero-shot TTS, synthesizing natural speech that matches an unseen speaker's voice, emotion, and environment from a 3-second prompt.

Faster Segment Anything: Towards Lightweight SAM for Mobile Applications

cs.CV · 2023-06-25 · conditional · novelty 5.0

MobileSAM is a 60x smaller distilled version of SAM that matches original performance and runs 5x faster than concurrent FastSAM while supporting CPU inference.

citing papers explorer

Showing 2 of 2 citing papers.

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers cs.CL · 2023-01-05 · unverdicted · none · ref 9 · 2 links
VALL-E is a neural codec language model trained on 60K hours of speech that performs zero-shot TTS, synthesizing natural speech that matches an unseen speaker's voice, emotion, and environment from a 3-second prompt.
Faster Segment Anything: Towards Lightweight SAM for Mobile Applications cs.CV · 2023-06-25 · conditional · none · ref 12
MobileSAM is a 60x smaller distilled version of SAM that matches original performance and runs 5x faster than concurrent FastSAM while supporting CPU inference.

Any-speaker adaptive text-to-speech synthesis with diffusion models

fields

years

verdicts

representative citing papers

citing papers explorer