arxiv: 2301.02111 · v1 · submitted 2023-01-05 · 💻 cs.CL · cs.SD· eess.AS

Recognition: 2 theorem links

· Lean Theorem

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Chengyi Wang , Sanyuan Chen , Yu Wu , Ziqiang Zhang , Long Zhou , Shujie Liu , Zhuo Chen , Yanqing Liu

show 5 more authors

Huaming Wang Jinyu Li Lei He Sheng Zhao Furu Wei

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:25 UTC · model grok-4.3

classification 💻 cs.CL cs.SDeess.AS

keywords zero-shot TTSneural codeclanguage modelingin-context learningspeech synthesisdiscrete audio codesVALL-E

0 comments

The pith

Vall-E treats text-to-speech as conditional language modeling over discrete audio codes to enable zero-shot personalized synthesis from a 3-second prompt.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that converting speech waveforms into discrete codes via an off-the-shelf neural codec and then training an autoregressive language model on those codes turns TTS into a standard conditional language modeling task. With 60,000 hours of English speech data, the resulting model acquires in-context learning, allowing it to generate speech for an unseen speaker when given text plus a short audio prompt. This matters because it removes the need for speaker-specific fine-tuning or large enrollment recordings while achieving higher naturalness and similarity than prior zero-shot systems. The same mechanism also transfers emotion and acoustic environment from the prompt to the output.

Core claim

Vall-E is a neural codec language model trained on discrete codes derived from a neural audio codec. By scaling training data to 60K hours and framing synthesis as next-token prediction conditioned on text and prompt codes, the model acquires zero-shot capabilities: it synthesizes high-quality personalized speech for unseen speakers from a 3-second acoustic prompt, outperforming prior systems on naturalness and speaker similarity while preserving the prompt's emotion and environment.

What carries the argument

The neural codec language model that performs autoregressive prediction over discrete audio codes produced by an off-the-shelf codec, conditioned on text tokens and prompt codes.

Load-bearing premise

The discrete codes from the neural audio codec retain enough speaker identity, prosody, emotion, and environmental detail for the language model to reconstruct natural speech without any continuous-signal modeling.

What would settle it

Human listening tests in which raters compare Vall-E output to real recordings of the prompt speaker on naturalness, speaker similarity, emotion match, and environment match; failure to exceed current zero-shot baselines on these metrics would falsify the claim.

read the original abstract

We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt. Experiment results show that Vall-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find Vall-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis. See https://aka.ms/valle for demos of our work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VALL-E shows a scaled LM on off-the-shelf codec tokens can do zero-shot TTS from a 3-second prompt and beats prior systems on naturalness and similarity, but the codec's information retention is doing a lot of the work.

read the letter

The main point is that treating TTS as next-token prediction over discrete neural codec tokens, then scaling the training data to 60k hours, produces a model that can synthesize speech for unseen speakers using only a short acoustic prompt. It reports better naturalness and speaker similarity than the previous zero-shot baseline, plus some ability to keep emotion and environment from the prompt.

Referee Report

2 major / 1 minor

Summary. The paper introduces VALL-E, a neural codec language model for TTS that treats synthesis as conditional language modeling over discrete codes from an off-the-shelf neural audio codec. Pre-trained on 60k hours of English speech, the model exhibits in-context learning and performs zero-shot personalized TTS using only a 3-second acoustic prompt from an unseen speaker, claiming to significantly outperform prior zero-shot TTS systems in naturalness and speaker similarity while also preserving the prompt's emotion and acoustic environment.

Significance. If the empirical results and underlying assumptions hold, this represents a notable advance by reframing TTS as scalable language modeling on discrete audio tokens rather than continuous regression, leveraging massive data to enable prompt-based zero-shot personalization without fine-tuning. The emergence of in-context learning from 60k-hour pretraining and the reported preservation of non-textual attributes are potentially high-impact if validated, as they could generalize the LM paradigm to audio generation tasks.

major comments (2)

[Abstract] Abstract: The central claim that VALL-E 'significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity' lacks any quantitative details on metrics (e.g., MOS, similarity scores), test sets, speaker counts, statistical significance, or controls for prompt selection/quality. This information is required to substantiate the empirical superiority and zero-shot capability.
[Method (codec and pretraining description)] The zero-shot performance with 3-second prompts rests on the unverified assumption that discrete codes from the off-the-shelf codec retain sufficient speaker identity, prosody, emotion, and environmental details. No analysis, reconstruction experiments, or ablations are referenced to confirm information preservation in these dimensions, which is load-bearing for the in-context learning mechanism to copy or extrapolate from the acoustic prompt.

minor comments (1)

[Abstract] The abstract and introduction would benefit from explicit comparison of training data scale (60k hours) against prior zero-shot TTS systems to contextualize the scaling contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of VALL-E's significance and for the constructive major comments. We address each point below and have revised the manuscript to incorporate additional details and analyses where feasible.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that VALL-E 'significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity' lacks any quantitative details on metrics (e.g., MOS, similarity scores), test sets, speaker counts, statistical significance, or controls for prompt selection/quality. This information is required to substantiate the empirical superiority and zero-shot capability.

Authors: We agree that the abstract would benefit from more specificity to support the central claim. The original abstract was kept concise per standard practice, with full quantitative results (including MOS naturalness and speaker similarity scores, test set of 100+ unseen speakers, 3-second prompts, and statistical tests) provided in Section 4 and Tables 2-3. In the revision we have updated the abstract to include key metrics (e.g., naturalness MOS improvement and similarity scores) and a brief reference to the zero-shot test protocol and prompt controls. This directly addresses the request without exceeding length limits. revision: yes
Referee: [Method (codec and pretraining description)] The zero-shot performance with 3-second prompts rests on the unverified assumption that discrete codes from the off-the-shelf codec retain sufficient speaker identity, prosody, emotion, and environmental details. No analysis, reconstruction experiments, or ablations are referenced to confirm information preservation in these dimensions, which is load-bearing for the in-context learning mechanism to copy or extrapolate from the acoustic prompt.

Authors: This is a fair point; the preservation properties of the discrete codes are foundational. The codec (EnCodec) was chosen because its original publication demonstrates high-fidelity reconstruction that retains speaker, prosody, and acoustic environment information. Our empirical zero-shot results and subjective listening tests (emotion and environment matching) provide indirect validation. To strengthen the manuscript we have added a new subsection with reconstruction experiments (objective speaker embedding similarity and prosody metrics on coded vs. original audio) plus a brief ablation on codebook levels, confirming sufficient information retention for in-context learning. These additions are now referenced in the method section. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical training and evaluation chain is self-contained

full rationale

The paper trains an autoregressive language model on discrete tokens from an external off-the-shelf neural codec, scales the training corpus to 60k hours, and reports zero-shot in-context TTS performance via standard held-out speaker similarity and naturalness metrics. No equation or claim reduces by construction to a fitted parameter, self-definition, or prior self-citation; the codec is treated as a fixed black-box input, and success is measured against independent baselines rather than internal re-derivation. The derivation therefore contains no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach assumes that an existing neural audio codec produces a discrete representation sufficient for high-quality reconstruction and that language modeling on these tokens can capture long-range acoustic dependencies without explicit prosody or speaker modeling modules.

axioms (2)

domain assumption Discrete codes from an off-the-shelf neural audio codec retain all information needed for natural speech synthesis and speaker similarity.
Invoked when the paper states it uses codes derived from an off-the-shelf codec and treats TTS as language modeling over those codes.
domain assumption Scaling training data to 60K hours enables emergent in-context learning for zero-shot TTS.
Central to the claim that the model can use a 3-second prompt without additional training.

pith-pipeline@v0.9.0 · 5509 in / 1487 out tokens · 41528 ms · 2026-05-13T01:25:48.445219+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation/LawOfExistence.lean defect_zero_iff_one unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

VALL-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 29 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling
cs.SD 2026-05 unverdicted novelty 7.0

AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.
Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech
eess.AS 2026-05 unverdicted novelty 7.0

GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.
PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization
cs.LG 2026-05 unverdicted novelty 7.0

PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...
Tibetan-TTS:Low-Resource Tibetan Speech Synthesis with Large Model Adaptation
cs.SD 2026-05 unverdicted novelty 7.0

Large-model adaptation with Tibetan text handling produces natural speech from limited data, outperforming commercial systems.
MelShield: Robust Mel-Domain Audio Watermarking for Provenance Attribution of AI Generated Synthesized Speech
cs.SD 2026-05 unverdicted novelty 7.0

MelShield adds keyed low-energy spread-spectrum perturbations to Mel-spectrograms inside TTS pipelines before vocoding to enable robust extraction of user-specific attribution signals even after compression or noise.
SPG-Codec: Exploring the Role and Boundaries of Semantic Priors in Ultra-Low-Bitrate Neural Speech Coding
eess.AS 2026-04 unverdicted novelty 7.0

Semantic priors from HuBERT and Whisper improve speech codec intelligibility up to 6 kbps but show diminishing returns beyond that, with a bitrate-aware regulation strategy balancing semantic consistency and naturalness.
V.O.I.C.E (Voice, Ownership, Identity, Control, Expression): Risk Taxonomy of Synthetic Voice Generation From Empirical Data
cs.CR 2026-04 unverdicted novelty 7.0

V.O.I.C.E is a new taxonomy that organizes synthetic voice risks into five categories and shows how they interact with exposure, visibility, and legal context using empirical incident data.
PhySE: A Psychological Framework for Real-Time AR-LLM Social Engineering Attacks
cs.AI 2026-04 unverdicted novelty 7.0

PhySE combines VLM pre-training for fast social context profiling with a dynamic psychological agent to overcome delays and static tactics in AR-LLM social engineering attacks, tested in a 60-person user study.
Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages
eess.AS 2026-04 unverdicted novelty 7.0

Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.
X-VC: Zero-shot Streaming Voice Conversion in Codec Space
eess.AS 2026-04 unverdicted novelty 7.0

X-VC achieves zero-shot streaming voice conversion via one-step codec-space conversion with dual-conditioning acoustic converter and role-assignment training on generated paired data.
Moshi: a speech-text foundation model for real-time dialogue
eess.AS 2024-09 accept novelty 7.0

Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
Break-the-Beat! Controllable MIDI-to-Drum Audio Synthesis
cs.SD 2026-05 unverdicted novelty 6.0

Break-the-Beat! renders drum MIDI audio that matches the timbre of a reference clip by fine-tuning a text-to-audio model with a content encoder and hybrid conditioning on a new paired dataset.
Exploring Token-Space Manipulation in Latent Audio Tokenizers
cs.SD 2026-05 unverdicted novelty 6.0

LATTE creates a compact latent token bottleneck in audio tokenizers that aggregates global information and enables unsupervised editing of attributes like speaker identity via token swapping.
CASCADE: Context-Aware Relaxation for Speculative Image Decoding
cs.CV 2026-05 unverdicted novelty 6.0

CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to...
MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model
cs.SD 2026-05 accept novelty 6.0

MiniMind-O delivers a working 0.1B-scale open omni model with speech-native output, Thinker-Talker split, frozen encoders, and full release of code, checkpoints, and training data.
Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation
eess.AS 2026-04 unverdicted novelty 6.0

Chain-of-Details (CoD) is a cascaded TTS method that explicitly models temporal coarse-to-fine dynamics with a shared decoder, achieving competitive performance using significantly fewer parameters.
HCFD: A Benchmark for Audio Deepfake Detection in Healthcare
eess.AS 2026-04 unverdicted novelty 6.0

HCFD is a new pathology-aware benchmark and dataset for codec-fake audio detection in healthcare, with PHOENIX-Mamba achieving up to 97% accuracy by modeling fakes as modes in hyperbolic space.
StreamMark: A Deep Learning-Based Semi-Fragile Audio Watermarking for Proactive Deepfake Detection
eess.AS 2026-04 unverdicted novelty 6.0

StreamMark trains an Encoder-Distortion-Decoder network to embed semi-fragile watermarks that remain recoverable after benign audio transformations but drop to random accuracy under voice conversion and editing attacks.
ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models
cs.CL 2026-04 unverdicted novelty 6.0

ASPIRin decouples speaking timing from token content via binary action space projection and applies GRPO with rule-based rewards to optimize interactivity in SLMs without semantic collapse or repetition.
Borderless Long Speech Synthesis
cs.SD 2026-03 unverdicted novelty 6.0

Borderless Long Speech Synthesis unifies voice design, multi-speaker TTS, and long-form generation via Global-Sentence-Token annotations, CoT reasoning, and a Structured Semantic Interface for agent-centric control.
HAFM: Hierarchical Autoregressive Foundation Model for Music Accompaniment Generation
cs.SD 2026-04 unverdicted novelty 5.0

HAFM uses a hierarchical autoregressive model with dual-rate HuBERT and EnCodec tokens to generate coherent instrumental music from vocals, achieving FAD 2.08 on MUSDB18 while matching prior systems with fewer parameters.
Controllable Singing Style Conversion with Boundary-Aware Information Bottleneck
cs.SD 2026-04 unverdicted novelty 5.0

A singing voice conversion system with boundary-aware information bottleneck and high-frequency augmentation achieves the best naturalness in SVCC2025 subjective tests while using less extra data than competitors.
Voxtral TTS
cs.AI 2026-03 unverdicted novelty 5.0

Voxtral TTS produces expressive multilingual speech from 3-second reference audio with a hybrid autoregressive-plus-flow-matching architecture and a new VQ-FSQ tokenizer, achieving 68.4% win rate over ElevenLabs in hu...
WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models
cs.CL 2026-03 unverdicted novelty 5.0

WAND adapts AR-TTS models to constant complexity via windowed attention and distillation, cutting KV cache memory by up to 66.2% while preserving quality and achieving length-invariant latency.
Kimi-Audio Technical Report
eess.AS 2025-04 unverdicted novelty 5.0

Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million ho...
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
cs.SD 2024-12 unverdicted novelty 5.0

CosyVoice 2 delivers human-parity naturalness and near-lossless streaming speech synthesis by combining finite-scalar quantization, a streamlined pre-trained LLM, and chunk-aware causal flow matching on large multilin...
One Voice, Many Tongues: Cross-Lingual Voice Cloning for Scientific Speech
eess.AS 2026-04 unverdicted novelty 4.0

A system based on OmniVoice with multi-model ensemble distillation for fine-tuning shows consistent gains in intelligibility metrics while keeping speaker similarity for cross-lingual scientific speech.
ATRIE: Adaptive Tuning for Robust Inference and Emotion in Persona-Driven Speech Synthesis
cs.SD 2026-04 unverdicted novelty 4.0

ATRIE disentangles timbre and prosody in a Persona-Prosody Dual-Track model distilled from a large LLM to achieve strong identity preservation (EER 0.04) and emotional speech synthesis with SOTA results on an extended...
The Rise and Potential of Large Language Model Based Agents: A Survey
cs.AI 2023-09 accept novelty 4.0

The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 29 Pith papers · 3 internal anchors

[1]

The emotional voices database: Towards controlling the emotion dimension in voice generation systems

Adaeze Adigwe, Noé Tits, Kevin El Haddad, Sarah Ostadabbas, and Thierry Dutoit. The emotional voices database: Towards controlling the emotion dimension in voice generation systems. arXiv preprint arXiv:1806.09514,

work page arXiv
[2]

vq-wav2vec: Self-supervised learning of discrete speech representations

Alexei Baevski, Steffen Schneider, and Michael Auli. vq-wav2vec: Self-supervised learning of discrete speech representations. In ICLM, 2020a. Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. NeurIPS, 33:12449–12460, 2020b. He Bai, Renjie Zheng, Junkun Chen, ...

work page 2022
[3]

Audiolm: a language modeling approach to audio generation

Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matthew Shariﬁ, Olivier Teboul, David Grangier, Marco Tagliasacchi, and Neil Zeghidour. Audiolm: a language modeling approach to audio generation. CoRR, abs/2209.03143,

work page arXiv
[4]

Exploring the encoding layer and loss function in end- to-end speaker and language recognition system

Weicheng Cai, Jinkun Chen, and Ming Li. Exploring the encoding layer and loss function in end- to-end speaker and language recognition system. In Odyssey 2018: The Speaker and Language Recognition Workshop, 26-29 June 2018, Les Sables d’Olonne, France , pages 74–81. ISCA,

work page 2018
[5]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...

work page internal anchor Pith review Pith/arXiv arXiv
[6]

High Fidelity Neural Audio Compression

Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High ﬁdelity neural audio compression. arXiv preprint arXiv:2210.13438,

work page internal anchor Pith review arXiv
[7]

VQTTS: high-ﬁdelity text-to-speech synthesis with self-supervised VQ acoustic feature

Chenpeng Du, Yiwei Guo, Xie Chen, and Kai Yu. VQTTS: high-ﬁdelity text-to-speech synthesis with self-supervised VQ acoustic feature. In Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022 , pages 1596–1600. ISCA,

work page 2022
[8]

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed

doi: 10.21437/Interspeech.2022-489. Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing , 29:3451–3460,

work page doi:10.21437/interspeech.2022-489 2022
[9]

Any-speaker adaptive text-to-speech synthesis with diffusion models

Minki Kang, Dongchan Min, and Sung Ju Hwang. Any-speaker adaptive text-to-speech synthesis with diffusion models. CoRR, abs/2211.09383,

work page arXiv
[10]

Any-speaker adaptive text-to-speech synthesis with diffusion models

doi: 10.48550/arXiv.2211.09383. Heeseung Kim, Sungwon Kim, and Sungroh Yoon. Guided-tts: A diffusion model for text-to-speech via classiﬁer guidance. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors, International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA , v...

work page doi:10.48550/arxiv.2211.09383 2022
[11]

Generative spoken language modeling from raw audio

Kushal Lakhotia, Evgeny Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte, Tu Anh Nguyen, Jade Copet, Alexei Baevski, Adelrahman Mohamed, and Emmanuel Dupoux. Generative spoken language modeling from raw audio. CoRR, abs/2102.01192,

work page arXiv
[12]

Fine-grained emotion strength transfer, control and prediction for emotional speech synthesis

Yi Lei, Shan Yang, and Lei Xie. Fine-grained emotion strength transfer, control and prediction for emotional speech synthesis. In 2021 IEEE Spoken Language Technology Workshop (SLT), pages 423–430. IEEE,

work page 2021
[13]

Delightfultts 2: End-to-end speech synthesis with adversarial vector-quantized auto-encoders

Yanqing Liu, Ruiqing Xue, Lei He, Xu Tan, and Sheng Zhao. Delightfultts 2: End-to-end speech synthesis with adversarial vector-quantized auto-encoders. In Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022, pages 1581–1585. ISCA,

work page 2022
[14]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

doi: 10.21437/Interspeech.2022-277. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.21437/interspeech.2022-277 2022
[15]

Vadim Popov, Ivan V ovk, Vladimir Gogoryan, Tasnima Sadekova, and Mikhail A. Kudinov. Grad-tts: A diffusion probabilistic model for text-to-speech. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research,...

work page 2021
[16]

Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al

URL http://proceedings.mlr.press/v139/popov21a.html. Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al. The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding , number CONF. IEEE Signal Processin...

work page 2011
[17]

Soong, and Tie-Yan Liu

Xu Tan, Tao Qin, Frank K. Soong, and Tie-Yan Liu. A survey on neural speech synthesis. CoRR, abs/2106.15561,

work page arXiv
[18]

Neural discrete representation learning

15 Aäron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Infor- mation Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA , pages 6306–6315,

work page 2017
[19]

Adaspeech 4: Adaptive text to speech in zero-shot scenarios

Yihan Wu, Xu Tan, Bohan Li, Lei He, Sheng Zhao, Ruihua Song, Tao Qin, and Tie-Yan Liu. Adaspeech 4: Adaptive text to speech in zero-shot scenarios. In Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022, pages 2568–2572. ISCA,

work page 2022
[20]

Jingjing Xu, Xu Sun, Zhiyuan Zhang, Guangxiang Zhao, and Junyang Lin

doi: 10.21437/Interspeech.2022-901. Jingjing Xu, Xu Sun, Zhiyuan Zhang, Guangxiang Zhao, and Junyang Lin. Understanding and improving layer normalization. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, V ancouver , BC, Canada, pages 4383–4393,

work page doi:10.21437/interspeech.2022-901 2022