pith. machine review for the scientific record. sign in

arxiv: 2301.02111 · v1 · submitted 2023-01-05 · 💻 cs.CL · cs.SD· eess.AS

Recognition: 2 theorem links

· Lean Theorem

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:25 UTC · model grok-4.3

classification 💻 cs.CL cs.SDeess.AS
keywords zero-shot TTSneural codeclanguage modelingin-context learningspeech synthesisdiscrete audio codesVALL-E
0
0 comments X

The pith

Vall-E treats text-to-speech as conditional language modeling over discrete audio codes to enable zero-shot personalized synthesis from a 3-second prompt.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that converting speech waveforms into discrete codes via an off-the-shelf neural codec and then training an autoregressive language model on those codes turns TTS into a standard conditional language modeling task. With 60,000 hours of English speech data, the resulting model acquires in-context learning, allowing it to generate speech for an unseen speaker when given text plus a short audio prompt. This matters because it removes the need for speaker-specific fine-tuning or large enrollment recordings while achieving higher naturalness and similarity than prior zero-shot systems. The same mechanism also transfers emotion and acoustic environment from the prompt to the output.

Core claim

Vall-E is a neural codec language model trained on discrete codes derived from a neural audio codec. By scaling training data to 60K hours and framing synthesis as next-token prediction conditioned on text and prompt codes, the model acquires zero-shot capabilities: it synthesizes high-quality personalized speech for unseen speakers from a 3-second acoustic prompt, outperforming prior systems on naturalness and speaker similarity while preserving the prompt's emotion and environment.

What carries the argument

The neural codec language model that performs autoregressive prediction over discrete audio codes produced by an off-the-shelf codec, conditioned on text tokens and prompt codes.

Load-bearing premise

The discrete codes from the neural audio codec retain enough speaker identity, prosody, emotion, and environmental detail for the language model to reconstruct natural speech without any continuous-signal modeling.

What would settle it

Human listening tests in which raters compare Vall-E output to real recordings of the prompt speaker on naturalness, speaker similarity, emotion match, and environment match; failure to exceed current zero-shot baselines on these metrics would falsify the claim.

read the original abstract

We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt. Experiment results show that Vall-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find Vall-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis. See https://aka.ms/valle for demos of our work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces VALL-E, a neural codec language model for TTS that treats synthesis as conditional language modeling over discrete codes from an off-the-shelf neural audio codec. Pre-trained on 60k hours of English speech, the model exhibits in-context learning and performs zero-shot personalized TTS using only a 3-second acoustic prompt from an unseen speaker, claiming to significantly outperform prior zero-shot TTS systems in naturalness and speaker similarity while also preserving the prompt's emotion and acoustic environment.

Significance. If the empirical results and underlying assumptions hold, this represents a notable advance by reframing TTS as scalable language modeling on discrete audio tokens rather than continuous regression, leveraging massive data to enable prompt-based zero-shot personalization without fine-tuning. The emergence of in-context learning from 60k-hour pretraining and the reported preservation of non-textual attributes are potentially high-impact if validated, as they could generalize the LM paradigm to audio generation tasks.

major comments (2)
  1. [Abstract] Abstract: The central claim that VALL-E 'significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity' lacks any quantitative details on metrics (e.g., MOS, similarity scores), test sets, speaker counts, statistical significance, or controls for prompt selection/quality. This information is required to substantiate the empirical superiority and zero-shot capability.
  2. [Method (codec and pretraining description)] The zero-shot performance with 3-second prompts rests on the unverified assumption that discrete codes from the off-the-shelf codec retain sufficient speaker identity, prosody, emotion, and environmental details. No analysis, reconstruction experiments, or ablations are referenced to confirm information preservation in these dimensions, which is load-bearing for the in-context learning mechanism to copy or extrapolate from the acoustic prompt.
minor comments (1)
  1. [Abstract] The abstract and introduction would benefit from explicit comparison of training data scale (60k hours) against prior zero-shot TTS systems to contextualize the scaling contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of VALL-E's significance and for the constructive major comments. We address each point below and have revised the manuscript to incorporate additional details and analyses where feasible.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that VALL-E 'significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity' lacks any quantitative details on metrics (e.g., MOS, similarity scores), test sets, speaker counts, statistical significance, or controls for prompt selection/quality. This information is required to substantiate the empirical superiority and zero-shot capability.

    Authors: We agree that the abstract would benefit from more specificity to support the central claim. The original abstract was kept concise per standard practice, with full quantitative results (including MOS naturalness and speaker similarity scores, test set of 100+ unseen speakers, 3-second prompts, and statistical tests) provided in Section 4 and Tables 2-3. In the revision we have updated the abstract to include key metrics (e.g., naturalness MOS improvement and similarity scores) and a brief reference to the zero-shot test protocol and prompt controls. This directly addresses the request without exceeding length limits. revision: yes

  2. Referee: [Method (codec and pretraining description)] The zero-shot performance with 3-second prompts rests on the unverified assumption that discrete codes from the off-the-shelf codec retain sufficient speaker identity, prosody, emotion, and environmental details. No analysis, reconstruction experiments, or ablations are referenced to confirm information preservation in these dimensions, which is load-bearing for the in-context learning mechanism to copy or extrapolate from the acoustic prompt.

    Authors: This is a fair point; the preservation properties of the discrete codes are foundational. The codec (EnCodec) was chosen because its original publication demonstrates high-fidelity reconstruction that retains speaker, prosody, and acoustic environment information. Our empirical zero-shot results and subjective listening tests (emotion and environment matching) provide indirect validation. To strengthen the manuscript we have added a new subsection with reconstruction experiments (objective speaker embedding similarity and prosody metrics on coded vs. original audio) plus a brief ablation on codebook levels, confirming sufficient information retention for in-context learning. These additions are now referenced in the method section. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical training and evaluation chain is self-contained

full rationale

The paper trains an autoregressive language model on discrete tokens from an external off-the-shelf neural codec, scales the training corpus to 60k hours, and reports zero-shot in-context TTS performance via standard held-out speaker similarity and naturalness metrics. No equation or claim reduces by construction to a fitted parameter, self-definition, or prior self-citation; the codec is treated as a fixed black-box input, and success is measured against independent baselines rather than internal re-derivation. The derivation therefore contains no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach assumes that an existing neural audio codec produces a discrete representation sufficient for high-quality reconstruction and that language modeling on these tokens can capture long-range acoustic dependencies without explicit prosody or speaker modeling modules.

axioms (2)
  • domain assumption Discrete codes from an off-the-shelf neural audio codec retain all information needed for natural speech synthesis and speaker similarity.
    Invoked when the paper states it uses codes derived from an off-the-shelf codec and treats TTS as language modeling over those codes.
  • domain assumption Scaling training data to 60K hours enables emergent in-context learning for zero-shot TTS.
    Central to the claim that the model can use a 3-second prompt without additional training.

pith-pipeline@v0.9.0 · 5509 in / 1487 out tokens · 41528 ms · 2026-05-13T01:25:48.445219+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation/LawOfExistence.lean defect_zero_iff_one unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    VALL-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 29 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling

    cs.SD 2026-05 unverdicted novelty 7.0

    AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.

  2. Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech

    eess.AS 2026-05 unverdicted novelty 7.0

    GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.

  3. PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

    cs.LG 2026-05 unverdicted novelty 7.0

    PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...

  4. Tibetan-TTS:Low-Resource Tibetan Speech Synthesis with Large Model Adaptation

    cs.SD 2026-05 unverdicted novelty 7.0

    Large-model adaptation with Tibetan text handling produces natural speech from limited data, outperforming commercial systems.

  5. MelShield: Robust Mel-Domain Audio Watermarking for Provenance Attribution of AI Generated Synthesized Speech

    cs.SD 2026-05 unverdicted novelty 7.0

    MelShield adds keyed low-energy spread-spectrum perturbations to Mel-spectrograms inside TTS pipelines before vocoding to enable robust extraction of user-specific attribution signals even after compression or noise.

  6. SPG-Codec: Exploring the Role and Boundaries of Semantic Priors in Ultra-Low-Bitrate Neural Speech Coding

    eess.AS 2026-04 unverdicted novelty 7.0

    Semantic priors from HuBERT and Whisper improve speech codec intelligibility up to 6 kbps but show diminishing returns beyond that, with a bitrate-aware regulation strategy balancing semantic consistency and naturalness.

  7. V.O.I.C.E (Voice, Ownership, Identity, Control, Expression): Risk Taxonomy of Synthetic Voice Generation From Empirical Data

    cs.CR 2026-04 unverdicted novelty 7.0

    V.O.I.C.E is a new taxonomy that organizes synthetic voice risks into five categories and shows how they interact with exposure, visibility, and legal context using empirical incident data.

  8. PhySE: A Psychological Framework for Real-Time AR-LLM Social Engineering Attacks

    cs.AI 2026-04 unverdicted novelty 7.0

    PhySE combines VLM pre-training for fast social context profiling with a dynamic psychological agent to overcome delays and static tactics in AR-LLM social engineering attacks, tested in a 60-person user study.

  9. Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages

    eess.AS 2026-04 unverdicted novelty 7.0

    Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.

  10. X-VC: Zero-shot Streaming Voice Conversion in Codec Space

    eess.AS 2026-04 unverdicted novelty 7.0

    X-VC achieves zero-shot streaming voice conversion via one-step codec-space conversion with dual-conditioning acoustic converter and role-assignment training on generated paired data.

  11. Moshi: a speech-text foundation model for real-time dialogue

    eess.AS 2024-09 accept novelty 7.0

    Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.

  12. Break-the-Beat! Controllable MIDI-to-Drum Audio Synthesis

    cs.SD 2026-05 unverdicted novelty 6.0

    Break-the-Beat! renders drum MIDI audio that matches the timbre of a reference clip by fine-tuning a text-to-audio model with a content encoder and hybrid conditioning on a new paired dataset.

  13. Exploring Token-Space Manipulation in Latent Audio Tokenizers

    cs.SD 2026-05 unverdicted novelty 6.0

    LATTE creates a compact latent token bottleneck in audio tokenizers that aggregates global information and enables unsupervised editing of attributes like speaker identity via token swapping.

  14. CASCADE: Context-Aware Relaxation for Speculative Image Decoding

    cs.CV 2026-05 unverdicted novelty 6.0

    CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to...

  15. MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model

    cs.SD 2026-05 accept novelty 6.0

    MiniMind-O delivers a working 0.1B-scale open omni model with speech-native output, Thinker-Talker split, frozen encoders, and full release of code, checkpoints, and training data.

  16. Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation

    eess.AS 2026-04 unverdicted novelty 6.0

    Chain-of-Details (CoD) is a cascaded TTS method that explicitly models temporal coarse-to-fine dynamics with a shared decoder, achieving competitive performance using significantly fewer parameters.

  17. HCFD: A Benchmark for Audio Deepfake Detection in Healthcare

    eess.AS 2026-04 unverdicted novelty 6.0

    HCFD is a new pathology-aware benchmark and dataset for codec-fake audio detection in healthcare, with PHOENIX-Mamba achieving up to 97% accuracy by modeling fakes as modes in hyperbolic space.

  18. StreamMark: A Deep Learning-Based Semi-Fragile Audio Watermarking for Proactive Deepfake Detection

    eess.AS 2026-04 unverdicted novelty 6.0

    StreamMark trains an Encoder-Distortion-Decoder network to embed semi-fragile watermarks that remain recoverable after benign audio transformations but drop to random accuracy under voice conversion and editing attacks.

  19. ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    ASPIRin decouples speaking timing from token content via binary action space projection and applies GRPO with rule-based rewards to optimize interactivity in SLMs without semantic collapse or repetition.

  20. Borderless Long Speech Synthesis

    cs.SD 2026-03 unverdicted novelty 6.0

    Borderless Long Speech Synthesis unifies voice design, multi-speaker TTS, and long-form generation via Global-Sentence-Token annotations, CoT reasoning, and a Structured Semantic Interface for agent-centric control.

  21. HAFM: Hierarchical Autoregressive Foundation Model for Music Accompaniment Generation

    cs.SD 2026-04 unverdicted novelty 5.0

    HAFM uses a hierarchical autoregressive model with dual-rate HuBERT and EnCodec tokens to generate coherent instrumental music from vocals, achieving FAD 2.08 on MUSDB18 while matching prior systems with fewer parameters.

  22. Controllable Singing Style Conversion with Boundary-Aware Information Bottleneck

    cs.SD 2026-04 unverdicted novelty 5.0

    A singing voice conversion system with boundary-aware information bottleneck and high-frequency augmentation achieves the best naturalness in SVCC2025 subjective tests while using less extra data than competitors.

  23. Voxtral TTS

    cs.AI 2026-03 unverdicted novelty 5.0

    Voxtral TTS produces expressive multilingual speech from 3-second reference audio with a hybrid autoregressive-plus-flow-matching architecture and a new VQ-FSQ tokenizer, achieving 68.4% win rate over ElevenLabs in hu...

  24. WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models

    cs.CL 2026-03 unverdicted novelty 5.0

    WAND adapts AR-TTS models to constant complexity via windowed attention and distillation, cutting KV cache memory by up to 66.2% while preserving quality and achieving length-invariant latency.

  25. Kimi-Audio Technical Report

    eess.AS 2025-04 unverdicted novelty 5.0

    Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million ho...

  26. CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

    cs.SD 2024-12 unverdicted novelty 5.0

    CosyVoice 2 delivers human-parity naturalness and near-lossless streaming speech synthesis by combining finite-scalar quantization, a streamlined pre-trained LLM, and chunk-aware causal flow matching on large multilin...

  27. One Voice, Many Tongues: Cross-Lingual Voice Cloning for Scientific Speech

    eess.AS 2026-04 unverdicted novelty 4.0

    A system based on OmniVoice with multi-model ensemble distillation for fine-tuning shows consistent gains in intelligibility metrics while keeping speaker similarity for cross-lingual scientific speech.

  28. ATRIE: Adaptive Tuning for Robust Inference and Emotion in Persona-Driven Speech Synthesis

    cs.SD 2026-04 unverdicted novelty 4.0

    ATRIE disentangles timbre and prosody in a Persona-Prosody Dual-Track model distilled from a large LLM to achieve strong identity preservation (EER 0.04) and emotional speech synthesis with SOTA results on an extended...

  29. The Rise and Potential of Large Language Model Based Agents: A Survey

    cs.AI 2023-09 accept novelty 4.0

    The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 29 Pith papers · 3 internal anchors

  1. [1]

    The emotional voices database: Towards controlling the emotion dimension in voice generation systems

    Adaeze Adigwe, Noé Tits, Kevin El Haddad, Sarah Ostadabbas, and Thierry Dutoit. The emotional voices database: Towards controlling the emotion dimension in voice generation systems. arXiv preprint arXiv:1806.09514,

  2. [2]

    vq-wav2vec: Self-supervised learning of discrete speech representations

    Alexei Baevski, Steffen Schneider, and Michael Auli. vq-wav2vec: Self-supervised learning of discrete speech representations. In ICLM, 2020a. Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. NeurIPS, 33:12449–12460, 2020b. He Bai, Renjie Zheng, Junkun Chen, ...

  3. [3]

    Audiolm: a language modeling approach to audio generation

    Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matthew Sharifi, Olivier Teboul, David Grangier, Marco Tagliasacchi, and Neil Zeghidour. Audiolm: a language modeling approach to audio generation. CoRR, abs/2209.03143,

  4. [4]

    Exploring the encoding layer and loss function in end- to-end speaker and language recognition system

    Weicheng Cai, Jinkun Chen, and Ming Li. Exploring the encoding layer and loss function in end- to-end speaker and language recognition system. In Odyssey 2018: The Speaker and Language Recognition Workshop, 26-29 June 2018, Les Sables d’Olonne, France , pages 74–81. ISCA,

  5. [5]

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...

  6. [6]

    High Fidelity Neural Audio Compression

    Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438,

  7. [7]

    VQTTS: high-fidelity text-to-speech synthesis with self-supervised VQ acoustic feature

    Chenpeng Du, Yiwei Guo, Xie Chen, and Kai Yu. VQTTS: high-fidelity text-to-speech synthesis with self-supervised VQ acoustic feature. In Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022 , pages 1596–1600. ISCA,

  8. [8]

    Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed

    doi: 10.21437/Interspeech.2022-489. Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing , 29:3451–3460,

  9. [9]

    Any-speaker adaptive text-to-speech synthesis with diffusion models

    Minki Kang, Dongchan Min, and Sung Ju Hwang. Any-speaker adaptive text-to-speech synthesis with diffusion models. CoRR, abs/2211.09383,

  10. [10]

    Any-speaker adaptive text-to-speech synthesis with diffusion models

    doi: 10.48550/arXiv.2211.09383. Heeseung Kim, Sungwon Kim, and Sungroh Yoon. Guided-tts: A diffusion model for text-to-speech via classifier guidance. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors, International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA , v...

  11. [11]

    Generative spoken language modeling from raw audio

    Kushal Lakhotia, Evgeny Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte, Tu Anh Nguyen, Jade Copet, Alexei Baevski, Adelrahman Mohamed, and Emmanuel Dupoux. Generative spoken language modeling from raw audio. CoRR, abs/2102.01192,

  12. [12]

    Fine-grained emotion strength transfer, control and prediction for emotional speech synthesis

    Yi Lei, Shan Yang, and Lei Xie. Fine-grained emotion strength transfer, control and prediction for emotional speech synthesis. In 2021 IEEE Spoken Language Technology Workshop (SLT), pages 423–430. IEEE,

  13. [13]

    Delightfultts 2: End-to-end speech synthesis with adversarial vector-quantized auto-encoders

    Yanqing Liu, Ruiqing Xue, Lei He, Xu Tan, and Sheng Zhao. Delightfultts 2: End-to-end speech synthesis with adversarial vector-quantized auto-encoders. In Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022, pages 1581–1585. ISCA,

  14. [14]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    doi: 10.21437/Interspeech.2022-277. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692,

  15. [15]

    Vadim Popov, Ivan V ovk, Vladimir Gogoryan, Tasnima Sadekova, and Mikhail A. Kudinov. Grad-tts: A diffusion probabilistic model for text-to-speech. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research,...

  16. [16]

    Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al

    URL http://proceedings.mlr.press/v139/popov21a.html. Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al. The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding , number CONF. IEEE Signal Processin...

  17. [17]

    Soong, and Tie-Yan Liu

    Xu Tan, Tao Qin, Frank K. Soong, and Tie-Yan Liu. A survey on neural speech synthesis. CoRR, abs/2106.15561,

  18. [18]

    Neural discrete representation learning

    15 Aäron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Infor- mation Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA , pages 6306–6315,

  19. [19]

    Adaspeech 4: Adaptive text to speech in zero-shot scenarios

    Yihan Wu, Xu Tan, Bohan Li, Lei He, Sheng Zhao, Ruihua Song, Tao Qin, and Tie-Yan Liu. Adaspeech 4: Adaptive text to speech in zero-shot scenarios. In Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022, pages 2568–2572. ISCA,

  20. [20]

    Jingjing Xu, Xu Sun, Zhiyuan Zhang, Guangxiang Zhao, and Junyang Lin

    doi: 10.21437/Interspeech.2022-901. Jingjing Xu, Xu Sun, Zhiyuan Zhang, Guangxiang Zhao, and Junyang Lin. Understanding and improving layer normalization. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, V ancouver , BC, Canada, pages 4383–4393,