hub

arXiv preprint arXiv:1904.02882 , year=

Libritts: A corpus derived from librispeech for text-to-speech , author= · 2019 · cs.SD · arXiv 1904.02882

15 Pith papers cite this work. Polarity classification is still indexing.

15 Pith papers citing it

open full Pith review browse 15 citing papers arXiv PDF

abstract

This paper introduces a new speech corpus called "LibriTTS" designed for text-to-speech use. It is derived from the original audio and text materials of the LibriSpeech corpus, which has been used for training and evaluating automatic speech recognition systems. The new corpus inherits desired properties of the LibriSpeech corpus while addressing a number of issues which make LibriSpeech less than ideal for text-to-speech work. The released corpus consists of 585 hours of speech data at 24kHz sampling rate from 2,456 speakers and the corresponding texts. Experimental results show that neural end-to-end TTS models trained from the LibriTTS corpus achieved above 4.0 in mean opinion scores in naturalness in five out of six evaluation speakers. The corpus is freely available for download from http://www.openslr.org/60/.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 3 method 1

citation-polarity summary

use dataset 3 use method 1

representative citing papers

AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling

cs.SD · 2026-05-11 · unverdicted · novelty 7.0

AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.

VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.

Two-Dimensional Quantization for Geometry-Aware Audio Coding

cs.SD · 2025-12-01 · unverdicted · novelty 6.0

Q2D2 uses 2D geometric grid projections to quantize feature pairs in neural audio codecs, yielding implicit codebooks that improve efficiency and utilization over RVQ, VQ, and FSQ while maintaining reconstruction quality.

StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs

cs.CL · 2025-09-26 · unverdicted · novelty 6.0

StableToken introduces a multi-branch architecture with bit-wise voting to create noise-robust semantic speech tokens, achieving lower Unit Edit Distance and better SpeechLLM robustness than prior single-path tokenizers.

MLS: A Large-Scale Multilingual Dataset for Speech Research

eess.AS · 2020-12-07 · accept · novelty 6.0

MLS is a new large-scale multilingual speech corpus derived from LibriVox with 44.5k hours of English and 6k hours across seven other languages, plus baseline ASR and LM models.

JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions

eess.AS · 2026-05-06 · unverdicted · novelty 6.0

JASTIN is an instruction-driven audio evaluation system that achieves state-of-the-art correlation with human ratings on speech, sound, music, and out-of-domain tasks without task-specific retraining.

Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation

eess.AS · 2026-04-21 · unverdicted · novelty 6.0

Chain-of-Details (CoD) is a cascaded TTS method that explicitly models temporal coarse-to-fine dynamics with a shared decoder, achieving competitive performance using significantly fewer parameters.

Enhancing Speech Large Language Models through Reinforced Behavior Alignment

cs.CL · 2025-08-25 · unverdicted · novelty 5.0

Reinforced Behavior Alignment (RBA) uses self-synthesized data from a teacher LLM and reinforcement learning to close the instruction-following gap in SpeechLMs, outperforming distillation and reaching SOTA on spoken QA and speech-to-text translation benchmarks.

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

eess.AS · 2024-10-09 · unverdicted · novelty 5.0

F5-TTS generates natural speech from text via flow matching on DiT with simple text padding, ConvNeXt refinement, and sway sampling, trained on 100K hours multilingual data.

Speech bandwidth extension with WaveNet

eess.AS · 2019-07-05 · conditional · novelty 5.0

WaveNet conditioned on log-mel spectrograms upsamples 8 kHz GSM-FR speech to 24 kHz and reaches perceptual quality close to 16 kHz AMR-WB in MUSHRA listening tests.

Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM

cs.CL · 2026-05-07 · unverdicted · novelty 5.0 · 2 refs

TextPro-SLM reduces the speech-text modality gap by feeding an LLM backbone with synchronized text tokens and prosody embeddings from WhisperPro, achieving lowest gap scores at 3B/7B scales with roughly 1,000 hours of audio.

Kimi-Audio Technical Report

eess.AS · 2025-04-25 · unverdicted · novelty 5.0

Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.

The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning

eess.AS · 2026-03-18 · unverdicted · novelty 4.0 · 2 refs

FLAIR enables simultaneous latent reasoning during speech input in full-duplex dialogue models via recursive latent embeddings and an ELBO-based training objective without added latency.

A Methodology for Controlling the Emotional Expressiveness in Synthetic Speech -- a Deep Learning approach

eess.AS · 2019-07-05 · unverdicted · novelty 3.0

A methodology is proposed for emotional text-to-speech using emotional data collection, transfer-learning-based annotation of expressiveness features, and fine-tuning of a neutral TTS model.

A Survey of Advancing Audio Super-Resolution and Bandwidth Extension from Discriminative to Generative Models

eess.AS · 2026-05-15 · unverdicted · novelty 2.0

A structured survey of audio bandwidth extension that organizes the transition from deterministic discriminative DNNs to generative approaches including GANs, diffusion models, and flow-based methods.

citing papers explorer

Showing 15 of 15 citing papers.

AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling cs.SD · 2026-05-11 · unverdicted · none · ref 3
AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing cs.CL · 2026-05-07 · unverdicted · none · ref 117
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.
Two-Dimensional Quantization for Geometry-Aware Audio Coding cs.SD · 2025-12-01 · unverdicted · none · ref 79 · internal anchor
Q2D2 uses 2D geometric grid projections to quantize feature pairs in neural audio codecs, yielding implicit codebooks that improve efficiency and utilization over RVQ, VQ, and FSQ while maintaining reconstruction quality.
StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs cs.CL · 2025-09-26 · unverdicted · none · ref 83 · internal anchor
StableToken introduces a multi-branch architecture with bit-wise voting to create noise-robust semantic speech tokens, achieving lower Unit Edit Distance and better SpeechLLM robustness than prior single-path tokenizers.
MLS: A Large-Scale Multilingual Dataset for Speech Research eess.AS · 2020-12-07 · accept · none · ref 25 · internal anchor
MLS is a new large-scale multilingual speech corpus derived from LibriVox with 44.5k hours of English and 6k hours across seven other languages, plus baseline ASR and LM models.
JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions eess.AS · 2026-05-06 · unverdicted · none · ref 40
JASTIN is an instruction-driven audio evaluation system that achieves state-of-the-art correlation with human ratings on speech, sound, music, and out-of-domain tasks without task-specific retraining.
Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation eess.AS · 2026-04-21 · unverdicted · none · ref 44
Chain-of-Details (CoD) is a cascaded TTS method that explicitly models temporal coarse-to-fine dynamics with a shared decoder, achieving competitive performance using significantly fewer parameters.
Enhancing Speech Large Language Models through Reinforced Behavior Alignment cs.CL · 2025-08-25 · unverdicted · none · ref 56 · internal anchor
Reinforced Behavior Alignment (RBA) uses self-synthesized data from a teacher LLM and reinforcement learning to close the instruction-following gap in SpeechLMs, outperforming distillation and reaching SOTA on spoken QA and speech-to-text translation benchmarks.
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching eess.AS · 2024-10-09 · unverdicted · none · ref 153 · internal anchor
F5-TTS generates natural speech from text via flow matching on DiT with simple text padding, ConvNeXt refinement, and sway sampling, trained on 100K hours multilingual data.
Speech bandwidth extension with WaveNet eess.AS · 2019-07-05 · conditional · none · ref 21 · internal anchor
WaveNet conditioned on log-mel spectrograms upsamples 8 kHz GSM-FR speech to 24 kHz and reaches perceptual quality close to 16 kHz AMR-WB in MUSHRA listening tests.
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM cs.CL · 2026-05-07 · unverdicted · none · ref 57 · 2 links
TextPro-SLM reduces the speech-text modality gap by feeding an LLM backbone with synchronized text tokens and prosody embeddings from WhisperPro, achieving lowest gap scores at 3B/7B scales with roughly 1,000 hours of audio.
Kimi-Audio Technical Report eess.AS · 2025-04-25 · unverdicted · none · ref 83
Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.
The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning eess.AS · 2026-03-18 · unverdicted · none · ref 39 · 2 links · internal anchor
FLAIR enables simultaneous latent reasoning during speech input in full-duplex dialogue models via recursive latent embeddings and an ELBO-based training objective without added latency.
A Methodology for Controlling the Emotional Expressiveness in Synthetic Speech -- a Deep Learning approach eess.AS · 2019-07-05 · unverdicted · none · ref 18 · internal anchor
A methodology is proposed for emotional text-to-speech using emotional data collection, transfer-learning-based annotation of expressiveness features, and fine-tuning of a neutral TTS model.
A Survey of Advancing Audio Super-Resolution and Bandwidth Extension from Discriminative to Generative Models eess.AS · 2026-05-15 · unverdicted · none · ref 68 · internal anchor
A structured survey of audio bandwidth extension that organizes the transition from deterministic discriminative DNNs to generative approaches including GANs, diffusion models, and flow-based methods.

arXiv preprint arXiv:1904.02882 , year=

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer