MLS: A Large-Scale Multilingual Dataset for Speech Research
Pith reviewed 2026-05-18 14:50 UTC · model grok-4.3
Add this Pith Number to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{J5I65L4W}
Prints a linked pith:J5I65L4W badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
A new multilingual speech dataset supplies tens of thousands of hours of audiobook audio paired with text across eight languages.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
This paper introduces Multilingual LibriSpeech (MLS) dataset, a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages, including about 44.5K hours of English and a total of about 6K hours for other languages. Additionally, we provide Language Models (LM) and baseline Automatic Speech Recognition (ASR) models for all the languages in our dataset. We believe such a large transcribed dataset will open new avenues in ASR and Text-To-Speech (TTS) research. The dataset will be made freely available for anyone.
What carries the argument
The MLS dataset constructed from LibriVox audiobook recordings with their associated transcriptions.
If this is right
- Multilingual automatic speech recognition systems can be trained on substantially more data than before.
- Text-to-speech models gain access to large amounts of natural read speech in multiple languages.
- Baseline models for each language provide reference points for future research improvements.
- Public release allows independent groups to build upon the same starting resource.
Where Pith is reading between the lines
- Speech technologies for the included languages could see faster improvement due to the increased training volume.
- If the read audiobook style transfers well, the dataset might support applications beyond formal speech.
- Combining this with other datasets could address gaps in spontaneous or conversational speech data.
Load-bearing premise
The provided transcriptions accurately match the spoken audio without significant errors that would mislead model training.
What would settle it
Evaluation results where speech recognition models trained solely on MLS data show no improvement over models trained on much smaller prior datasets when tested on held-out speech.
read the original abstract
This paper introduces Multilingual LibriSpeech (MLS) dataset, a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages, including about 44.5K hours of English and a total of about 6K hours for other languages. Additionally, we provide Language Models (LM) and baseline Automatic Speech Recognition (ASR) models and for all the languages in our dataset. We believe such a large transcribed dataset will open new avenues in ASR and Text-To-Speech (TTS) research. The dataset will be made freely available for anyone at http://www.openslr.org.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Multilingual LibriSpeech (MLS) dataset, a large-scale multilingual corpus derived from LibriVox audiobooks. It covers 8 languages with approximately 44.5K hours of English speech and 6K hours across the other seven languages. Baseline ASR and language model results are provided for each language, and the authors announce that the full dataset will be released publicly.
Significance. If the audio-transcript alignments prove sufficiently clean, MLS would constitute a valuable public resource for multilingual ASR and TTS research, especially given its scale and the inclusion of reference baselines. The public release itself is a clear strength that directly benefits the community.
major comments (2)
- [Dataset construction] Dataset construction section: the alignment procedure for the seven non-English languages is described at a high level but no quantitative validation of transcription accuracy (e.g., WER against a small set of independently verified native-speaker transcripts or alignment error statistics) is reported. This measurement is load-bearing for the claim that the ~6K hours constitute a ready-to-use corpus for speech research.
- [Baseline results] Baseline ASR results section: the reported WER numbers are presented without any analysis of how performance might be affected by possible transcription noise in the non-English portions or comparison to models trained on existing cleaner monolingual corpora.
minor comments (1)
- [Abstract] Abstract: the sentence contains a repeated conjunction ('provide Language Models (LM) and baseline Automatic Speech Recognition (ASR) models and for all the languages').
Simulated Author's Rebuttal
We thank the referee for the constructive review and recommendation of minor revision. We address each major comment below and outline the changes we will make to the manuscript.
read point-by-point responses
-
Referee: [Dataset construction] Dataset construction section: the alignment procedure for the seven non-English languages is described at a high level but no quantitative validation of transcription accuracy (e.g., WER against a small set of independently verified native-speaker transcripts or alignment error statistics) is reported. This measurement is load-bearing for the claim that the ~6K hours constitute a ready-to-use corpus for speech research.
Authors: We agree that explicit quantitative validation strengthens the claim that the non-English portions are ready for research use. The alignment pipeline is identical to the one validated for the English LibriSpeech corpus in prior work, and the source transcripts are the original public LibriVox texts. Nevertheless, we will add a new subsection reporting alignment error rates and WERs computed on small, independently verified native-speaker transcript subsets for each of the seven languages. revision: yes
-
Referee: [Baseline results] Baseline ASR results section: the reported WER numbers are presented without any analysis of how performance might be affected by possible transcription noise in the non-English portions or comparison to models trained on existing cleaner monolingual corpora.
Authors: The baselines are provided as initial reference points rather than exhaustive benchmarks. We will expand the discussion to include a short analysis of how residual alignment noise may influence the reported WERs and will add comparisons, where feasible, against models trained on existing cleaner monolingual corpora for the same languages. revision: yes
Circularity Check
No circularity: dataset release with independent baselines
full rationale
The paper's central contribution is the release of the MLS corpus (audio + transcripts from LibriVox) plus trained baseline ASR/LM models. No equations, fitted parameters, or predictions are derived from the data in a way that reduces back to the inputs by construction. The English portion re-uses an existing LibriSpeech pipeline (external to this work), and non-English processing is described as a direct alignment step without any self-referential fitting or uniqueness theorems. Self-citations, if present, are not load-bearing for any claim. The skeptic concern about transcription accuracy is a data-quality assumption, not a circular derivation.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 17 Pith papers
-
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...
-
Rethinking Entropy Minimization in Test-Time Adaptation for Autoregressive Models
Derives a rigorous entropy minimization formulation for autoregressive test-time adaptation that decomposes into policy gradient and entropy terms, reinterpreting prior methods and improving Whisper ASR across 20+ domains.
-
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.
-
Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation
Chain-of-Details (CoD) is a cascaded TTS method that explicitly models temporal coarse-to-fine dynamics with a shared decoder, achieving competitive performance using significantly fewer parameters.
-
Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs
A multi-stage training method for LLM-based ASR uses new entropy allocation metrics to achieve competitive benchmark performance with 2.3B parameters while mitigating hallucinations via better encoder-LLM decoupling.
-
FastTurn: Unifying Acoustic and Streaming Semantic Cues for Low-Latency and Robust Turn Detection
FastTurn unifies acoustic features and streaming CTC decoding for low-latency, robust turn detection in full-duplex dialogue systems and releases a realistic human-dialogue test set.
-
Rethinking Training Targets, Architectures and Data Quality for Universal Speech Enhancement
Replacing early-reflected speech with time-shifted anechoic clean speech as the training target, combined with a two-stage distortion-perception framework, yields state-of-the-art universal speech enhancement.
-
StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs
StableToken introduces a multi-branch architecture with bit-wise voting to create noise-robust semantic speech tokens, achieving lower Unit Edit Distance and better SpeechLLM robustness than prior single-path tokenizers.
-
AudioPaLM: A Large Language Model That Can Speak and Listen
AudioPaLM unifies PaLM-2 and AudioLM to outperform prior systems on speech translation while enabling zero-shot speech-to-text for many unseen language pairs and voice transfer from short prompts.
-
Omni-Fake: Benchmarking Unified Multimodal Social Media Deepfake Detection
Omni-Fake delivers a unified multimodal deepfake benchmark dataset and RL-driven detector that reports gains in accuracy, cross-modal generalization, and explainability over prior baselines.
-
Data Selection Effects on Self-Supervised Learning of Audio Representations for French Audiovisual Broadcasts
Pretraining audio SSL encoders on diverse French broadcast content rather than clean speech yields better downstream performance on ASR, music detection, and speaker recognition, with deduplication mitigating memorization.
-
AudioKV: KV Cache Eviction in Efficient Large Audio Language Models
AudioKV prioritizes audio-critical attention heads identified via ASR analysis and applies spectral score smoothing to evict KV cache tokens, achieving high compression with minimal accuracy loss in LALMs.
-
Kimi-Audio Technical Report
Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million ho...
-
NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR
NIM4-ASR delivers SOTA ASR performance on public benchmarks using a 2.3B-parameter LLM with multi-stage training, real-time streaming, and million-scale hotword customization via RAG.
-
In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word Level Timestamp Predictions
Lightweight training strategies allow speech-aware LLMs to output accurate word timestamps alongside ASR transcripts while also improving recognition quality across datasets.
-
Empowering Video Translation using Multimodal Large Language Models
The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.
-
Revisiting Direct Speech-to-Text Translation with Speech LLMs: Better Scaling than CoT Prompting?
Direct prompting scales more consistently than CoT prompting for speech-to-text translation as the amount of S2TT data increases.
Reference graph
Works this paper leans on
-
[1]
Librispeech: An asr corpus based on public domain audio books,
V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An asr corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2015, pp. 5206–5210
work page 2015
-
[2]
M. P. Harper, “The iarpa babel program,” https://www.iarpa.gov/index.php/research-programs/babel
-
[3]
Free speech... recognition (linux, windows and mac) - voxforge.org,
V oxforge.org, “Free speech... recognition (linux, windows and mac) - voxforge.org,” http://www.voxforge.org/, accessed 06/25/2014
work page 2014
-
[4]
Common voice: A massively-multilingual speech corpus,
R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” 2019
work page 2019
-
[5]
I. Solak, “The M-AILABS Speech Dataset,” https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/, 2019, [Online; accessed 19-April-2020]
work page 2019
-
[6]
Cmu wilderness multilingual speech dataset,
A. W. Black, “Cmu wilderness multilingual speech dataset,” in ICASSP 2019 - 2019 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP) , 2019, pp. 5971–5975
work page 2019
-
[7]
Sequence-to-sequence speech recognition with time-depth sepa- rable convolutions,
A. Hannun, A. Lee, Q. Xu, and R. Collobert, “Sequence-to-sequence speech recognition with time-depth sepa- rable convolutions,” in INTERSPEECH, 2019
work page 2019
-
[8]
Wav2letter: an end-to-end convnet-based speech recognition sys- tem,
R. Collobert, C. Puhrsch, and G. Synnaeve, “Wav2letter: an end-to-end convnet-based speech recognition sys- tem,” 2016
work page 2016
-
[9]
A. Graves, S. Fern ´andez, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling un- segmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369–376
work page 2006
-
[10]
Letter-based speech recognition with gated convnets,
V . Liptchinsky, G. Synnaeve, and R. Collobert, “Letter-based speech recognition with gated convnets,” 2017
work page 2017
-
[11]
Scaling up online speech recognition using convnets,
V . Pratap, Q. Xu, J. Kahn, G. Avidov, T. Likhomanenko, A. Hannun, V . Liptchinsky, G. Synnaeve, and R. Col- lobert, “Scaling up online speech recognition using convnets,” 2020
work page 2020
-
[12]
End-to-end asr: from supervised to semi-supervised learning with modern architectures,
G. Synnaeve, Q. Xu, J. Kahn, T. Likhomanenko, E. Grave, V . Pratap, A. Sriram, V . Liptchinsky, and R. Collobert, “End-to-end asr: from supervised to semi-supervised learning with modern architectures,” 2019
work page 2019
-
[13]
Librispeech: an asr corpus based on public domain audio books,
V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2015, pp. 5206–5210
work page 2015
-
[14]
V . Manohar, D. Povey, and S. Khudanpur, “Jhu kaldi system for arabic mgb-3 asr challenge using diarization, audio-transcript alignment and transfer learning,” in 2017 IEEE Automatic Speech Recognition and Understand- ing Workshop (ASRU). IEEE, 2017, pp. 346–352
work page 2017
-
[15]
Identification of common molecular subsequences,
T. Smith and M. Waterman, “Identification of common molecular subsequences,” Journal of Molecular Biology, vol. 147, no. 1, pp. 195 – 197, 1981. [Online]. Available: http://www.sciencedirect.com/science/article/pii/ 0022283681900875
work page 1981
-
[16]
A training algorithm for optimal margin classifiers,
B. E. Boser, I. M. Guyon, and V . N. Vapnik, “A training algorithm for optimal margin classifiers,” in Proceedings of the Fifth Annual Workshop on Computational Learning Theory , ser. COLT ’92. New York, NY , USA: Association for Computing Machinery, 1992, p. 144–152. [Online]. Available: https://doi.org/10.1145/130385.130401
-
[17]
Libri-light: A benchmark for asr with limited or no supervision,
J. Kahn, M. Rivi `ere, W. Zheng, E. Kharitonov, Q. Xu, P. E. Mazar ´e, J. Karadayi, V . Liptchinsky, R. Collobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin, A. Mohamed, and E. Dupoux, “Libri-light: A benchmark for asr with limited or no supervision,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (...
work page 2020
-
[18]
Kenlm: Faster and smaller language model queries,
K. Heafield, “Kenlm: Faster and smaller language model queries,” in Proceedings of the sixth workshop on statistical machine translation. Association for Computational Linguistics, 2011, pp. 187–197
work page 2011
-
[19]
Wav2letter++: A fast open-source speech recognition system,
V . Pratap, A. Hannun, Q. Xu, J. Cai, J. Kahn, G. Synnaeve, V . Liptchinsky, and R. Collobert, “Wav2letter++: A fast open-source speech recognition system,” in ICASSP 2019 - 2019 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP) , 2019, pp. 6460–6464
work page 2019
-
[20]
Rethink- ing evaluation in asr: Are our models robust enough?
T. Likhomanenko, Q. Xu, V . Pratap, P. Tomasello, J. Kahn, G. Avidov, R. Collobert, and G. Synnaeve, “Rethink- ing evaluation in asr: Are our models robust enough?” arXiv preprint arXiv:2010.11745, 2020
-
[21]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2017. 9 A PREPRINT - D ECEMBER 22, 2020
work page 2017
-
[22]
Reducing transformer depth on demand with structured dropout,
A. Fan, E. Grave, and A. Joulin, “Reducing transformer depth on demand with structured dropout,” in 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 . OpenReview.net, 2020. [Online]. Available: https://openreview.net/forum?id=SylO2yStDr
work page 2020
-
[23]
Adaptive subgradient methods for online learning and stochastic optimiza- tion
J. Duchi, E. Hazan, and Y . Singer, “Adaptive subgradient methods for online learning and stochastic optimiza- tion.” Journal of machine learning research, vol. 12, no. 7, 2011
work page 2011
-
[24]
Specaugment: A simple data augmentation method for automatic speech recognition,
D. S. Park, W. Chan, Y . Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V . Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” Interspeech 2019, Sep 2019. [Online]. Available: http://dx.doi.org/10.21437/interspeech.2019-2680
-
[25]
LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech
H. Zen, R. Clark, R. J. Weiss, V . Dang, Y . Jia, Y . Wu, Y . Zhang, and Z. Chen, “Libritts: A corpus derived from librispeech for text-to-speech,” in Interspeech, 2019. [Online]. Available: https://arxiv.org/abs/1904.02882 10
work page internal anchor Pith review Pith/arXiv arXiv 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.