arxiv: 2012.03411 · v2 · pith:J5I65L4Wnew · submitted 2020-12-07 · 📡 eess.AS · cs.CL· cs.SD

MLS: A Large-Scale Multilingual Dataset for Speech Research

Vineel Pratap , Qiantong Xu , Anuroop Sriram , Gabriel Synnaeve , Ronan Collobert This is my paper

Pith reviewed 2026-05-18 14:50 UTC · model grok-4.3

classification 📡 eess.AS cs.CLcs.SD

keywords multilingual datasetspeech recognitionLibriVoxautomatic speech recognitiontext-to-speechlarge-scale corpusmultilingual ASRLibriSpeech

0 comments p. Extension

Add this Pith Number to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{J5I65L4W}

Prints a linked pith:J5I65L4W badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

A new multilingual speech dataset supplies tens of thousands of hours of audiobook audio paired with text across eight languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Multilingual LibriSpeech dataset as a large resource for speech research. It draws from LibriVox public domain audiobooks to create aligned audio and transcriptions in English and seven other languages. The collection totals roughly 50,000 hours, with the bulk in English. The authors release accompanying language models and baseline automatic speech recognition systems for each language. This scale of data is meant to enable broader experimentation in multilingual speech recognition and text-to-speech synthesis.

Core claim

This paper introduces Multilingual LibriSpeech (MLS) dataset, a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages, including about 44.5K hours of English and a total of about 6K hours for other languages. Additionally, we provide Language Models (LM) and baseline Automatic Speech Recognition (ASR) models for all the languages in our dataset. We believe such a large transcribed dataset will open new avenues in ASR and Text-To-Speech (TTS) research. The dataset will be made freely available for anyone.

What carries the argument

The MLS dataset constructed from LibriVox audiobook recordings with their associated transcriptions.

If this is right

Multilingual automatic speech recognition systems can be trained on substantially more data than before.
Text-to-speech models gain access to large amounts of natural read speech in multiple languages.
Baseline models for each language provide reference points for future research improvements.
Public release allows independent groups to build upon the same starting resource.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Speech technologies for the included languages could see faster improvement due to the increased training volume.
If the read audiobook style transfers well, the dataset might support applications beyond formal speech.
Combining this with other datasets could address gaps in spontaneous or conversational speech data.

Load-bearing premise

The provided transcriptions accurately match the spoken audio without significant errors that would mislead model training.

What would settle it

Evaluation results where speech recognition models trained solely on MLS data show no improvement over models trained on much smaller prior datasets when tested on held-out speech.

read the original abstract

This paper introduces Multilingual LibriSpeech (MLS) dataset, a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages, including about 44.5K hours of English and a total of about 6K hours for other languages. Additionally, we provide Language Models (LM) and baseline Automatic Speech Recognition (ASR) models and for all the languages in our dataset. We believe such a large transcribed dataset will open new avenues in ASR and Text-To-Speech (TTS) research. The dataset will be made freely available for anyone at http://www.openslr.org.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MLS is a practical data release that supplies needed scale for multilingual speech work, though non-English transcription quality is assumed rather than measured.

read the letter

The main thing here is that the paper releases MLS, a multilingual LibriVox-derived corpus with 44.5k English hours and roughly 6k hours across seven other languages, plus baseline ASR and LM models for each. That directly tackles data scarcity for non-English research and lets people run larger experiments right away. The English side reuses the LibriSpeech pipeline, which is already vetted, and the construction details for the rest are laid out plainly enough to follow. Releasing the data publicly and giving reference numbers instead of overclaiming is the right approach for this kind of paper. It earns credit for being straightforward and useful on those points. The soft spot is the one the stress test highlights. There is no quantitative check on alignment or transcription accuracy for the non-English audio, such as WER against held-out native transcripts. The paper treats the volunteer readings as sufficiently clean, but without that measurement users cannot know how much noise they are inheriting. It is not a load-bearing flaw for a data paper, yet it is worth flagging so readers do not treat the full set as equally reliable. This work is aimed at speech researchers who need scale for ASR or TTS in those languages. Anyone building multilingual systems or testing cross-lingual methods will get immediate value from the released resource and baselines. The thinking is clear and the contribution is grounded in the actual data rather than new modeling tricks. I would bring it to a reading group focused on speech and would cite the dataset in my own work. It deserves peer review so referees can confirm the pipeline details and perhaps ask for the missing quality numbers.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the Multilingual LibriSpeech (MLS) dataset, a large-scale multilingual corpus derived from LibriVox audiobooks. It covers 8 languages with approximately 44.5K hours of English speech and 6K hours across the other seven languages. Baseline ASR and language model results are provided for each language, and the authors announce that the full dataset will be released publicly.

Significance. If the audio-transcript alignments prove sufficiently clean, MLS would constitute a valuable public resource for multilingual ASR and TTS research, especially given its scale and the inclusion of reference baselines. The public release itself is a clear strength that directly benefits the community.

major comments (2)

[Dataset construction] Dataset construction section: the alignment procedure for the seven non-English languages is described at a high level but no quantitative validation of transcription accuracy (e.g., WER against a small set of independently verified native-speaker transcripts or alignment error statistics) is reported. This measurement is load-bearing for the claim that the ~6K hours constitute a ready-to-use corpus for speech research.
[Baseline results] Baseline ASR results section: the reported WER numbers are presented without any analysis of how performance might be affected by possible transcription noise in the non-English portions or comparison to models trained on existing cleaner monolingual corpora.

minor comments (1)

[Abstract] Abstract: the sentence contains a repeated conjunction ('provide Language Models (LM) and baseline Automatic Speech Recognition (ASR) models and for all the languages').

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and recommendation of minor revision. We address each major comment below and outline the changes we will make to the manuscript.

read point-by-point responses

Referee: [Dataset construction] Dataset construction section: the alignment procedure for the seven non-English languages is described at a high level but no quantitative validation of transcription accuracy (e.g., WER against a small set of independently verified native-speaker transcripts or alignment error statistics) is reported. This measurement is load-bearing for the claim that the ~6K hours constitute a ready-to-use corpus for speech research.

Authors: We agree that explicit quantitative validation strengthens the claim that the non-English portions are ready for research use. The alignment pipeline is identical to the one validated for the English LibriSpeech corpus in prior work, and the source transcripts are the original public LibriVox texts. Nevertheless, we will add a new subsection reporting alignment error rates and WERs computed on small, independently verified native-speaker transcript subsets for each of the seven languages. revision: yes
Referee: [Baseline results] Baseline ASR results section: the reported WER numbers are presented without any analysis of how performance might be affected by possible transcription noise in the non-English portions or comparison to models trained on existing cleaner monolingual corpora.

Authors: The baselines are provided as initial reference points rather than exhaustive benchmarks. We will expand the discussion to include a short analysis of how residual alignment noise may influence the reported WERs and will add comparisons, where feasible, against models trained on existing cleaner monolingual corpora for the same languages. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset release with independent baselines

full rationale

The paper's central contribution is the release of the MLS corpus (audio + transcripts from LibriVox) plus trained baseline ASR/LM models. No equations, fitted parameters, or predictions are derived from the data in a way that reduces back to the inputs by construction. The English portion re-uses an existing LibriSpeech pipeline (external to this work), and non-English processing is described as a direct alignment step without any self-referential fitting or uniqueness theorems. Self-citations, if present, are not load-bearing for any claim. The skeptic concern about transcription accuracy is a data-quality assumption, not a circular derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces no new mathematical parameters, axioms, or postulated entities; it relies on existing public-domain audio and standard ASR training pipelines.

pith-pipeline@v0.9.0 · 5652 in / 1029 out tokens · 45934 ms · 2026-05-18T14:50:41.623380+00:00 · methodology

discussion (0)

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
cs.CL 2026-05 unverdicted novelty 7.0

VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...
Rethinking Entropy Minimization in Test-Time Adaptation for Autoregressive Models
eess.AS 2026-05 unverdicted novelty 7.0

Derives a rigorous entropy minimization formulation for autoregressive test-time adaptation that decomposes into policy gradient and entropy terms, reinterpreting prior methods and improving Whisper ASR across 20+ domains.
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
cs.SD 2025-07 unverdicted novelty 7.0

Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.
Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation
eess.AS 2026-04 unverdicted novelty 6.0

Chain-of-Details (CoD) is a cascaded TTS method that explicitly models temporal coarse-to-fine dynamics with a shared decoder, achieving competitive performance using significantly fewer parameters.
Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs
eess.AS 2026-04 unverdicted novelty 6.0

A multi-stage training method for LLM-based ASR uses new entropy allocation metrics to achieve competitive benchmark performance with 2.3B parameters while mitigating hallucinations via better encoder-LLM decoupling.
FastTurn: Unifying Acoustic and Streaming Semantic Cues for Low-Latency and Robust Turn Detection
cs.SD 2026-04 unverdicted novelty 6.0

FastTurn unifies acoustic features and streaming CTC decoding for low-latency, robust turn detection in full-duplex dialogue systems and releases a realistic human-dialogue test set.
Rethinking Training Targets, Architectures and Data Quality for Universal Speech Enhancement
cs.SD 2026-03 unverdicted novelty 6.0

Replacing early-reflected speech with time-shifted anechoic clean speech as the training target, combined with a two-stage distortion-perception framework, yields state-of-the-art universal speech enhancement.
StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs
cs.CL 2025-09 unverdicted novelty 6.0

StableToken introduces a multi-branch architecture with bit-wise voting to create noise-robust semantic speech tokens, achieving lower Unit Edit Distance and better SpeechLLM robustness than prior single-path tokenizers.
AudioPaLM: A Large Language Model That Can Speak and Listen
cs.CL 2023-06 unverdicted novelty 6.0

AudioPaLM unifies PaLM-2 and AudioLM to outperform prior systems on speech translation while enabling zero-shot speech-to-text for many unseen language pairs and voice transfer from short prompts.
Omni-Fake: Benchmarking Unified Multimodal Social Media Deepfake Detection
cs.CV 2026-05 unverdicted novelty 5.0

Omni-Fake delivers a unified multimodal deepfake benchmark dataset and RL-driven detector that reports gains in accuracy, cross-modal generalization, and explainability over prior baselines.
Data Selection Effects on Self-Supervised Learning of Audio Representations for French Audiovisual Broadcasts
eess.AS 2026-04 unverdicted novelty 5.0

Pretraining audio SSL encoders on diverse French broadcast content rather than clean speech yields better downstream performance on ASR, music detection, and speaker recognition, with deduplication mitigating memorization.
AudioKV: KV Cache Eviction in Efficient Large Audio Language Models
cs.SD 2026-04 unverdicted novelty 5.0

AudioKV prioritizes audio-critical attention heads identified via ASR analysis and applies spectral score smoothing to evict KV cache tokens, achieving high compression with minimal accuracy loss in LALMs.
Kimi-Audio Technical Report
eess.AS 2025-04 unverdicted novelty 5.0

Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million ho...
NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR
eess.AS 2026-04 unverdicted novelty 4.0

NIM4-ASR delivers SOTA ASR performance on public benchmarks using a 2.3B-parameter LLM with multi-stage training, real-time streaming, and million-scale hotword customization via RAG.
In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word Level Timestamp Predictions
eess.AS 2026-04 unverdicted novelty 4.0

Lightweight training strategies allow speech-aware LLMs to output accurate word timestamps alongside ASR transcripts while also improving recognition quality across datasets.
Empowering Video Translation using Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 4.0

The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.
Revisiting Direct Speech-to-Text Translation with Speech LLMs: Better Scaling than CoT Prompting?
cs.CL 2025-10 conditional novelty 4.0

Direct prompting scales more consistently than CoT prompting for speech-to-text translation as the amount of S2TT data increases.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · cited by 17 Pith papers · 1 internal anchor

[1]

Librispeech: An asr corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An asr corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2015, pp. 5206–5210

work page 2015
[2]

The iarpa babel program,

M. P. Harper, “The iarpa babel program,” https://www.iarpa.gov/index.php/research-programs/babel

work page
[3]

Free speech... recognition (linux, windows and mac) - voxforge.org,

V oxforge.org, “Free speech... recognition (linux, windows and mac) - voxforge.org,” http://www.voxforge.org/, accessed 06/25/2014

work page 2014
[4]

Common voice: A massively-multilingual speech corpus,

R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” 2019

work page 2019
[5]

The M-AILABS Speech Dataset,

I. Solak, “The M-AILABS Speech Dataset,” https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/, 2019, [Online; accessed 19-April-2020]

work page 2019
[6]

Cmu wilderness multilingual speech dataset,

A. W. Black, “Cmu wilderness multilingual speech dataset,” in ICASSP 2019 - 2019 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP) , 2019, pp. 5971–5975

work page 2019
[7]

Sequence-to-sequence speech recognition with time-depth sepa- rable convolutions,

A. Hannun, A. Lee, Q. Xu, and R. Collobert, “Sequence-to-sequence speech recognition with time-depth sepa- rable convolutions,” in INTERSPEECH, 2019

work page 2019
[8]

Wav2letter: an end-to-end convnet-based speech recognition sys- tem,

R. Collobert, C. Puhrsch, and G. Synnaeve, “Wav2letter: an end-to-end convnet-based speech recognition sys- tem,” 2016

work page 2016
[9]

Connectionist temporal classiﬁcation: labelling un- segmented sequence data with recurrent neural networks,

A. Graves, S. Fern ´andez, F. Gomez, and J. Schmidhuber, “Connectionist temporal classiﬁcation: labelling un- segmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369–376

work page 2006
[10]

Letter-based speech recognition with gated convnets,

V . Liptchinsky, G. Synnaeve, and R. Collobert, “Letter-based speech recognition with gated convnets,” 2017

work page 2017
[11]

Scaling up online speech recognition using convnets,

V . Pratap, Q. Xu, J. Kahn, G. Avidov, T. Likhomanenko, A. Hannun, V . Liptchinsky, G. Synnaeve, and R. Col- lobert, “Scaling up online speech recognition using convnets,” 2020

work page 2020
[12]

End-to-end asr: from supervised to semi-supervised learning with modern architectures,

G. Synnaeve, Q. Xu, J. Kahn, T. Likhomanenko, E. Grave, V . Pratap, A. Sriram, V . Liptchinsky, and R. Collobert, “End-to-end asr: from supervised to semi-supervised learning with modern architectures,” 2019

work page 2019
[13]

Librispeech: an asr corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2015, pp. 5206–5210

work page 2015
[14]

Jhu kaldi system for arabic mgb-3 asr challenge using diarization, audio-transcript alignment and transfer learning,

V . Manohar, D. Povey, and S. Khudanpur, “Jhu kaldi system for arabic mgb-3 asr challenge using diarization, audio-transcript alignment and transfer learning,” in 2017 IEEE Automatic Speech Recognition and Understand- ing Workshop (ASRU). IEEE, 2017, pp. 346–352

work page 2017
[15]

Identiﬁcation of common molecular subsequences,

T. Smith and M. Waterman, “Identiﬁcation of common molecular subsequences,” Journal of Molecular Biology, vol. 147, no. 1, pp. 195 – 197, 1981. [Online]. Available: http://www.sciencedirect.com/science/article/pii/ 0022283681900875

work page 1981
[16]

A training algorithm for optimal margin classiﬁers,

B. E. Boser, I. M. Guyon, and V . N. Vapnik, “A training algorithm for optimal margin classiﬁers,” in Proceedings of the Fifth Annual Workshop on Computational Learning Theory , ser. COLT ’92. New York, NY , USA: Association for Computing Machinery, 1992, p. 144–152. [Online]. Available: https://doi.org/10.1145/130385.130401

work page doi:10.1145/130385.130401 1992
[17]

Libri-light: A benchmark for asr with limited or no supervision,

J. Kahn, M. Rivi `ere, W. Zheng, E. Kharitonov, Q. Xu, P. E. Mazar ´e, J. Karadayi, V . Liptchinsky, R. Collobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin, A. Mohamed, and E. Dupoux, “Libri-light: A benchmark for asr with limited or no supervision,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (...

work page 2020
[18]

Kenlm: Faster and smaller language model queries,

K. Heaﬁeld, “Kenlm: Faster and smaller language model queries,” in Proceedings of the sixth workshop on statistical machine translation. Association for Computational Linguistics, 2011, pp. 187–197

work page 2011
[19]

Wav2letter++: A fast open-source speech recognition system,

V . Pratap, A. Hannun, Q. Xu, J. Cai, J. Kahn, G. Synnaeve, V . Liptchinsky, and R. Collobert, “Wav2letter++: A fast open-source speech recognition system,” in ICASSP 2019 - 2019 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP) , 2019, pp. 6460–6464

work page 2019
[20]

Rethink- ing evaluation in asr: Are our models robust enough?

T. Likhomanenko, Q. Xu, V . Pratap, P. Tomasello, J. Kahn, G. Avidov, R. Collobert, and G. Synnaeve, “Rethink- ing evaluation in asr: Are our models robust enough?” arXiv preprint arXiv:2010.11745, 2020

work page arXiv 2010
[21]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2017. 9 A PREPRINT - D ECEMBER 22, 2020

work page 2017
[22]

Reducing transformer depth on demand with structured dropout,

A. Fan, E. Grave, and A. Joulin, “Reducing transformer depth on demand with structured dropout,” in 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 . OpenReview.net, 2020. [Online]. Available: https://openreview.net/forum?id=SylO2yStDr

work page 2020
[23]

Adaptive subgradient methods for online learning and stochastic optimiza- tion

J. Duchi, E. Hazan, and Y . Singer, “Adaptive subgradient methods for online learning and stochastic optimiza- tion.” Journal of machine learning research, vol. 12, no. 7, 2011

work page 2011
[24]

Specaugment: A simple data augmentation method for automatic speech recognition,

D. S. Park, W. Chan, Y . Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V . Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” Interspeech 2019, Sep 2019. [Online]. Available: http://dx.doi.org/10.21437/interspeech.2019-2680

work page doi:10.21437/interspeech.2019-2680 2019
[25]

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

H. Zen, R. Clark, R. J. Weiss, V . Dang, Y . Jia, Y . Wu, Y . Zhang, and Z. Chen, “Libritts: A corpus derived from librispeech for text-to-speech,” in Interspeech, 2019. [Online]. Available: https://arxiv.org/abs/1904.02882 10

work page internal anchor Pith review Pith/arXiv arXiv 2019