pith. the verified trust layer for science. sign in

arxiv: 2012.03411 · v2 · pith:J5I65L4Wnew · submitted 2020-12-07 · 📡 eess.AS · cs.CL· cs.SD

MLS: A Large-Scale Multilingual Dataset for Speech Research

Pith reviewed 2026-05-18 14:50 UTC · model grok-4.3

classification 📡 eess.AS cs.CLcs.SD
keywords multilingual datasetspeech recognitionLibriVoxautomatic speech recognitiontext-to-speechlarge-scale corpusmultilingual ASRLibriSpeech
0
0 comments X p. Extension
Add this Pith Number to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{J5I65L4W}

Prints a linked pith:J5I65L4W badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

A new multilingual speech dataset supplies tens of thousands of hours of audiobook audio paired with text across eight languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Multilingual LibriSpeech dataset as a large resource for speech research. It draws from LibriVox public domain audiobooks to create aligned audio and transcriptions in English and seven other languages. The collection totals roughly 50,000 hours, with the bulk in English. The authors release accompanying language models and baseline automatic speech recognition systems for each language. This scale of data is meant to enable broader experimentation in multilingual speech recognition and text-to-speech synthesis.

Core claim

This paper introduces Multilingual LibriSpeech (MLS) dataset, a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages, including about 44.5K hours of English and a total of about 6K hours for other languages. Additionally, we provide Language Models (LM) and baseline Automatic Speech Recognition (ASR) models for all the languages in our dataset. We believe such a large transcribed dataset will open new avenues in ASR and Text-To-Speech (TTS) research. The dataset will be made freely available for anyone.

What carries the argument

The MLS dataset constructed from LibriVox audiobook recordings with their associated transcriptions.

If this is right

  • Multilingual automatic speech recognition systems can be trained on substantially more data than before.
  • Text-to-speech models gain access to large amounts of natural read speech in multiple languages.
  • Baseline models for each language provide reference points for future research improvements.
  • Public release allows independent groups to build upon the same starting resource.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Speech technologies for the included languages could see faster improvement due to the increased training volume.
  • If the read audiobook style transfers well, the dataset might support applications beyond formal speech.
  • Combining this with other datasets could address gaps in spontaneous or conversational speech data.

Load-bearing premise

The provided transcriptions accurately match the spoken audio without significant errors that would mislead model training.

What would settle it

Evaluation results where speech recognition models trained solely on MLS data show no improvement over models trained on much smaller prior datasets when tested on held-out speech.

read the original abstract

This paper introduces Multilingual LibriSpeech (MLS) dataset, a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages, including about 44.5K hours of English and a total of about 6K hours for other languages. Additionally, we provide Language Models (LM) and baseline Automatic Speech Recognition (ASR) models and for all the languages in our dataset. We believe such a large transcribed dataset will open new avenues in ASR and Text-To-Speech (TTS) research. The dataset will be made freely available for anyone at http://www.openslr.org.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the Multilingual LibriSpeech (MLS) dataset, a large-scale multilingual corpus derived from LibriVox audiobooks. It covers 8 languages with approximately 44.5K hours of English speech and 6K hours across the other seven languages. Baseline ASR and language model results are provided for each language, and the authors announce that the full dataset will be released publicly.

Significance. If the audio-transcript alignments prove sufficiently clean, MLS would constitute a valuable public resource for multilingual ASR and TTS research, especially given its scale and the inclusion of reference baselines. The public release itself is a clear strength that directly benefits the community.

major comments (2)
  1. [Dataset construction] Dataset construction section: the alignment procedure for the seven non-English languages is described at a high level but no quantitative validation of transcription accuracy (e.g., WER against a small set of independently verified native-speaker transcripts or alignment error statistics) is reported. This measurement is load-bearing for the claim that the ~6K hours constitute a ready-to-use corpus for speech research.
  2. [Baseline results] Baseline ASR results section: the reported WER numbers are presented without any analysis of how performance might be affected by possible transcription noise in the non-English portions or comparison to models trained on existing cleaner monolingual corpora.
minor comments (1)
  1. [Abstract] Abstract: the sentence contains a repeated conjunction ('provide Language Models (LM) and baseline Automatic Speech Recognition (ASR) models and for all the languages').

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and recommendation of minor revision. We address each major comment below and outline the changes we will make to the manuscript.

read point-by-point responses
  1. Referee: [Dataset construction] Dataset construction section: the alignment procedure for the seven non-English languages is described at a high level but no quantitative validation of transcription accuracy (e.g., WER against a small set of independently verified native-speaker transcripts or alignment error statistics) is reported. This measurement is load-bearing for the claim that the ~6K hours constitute a ready-to-use corpus for speech research.

    Authors: We agree that explicit quantitative validation strengthens the claim that the non-English portions are ready for research use. The alignment pipeline is identical to the one validated for the English LibriSpeech corpus in prior work, and the source transcripts are the original public LibriVox texts. Nevertheless, we will add a new subsection reporting alignment error rates and WERs computed on small, independently verified native-speaker transcript subsets for each of the seven languages. revision: yes

  2. Referee: [Baseline results] Baseline ASR results section: the reported WER numbers are presented without any analysis of how performance might be affected by possible transcription noise in the non-English portions or comparison to models trained on existing cleaner monolingual corpora.

    Authors: The baselines are provided as initial reference points rather than exhaustive benchmarks. We will expand the discussion to include a short analysis of how residual alignment noise may influence the reported WERs and will add comparisons, where feasible, against models trained on existing cleaner monolingual corpora for the same languages. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset release with independent baselines

full rationale

The paper's central contribution is the release of the MLS corpus (audio + transcripts from LibriVox) plus trained baseline ASR/LM models. No equations, fitted parameters, or predictions are derived from the data in a way that reduces back to the inputs by construction. The English portion re-uses an existing LibriSpeech pipeline (external to this work), and non-English processing is described as a direct alignment step without any self-referential fitting or uniqueness theorems. Self-citations, if present, are not load-bearing for any claim. The skeptic concern about transcription accuracy is a data-quality assumption, not a circular derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces no new mathematical parameters, axioms, or postulated entities; it relies on existing public-domain audio and standard ASR training pipelines.

pith-pipeline@v0.9.0 · 5652 in / 1029 out tokens · 45934 ms · 2026-05-18T14:50:41.623380+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

    cs.CL 2026-05 unverdicted novelty 7.0

    VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...

  2. Rethinking Entropy Minimization in Test-Time Adaptation for Autoregressive Models

    eess.AS 2026-05 unverdicted novelty 7.0

    Derives a rigorous entropy minimization formulation for autoregressive test-time adaptation that decomposes into policy gradient and entropy terms, reinterpreting prior methods and improving Whisper ASR across 20+ domains.

  3. Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

    cs.SD 2025-07 unverdicted novelty 7.0

    Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.

  4. Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation

    eess.AS 2026-04 unverdicted novelty 6.0

    Chain-of-Details (CoD) is a cascaded TTS method that explicitly models temporal coarse-to-fine dynamics with a shared decoder, achieving competitive performance using significantly fewer parameters.

  5. Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs

    eess.AS 2026-04 unverdicted novelty 6.0

    A multi-stage training method for LLM-based ASR uses new entropy allocation metrics to achieve competitive benchmark performance with 2.3B parameters while mitigating hallucinations via better encoder-LLM decoupling.

  6. FastTurn: Unifying Acoustic and Streaming Semantic Cues for Low-Latency and Robust Turn Detection

    cs.SD 2026-04 unverdicted novelty 6.0

    FastTurn unifies acoustic features and streaming CTC decoding for low-latency, robust turn detection in full-duplex dialogue systems and releases a realistic human-dialogue test set.

  7. Rethinking Training Targets, Architectures and Data Quality for Universal Speech Enhancement

    cs.SD 2026-03 unverdicted novelty 6.0

    Replacing early-reflected speech with time-shifted anechoic clean speech as the training target, combined with a two-stage distortion-perception framework, yields state-of-the-art universal speech enhancement.

  8. StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs

    cs.CL 2025-09 unverdicted novelty 6.0

    StableToken introduces a multi-branch architecture with bit-wise voting to create noise-robust semantic speech tokens, achieving lower Unit Edit Distance and better SpeechLLM robustness than prior single-path tokenizers.

  9. AudioPaLM: A Large Language Model That Can Speak and Listen

    cs.CL 2023-06 unverdicted novelty 6.0

    AudioPaLM unifies PaLM-2 and AudioLM to outperform prior systems on speech translation while enabling zero-shot speech-to-text for many unseen language pairs and voice transfer from short prompts.

  10. Omni-Fake: Benchmarking Unified Multimodal Social Media Deepfake Detection

    cs.CV 2026-05 unverdicted novelty 5.0

    Omni-Fake delivers a unified multimodal deepfake benchmark dataset and RL-driven detector that reports gains in accuracy, cross-modal generalization, and explainability over prior baselines.

  11. Data Selection Effects on Self-Supervised Learning of Audio Representations for French Audiovisual Broadcasts

    eess.AS 2026-04 unverdicted novelty 5.0

    Pretraining audio SSL encoders on diverse French broadcast content rather than clean speech yields better downstream performance on ASR, music detection, and speaker recognition, with deduplication mitigating memorization.

  12. AudioKV: KV Cache Eviction in Efficient Large Audio Language Models

    cs.SD 2026-04 unverdicted novelty 5.0

    AudioKV prioritizes audio-critical attention heads identified via ASR analysis and applies spectral score smoothing to evict KV cache tokens, achieving high compression with minimal accuracy loss in LALMs.

  13. Kimi-Audio Technical Report

    eess.AS 2025-04 unverdicted novelty 5.0

    Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million ho...

  14. NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR

    eess.AS 2026-04 unverdicted novelty 4.0

    NIM4-ASR delivers SOTA ASR performance on public benchmarks using a 2.3B-parameter LLM with multi-stage training, real-time streaming, and million-scale hotword customization via RAG.

  15. In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word Level Timestamp Predictions

    eess.AS 2026-04 unverdicted novelty 4.0

    Lightweight training strategies allow speech-aware LLMs to output accurate word timestamps alongside ASR transcripts while also improving recognition quality across datasets.

  16. Empowering Video Translation using Multimodal Large Language Models

    cs.CV 2026-04 unverdicted novelty 4.0

    The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.

  17. Revisiting Direct Speech-to-Text Translation with Speech LLMs: Better Scaling than CoT Prompting?

    cs.CL 2025-10 conditional novelty 4.0

    Direct prompting scales more consistently than CoT prompting for speech-to-text translation as the amount of S2TT data increases.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · cited by 17 Pith papers · 1 internal anchor

  1. [1]

    Librispeech: An asr corpus based on public domain audio books,

    V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An asr corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2015, pp. 5206–5210

  2. [2]

    The iarpa babel program,

    M. P. Harper, “The iarpa babel program,” https://www.iarpa.gov/index.php/research-programs/babel

  3. [3]

    Free speech... recognition (linux, windows and mac) - voxforge.org,

    V oxforge.org, “Free speech... recognition (linux, windows and mac) - voxforge.org,” http://www.voxforge.org/, accessed 06/25/2014

  4. [4]

    Common voice: A massively-multilingual speech corpus,

    R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” 2019

  5. [5]

    The M-AILABS Speech Dataset,

    I. Solak, “The M-AILABS Speech Dataset,” https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/, 2019, [Online; accessed 19-April-2020]

  6. [6]

    Cmu wilderness multilingual speech dataset,

    A. W. Black, “Cmu wilderness multilingual speech dataset,” in ICASSP 2019 - 2019 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP) , 2019, pp. 5971–5975

  7. [7]

    Sequence-to-sequence speech recognition with time-depth sepa- rable convolutions,

    A. Hannun, A. Lee, Q. Xu, and R. Collobert, “Sequence-to-sequence speech recognition with time-depth sepa- rable convolutions,” in INTERSPEECH, 2019

  8. [8]

    Wav2letter: an end-to-end convnet-based speech recognition sys- tem,

    R. Collobert, C. Puhrsch, and G. Synnaeve, “Wav2letter: an end-to-end convnet-based speech recognition sys- tem,” 2016

  9. [9]

    Connectionist temporal classification: labelling un- segmented sequence data with recurrent neural networks,

    A. Graves, S. Fern ´andez, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling un- segmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369–376

  10. [10]

    Letter-based speech recognition with gated convnets,

    V . Liptchinsky, G. Synnaeve, and R. Collobert, “Letter-based speech recognition with gated convnets,” 2017

  11. [11]

    Scaling up online speech recognition using convnets,

    V . Pratap, Q. Xu, J. Kahn, G. Avidov, T. Likhomanenko, A. Hannun, V . Liptchinsky, G. Synnaeve, and R. Col- lobert, “Scaling up online speech recognition using convnets,” 2020

  12. [12]

    End-to-end asr: from supervised to semi-supervised learning with modern architectures,

    G. Synnaeve, Q. Xu, J. Kahn, T. Likhomanenko, E. Grave, V . Pratap, A. Sriram, V . Liptchinsky, and R. Collobert, “End-to-end asr: from supervised to semi-supervised learning with modern architectures,” 2019

  13. [13]

    Librispeech: an asr corpus based on public domain audio books,

    V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2015, pp. 5206–5210

  14. [14]

    Jhu kaldi system for arabic mgb-3 asr challenge using diarization, audio-transcript alignment and transfer learning,

    V . Manohar, D. Povey, and S. Khudanpur, “Jhu kaldi system for arabic mgb-3 asr challenge using diarization, audio-transcript alignment and transfer learning,” in 2017 IEEE Automatic Speech Recognition and Understand- ing Workshop (ASRU). IEEE, 2017, pp. 346–352

  15. [15]

    Identification of common molecular subsequences,

    T. Smith and M. Waterman, “Identification of common molecular subsequences,” Journal of Molecular Biology, vol. 147, no. 1, pp. 195 – 197, 1981. [Online]. Available: http://www.sciencedirect.com/science/article/pii/ 0022283681900875

  16. [16]

    A training algorithm for optimal margin classifiers,

    B. E. Boser, I. M. Guyon, and V . N. Vapnik, “A training algorithm for optimal margin classifiers,” in Proceedings of the Fifth Annual Workshop on Computational Learning Theory , ser. COLT ’92. New York, NY , USA: Association for Computing Machinery, 1992, p. 144–152. [Online]. Available: https://doi.org/10.1145/130385.130401

  17. [17]

    Libri-light: A benchmark for asr with limited or no supervision,

    J. Kahn, M. Rivi `ere, W. Zheng, E. Kharitonov, Q. Xu, P. E. Mazar ´e, J. Karadayi, V . Liptchinsky, R. Collobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin, A. Mohamed, and E. Dupoux, “Libri-light: A benchmark for asr with limited or no supervision,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (...

  18. [18]

    Kenlm: Faster and smaller language model queries,

    K. Heafield, “Kenlm: Faster and smaller language model queries,” in Proceedings of the sixth workshop on statistical machine translation. Association for Computational Linguistics, 2011, pp. 187–197

  19. [19]

    Wav2letter++: A fast open-source speech recognition system,

    V . Pratap, A. Hannun, Q. Xu, J. Cai, J. Kahn, G. Synnaeve, V . Liptchinsky, and R. Collobert, “Wav2letter++: A fast open-source speech recognition system,” in ICASSP 2019 - 2019 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP) , 2019, pp. 6460–6464

  20. [20]

    Rethink- ing evaluation in asr: Are our models robust enough?

    T. Likhomanenko, Q. Xu, V . Pratap, P. Tomasello, J. Kahn, G. Avidov, R. Collobert, and G. Synnaeve, “Rethink- ing evaluation in asr: Are our models robust enough?” arXiv preprint arXiv:2010.11745, 2020

  21. [21]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2017. 9 A PREPRINT - D ECEMBER 22, 2020

  22. [22]

    Reducing transformer depth on demand with structured dropout,

    A. Fan, E. Grave, and A. Joulin, “Reducing transformer depth on demand with structured dropout,” in 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 . OpenReview.net, 2020. [Online]. Available: https://openreview.net/forum?id=SylO2yStDr

  23. [23]

    Adaptive subgradient methods for online learning and stochastic optimiza- tion

    J. Duchi, E. Hazan, and Y . Singer, “Adaptive subgradient methods for online learning and stochastic optimiza- tion.” Journal of machine learning research, vol. 12, no. 7, 2011

  24. [24]

    Specaugment: A simple data augmentation method for automatic speech recognition,

    D. S. Park, W. Chan, Y . Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V . Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” Interspeech 2019, Sep 2019. [Online]. Available: http://dx.doi.org/10.21437/interspeech.2019-2680

  25. [25]

    LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

    H. Zen, R. Clark, R. J. Weiss, V . Dang, Y . Jia, Y . Wu, Y . Zhang, and Z. Chen, “Libritts: A corpus derived from librispeech for text-to-speech,” in Interspeech, 2019. [Online]. Available: https://arxiv.org/abs/1904.02882 10