pith. sign in

arxiv: 2605.17846 · v1 · pith:OXAC2W3Mnew · submitted 2026-05-18 · 📡 eess.AS

UrduSpeech: A 156-Hour Urdu Speech Corpus with 12-Dimension Paralinguistic Annotations

Pith reviewed 2026-05-20 01:11 UTC · model grok-4.3

classification 📡 eess.AS
keywords Urdu speech corpusparalinguistic annotationscode-switchinglow-resource languagesspeech technologyaudio datasetnatural language processinglinguistic inclusivity
0
0 comments X

The pith

UrduSpeech supplies 156 hours of Urdu audio annotated across 12 paralinguistic dimensions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a large speech corpus to address the shortage of training data for Urdu, spoken by over 230 million people. It compiles 156 hours of audio from diverse sources including news, drama, and literary forms, with metadata on 12 paralinguistic aspects. The collection covers standard Urdu, code-switched varieties, and English-Pakistani forms while handling right-to-left script challenges through an automated pipeline. A smaller manually verified benchmark set is also provided for evaluation. Human checks confirm high audio quality and annotation consistency, making the resource usable for building speech recognition and synthesis tools.

Core claim

UrduSpeech is a high-fidelity corpus of 156 hours of Urdu audio with 12-dimension paralinguistic metadata that spans US-Std, US-CS, and US-EngPk varieties. The data set draws from 12 categories and is assembled by an LLM-driven curation pipeline that manages script direction and frequent code-switching. A 9-hour benchmark subset receives manual correction by native annotators. Quality validation yields a mean opinion score of 4.6 and Cohen's kappa of 0.68, supporting the pipeline's reported 97.6 percent confidence level. The full set maintains a 60-40 gender balance over 71,792 utterances.

What carries the argument

LLM-driven curation pipeline that selects utterances from 12 categories and applies 12-dimension paralinguistic annotations while resolving right-to-left script and code-switching constraints.

If this is right

  • Speech recognition models can now train directly on balanced, multi-variety Urdu data rather than relying on translated or synthetic material.
  • Paralinguistic features such as emotion or emphasis become measurable targets for Urdu-specific natural language understanding systems.
  • Researchers gain a standard benchmark set for consistent comparison of new Urdu speech models.
  • Code-switching behavior can be modeled explicitly, improving performance on everyday mixed-language conversations.
  • Open release of both corpus and curation code allows direct replication and incremental addition of new categories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same curation approach could be adapted to other right-to-left or heavily code-switched languages such as Arabic or Hindi-English mixes.
  • Voice interfaces built on this data may better capture culturally specific prosody that generic multilingual models currently miss.
  • Pairing the corpus with existing English datasets could produce stronger joint models for South Asian code-switched speech.
  • Long-term tracking of model performance on the released benchmark would quantify whether the added paralinguistic labels actually improve downstream tasks.

Load-bearing premise

The LLM pipeline's 97.6 percent score together with the human MOS of 4.6 and Cohen's kappa of 0.68 correctly indicate accurate annotations and faithful audio across every category and language variety.

What would settle it

Independent native speakers re-annotating a random sample of the 9-hour benchmark set and reporting frequent label errors or quality scores below 4.0 would falsify the claim of high-fidelity curation.

Figures

Figures reproduced from arXiv: 2605.17846 by Attia Nafees ul Haq, Chunjiang He, Jingbin Hu, Lei Xie, Zeyu Zhu.

Figure 1
Figure 1. Figure 1: Overview of the UrduSpeech data curation pipeline. tion performance. This preprocessing resulted in our 9-hour, manually-verified US-benchmark set. 2.1. Transcription model selection We compared three models for transcription: Whisper-v3 [22], as it is the most commonly used model for Urdu; the recently released OmniASR-LLM-1B [7], which supports 1,600 lan￾guages and classifies Arab-Urdu as a high-resource… view at source ↗
Figure 2
Figure 2. Figure 2: Detailed results for human-centric assessment 5. UrduSpeech corpus 5.1. Corpus Distribution and Statistics The UrduSpeech corpus comprises 91GB 156 hours of diarized audio. As shown in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Corpus data distribution across subsets and cate￾gories The corpus contains 71,792 diarized segments, catego￾rized by duration into short-format (55,407 segments) and long￾format (16,243 segments) clips. Detailed demographic and lin￾guistic insights are provided in [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
read the original abstract

Despite 230 million speakers, Urdu remains critically under-resourced in speech technology. We introduce UrduSpeech: a large high-fidelity Urdu corpus comprising 156 hours of audio with 12-dimension paralinguistic metadata, encompassing US-Std, US-CS, US-EngPk. To address Right-to-Left script constraints and frequent code-switching, we developed UrduSpeech, a LLM-driven pipeline to curate data across 12 diverse categories, including news, drama, and rare literary forms like Bait-Bazi. We also release a 9-hour US-Benchmark set, manually corrected by native annotators to serve as a standard. Human quality assessment of the primary 156-hour corpus yielded a Mean Opinion Score (MOS) of 4.6 (std = 0.7) with inter-rater reliability confirmed by a 0.68 Cohen's Kappa, validating our curation pipeline's 97.6% confidence score. The corpus maintains a 60-40 gender balance across 71,792 utterances. Our work represents a significant leap toward linguistic inclusivity in global AI. The corpus and code are open-sourced, and a demo page is available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces UrduSpeech, a 156-hour high-fidelity Urdu speech corpus with 12-dimension paralinguistic annotations covering US-Std, US-CS, and US-EngPk varieties. It describes an LLM-driven curation pipeline addressing RTL script and code-switching issues, releases a 9-hour manually corrected US-Benchmark, and reports aggregate quality metrics: MOS of 4.6 (std=0.7), Cohen's Kappa of 0.68, and 97.6% pipeline confidence, with 60-40 gender balance across 71,792 utterances from diverse sources including news, drama, and literary forms.

Significance. If the 12-dimension annotations are shown to be reliable, this corpus would provide a valuable open resource for an under-resourced language spoken by 230 million people, supporting research in paralinguistic speech processing, code-switching ASR, and inclusive AI. The inclusion of a benchmark set and open-sourcing are clear strengths for reproducibility and community use.

major comments (1)
  1. [Quality Assessment] Quality Assessment section: The central claim of reliable 12-dimension paralinguistic annotations rests on aggregate MOS 4.6, Cohen's Kappa 0.68 for the primary corpus, and 97.6% LLM confidence. No per-dimension breakdown of these metrics, no per-variety (US-Std/US-CS/US-EngPk) results, and no quantitative agreement (e.g., Kappa or error rate) between LLM labels and the manually corrected 9-hour US-Benchmark are reported. This gap directly affects verification that the annotations are accurate across the full claimed scope.
minor comments (2)
  1. [Methods] Provide explicit definitions and annotation protocols for each of the 12 paralinguistic dimensions to support reproducibility.
  2. [Corpus Description] Include utterance counts or duration breakdowns by variety and by the 12 source categories to allow assessment of balance.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the quality assessment of the annotations. We agree that additional granularity is needed to strengthen verification of the 12-dimension labels and will revise the manuscript to address these points.

read point-by-point responses
  1. Referee: [Quality Assessment] Quality Assessment section: The central claim of reliable 12-dimension paralinguistic annotations rests on aggregate MOS 4.6, Cohen's Kappa 0.68 for the primary corpus, and 97.6% LLM confidence. No per-dimension breakdown of these metrics, no per-variety (US-Std/US-CS/US-EngPk) results, and no quantitative agreement (e.g., Kappa or error rate) between LLM labels and the manually corrected 9-hour US-Benchmark are reported. This gap directly affects verification that the annotations are accurate across the full claimed scope.

    Authors: We acknowledge that the manuscript reports only aggregate MOS, Kappa, and pipeline confidence values. In the revised version we will add (1) per-dimension breakdowns of MOS and Cohen's Kappa for the 12 paralinguistic attributes, (2) the same metrics stratified by variety (US-Std, US-CS, US-EngPk), and (3) quantitative agreement statistics (Cohen's Kappa and/or error rates) between the LLM labels and the manually corrected 9-hour US-Benchmark. These additions will be placed in an expanded Quality Assessment section to allow direct verification across the claimed scope. revision: yes

Circularity Check

0 steps flagged

No circularity: direct corpus release with independent quality metrics

full rationale

This is a data collection and release paper with no derivations, equations, predictions, or first-principles results. The central claims concern the creation of the 156-hour UrduSpeech corpus and the reporting of quality metrics (MOS 4.6, Cohen's Kappa 0.68, 97.6% LLM confidence) obtained from human assessments and pipeline outputs. These metrics are presented as direct empirical results rather than outputs derived from or fitted to the same inputs by construction. No self-citation chains, ansatzes, or renamings of known results appear in any load-bearing step. The work is self-contained against external benchmarks of corpus quality.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is a data resource paper rather than a mathematical derivation. No numbers are fitted to data and no new physical entities are postulated. The 12 annotation dimensions and 12 curation categories are design choices rather than fitted parameters.

axioms (1)
  • domain assumption Urdu remains critically under-resourced in speech technology despite 230 million speakers.
    This premise motivates the entire work and is stated directly in the abstract.

pith-pipeline@v0.9.0 · 5758 in / 1409 out tokens · 70584 ms · 2026-05-20T01:11:35.089910+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 2 internal anchors

  1. [1]

    Yet, despite Urdu’s global significance and its vast diaspora, it remains remarkably under-resourced in the context of multi- modal foundation models and Speech LLMs

    Introduction In recent years, the digital preservation of languages within the AI landscape has become a cornerstone of linguistic equality [1]. Yet, despite Urdu’s global significance and its vast diaspora, it remains remarkably under-resourced in the context of multi- modal foundation models and Speech LLMs. Recent bench- marks highlight a persistent pe...

  2. [2]

    and WenetSpeech-Chuan [13] pipelines, we developed a specialized solution for the Urdu-English paradigm. We build upon foundational datasets including ARL Urdu [14], CLE Pak- istan [15], and LDC-IL [16] while addressing the critical short- age of high-fidelity data in modern Urdu TTS [17] and ASR. **indicates the corresponding author. 1https://interspeech...

  3. [3]

    We gathered this raw audio ”in-the-wild” and processed it according to our curation pipeline stage 1 as show in the fig- ure 1

    Model selection and benchmark set Prior to large-scale development, we conducted a 13-hour au- dio pilot study across 12 categories, including poetry, news, and vlogs. We gathered this raw audio ”in-the-wild” and processed it according to our curation pipeline stage 1 as show in the fig- ure 1. We utilized Spleeter [23] for noise removal and Pyannote

  4. [4]

    UrduSpeech: A 156-Hour Urdu Speech Corpus with 12-Dimension Paralinguistic Annotations

    for speaker diarization. To ensure high data quality, we discarded single-speaker clips and segments shorter than two seconds. Additionally, all audio clips were capped at a maxi- mum duration of 35 seconds to optimize downstream transcrip- 2Ethical Statement: All data sourced from public repositories; no personal identifiers retained. Content is non-poli...

  5. [5]

    We further incorporated audio format metadata (short vs

    UrduSpeech corpus curation pipeline Building on our US-benchmark set pilot, we scaled the corpus development into a multi-stage pipeline, as illustrated in Fig- ure 1. We further incorporated audio format metadata (short vs. long form) and integrated model confidence scores alongside quality assessments conducted by native annotators. 3.1. Data collection...

  6. [6]

    Experimental setup and recruitment To validate the corpus, 180 clips across three sets (A, B, and C) were randomly sampled by complexity using an anchor set strategy (Table 2)

    Human-centric quality assessment 4.1. Experimental setup and recruitment To validate the corpus, 180 clips across three sets (A, B, and C) were randomly sampled by complexity using an anchor set strategy (Table 2). Six university-recruited native Urdu speak- ers (3M/3F) evaluated the data in a controlled laboratory set- ting. To ensure independent, high-q...

  7. [7]

    Corpus Distribution and Statistics The UrduSpeech corpus comprises91GB156 hours of diarized audio

    UrduSpeech corpus 5.1. Corpus Distribution and Statistics The UrduSpeech corpus comprises91GB156 hours of diarized audio. As shown in Figure 3, the Interview category represents the largest share, accounting for approximately 34 hours (21% of the total volume). Traditional genres such as drama and po- etry contain a higher volume of Us-Std, whereas conver...

  8. [8]

    Limitation and future work Our corpus provides a substantial resource for Urdu, code- switched Urdu-English, and Pakistani-accent English speech re- search, yet several limitations exist. First, while automated di- arization via Pyannote 3.1 identified over 3,000 unique speaker clusters, we conservatively estimate the count at 1,000+ unique speakers to ac...

  9. [9]

    By developing a robust and reproducible pipeline, we successfully addressed the complexities of ”in-the- wild” Urdu speech and the high prevalence of Urdu-English code-switching

    Conclusion In this study, we introduced UrduSpeech, a 156-hour (91 GB) multi-domain speech corpus featuring 12-dimensions paralin- guistic metadata. By developing a robust and reproducible pipeline, we successfully addressed the complexities of ”in-the- wild” Urdu speech and the high prevalence of Urdu-English code-switching. Our stratified methodology re...

  10. [10]

    All technical methodologies, data collection, and original research contributions were conceived and exe- cuted entirely by the authors

    Generative AI Use Disclosure The authors acknowledge the use of generative AI tools solely for text refinement, grammar corrections, and proofreading of the manuscript. All technical methodologies, data collection, and original research contributions were conceived and exe- cuted entirely by the authors

  11. [11]

    Systematic inequal- ities in language technology performance across the world’s lan- guages,

    D. Blasi, A. Anastasopoulos, and G. Neubig, “Systematic inequal- ities in language technology performance across the world’s lan- guages,” inProceedings of the 60th Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 5486–5505

  12. [12]

    Wer we stand: Benchmarking urdu asr models,

    S. Arif, A. J. Khan, M. Abbas, A. A. Raza, and A. Athar, “Wer we stand: Benchmarking urdu asr models,” inProceedings of the 31st International Conference on Computational Linguistics, 2025, pp. 5952–5961

  13. [13]

    Towards unified pro- cessing of perso-arabic scripts for asr,

    S. Bandarupalli, B. Akkiraju, S. C. Devarakonda, H. Sivara- masethu, V . Narasinga, and A. Vuppala, “Towards unified pro- cessing of perso-arabic scripts for asr,” inProceedings of the 1st Workshop on NLP for Languages Using Arabic Script, 2025, pp. 23–28

  14. [14]

    From statistical methods to pre-trained models: A survey on automatic speech recognition for resource-scarce urdu language,

    M. Sharif, Z. Abbas, J. Yi, and C. Liu, “From statistical methods to pre-trained models: A survey on automatic speech recognition for resource-scarce urdu language,”arXiv preprint arXiv:2411.14493, 2024

  15. [15]

    Challenges and opportunities in urdu-english code-switched speech recognition,

    M. Sadeqiet al., “Challenges and opportunities in urdu-english code-switched speech recognition,”Journal of Linguistic Engi- neering, 2023

  16. [16]

    Urdu language processing: a survey,

    A. Daud, W. Khan, and D. Che, “Urdu language processing: a survey,”Artificial Intelligence Review, vol. 47, no. 3, pp. 279– 311, 2017

  17. [17]

    Omnilingual asr: Open-source multilingual speech recognition for 1600+ languages

    A. Omnilingual, G. Keren, A. Kozhevnikov, Y . Meng, C. Rop- ers, M. Setzler, S. Wang, I. Adebara, M. Auli, C. Baliogluet al., “Omnilingual asr: Open-source multilingual speech recognition for 1600+ languages,”arXiv preprint arXiv:2511.09690, 2025

  18. [18]

    Com- mon voice: A massively-multilingual speech corpus,

    R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Hen- retty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Com- mon voice: A massively-multilingual speech corpus,” inProceed- ings of the twelfth language resources and evaluation conference, 2020, pp. 4218–4222

  19. [19]

    Uquad+: Benchmark dataset for urdu ma- chine reading comprehension,

    S. Kazi and S. Khoja, “Uquad+: Benchmark dataset for urdu ma- chine reading comprehension,”ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 25, no. 2, pp. 1–34, 2026

  20. [20]

    Deepfake audio detection in low- resource languages: A case study of urdu,

    M. Owais, K. K. Jadoon, A. I. Sandhu, Z. Ali, Z. Mahmood, M. Yahya, and A. Wahid, “Deepfake audio detection in low- resource languages: A case study of urdu,”IEEE Access, 2026

  21. [21]

    Cross-lingual speech emotion recognition with attention-driven bi-lstm: Advancing kashmiri and multilingual adaptation,

    G. M. Dar and R. Delhibabu, “Cross-lingual speech emotion recognition with attention-driven bi-lstm: Advancing kashmiri and multilingual adaptation,”International Journal of Analysis and Applications, vol. 24, pp. 43–43, 2026

  22. [22]

    Wenetspeech-yue: A large- scale cantonese speech corpus with multi-dimensional annota- tion,

    L. Li, Z. Guo, H. Chen, Y . Dai, Z. Zhang, H. Xue, T. Zuo, C. Wang, S. Wang, J. Liet al., “Wenetspeech-yue: A large- scale cantonese speech corpus with multi-dimensional annota- tion,”arXiv preprint arXiv:2509.03959, 2025

  23. [23]

    Wenetspeech-chuan: A large- scale sichuanese corpus with rich annotation for dialectal speech processing,

    Y . Dai, Z. Zhang, S. Wang, L. Li, Z. Guo, T. Zuo, S. Wang, H. Xue, C. Wang, Q. Wanget al., “Wenetspeech-chuan: A large- scale sichuanese corpus with rich annotation for dialectal speech processing,”arXiv preprint arXiv:2509.18004, 2025

  24. [24]

    Arl urdu speech database, training data ldc2007s03,

    Appen Pty Ltd, “Arl urdu speech database, training data ldc2007s03,” Linguistic Data Consortium, Philadelphia, 2007, iSBN: 1-58563-412-3

  25. [25]

    Cle pakistan district names speech corpus - urdu speakers,

    Center for Language Engineering (CLE), “Cle pakistan district names speech corpus - urdu speakers,” https://www.cle.org.pk/ clestore/speech-urdu.htm, 2016, accessed: 2026-03-02

  26. [26]

    Urdu sentence aligned speech corpus,

    M. Khan, S. Alam, B. B. Mariyam, N. Rajesha, G. Manasa, D. Srikanth, S. Fernandes, S. Nithin, N. K. Choudhary, and S. Mohan, “Urdu sentence aligned speech corpus,” Central Institute of Indian Languages, Mysore, 2023, iSBN: 978-81- 19411-87-0. Catalogue Number: 1434. [Online]. Available: https://data.ldcil.org/urdu-sentence-aligned-corpus

  27. [27]

    Overcoming linguis- tic barriers developing advanced urdu text-to-speech systems,

    S. A. Khan, M. Mansoor, and A. Habib, “Overcoming linguis- tic barriers developing advanced urdu text-to-speech systems,” in 2024 19th International Conference on Emerging Technologies (ICET). IEEE, 2024, pp. 1–6

  28. [28]

    Phonological variations of english in pakistan,

    S. Sarfrazet al., “Phonological variations of english in pakistan,” inProceedings of the Conference on Language and Technology (CLT10), 2010

  29. [29]

    V oxceleb: a large- scale speaker identification dataset,

    A. Nagrani, J. S. Chung, and A. Zisserman, “V oxceleb: a large- scale speaker identification dataset,” inINTERSPEECH, 2017, pp. 2616–2620

  30. [30]

    The interspeech 2013 computational paralinguistics challenge: Social signals, conflict, emotion, autism,

    B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Schereret al., “The interspeech 2013 computational paralinguistics challenge: Social signals, conflict, emotion, autism,”Proc. INTERSPEECH, pp. 148–152, 2013

  31. [31]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen et al., “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabil- ities,”arXiv preprint arXiv:2507.06261, 2025

  32. [32]

    Robust speech recognition via large-scale weak supervision,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518

  33. [33]

    Spleeter: a fast and efficient music source separation tool with pre-trained models,

    R. Hennequin, A. Khlif, F. V oituret, and M. Moussallam, “Spleeter: a fast and efficient music source separation tool with pre-trained models,”Journal of Open Source Software, vol. 5, no. 50, p. 2154, 2020

  34. [34]

    Pyannote. audio: neural building blocks for speaker diarization,

    H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov, M. Lavechin, D. Fustes, H. Titeux, W. Bouaziz, and M.-P. Gill, “Pyannote. audio: neural building blocks for speaker diarization,” inICASSP 2020-2020 IEEE International conference on acous- tics, speech and signal processing (ICASSP). IEEE, 2020, pp. 7124–7128

  35. [35]

    From wer and ril to mer and wil: improved evaluation measures for connected speech recognition

    A. C. Morris, V . Maier, and P. D. Green, “From wer and ril to mer and wil: improved evaluation measures for connected speech recognition.” inInterspeech, no. 4-8, 2004, p. 2004

  36. [36]

    Demucs: Deep extractor for music sources with extra unlabeled data remixed.arXiv preprint arXiv:1909.01174, 2019

    A. D ´efossez, N. Usunier, L. Bottou, and F. Bach, “Demucs: Deep extractor for music sources with extra unlabeled data remixed,” arXiv preprint arXiv:1909.01174, 2019

  37. [37]

    Recommendation p. 800. methods for subjective deter- mination of transmission quality,

    T. ITU, “Recommendation p. 800. methods for subjective deter- mination of transmission quality,”International Telecommunica- tion Union, 1996

  38. [38]

    Data structures for statistical computing in python

    W. McKinneyet al., “Data structures for statistical computing in python.”scipy, vol. 445, no. 1, pp. 51–56, 2010

  39. [39]

    A coefficient of agreement for nominal scales,

    J. Cohen, “A coefficient of agreement for nominal scales,”Educa- tional and psychological measurement, vol. 20, no. 1, pp. 37–46, 1960

  40. [40]

    Measuring nominal scale agreement among many raters,

    J. L. Fleiss, “Measuring nominal scale agreement among many raters,”Psychological bulletin, vol. 76, no. 5, p. 378, 1971

  41. [41]

    Scikit-learn: Machine learning in python,

    F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourget al., “Scikit-learn: Machine learning in python,”the Journal of ma- chine Learning research, vol. 12, pp. 2825–2830, 2011

  42. [42]

    Statsmodels: econometric and sta- tistical modeling with python

    S. Seabold, J. Perktoldet al., “Statsmodels: econometric and sta- tistical modeling with python.”scipy, vol. 7, no. 1, pp. 92–96, 2010

  43. [43]

    Fleurs: Few-shot learning evaluation of universal representations of speech,

    A. Conneau, M. Ma, S. Khanuja, Y . Zhang, V . Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “Fleurs: Few-shot learning evaluation of universal representations of speech,” in2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 798–805