UrduSpeech: A 156-Hour Urdu Speech Corpus with 12-Dimension Paralinguistic Annotations
Pith reviewed 2026-05-20 01:11 UTC · model grok-4.3
The pith
UrduSpeech supplies 156 hours of Urdu audio annotated across 12 paralinguistic dimensions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
UrduSpeech is a high-fidelity corpus of 156 hours of Urdu audio with 12-dimension paralinguistic metadata that spans US-Std, US-CS, and US-EngPk varieties. The data set draws from 12 categories and is assembled by an LLM-driven curation pipeline that manages script direction and frequent code-switching. A 9-hour benchmark subset receives manual correction by native annotators. Quality validation yields a mean opinion score of 4.6 and Cohen's kappa of 0.68, supporting the pipeline's reported 97.6 percent confidence level. The full set maintains a 60-40 gender balance over 71,792 utterances.
What carries the argument
LLM-driven curation pipeline that selects utterances from 12 categories and applies 12-dimension paralinguistic annotations while resolving right-to-left script and code-switching constraints.
If this is right
- Speech recognition models can now train directly on balanced, multi-variety Urdu data rather than relying on translated or synthetic material.
- Paralinguistic features such as emotion or emphasis become measurable targets for Urdu-specific natural language understanding systems.
- Researchers gain a standard benchmark set for consistent comparison of new Urdu speech models.
- Code-switching behavior can be modeled explicitly, improving performance on everyday mixed-language conversations.
- Open release of both corpus and curation code allows direct replication and incremental addition of new categories.
Where Pith is reading between the lines
- The same curation approach could be adapted to other right-to-left or heavily code-switched languages such as Arabic or Hindi-English mixes.
- Voice interfaces built on this data may better capture culturally specific prosody that generic multilingual models currently miss.
- Pairing the corpus with existing English datasets could produce stronger joint models for South Asian code-switched speech.
- Long-term tracking of model performance on the released benchmark would quantify whether the added paralinguistic labels actually improve downstream tasks.
Load-bearing premise
The LLM pipeline's 97.6 percent score together with the human MOS of 4.6 and Cohen's kappa of 0.68 correctly indicate accurate annotations and faithful audio across every category and language variety.
What would settle it
Independent native speakers re-annotating a random sample of the 9-hour benchmark set and reporting frequent label errors or quality scores below 4.0 would falsify the claim of high-fidelity curation.
Figures
read the original abstract
Despite 230 million speakers, Urdu remains critically under-resourced in speech technology. We introduce UrduSpeech: a large high-fidelity Urdu corpus comprising 156 hours of audio with 12-dimension paralinguistic metadata, encompassing US-Std, US-CS, US-EngPk. To address Right-to-Left script constraints and frequent code-switching, we developed UrduSpeech, a LLM-driven pipeline to curate data across 12 diverse categories, including news, drama, and rare literary forms like Bait-Bazi. We also release a 9-hour US-Benchmark set, manually corrected by native annotators to serve as a standard. Human quality assessment of the primary 156-hour corpus yielded a Mean Opinion Score (MOS) of 4.6 (std = 0.7) with inter-rater reliability confirmed by a 0.68 Cohen's Kappa, validating our curation pipeline's 97.6% confidence score. The corpus maintains a 60-40 gender balance across 71,792 utterances. Our work represents a significant leap toward linguistic inclusivity in global AI. The corpus and code are open-sourced, and a demo page is available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces UrduSpeech, a 156-hour high-fidelity Urdu speech corpus with 12-dimension paralinguistic annotations covering US-Std, US-CS, and US-EngPk varieties. It describes an LLM-driven curation pipeline addressing RTL script and code-switching issues, releases a 9-hour manually corrected US-Benchmark, and reports aggregate quality metrics: MOS of 4.6 (std=0.7), Cohen's Kappa of 0.68, and 97.6% pipeline confidence, with 60-40 gender balance across 71,792 utterances from diverse sources including news, drama, and literary forms.
Significance. If the 12-dimension annotations are shown to be reliable, this corpus would provide a valuable open resource for an under-resourced language spoken by 230 million people, supporting research in paralinguistic speech processing, code-switching ASR, and inclusive AI. The inclusion of a benchmark set and open-sourcing are clear strengths for reproducibility and community use.
major comments (1)
- [Quality Assessment] Quality Assessment section: The central claim of reliable 12-dimension paralinguistic annotations rests on aggregate MOS 4.6, Cohen's Kappa 0.68 for the primary corpus, and 97.6% LLM confidence. No per-dimension breakdown of these metrics, no per-variety (US-Std/US-CS/US-EngPk) results, and no quantitative agreement (e.g., Kappa or error rate) between LLM labels and the manually corrected 9-hour US-Benchmark are reported. This gap directly affects verification that the annotations are accurate across the full claimed scope.
minor comments (2)
- [Methods] Provide explicit definitions and annotation protocols for each of the 12 paralinguistic dimensions to support reproducibility.
- [Corpus Description] Include utterance counts or duration breakdowns by variety and by the 12 source categories to allow assessment of balance.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the quality assessment of the annotations. We agree that additional granularity is needed to strengthen verification of the 12-dimension labels and will revise the manuscript to address these points.
read point-by-point responses
-
Referee: [Quality Assessment] Quality Assessment section: The central claim of reliable 12-dimension paralinguistic annotations rests on aggregate MOS 4.6, Cohen's Kappa 0.68 for the primary corpus, and 97.6% LLM confidence. No per-dimension breakdown of these metrics, no per-variety (US-Std/US-CS/US-EngPk) results, and no quantitative agreement (e.g., Kappa or error rate) between LLM labels and the manually corrected 9-hour US-Benchmark are reported. This gap directly affects verification that the annotations are accurate across the full claimed scope.
Authors: We acknowledge that the manuscript reports only aggregate MOS, Kappa, and pipeline confidence values. In the revised version we will add (1) per-dimension breakdowns of MOS and Cohen's Kappa for the 12 paralinguistic attributes, (2) the same metrics stratified by variety (US-Std, US-CS, US-EngPk), and (3) quantitative agreement statistics (Cohen's Kappa and/or error rates) between the LLM labels and the manually corrected 9-hour US-Benchmark. These additions will be placed in an expanded Quality Assessment section to allow direct verification across the claimed scope. revision: yes
Circularity Check
No circularity: direct corpus release with independent quality metrics
full rationale
This is a data collection and release paper with no derivations, equations, predictions, or first-principles results. The central claims concern the creation of the 156-hour UrduSpeech corpus and the reporting of quality metrics (MOS 4.6, Cohen's Kappa 0.68, 97.6% LLM confidence) obtained from human assessments and pipeline outputs. These metrics are presented as direct empirical results rather than outputs derived from or fitted to the same inputs by construction. No self-citation chains, ansatzes, or renamings of known results appear in any load-bearing step. The work is self-contained against external benchmarks of corpus quality.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Urdu remains critically under-resourced in speech technology despite 230 million speakers.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce UrduSpeech: a large high-fidelity Urdu corpus comprising 156 hours of audio with 12-dimension paralinguistic metadata... Human quality assessment... MOS of 4.6... Cohen's Kappa... 97.6% confidence score.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Introduction In recent years, the digital preservation of languages within the AI landscape has become a cornerstone of linguistic equality [1]. Yet, despite Urdu’s global significance and its vast diaspora, it remains remarkably under-resourced in the context of multi- modal foundation models and Speech LLMs. Recent bench- marks highlight a persistent pe...
-
[2]
and WenetSpeech-Chuan [13] pipelines, we developed a specialized solution for the Urdu-English paradigm. We build upon foundational datasets including ARL Urdu [14], CLE Pak- istan [15], and LDC-IL [16] while addressing the critical short- age of high-fidelity data in modern Urdu TTS [17] and ASR. **indicates the corresponding author. 1https://interspeech...
-
[3]
Model selection and benchmark set Prior to large-scale development, we conducted a 13-hour au- dio pilot study across 12 categories, including poetry, news, and vlogs. We gathered this raw audio ”in-the-wild” and processed it according to our curation pipeline stage 1 as show in the fig- ure 1. We utilized Spleeter [23] for noise removal and Pyannote
-
[4]
UrduSpeech: A 156-Hour Urdu Speech Corpus with 12-Dimension Paralinguistic Annotations
for speaker diarization. To ensure high data quality, we discarded single-speaker clips and segments shorter than two seconds. Additionally, all audio clips were capped at a maxi- mum duration of 35 seconds to optimize downstream transcrip- 2Ethical Statement: All data sourced from public repositories; no personal identifiers retained. Content is non-poli...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[5]
We further incorporated audio format metadata (short vs
UrduSpeech corpus curation pipeline Building on our US-benchmark set pilot, we scaled the corpus development into a multi-stage pipeline, as illustrated in Fig- ure 1. We further incorporated audio format metadata (short vs. long form) and integrated model confidence scores alongside quality assessments conducted by native annotators. 3.1. Data collection...
-
[6]
Human-centric quality assessment 4.1. Experimental setup and recruitment To validate the corpus, 180 clips across three sets (A, B, and C) were randomly sampled by complexity using an anchor set strategy (Table 2). Six university-recruited native Urdu speak- ers (3M/3F) evaluated the data in a controlled laboratory set- ting. To ensure independent, high-q...
-
[7]
Corpus Distribution and Statistics The UrduSpeech corpus comprises91GB156 hours of diarized audio
UrduSpeech corpus 5.1. Corpus Distribution and Statistics The UrduSpeech corpus comprises91GB156 hours of diarized audio. As shown in Figure 3, the Interview category represents the largest share, accounting for approximately 34 hours (21% of the total volume). Traditional genres such as drama and po- etry contain a higher volume of Us-Std, whereas conver...
work page 2007
-
[8]
Limitation and future work Our corpus provides a substantial resource for Urdu, code- switched Urdu-English, and Pakistani-accent English speech re- search, yet several limitations exist. First, while automated di- arization via Pyannote 3.1 identified over 3,000 unique speaker clusters, we conservatively estimate the count at 1,000+ unique speakers to ac...
-
[9]
Conclusion In this study, we introduced UrduSpeech, a 156-hour (91 GB) multi-domain speech corpus featuring 12-dimensions paralin- guistic metadata. By developing a robust and reproducible pipeline, we successfully addressed the complexities of ”in-the- wild” Urdu speech and the high prevalence of Urdu-English code-switching. Our stratified methodology re...
-
[10]
Generative AI Use Disclosure The authors acknowledge the use of generative AI tools solely for text refinement, grammar corrections, and proofreading of the manuscript. All technical methodologies, data collection, and original research contributions were conceived and exe- cuted entirely by the authors
-
[11]
Systematic inequal- ities in language technology performance across the world’s lan- guages,
D. Blasi, A. Anastasopoulos, and G. Neubig, “Systematic inequal- ities in language technology performance across the world’s lan- guages,” inProceedings of the 60th Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 5486–5505
work page 2022
-
[12]
Wer we stand: Benchmarking urdu asr models,
S. Arif, A. J. Khan, M. Abbas, A. A. Raza, and A. Athar, “Wer we stand: Benchmarking urdu asr models,” inProceedings of the 31st International Conference on Computational Linguistics, 2025, pp. 5952–5961
work page 2025
-
[13]
Towards unified pro- cessing of perso-arabic scripts for asr,
S. Bandarupalli, B. Akkiraju, S. C. Devarakonda, H. Sivara- masethu, V . Narasinga, and A. Vuppala, “Towards unified pro- cessing of perso-arabic scripts for asr,” inProceedings of the 1st Workshop on NLP for Languages Using Arabic Script, 2025, pp. 23–28
work page 2025
-
[14]
M. Sharif, Z. Abbas, J. Yi, and C. Liu, “From statistical methods to pre-trained models: A survey on automatic speech recognition for resource-scarce urdu language,”arXiv preprint arXiv:2411.14493, 2024
-
[15]
Challenges and opportunities in urdu-english code-switched speech recognition,
M. Sadeqiet al., “Challenges and opportunities in urdu-english code-switched speech recognition,”Journal of Linguistic Engi- neering, 2023
work page 2023
-
[16]
Urdu language processing: a survey,
A. Daud, W. Khan, and D. Che, “Urdu language processing: a survey,”Artificial Intelligence Review, vol. 47, no. 3, pp. 279– 311, 2017
work page 2017
-
[17]
Omnilingual asr: Open-source multilingual speech recognition for 1600+ languages
A. Omnilingual, G. Keren, A. Kozhevnikov, Y . Meng, C. Rop- ers, M. Setzler, S. Wang, I. Adebara, M. Auli, C. Baliogluet al., “Omnilingual asr: Open-source multilingual speech recognition for 1600+ languages,”arXiv preprint arXiv:2511.09690, 2025
-
[18]
Com- mon voice: A massively-multilingual speech corpus,
R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Hen- retty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Com- mon voice: A massively-multilingual speech corpus,” inProceed- ings of the twelfth language resources and evaluation conference, 2020, pp. 4218–4222
work page 2020
-
[19]
Uquad+: Benchmark dataset for urdu ma- chine reading comprehension,
S. Kazi and S. Khoja, “Uquad+: Benchmark dataset for urdu ma- chine reading comprehension,”ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 25, no. 2, pp. 1–34, 2026
work page 2026
-
[20]
Deepfake audio detection in low- resource languages: A case study of urdu,
M. Owais, K. K. Jadoon, A. I. Sandhu, Z. Ali, Z. Mahmood, M. Yahya, and A. Wahid, “Deepfake audio detection in low- resource languages: A case study of urdu,”IEEE Access, 2026
work page 2026
-
[21]
G. M. Dar and R. Delhibabu, “Cross-lingual speech emotion recognition with attention-driven bi-lstm: Advancing kashmiri and multilingual adaptation,”International Journal of Analysis and Applications, vol. 24, pp. 43–43, 2026
work page 2026
-
[22]
Wenetspeech-yue: A large- scale cantonese speech corpus with multi-dimensional annota- tion,
L. Li, Z. Guo, H. Chen, Y . Dai, Z. Zhang, H. Xue, T. Zuo, C. Wang, S. Wang, J. Liet al., “Wenetspeech-yue: A large- scale cantonese speech corpus with multi-dimensional annota- tion,”arXiv preprint arXiv:2509.03959, 2025
-
[23]
Y . Dai, Z. Zhang, S. Wang, L. Li, Z. Guo, T. Zuo, S. Wang, H. Xue, C. Wang, Q. Wanget al., “Wenetspeech-chuan: A large- scale sichuanese corpus with rich annotation for dialectal speech processing,”arXiv preprint arXiv:2509.18004, 2025
-
[24]
Arl urdu speech database, training data ldc2007s03,
Appen Pty Ltd, “Arl urdu speech database, training data ldc2007s03,” Linguistic Data Consortium, Philadelphia, 2007, iSBN: 1-58563-412-3
work page 2007
-
[25]
Cle pakistan district names speech corpus - urdu speakers,
Center for Language Engineering (CLE), “Cle pakistan district names speech corpus - urdu speakers,” https://www.cle.org.pk/ clestore/speech-urdu.htm, 2016, accessed: 2026-03-02
work page 2016
-
[26]
Urdu sentence aligned speech corpus,
M. Khan, S. Alam, B. B. Mariyam, N. Rajesha, G. Manasa, D. Srikanth, S. Fernandes, S. Nithin, N. K. Choudhary, and S. Mohan, “Urdu sentence aligned speech corpus,” Central Institute of Indian Languages, Mysore, 2023, iSBN: 978-81- 19411-87-0. Catalogue Number: 1434. [Online]. Available: https://data.ldcil.org/urdu-sentence-aligned-corpus
work page 2023
-
[27]
Overcoming linguis- tic barriers developing advanced urdu text-to-speech systems,
S. A. Khan, M. Mansoor, and A. Habib, “Overcoming linguis- tic barriers developing advanced urdu text-to-speech systems,” in 2024 19th International Conference on Emerging Technologies (ICET). IEEE, 2024, pp. 1–6
work page 2024
-
[28]
Phonological variations of english in pakistan,
S. Sarfrazet al., “Phonological variations of english in pakistan,” inProceedings of the Conference on Language and Technology (CLT10), 2010
work page 2010
-
[29]
V oxceleb: a large- scale speaker identification dataset,
A. Nagrani, J. S. Chung, and A. Zisserman, “V oxceleb: a large- scale speaker identification dataset,” inINTERSPEECH, 2017, pp. 2616–2620
work page 2017
-
[30]
B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Schereret al., “The interspeech 2013 computational paralinguistics challenge: Social signals, conflict, emotion, autism,”Proc. INTERSPEECH, pp. 148–152, 2013
work page 2013
-
[31]
G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen et al., “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabil- ities,”arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Robust speech recognition via large-scale weak supervision,
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518
work page 2023
-
[33]
Spleeter: a fast and efficient music source separation tool with pre-trained models,
R. Hennequin, A. Khlif, F. V oituret, and M. Moussallam, “Spleeter: a fast and efficient music source separation tool with pre-trained models,”Journal of Open Source Software, vol. 5, no. 50, p. 2154, 2020
work page 2020
-
[34]
Pyannote. audio: neural building blocks for speaker diarization,
H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov, M. Lavechin, D. Fustes, H. Titeux, W. Bouaziz, and M.-P. Gill, “Pyannote. audio: neural building blocks for speaker diarization,” inICASSP 2020-2020 IEEE International conference on acous- tics, speech and signal processing (ICASSP). IEEE, 2020, pp. 7124–7128
work page 2020
-
[35]
From wer and ril to mer and wil: improved evaluation measures for connected speech recognition
A. C. Morris, V . Maier, and P. D. Green, “From wer and ril to mer and wil: improved evaluation measures for connected speech recognition.” inInterspeech, no. 4-8, 2004, p. 2004
work page 2004
-
[36]
A. D ´efossez, N. Usunier, L. Bottou, and F. Bach, “Demucs: Deep extractor for music sources with extra unlabeled data remixed,” arXiv preprint arXiv:1909.01174, 2019
-
[37]
Recommendation p. 800. methods for subjective deter- mination of transmission quality,
T. ITU, “Recommendation p. 800. methods for subjective deter- mination of transmission quality,”International Telecommunica- tion Union, 1996
work page 1996
-
[38]
Data structures for statistical computing in python
W. McKinneyet al., “Data structures for statistical computing in python.”scipy, vol. 445, no. 1, pp. 51–56, 2010
work page 2010
-
[39]
A coefficient of agreement for nominal scales,
J. Cohen, “A coefficient of agreement for nominal scales,”Educa- tional and psychological measurement, vol. 20, no. 1, pp. 37–46, 1960
work page 1960
-
[40]
Measuring nominal scale agreement among many raters,
J. L. Fleiss, “Measuring nominal scale agreement among many raters,”Psychological bulletin, vol. 76, no. 5, p. 378, 1971
work page 1971
-
[41]
Scikit-learn: Machine learning in python,
F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourget al., “Scikit-learn: Machine learning in python,”the Journal of ma- chine Learning research, vol. 12, pp. 2825–2830, 2011
work page 2011
-
[42]
Statsmodels: econometric and sta- tistical modeling with python
S. Seabold, J. Perktoldet al., “Statsmodels: econometric and sta- tistical modeling with python.”scipy, vol. 7, no. 1, pp. 92–96, 2010
work page 2010
-
[43]
Fleurs: Few-shot learning evaluation of universal representations of speech,
A. Conneau, M. Ma, S. Khanuja, Y . Zhang, V . Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “Fleurs: Few-shot learning evaluation of universal representations of speech,” in2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 798–805
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.