UrduSpeech: A 156-Hour Urdu Speech Corpus with 12-Dimension Paralinguistic Annotations

Attia Nafees ul Haq; Chunjiang He; Jingbin Hu; Lei Xie; Zeyu Zhu

arxiv: 2605.17846 · v1 · pith:OXAC2W3Mnew · submitted 2026-05-18 · 📡 eess.AS

UrduSpeech: A 156-Hour Urdu Speech Corpus with 12-Dimension Paralinguistic Annotations

Attia Nafees ul Haq , Zeyu Zhu , Jingbin Hu , ChunJiang He , Lei Xie This is my paper

Pith reviewed 2026-05-20 01:11 UTC · model grok-4.3

classification 📡 eess.AS

keywords Urdu speech corpusparalinguistic annotationscode-switchinglow-resource languagesspeech technologyaudio datasetnatural language processinglinguistic inclusivity

0 comments

The pith

UrduSpeech supplies 156 hours of Urdu audio annotated across 12 paralinguistic dimensions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a large speech corpus to address the shortage of training data for Urdu, spoken by over 230 million people. It compiles 156 hours of audio from diverse sources including news, drama, and literary forms, with metadata on 12 paralinguistic aspects. The collection covers standard Urdu, code-switched varieties, and English-Pakistani forms while handling right-to-left script challenges through an automated pipeline. A smaller manually verified benchmark set is also provided for evaluation. Human checks confirm high audio quality and annotation consistency, making the resource usable for building speech recognition and synthesis tools.

Core claim

UrduSpeech is a high-fidelity corpus of 156 hours of Urdu audio with 12-dimension paralinguistic metadata that spans US-Std, US-CS, and US-EngPk varieties. The data set draws from 12 categories and is assembled by an LLM-driven curation pipeline that manages script direction and frequent code-switching. A 9-hour benchmark subset receives manual correction by native annotators. Quality validation yields a mean opinion score of 4.6 and Cohen's kappa of 0.68, supporting the pipeline's reported 97.6 percent confidence level. The full set maintains a 60-40 gender balance over 71,792 utterances.

What carries the argument

LLM-driven curation pipeline that selects utterances from 12 categories and applies 12-dimension paralinguistic annotations while resolving right-to-left script and code-switching constraints.

If this is right

Speech recognition models can now train directly on balanced, multi-variety Urdu data rather than relying on translated or synthetic material.
Paralinguistic features such as emotion or emphasis become measurable targets for Urdu-specific natural language understanding systems.
Researchers gain a standard benchmark set for consistent comparison of new Urdu speech models.
Code-switching behavior can be modeled explicitly, improving performance on everyday mixed-language conversations.
Open release of both corpus and curation code allows direct replication and incremental addition of new categories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same curation approach could be adapted to other right-to-left or heavily code-switched languages such as Arabic or Hindi-English mixes.
Voice interfaces built on this data may better capture culturally specific prosody that generic multilingual models currently miss.
Pairing the corpus with existing English datasets could produce stronger joint models for South Asian code-switched speech.
Long-term tracking of model performance on the released benchmark would quantify whether the added paralinguistic labels actually improve downstream tasks.

Load-bearing premise

The LLM pipeline's 97.6 percent score together with the human MOS of 4.6 and Cohen's kappa of 0.68 correctly indicate accurate annotations and faithful audio across every category and language variety.

What would settle it

Independent native speakers re-annotating a random sample of the 9-hour benchmark set and reporting frequent label errors or quality scores below 4.0 would falsify the claim of high-fidelity curation.

Figures

Figures reproduced from arXiv: 2605.17846 by Attia Nafees ul Haq, Chunjiang He, Jingbin Hu, Lei Xie, Zeyu Zhu.

**Figure 1.** Figure 1: Overview of the UrduSpeech data curation pipeline. tion performance. This preprocessing resulted in our 9-hour, manually-verified US-benchmark set. 2.1. Transcription model selection We compared three models for transcription: Whisper-v3 [22], as it is the most commonly used model for Urdu; the recently released OmniASR-LLM-1B [7], which supports 1,600 languages and classifies Arab-Urdu as a high-resource… view at source ↗

**Figure 2.** Figure 2: Detailed results for human-centric assessment 5. UrduSpeech corpus 5.1. Corpus Distribution and Statistics The UrduSpeech corpus comprises 91GB 156 hours of diarized audio. As shown in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Corpus data distribution across subsets and categories The corpus contains 71,792 diarized segments, categorized by duration into short-format (55,407 segments) and longformat (16,243 segments) clips. Detailed demographic and linguistic insights are provided in [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

read the original abstract

Despite 230 million speakers, Urdu remains critically under-resourced in speech technology. We introduce UrduSpeech: a large high-fidelity Urdu corpus comprising 156 hours of audio with 12-dimension paralinguistic metadata, encompassing US-Std, US-CS, US-EngPk. To address Right-to-Left script constraints and frequent code-switching, we developed UrduSpeech, a LLM-driven pipeline to curate data across 12 diverse categories, including news, drama, and rare literary forms like Bait-Bazi. We also release a 9-hour US-Benchmark set, manually corrected by native annotators to serve as a standard. Human quality assessment of the primary 156-hour corpus yielded a Mean Opinion Score (MOS) of 4.6 (std = 0.7) with inter-rater reliability confirmed by a 0.68 Cohen's Kappa, validating our curation pipeline's 97.6% confidence score. The corpus maintains a 60-40 gender balance across 71,792 utterances. Our work represents a significant leap toward linguistic inclusivity in global AI. The corpus and code are open-sourced, and a demo page is available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UrduSpeech releases a practical new 156-hour corpus for an under-resourced language, but the annotation quality rests on aggregate metrics without per-dimension or per-variety checks.

read the letter

UrduSpeech is mainly a data release paper that puts out 156 hours of Urdu audio with 12 paralinguistic annotations and a 9-hour benchmark set. That is the key new thing for a language with 230 million speakers but thin existing resources. The authors handle some real practical issues well. They built an LLM pipeline to manage right-to-left script and code-switching, drew from a range of categories like news and drama, and kept a good gender balance. The open release plus the reported MOS of 4.6 and Kappa of 0.68 give a starting point for judging quality. The soft spot is the lack of granular checks. The quality metrics are all aggregate, with no breakdown by the 12 dimensions or the different varieties (standard, code-switched, English-Pakistani). There is also no reported agreement rate between the LLM and the manual fixes on the benchmark. This leaves some uncertainty about how reliable each annotation type actually is. The paper is for researchers in speech recognition or paralinguistics who need data for Urdu or similar low-resource languages. It could be useful for training models or as a reference. I think it deserves peer review to verify the curation details and see if more analysis is there.

Referee Report

1 major / 2 minor

Summary. The paper introduces UrduSpeech, a 156-hour high-fidelity Urdu speech corpus with 12-dimension paralinguistic annotations covering US-Std, US-CS, and US-EngPk varieties. It describes an LLM-driven curation pipeline addressing RTL script and code-switching issues, releases a 9-hour manually corrected US-Benchmark, and reports aggregate quality metrics: MOS of 4.6 (std=0.7), Cohen's Kappa of 0.68, and 97.6% pipeline confidence, with 60-40 gender balance across 71,792 utterances from diverse sources including news, drama, and literary forms.

Significance. If the 12-dimension annotations are shown to be reliable, this corpus would provide a valuable open resource for an under-resourced language spoken by 230 million people, supporting research in paralinguistic speech processing, code-switching ASR, and inclusive AI. The inclusion of a benchmark set and open-sourcing are clear strengths for reproducibility and community use.

major comments (1)

[Quality Assessment] Quality Assessment section: The central claim of reliable 12-dimension paralinguistic annotations rests on aggregate MOS 4.6, Cohen's Kappa 0.68 for the primary corpus, and 97.6% LLM confidence. No per-dimension breakdown of these metrics, no per-variety (US-Std/US-CS/US-EngPk) results, and no quantitative agreement (e.g., Kappa or error rate) between LLM labels and the manually corrected 9-hour US-Benchmark are reported. This gap directly affects verification that the annotations are accurate across the full claimed scope.

minor comments (2)

[Methods] Provide explicit definitions and annotation protocols for each of the 12 paralinguistic dimensions to support reproducibility.
[Corpus Description] Include utterance counts or duration breakdowns by variety and by the 12 source categories to allow assessment of balance.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the quality assessment of the annotations. We agree that additional granularity is needed to strengthen verification of the 12-dimension labels and will revise the manuscript to address these points.

read point-by-point responses

Referee: [Quality Assessment] Quality Assessment section: The central claim of reliable 12-dimension paralinguistic annotations rests on aggregate MOS 4.6, Cohen's Kappa 0.68 for the primary corpus, and 97.6% LLM confidence. No per-dimension breakdown of these metrics, no per-variety (US-Std/US-CS/US-EngPk) results, and no quantitative agreement (e.g., Kappa or error rate) between LLM labels and the manually corrected 9-hour US-Benchmark are reported. This gap directly affects verification that the annotations are accurate across the full claimed scope.

Authors: We acknowledge that the manuscript reports only aggregate MOS, Kappa, and pipeline confidence values. In the revised version we will add (1) per-dimension breakdowns of MOS and Cohen's Kappa for the 12 paralinguistic attributes, (2) the same metrics stratified by variety (US-Std, US-CS, US-EngPk), and (3) quantitative agreement statistics (Cohen's Kappa and/or error rates) between the LLM labels and the manually corrected 9-hour US-Benchmark. These additions will be placed in an expanded Quality Assessment section to allow direct verification across the claimed scope. revision: yes

Circularity Check

0 steps flagged

No circularity: direct corpus release with independent quality metrics

full rationale

This is a data collection and release paper with no derivations, equations, predictions, or first-principles results. The central claims concern the creation of the 156-hour UrduSpeech corpus and the reporting of quality metrics (MOS 4.6, Cohen's Kappa 0.68, 97.6% LLM confidence) obtained from human assessments and pipeline outputs. These metrics are presented as direct empirical results rather than outputs derived from or fitted to the same inputs by construction. No self-citation chains, ansatzes, or renamings of known results appear in any load-bearing step. The work is self-contained against external benchmarks of corpus quality.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is a data resource paper rather than a mathematical derivation. No numbers are fitted to data and no new physical entities are postulated. The 12 annotation dimensions and 12 curation categories are design choices rather than fitted parameters.

axioms (1)

domain assumption Urdu remains critically under-resourced in speech technology despite 230 million speakers.
This premise motivates the entire work and is stated directly in the abstract.

pith-pipeline@v0.9.0 · 5758 in / 1409 out tokens · 70584 ms · 2026-05-20T01:11:35.089910+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce UrduSpeech: a large high-fidelity Urdu corpus comprising 156 hours of audio with 12-dimension paralinguistic metadata... Human quality assessment... MOS of 4.6... Cohen's Kappa... 97.6% confidence score.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 2 internal anchors

[1]

Yet, despite Urdu’s global significance and its vast diaspora, it remains remarkably under-resourced in the context of multi- modal foundation models and Speech LLMs

Introduction In recent years, the digital preservation of languages within the AI landscape has become a cornerstone of linguistic equality [1]. Yet, despite Urdu’s global significance and its vast diaspora, it remains remarkably under-resourced in the context of multi- modal foundation models and Speech LLMs. Recent bench- marks highlight a persistent pe...

work page
[2]

and WenetSpeech-Chuan [13] pipelines, we developed a specialized solution for the Urdu-English paradigm. We build upon foundational datasets including ARL Urdu [14], CLE Pak- istan [15], and LDC-IL [16] while addressing the critical short- age of high-fidelity data in modern Urdu TTS [17] and ASR. **indicates the corresponding author. 1https://interspeech...

work page
[3]

We gathered this raw audio ”in-the-wild” and processed it according to our curation pipeline stage 1 as show in the fig- ure 1

Model selection and benchmark set Prior to large-scale development, we conducted a 13-hour au- dio pilot study across 12 categories, including poetry, news, and vlogs. We gathered this raw audio ”in-the-wild” and processed it according to our curation pipeline stage 1 as show in the fig- ure 1. We utilized Spleeter [23] for noise removal and Pyannote

work page
[4]

UrduSpeech: A 156-Hour Urdu Speech Corpus with 12-Dimension Paralinguistic Annotations

for speaker diarization. To ensure high data quality, we discarded single-speaker clips and segments shorter than two seconds. Additionally, all audio clips were capped at a maxi- mum duration of 35 seconds to optimize downstream transcrip- 2Ethical Statement: All data sourced from public repositories; no personal identifiers retained. Content is non-poli...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

We further incorporated audio format metadata (short vs

UrduSpeech corpus curation pipeline Building on our US-benchmark set pilot, we scaled the corpus development into a multi-stage pipeline, as illustrated in Fig- ure 1. We further incorporated audio format metadata (short vs. long form) and integrated model confidence scores alongside quality assessments conducted by native annotators. 3.1. Data collection...

work page
[6]

Experimental setup and recruitment To validate the corpus, 180 clips across three sets (A, B, and C) were randomly sampled by complexity using an anchor set strategy (Table 2)

Human-centric quality assessment 4.1. Experimental setup and recruitment To validate the corpus, 180 clips across three sets (A, B, and C) were randomly sampled by complexity using an anchor set strategy (Table 2). Six university-recruited native Urdu speak- ers (3M/3F) evaluated the data in a controlled laboratory set- ting. To ensure independent, high-q...

work page
[7]

Corpus Distribution and Statistics The UrduSpeech corpus comprises91GB156 hours of diarized audio

UrduSpeech corpus 5.1. Corpus Distribution and Statistics The UrduSpeech corpus comprises91GB156 hours of diarized audio. As shown in Figure 3, the Interview category represents the largest share, accounting for approximately 34 hours (21% of the total volume). Traditional genres such as drama and po- etry contain a higher volume of Us-Std, whereas conver...

work page 2007
[8]

Limitation and future work Our corpus provides a substantial resource for Urdu, code- switched Urdu-English, and Pakistani-accent English speech re- search, yet several limitations exist. First, while automated di- arization via Pyannote 3.1 identified over 3,000 unique speaker clusters, we conservatively estimate the count at 1,000+ unique speakers to ac...

work page
[9]

By developing a robust and reproducible pipeline, we successfully addressed the complexities of ”in-the- wild” Urdu speech and the high prevalence of Urdu-English code-switching

Conclusion In this study, we introduced UrduSpeech, a 156-hour (91 GB) multi-domain speech corpus featuring 12-dimensions paralin- guistic metadata. By developing a robust and reproducible pipeline, we successfully addressed the complexities of ”in-the- wild” Urdu speech and the high prevalence of Urdu-English code-switching. Our stratified methodology re...

work page
[10]

All technical methodologies, data collection, and original research contributions were conceived and exe- cuted entirely by the authors

Generative AI Use Disclosure The authors acknowledge the use of generative AI tools solely for text refinement, grammar corrections, and proofreading of the manuscript. All technical methodologies, data collection, and original research contributions were conceived and exe- cuted entirely by the authors

work page
[11]

Systematic inequal- ities in language technology performance across the world’s lan- guages,

D. Blasi, A. Anastasopoulos, and G. Neubig, “Systematic inequal- ities in language technology performance across the world’s lan- guages,” inProceedings of the 60th Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 5486–5505

work page 2022
[12]

Wer we stand: Benchmarking urdu asr models,

S. Arif, A. J. Khan, M. Abbas, A. A. Raza, and A. Athar, “Wer we stand: Benchmarking urdu asr models,” inProceedings of the 31st International Conference on Computational Linguistics, 2025, pp. 5952–5961

work page 2025
[13]

Towards unified pro- cessing of perso-arabic scripts for asr,

S. Bandarupalli, B. Akkiraju, S. C. Devarakonda, H. Sivara- masethu, V . Narasinga, and A. Vuppala, “Towards unified pro- cessing of perso-arabic scripts for asr,” inProceedings of the 1st Workshop on NLP for Languages Using Arabic Script, 2025, pp. 23–28

work page 2025
[14]

From statistical methods to pre-trained models: A survey on automatic speech recognition for resource-scarce urdu language,

M. Sharif, Z. Abbas, J. Yi, and C. Liu, “From statistical methods to pre-trained models: A survey on automatic speech recognition for resource-scarce urdu language,”arXiv preprint arXiv:2411.14493, 2024

work page arXiv 2024
[15]

Challenges and opportunities in urdu-english code-switched speech recognition,

M. Sadeqiet al., “Challenges and opportunities in urdu-english code-switched speech recognition,”Journal of Linguistic Engi- neering, 2023

work page 2023
[16]

Urdu language processing: a survey,

A. Daud, W. Khan, and D. Che, “Urdu language processing: a survey,”Artificial Intelligence Review, vol. 47, no. 3, pp. 279– 311, 2017

work page 2017
[17]

Omnilingual asr: Open-source multilingual speech recognition for 1600+ languages

A. Omnilingual, G. Keren, A. Kozhevnikov, Y . Meng, C. Rop- ers, M. Setzler, S. Wang, I. Adebara, M. Auli, C. Baliogluet al., “Omnilingual asr: Open-source multilingual speech recognition for 1600+ languages,”arXiv preprint arXiv:2511.09690, 2025

work page arXiv 2025
[18]

Com- mon voice: A massively-multilingual speech corpus,

R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Hen- retty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Com- mon voice: A massively-multilingual speech corpus,” inProceed- ings of the twelfth language resources and evaluation conference, 2020, pp. 4218–4222

work page 2020
[19]

Uquad+: Benchmark dataset for urdu ma- chine reading comprehension,

S. Kazi and S. Khoja, “Uquad+: Benchmark dataset for urdu ma- chine reading comprehension,”ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 25, no. 2, pp. 1–34, 2026

work page 2026
[20]

Deepfake audio detection in low- resource languages: A case study of urdu,

M. Owais, K. K. Jadoon, A. I. Sandhu, Z. Ali, Z. Mahmood, M. Yahya, and A. Wahid, “Deepfake audio detection in low- resource languages: A case study of urdu,”IEEE Access, 2026

work page 2026
[21]

Cross-lingual speech emotion recognition with attention-driven bi-lstm: Advancing kashmiri and multilingual adaptation,

G. M. Dar and R. Delhibabu, “Cross-lingual speech emotion recognition with attention-driven bi-lstm: Advancing kashmiri and multilingual adaptation,”International Journal of Analysis and Applications, vol. 24, pp. 43–43, 2026

work page 2026
[22]

Wenetspeech-yue: A large- scale cantonese speech corpus with multi-dimensional annota- tion,

L. Li, Z. Guo, H. Chen, Y . Dai, Z. Zhang, H. Xue, T. Zuo, C. Wang, S. Wang, J. Liet al., “Wenetspeech-yue: A large- scale cantonese speech corpus with multi-dimensional annota- tion,”arXiv preprint arXiv:2509.03959, 2025

work page arXiv 2025
[23]

Wenetspeech-chuan: A large- scale sichuanese corpus with rich annotation for dialectal speech processing,

Y . Dai, Z. Zhang, S. Wang, L. Li, Z. Guo, T. Zuo, S. Wang, H. Xue, C. Wang, Q. Wanget al., “Wenetspeech-chuan: A large- scale sichuanese corpus with rich annotation for dialectal speech processing,”arXiv preprint arXiv:2509.18004, 2025

work page arXiv 2025
[24]

Arl urdu speech database, training data ldc2007s03,

Appen Pty Ltd, “Arl urdu speech database, training data ldc2007s03,” Linguistic Data Consortium, Philadelphia, 2007, iSBN: 1-58563-412-3

work page 2007
[25]

Cle pakistan district names speech corpus - urdu speakers,

Center for Language Engineering (CLE), “Cle pakistan district names speech corpus - urdu speakers,” https://www.cle.org.pk/ clestore/speech-urdu.htm, 2016, accessed: 2026-03-02

work page 2016
[26]

Urdu sentence aligned speech corpus,

M. Khan, S. Alam, B. B. Mariyam, N. Rajesha, G. Manasa, D. Srikanth, S. Fernandes, S. Nithin, N. K. Choudhary, and S. Mohan, “Urdu sentence aligned speech corpus,” Central Institute of Indian Languages, Mysore, 2023, iSBN: 978-81- 19411-87-0. Catalogue Number: 1434. [Online]. Available: https://data.ldcil.org/urdu-sentence-aligned-corpus

work page 2023
[27]

Overcoming linguis- tic barriers developing advanced urdu text-to-speech systems,

S. A. Khan, M. Mansoor, and A. Habib, “Overcoming linguis- tic barriers developing advanced urdu text-to-speech systems,” in 2024 19th International Conference on Emerging Technologies (ICET). IEEE, 2024, pp. 1–6

work page 2024
[28]

Phonological variations of english in pakistan,

S. Sarfrazet al., “Phonological variations of english in pakistan,” inProceedings of the Conference on Language and Technology (CLT10), 2010

work page 2010
[29]

V oxceleb: a large- scale speaker identification dataset,

A. Nagrani, J. S. Chung, and A. Zisserman, “V oxceleb: a large- scale speaker identification dataset,” inINTERSPEECH, 2017, pp. 2616–2620

work page 2017
[30]

The interspeech 2013 computational paralinguistics challenge: Social signals, conflict, emotion, autism,

B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Schereret al., “The interspeech 2013 computational paralinguistics challenge: Social signals, conflict, emotion, autism,”Proc. INTERSPEECH, pp. 148–152, 2013

work page 2013
[31]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen et al., “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabil- ities,”arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518

work page 2023
[33]

Spleeter: a fast and efficient music source separation tool with pre-trained models,

R. Hennequin, A. Khlif, F. V oituret, and M. Moussallam, “Spleeter: a fast and efficient music source separation tool with pre-trained models,”Journal of Open Source Software, vol. 5, no. 50, p. 2154, 2020

work page 2020
[34]

Pyannote. audio: neural building blocks for speaker diarization,

H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov, M. Lavechin, D. Fustes, H. Titeux, W. Bouaziz, and M.-P. Gill, “Pyannote. audio: neural building blocks for speaker diarization,” inICASSP 2020-2020 IEEE International conference on acous- tics, speech and signal processing (ICASSP). IEEE, 2020, pp. 7124–7128

work page 2020
[35]

From wer and ril to mer and wil: improved evaluation measures for connected speech recognition

A. C. Morris, V . Maier, and P. D. Green, “From wer and ril to mer and wil: improved evaluation measures for connected speech recognition.” inInterspeech, no. 4-8, 2004, p. 2004

work page 2004
[36]

Demucs: Deep extractor for music sources with extra unlabeled data remixed.arXiv preprint arXiv:1909.01174, 2019

A. D ´efossez, N. Usunier, L. Bottou, and F. Bach, “Demucs: Deep extractor for music sources with extra unlabeled data remixed,” arXiv preprint arXiv:1909.01174, 2019

work page arXiv 1909
[37]

Recommendation p. 800. methods for subjective deter- mination of transmission quality,

T. ITU, “Recommendation p. 800. methods for subjective deter- mination of transmission quality,”International Telecommunica- tion Union, 1996

work page 1996
[38]

Data structures for statistical computing in python

W. McKinneyet al., “Data structures for statistical computing in python.”scipy, vol. 445, no. 1, pp. 51–56, 2010

work page 2010
[39]

A coefficient of agreement for nominal scales,

J. Cohen, “A coefficient of agreement for nominal scales,”Educa- tional and psychological measurement, vol. 20, no. 1, pp. 37–46, 1960

work page 1960
[40]

Measuring nominal scale agreement among many raters,

J. L. Fleiss, “Measuring nominal scale agreement among many raters,”Psychological bulletin, vol. 76, no. 5, p. 378, 1971

work page 1971
[41]

Scikit-learn: Machine learning in python,

F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourget al., “Scikit-learn: Machine learning in python,”the Journal of ma- chine Learning research, vol. 12, pp. 2825–2830, 2011

work page 2011
[42]

Statsmodels: econometric and sta- tistical modeling with python

S. Seabold, J. Perktoldet al., “Statsmodels: econometric and sta- tistical modeling with python.”scipy, vol. 7, no. 1, pp. 92–96, 2010

work page 2010
[43]

Fleurs: Few-shot learning evaluation of universal representations of speech,

A. Conneau, M. Ma, S. Khanuja, Y . Zhang, V . Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “Fleurs: Few-shot learning evaluation of universal representations of speech,” in2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 798–805

work page 2023

[1] [1]

Yet, despite Urdu’s global significance and its vast diaspora, it remains remarkably under-resourced in the context of multi- modal foundation models and Speech LLMs

Introduction In recent years, the digital preservation of languages within the AI landscape has become a cornerstone of linguistic equality [1]. Yet, despite Urdu’s global significance and its vast diaspora, it remains remarkably under-resourced in the context of multi- modal foundation models and Speech LLMs. Recent bench- marks highlight a persistent pe...

work page

[2] [2]

and WenetSpeech-Chuan [13] pipelines, we developed a specialized solution for the Urdu-English paradigm. We build upon foundational datasets including ARL Urdu [14], CLE Pak- istan [15], and LDC-IL [16] while addressing the critical short- age of high-fidelity data in modern Urdu TTS [17] and ASR. **indicates the corresponding author. 1https://interspeech...

work page

[3] [3]

We gathered this raw audio ”in-the-wild” and processed it according to our curation pipeline stage 1 as show in the fig- ure 1

Model selection and benchmark set Prior to large-scale development, we conducted a 13-hour au- dio pilot study across 12 categories, including poetry, news, and vlogs. We gathered this raw audio ”in-the-wild” and processed it according to our curation pipeline stage 1 as show in the fig- ure 1. We utilized Spleeter [23] for noise removal and Pyannote

work page

[4] [4]

UrduSpeech: A 156-Hour Urdu Speech Corpus with 12-Dimension Paralinguistic Annotations

for speaker diarization. To ensure high data quality, we discarded single-speaker clips and segments shorter than two seconds. Additionally, all audio clips were capped at a maxi- mum duration of 35 seconds to optimize downstream transcrip- 2Ethical Statement: All data sourced from public repositories; no personal identifiers retained. Content is non-poli...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[5] [5]

We further incorporated audio format metadata (short vs

UrduSpeech corpus curation pipeline Building on our US-benchmark set pilot, we scaled the corpus development into a multi-stage pipeline, as illustrated in Fig- ure 1. We further incorporated audio format metadata (short vs. long form) and integrated model confidence scores alongside quality assessments conducted by native annotators. 3.1. Data collection...

work page

[6] [6]

Experimental setup and recruitment To validate the corpus, 180 clips across three sets (A, B, and C) were randomly sampled by complexity using an anchor set strategy (Table 2)

Human-centric quality assessment 4.1. Experimental setup and recruitment To validate the corpus, 180 clips across three sets (A, B, and C) were randomly sampled by complexity using an anchor set strategy (Table 2). Six university-recruited native Urdu speak- ers (3M/3F) evaluated the data in a controlled laboratory set- ting. To ensure independent, high-q...

work page

[7] [7]

Corpus Distribution and Statistics The UrduSpeech corpus comprises91GB156 hours of diarized audio

UrduSpeech corpus 5.1. Corpus Distribution and Statistics The UrduSpeech corpus comprises91GB156 hours of diarized audio. As shown in Figure 3, the Interview category represents the largest share, accounting for approximately 34 hours (21% of the total volume). Traditional genres such as drama and po- etry contain a higher volume of Us-Std, whereas conver...

work page 2007

[8] [8]

Limitation and future work Our corpus provides a substantial resource for Urdu, code- switched Urdu-English, and Pakistani-accent English speech re- search, yet several limitations exist. First, while automated di- arization via Pyannote 3.1 identified over 3,000 unique speaker clusters, we conservatively estimate the count at 1,000+ unique speakers to ac...

work page

[9] [9]

By developing a robust and reproducible pipeline, we successfully addressed the complexities of ”in-the- wild” Urdu speech and the high prevalence of Urdu-English code-switching

Conclusion In this study, we introduced UrduSpeech, a 156-hour (91 GB) multi-domain speech corpus featuring 12-dimensions paralin- guistic metadata. By developing a robust and reproducible pipeline, we successfully addressed the complexities of ”in-the- wild” Urdu speech and the high prevalence of Urdu-English code-switching. Our stratified methodology re...

work page

[10] [10]

All technical methodologies, data collection, and original research contributions were conceived and exe- cuted entirely by the authors

Generative AI Use Disclosure The authors acknowledge the use of generative AI tools solely for text refinement, grammar corrections, and proofreading of the manuscript. All technical methodologies, data collection, and original research contributions were conceived and exe- cuted entirely by the authors

work page

[11] [11]

Systematic inequal- ities in language technology performance across the world’s lan- guages,

D. Blasi, A. Anastasopoulos, and G. Neubig, “Systematic inequal- ities in language technology performance across the world’s lan- guages,” inProceedings of the 60th Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 5486–5505

work page 2022

[12] [12]

Wer we stand: Benchmarking urdu asr models,

S. Arif, A. J. Khan, M. Abbas, A. A. Raza, and A. Athar, “Wer we stand: Benchmarking urdu asr models,” inProceedings of the 31st International Conference on Computational Linguistics, 2025, pp. 5952–5961

work page 2025

[13] [13]

Towards unified pro- cessing of perso-arabic scripts for asr,

S. Bandarupalli, B. Akkiraju, S. C. Devarakonda, H. Sivara- masethu, V . Narasinga, and A. Vuppala, “Towards unified pro- cessing of perso-arabic scripts for asr,” inProceedings of the 1st Workshop on NLP for Languages Using Arabic Script, 2025, pp. 23–28

work page 2025

[14] [14]

From statistical methods to pre-trained models: A survey on automatic speech recognition for resource-scarce urdu language,

M. Sharif, Z. Abbas, J. Yi, and C. Liu, “From statistical methods to pre-trained models: A survey on automatic speech recognition for resource-scarce urdu language,”arXiv preprint arXiv:2411.14493, 2024

work page arXiv 2024

[15] [15]

Challenges and opportunities in urdu-english code-switched speech recognition,

M. Sadeqiet al., “Challenges and opportunities in urdu-english code-switched speech recognition,”Journal of Linguistic Engi- neering, 2023

work page 2023

[16] [16]

Urdu language processing: a survey,

A. Daud, W. Khan, and D. Che, “Urdu language processing: a survey,”Artificial Intelligence Review, vol. 47, no. 3, pp. 279– 311, 2017

work page 2017

[17] [17]

Omnilingual asr: Open-source multilingual speech recognition for 1600+ languages

A. Omnilingual, G. Keren, A. Kozhevnikov, Y . Meng, C. Rop- ers, M. Setzler, S. Wang, I. Adebara, M. Auli, C. Baliogluet al., “Omnilingual asr: Open-source multilingual speech recognition for 1600+ languages,”arXiv preprint arXiv:2511.09690, 2025

work page arXiv 2025

[18] [18]

Com- mon voice: A massively-multilingual speech corpus,

R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Hen- retty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Com- mon voice: A massively-multilingual speech corpus,” inProceed- ings of the twelfth language resources and evaluation conference, 2020, pp. 4218–4222

work page 2020

[19] [19]

Uquad+: Benchmark dataset for urdu ma- chine reading comprehension,

S. Kazi and S. Khoja, “Uquad+: Benchmark dataset for urdu ma- chine reading comprehension,”ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 25, no. 2, pp. 1–34, 2026

work page 2026

[20] [20]

Deepfake audio detection in low- resource languages: A case study of urdu,

M. Owais, K. K. Jadoon, A. I. Sandhu, Z. Ali, Z. Mahmood, M. Yahya, and A. Wahid, “Deepfake audio detection in low- resource languages: A case study of urdu,”IEEE Access, 2026

work page 2026

[21] [21]

Cross-lingual speech emotion recognition with attention-driven bi-lstm: Advancing kashmiri and multilingual adaptation,

G. M. Dar and R. Delhibabu, “Cross-lingual speech emotion recognition with attention-driven bi-lstm: Advancing kashmiri and multilingual adaptation,”International Journal of Analysis and Applications, vol. 24, pp. 43–43, 2026

work page 2026

[22] [22]

Wenetspeech-yue: A large- scale cantonese speech corpus with multi-dimensional annota- tion,

L. Li, Z. Guo, H. Chen, Y . Dai, Z. Zhang, H. Xue, T. Zuo, C. Wang, S. Wang, J. Liet al., “Wenetspeech-yue: A large- scale cantonese speech corpus with multi-dimensional annota- tion,”arXiv preprint arXiv:2509.03959, 2025

work page arXiv 2025

[23] [23]

Wenetspeech-chuan: A large- scale sichuanese corpus with rich annotation for dialectal speech processing,

Y . Dai, Z. Zhang, S. Wang, L. Li, Z. Guo, T. Zuo, S. Wang, H. Xue, C. Wang, Q. Wanget al., “Wenetspeech-chuan: A large- scale sichuanese corpus with rich annotation for dialectal speech processing,”arXiv preprint arXiv:2509.18004, 2025

work page arXiv 2025

[24] [24]

Arl urdu speech database, training data ldc2007s03,

Appen Pty Ltd, “Arl urdu speech database, training data ldc2007s03,” Linguistic Data Consortium, Philadelphia, 2007, iSBN: 1-58563-412-3

work page 2007

[25] [25]

Cle pakistan district names speech corpus - urdu speakers,

Center for Language Engineering (CLE), “Cle pakistan district names speech corpus - urdu speakers,” https://www.cle.org.pk/ clestore/speech-urdu.htm, 2016, accessed: 2026-03-02

work page 2016

[26] [26]

Urdu sentence aligned speech corpus,

M. Khan, S. Alam, B. B. Mariyam, N. Rajesha, G. Manasa, D. Srikanth, S. Fernandes, S. Nithin, N. K. Choudhary, and S. Mohan, “Urdu sentence aligned speech corpus,” Central Institute of Indian Languages, Mysore, 2023, iSBN: 978-81- 19411-87-0. Catalogue Number: 1434. [Online]. Available: https://data.ldcil.org/urdu-sentence-aligned-corpus

work page 2023

[27] [27]

Overcoming linguis- tic barriers developing advanced urdu text-to-speech systems,

S. A. Khan, M. Mansoor, and A. Habib, “Overcoming linguis- tic barriers developing advanced urdu text-to-speech systems,” in 2024 19th International Conference on Emerging Technologies (ICET). IEEE, 2024, pp. 1–6

work page 2024

[28] [28]

Phonological variations of english in pakistan,

S. Sarfrazet al., “Phonological variations of english in pakistan,” inProceedings of the Conference on Language and Technology (CLT10), 2010

work page 2010

[29] [29]

V oxceleb: a large- scale speaker identification dataset,

A. Nagrani, J. S. Chung, and A. Zisserman, “V oxceleb: a large- scale speaker identification dataset,” inINTERSPEECH, 2017, pp. 2616–2620

work page 2017

[30] [30]

The interspeech 2013 computational paralinguistics challenge: Social signals, conflict, emotion, autism,

B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Schereret al., “The interspeech 2013 computational paralinguistics challenge: Social signals, conflict, emotion, autism,”Proc. INTERSPEECH, pp. 148–152, 2013

work page 2013

[31] [31]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen et al., “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabil- ities,”arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518

work page 2023

[33] [33]

Spleeter: a fast and efficient music source separation tool with pre-trained models,

R. Hennequin, A. Khlif, F. V oituret, and M. Moussallam, “Spleeter: a fast and efficient music source separation tool with pre-trained models,”Journal of Open Source Software, vol. 5, no. 50, p. 2154, 2020

work page 2020

[34] [34]

Pyannote. audio: neural building blocks for speaker diarization,

H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov, M. Lavechin, D. Fustes, H. Titeux, W. Bouaziz, and M.-P. Gill, “Pyannote. audio: neural building blocks for speaker diarization,” inICASSP 2020-2020 IEEE International conference on acous- tics, speech and signal processing (ICASSP). IEEE, 2020, pp. 7124–7128

work page 2020

[35] [35]

From wer and ril to mer and wil: improved evaluation measures for connected speech recognition

A. C. Morris, V . Maier, and P. D. Green, “From wer and ril to mer and wil: improved evaluation measures for connected speech recognition.” inInterspeech, no. 4-8, 2004, p. 2004

work page 2004

[36] [36]

Demucs: Deep extractor for music sources with extra unlabeled data remixed.arXiv preprint arXiv:1909.01174, 2019

A. D ´efossez, N. Usunier, L. Bottou, and F. Bach, “Demucs: Deep extractor for music sources with extra unlabeled data remixed,” arXiv preprint arXiv:1909.01174, 2019

work page arXiv 1909

[37] [37]

Recommendation p. 800. methods for subjective deter- mination of transmission quality,

T. ITU, “Recommendation p. 800. methods for subjective deter- mination of transmission quality,”International Telecommunica- tion Union, 1996

work page 1996

[38] [38]

Data structures for statistical computing in python

W. McKinneyet al., “Data structures for statistical computing in python.”scipy, vol. 445, no. 1, pp. 51–56, 2010

work page 2010

[39] [39]

A coefficient of agreement for nominal scales,

J. Cohen, “A coefficient of agreement for nominal scales,”Educa- tional and psychological measurement, vol. 20, no. 1, pp. 37–46, 1960

work page 1960

[40] [40]

Measuring nominal scale agreement among many raters,

J. L. Fleiss, “Measuring nominal scale agreement among many raters,”Psychological bulletin, vol. 76, no. 5, p. 378, 1971

work page 1971

[41] [41]

Scikit-learn: Machine learning in python,

F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourget al., “Scikit-learn: Machine learning in python,”the Journal of ma- chine Learning research, vol. 12, pp. 2825–2830, 2011

work page 2011

[42] [42]

Statsmodels: econometric and sta- tistical modeling with python

S. Seabold, J. Perktoldet al., “Statsmodels: econometric and sta- tistical modeling with python.”scipy, vol. 7, no. 1, pp. 92–96, 2010

work page 2010

[43] [43]

Fleurs: Few-shot learning evaluation of universal representations of speech,

A. Conneau, M. Ma, S. Khanuja, Y . Zhang, V . Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “Fleurs: Few-shot learning evaluation of universal representations of speech,” in2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 798–805

work page 2023