pith. sign in

arxiv: 2605.26935 · v1 · pith:2PEPIXAGnew · submitted 2026-05-26 · 💻 cs.CL

DunbaaBERT: From Sacrifice to Semantics

Pith reviewed 2026-06-29 18:04 UTC · model grok-4.3

classification 💻 cs.CL
keywords Urdu NLPlanguage-specific pretrainingRoBERTaByte-BPEmultilingual baselinesmodel efficiencylow-resource languagesvocabulary size
0
0 comments X

The pith

Urdu RoBERTa models trained from scratch on a 17GB corpus match multilingual baselines with better efficiency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that dedicated Urdu encoder models, built as RoBERTa-base variants on a carefully deduplicated 17GB Urdu corpus using Byte-BPE tokenizers of 32k, 52k, and 96k tokens, reach performance levels comparable to strong multilingual models on tasks such as linguistic acceptability, news classification, offensive language detection, and sentiment analysis. A sympathetic reader would care because Urdu has lacked sufficient resources for high-quality models, and this work tests whether language-specific training at modest scale can close that gap without the overhead of massive multilingual pretraining. The results indicate that vocabulary size does not improve downstream results in a simple linear way, with the 32k variant frequently delivering the strongest accuracy-efficiency balance.

Core claim

DunbaaBERT models trained from scratch on the 17GB deduplicated Urdu corpus achieve competitive results against multilingual baselines across intrinsic and downstream Urdu benchmarks while maintaining favorable efficiency trade-offs, with the 32k-vocabulary variant repeatedly showing the best overall profile.

What carries the argument

RoBERTa-base encoders pretrained from scratch on a deduplicated 17GB Urdu corpus using Byte-BPE vocabularies of varying sizes.

If this is right

  • Language-specific pretraining at compact scale can serve as a practical alternative to multilingual models for Urdu NLP applications.
  • Larger token vocabularies do not guarantee better downstream performance and can increase computational cost.
  • Releasing the models under an open license supports further development of Urdu-specific tools and datasets.
  • Efficiency advantages of the smaller-vocabulary variant enable deployment in lower-resource environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same curation and training approach could be tested on other languages with similarly sized corpora to check whether the competitiveness pattern holds.
  • If the efficiency edge persists, it may reduce the need for full-scale multilingual pretraining when building task-specific systems for a single language.
  • The non-monotonic vocabulary effect suggests that tokenizer design choices deserve systematic study rather than defaulting to larger sizes.

Load-bearing premise

The chosen 17GB Urdu corpus and the selected downstream benchmarks are representative enough to show that language-specific training competes with multilingual baselines.

What would settle it

A broader or more diverse Urdu evaluation set where the multilingual baselines significantly outperform all DunbaaBERT variants on multiple tasks would falsify the competitiveness claim.

Figures

Figures reproduced from arXiv: 2605.26935 by Iffat Maab, Raphael Schmitt, Waleed Jamil.

Figure 1
Figure 1. Figure 1: Validation perplexity across DunbaaBERT models with 32k, 52k, and 96k vocabularies. Nearly overlapping convergence curves and only marginal dif￾ferences in final perplexity suggest limited impact of vocabulary size on intrinsic pre-training behavior. B Technical Specifications B.1 Computational Setup. All experiments were conducted on one GPU com￾pute node. GPU compute node is equipped with NVIDIA A100 GPU… view at source ↗
Figure 2
Figure 2. Figure 2: Predictive performance (test Macro-F1) versus inference throughput (samples per second) across down [PITH_FULL_IMAGE:figures/full_fig_p016_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training-time versus inference-efficiency trade-off across downstream Urdu benchmarks. The x-axis [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Aggregate relationship between total hyperparameter-search training cost and model performance– [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗
read the original abstract

Large language models have achieved strong performance across many NLP tasks, yet Urdu remains comparatively underexplored due to limited resources and fragmented evaluation settings. To address this gap, we introduce DunbaaBERT, a family of Urdu RoBERTa-base models trained from scratch with Byte-BPE vocabularies of 32k, 52k, and 96k tokens on a deduplicated 17GB Urdu corpus. We evaluate DunbaaBERT across intrinsic and downstream Urdu NLP benchmarks covering linguistic acceptability, news classification, offensive language detection, and sentiment analysis while analyzing vocabulary-size effects on performance and efficiency trade-offs. Across benchmarks, the DunbaaBERT variants achieve competitive performance against strong multilingual baselines while consistently maintaining favorable efficiency trade-offs. Interestingly, larger vocabularies do not consistently improve downstream effectiveness, with DunbaaBERT$_{\text{32k}}$ repeatedly providing the strongest overall efficiency profile. Overall, our results demonstrate that carefully curated Urdu-specific encoder models can remain highly competitive despite comparatively compact model and training scales. All models are released under the MIT license.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces DunbaaBERT, a family of Urdu RoBERTa-base models trained from scratch on a deduplicated 17GB Urdu corpus using Byte-BPE vocabularies of sizes 32k, 52k, and 96k. It reports evaluations on intrinsic and downstream Urdu NLP tasks covering linguistic acceptability, news classification, offensive language detection, and sentiment analysis, along with vocabulary-size ablations. The central claim is that these compact, language-specific models achieve competitive performance against multilingual baselines while offering favorable efficiency trade-offs, with the 32k variant often strongest; all models are released under MIT license.

Significance. If the empirical results hold with proper documentation, this would provide evidence that carefully curated language-specific pretraining on modest scales (17GB corpus, RoBERTa-base) can compete with multilingual models for Urdu, with practical implications for efficiency in low-resource settings. The public release of the models under MIT is a clear strength that enables reproducibility and follow-on work. The vocabulary-size analysis could inform tokenizer design choices, though its generalizability depends on the breadth of the reported benchmarks.

major comments (2)
  1. [Abstract] Abstract: The claim that 'DunbaaBERT variants achieve competitive performance against strong multilingual baselines' is stated without any accompanying metrics, tables, error bars, statistical tests, or baseline details. This is load-bearing for the central empirical claim and prevents verification of competitiveness or efficiency trade-offs from the provided text.
  2. [Corpus description] Corpus section (referenced in abstract as 'deduplicated 17GB Urdu corpus'): No domain breakdown, dialect coverage, or deduplication procedure is described. This directly affects the representativeness assumption required for the claim that language-specific training on this corpus yields general competitiveness.
minor comments (1)
  1. [Abstract] The notation 'DunbaaBERT$_{\text{32k}}$' is used without prior definition of the subscript convention in the abstract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major point below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that 'DunbaaBERT variants achieve competitive performance against strong multilingual baselines' is stated without any accompanying metrics, tables, error bars, statistical tests, or baseline details. This is load-bearing for the central empirical claim and prevents verification of competitiveness or efficiency trade-offs from the provided text.

    Authors: We agree that the abstract would be more informative if it included key quantitative results. The body of the manuscript contains full tables with performance metrics, baseline comparisons (including XLM-R and mBERT), and efficiency measurements across tasks. We will revise the abstract to incorporate representative numbers (e.g., average scores on acceptability, classification, and sentiment tasks) and a brief reference to the main baselines, while preserving the word limit. revision: yes

  2. Referee: [Corpus description] Corpus section (referenced in abstract as 'deduplicated 17GB Urdu corpus'): No domain breakdown, dialect coverage, or deduplication procedure is described. This directly affects the representativeness assumption required for the claim that language-specific training on this corpus yields general competitiveness.

    Authors: The referee correctly identifies that the current corpus description is insufficiently detailed. We will expand the relevant section to provide a domain breakdown of the source data, notes on dialect coverage (primarily formal Urdu with limited regional variants), and the exact deduplication method applied (combination of exact matching and MinHash-based near-duplicate removal). These additions will directly support the representativeness claim. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical training and benchmark comparison is self-contained

full rationale

The paper trains RoBERTa-base models from scratch on a fixed 17GB Urdu corpus using Byte-BPE vocabularies and reports direct performance numbers on four downstream tasks plus intrinsic metrics. No equations, fitted parameters, or predictions are defined in terms of the target quantities; the central claim of competitiveness is an empirical observation against external multilingual baselines rather than a reduction to self-defined inputs or self-citations. No load-bearing self-citation chains, uniqueness theorems, or ansatzes appear in the provided text. The result therefore stands or falls on the representativeness of the corpus and breadth of the benchmarks, which is a validity question rather than circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, free parameters, or invented entities are present; the work is an empirical model-release paper whose central claim rests on the representativeness of the 17GB corpus and the chosen evaluation benchmarks.

pith-pipeline@v0.9.1-grok · 5709 in / 1072 out tokens · 18269 ms · 2026-06-29T18:04:12.214191+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

6 extracted references · 5 canonical work pages · 4 internal anchors

  1. [1]

    Longformer: The Long-Document Transformer

    Longformer: The long-document transformer. Preprint, arXiv:2004.05150. Muhammad Bilal, Atif Khan, Salman Jan, Shahrulniza Musa, and Shaukat Ali. 2023. Roman urdu hate speech detection using transformer-based model for cyber security applications.Sensors, 23(8):3909. Damian Blasi, Antonios Anastasopoulos, and Gra- ham Neubig. 2021. Systematic inequalities ...

  2. [2]

    EuroBERT: Scaling Multilingual Encoders for European Languages

    Eurobert: Scaling multilingual encoders for european languages.Preprint, arXiv:2503.05500. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching word vec- tors with subword information.arXiv preprint arXiv:1607.04606. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, ...

  3. [3]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Enhancing Urdu sentiment classification through instruction-tuned LLMs and cross-lingual transfer. InProceedings of the 2nd Workshop on NLP for Languages Using Arabic Script, pages 198– 207, Rabat, Morocco. Association for Computational Linguistics. Lal Khan, Ammar Amjad, Noman Ashraf, and Hsien- Tsung Chang. 2022. Multi-class sentiment analysis of urdu t...

  4. [4]

    Language Resources and Evaluation, pages 1–26

    Roman urdu toxic comment classification. Language Resources and Evaluation, pages 1–26. David Samuel, Andrey Kutuzov, Lilja Øvrelid, and Erik Velldal. 2023. Trained on 100 million words and still in shape: BERT meets British National Corpus. In Findings of the Association for Computational Lin- guistics: EACL 2023, pages 1954–1974, Dubrovnik, Croatia. Ass...

  5. [5]

    Script identification of multi-script documents: a survey.IEEE access, 5:6546–6559. Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, and Iacopo Poli. 2024. Smarter, better, faster, longer: A modern bidir...

  6. [6]

    Read more

    From courtroom to corpora: Building a name entity corpus for Urdu legal texts. InProceed- ings of the 15th International Conference on Recent Advances in Natural Language Processing - Natu- ral Language Processing in the Generative AI Era, pages 1396–1405, Varna, Bulgaria. INCOMA Ltd., Shoumen, Bulgaria. A Pre-training Dynamics During pre-training, perple...