IHUBERT: Vector-Based Semantic Deduplication and Domain-Balanced Pretraining for Persian Resources
Pith reviewed 2026-06-26 17:27 UTC · model grok-4.3
The pith
Vector-based semantic deduplication on a 45 GB Persian corpus produces a RoBERTa model that leads on extractive QA.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
IHUBERT is trained from scratch with the RoBERTa-base encoder on a 45 GB curated subset of the Sepahr-Danesh collection after normalization, duplicate removal, anonymization, and vector-database-based semantic deduplication for domain balancing. A 139k-vocabulary BPE tokenizer is trained on the full corpus. The model records F1 scores of 88.3542 on PQuAD and 49.0987 on ParsiNLU-RC, the highest reported, and Macro-F1 of 0.8350 on FarsTail; it stays competitive on NER and topic classification while trailing on relation extraction.
What carries the argument
Vector-database-based semantic deduplication that enforces distribution balance across domains and registers during corpus curation.
If this is right
- The model sets new state-of-the-art results on two extractive QA benchmarks and one NLI benchmark.
- A controlled ablation confirms that the chosen BPE tokenizer produces modestly lower subword fragmentation than WordPiece at the same vocabulary size.
- Performance remains competitive on NER and topic classification while leaving the largest gap on relation extraction.
- The overall pipeline demonstrates that large-scale monolingual pretraining for Persian can be improved by focusing on corpus quality and balance rather than scale alone.
Where Pith is reading between the lines
- The same deduplication-plus-balancing pipeline could be tested on other low-resource languages where raw web data is noisy and domain-skewed.
- If domain balance reduces certain failure modes on comprehension tasks, the method might also lower unintended biases in generated text for those languages.
- Releasing the cleaned corpus would allow direct measurement of how much the vector step improves downstream metrics compared with simpler deduplication.
- Extending the approach to larger token counts while keeping the same balancing controls would test whether quality gains compound with scale.
Load-bearing premise
The reported gains on downstream tasks arise directly from the semantic deduplication and domain balancing steps rather than tokenizer choice, random variation, or other unmeasured differences in training.
What would settle it
Retrain an identical RoBERTa-base model on the same 45 GB corpus after removing only the vector-database semantic deduplication step and check whether the F1 scores on PQuAD and ParsiNLU-RC fall to the level of prior Persian models.
read the original abstract
Persian pretrained language models (PLMs) are still limited by the scarcity of large-scale, high-quality pretraining corpora and by insufficient evaluation beyond standard classification and NER tasks. We present IHUBERT, a monolingual Persian PLM trained from scratch with the RoBERTa-base encoder (125M parameters) on a 45 GB curated subset of the Sepahr-Danesh collection (about 7-8B tokens). To improve corpus quality and reduce redundancy, we employ a multi-stage preprocessing pipeline that includes normalization, exact and near-duplicate removal, anonymization, and vector-database-based semantic deduplication for distribution balancing control across domains and registers. We additionally train a 139k-vocabulary BPE tokenizer on the full pretraining corpus to better capture Persian morphology and orthographic variation. IHUBERT is evaluated on seven Persian NLU benchmarks covering NER, sentiment analysis, topic classification, NLI, extractive question answering, and relation extraction, using task-standard metrics (entity-level F1, Macro-F1, EM/F1). IHUBERT achieves its strongest gains on extractive QA, ranking first on both PQuAD (F1 88.3542) and ParsiNLU-RC (F1 49.0987), and attains the best result on FarsTail (Macro-F1 0.8350). On NER and topic classification, it remains competitive (e.g., 0.8308 F1 on ParsTwiNER; 0.7953 Macro-F1 on DigiMag), while relation extraction remains the main remaining gap (0.6684 Macro-F1 on PERLEX). A controlled tokenizer ablation on the IHUBERT pretraining corpus shows that BPE yields slightly lower subword fragmentation than WordPiece at matched vocabulary size, supporting our tokenization design. Overall, IHUBERT advances Persian language modeling through semantically curated large-scale pretraining and broad evaluation across both classification and comprehension-oriented tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents IHUBERT, a 125M-parameter RoBERTa-base Persian PLM pretrained from scratch on a 45 GB curated subset (~7-8B tokens) of the Sepahr-Danesh collection. Curation uses a multi-stage pipeline (normalization, exact/near-duplicate removal, anonymization, and vector-database semantic deduplication for domain/register balancing), plus a custom 139k-vocabulary BPE tokenizer. The model is evaluated on seven Persian NLU benchmarks (NER, sentiment, topic classification, NLI, extractive QA, relation extraction) and reports top results on PQuAD (F1 88.3542), ParsiNLU-RC (F1 49.0987), and FarsTail (Macro-F1 0.8350), with a controlled BPE vs. WordPiece tokenizer ablation.
Significance. If the performance gains on extractive QA and NLI tasks can be attributed to the semantic deduplication and domain-balancing steps, the work would provide concrete evidence that vector-database curation improves monolingual pretraining for Persian, a lower-resource language. The broad task coverage and tokenizer ablation supply useful reference points for future Persian PLM development.
major comments (2)
- [Abstract] Abstract: The central claim attributes the strongest gains (PQuAD F1 88.3542, ParsiNLU-RC F1 49.0987) to the multi-stage pipeline that includes vector-database-based semantic deduplication for distribution balancing. However, the only controlled ablation reported is tokenizer choice (BPE vs. WordPiece); no ablation trains otherwise identical models on the corpus before versus after the semantic deduplication step, so the performance delta cannot be isolated from corpus size, training schedule, or the 139k vocabulary.
- [Abstract] Abstract (evaluation paragraph): Reported metrics lack error bars, number of random seeds, or statistical significance tests against baselines. Without these, it is not possible to determine whether the reported improvements over prior Persian models are robust or within run-to-run variance.
minor comments (2)
- [Abstract] Abstract: The token count is given only as 'about 7-8B tokens'; an exact figure after all filtering steps would allow clearer comparison with other Persian corpora.
- [Abstract] Abstract: The relation-extraction result (0.6684 Macro-F1 on PERLEX) is described as 'the main remaining gap' but no analysis or error breakdown is supplied to explain why this task lags while QA improves.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments point by point below, acknowledging where the manuscript requires clarification or qualification.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim attributes the strongest gains (PQuAD F1 88.3542, ParsiNLU-RC F1 49.0987) to the multi-stage pipeline that includes vector-database-based semantic deduplication for distribution balancing. However, the only controlled ablation reported is tokenizer choice (BPE vs. WordPiece); no ablation trains otherwise identical models on the corpus before versus after the semantic deduplication step, so the performance delta cannot be isolated from corpus size, training schedule, or the 139k vocabulary.
Authors: We agree that a direct before/after ablation on the semantic deduplication step would be required to isolate its contribution from other pipeline elements. Our available compute permitted only the reported tokenizer ablation. In the revised manuscript we have added explicit discussion in the Experiments and Limitations sections clarifying that performance gains reflect the full curation pipeline and cannot be attributed solely to deduplication. We frame the results accordingly rather than claiming isolated credit for that component. revision: partial
-
Referee: [Abstract] Abstract (evaluation paragraph): Reported metrics lack error bars, number of random seeds, or statistical significance tests against baselines. Without these, it is not possible to determine whether the reported improvements over prior Persian models are robust or within run-to-run variance.
Authors: We acknowledge this limitation. All models were trained with a single random seed owing to the high cost of 125M-parameter pretraining. The revised manuscript now states the single-seed nature of the results and qualifies all comparisons accordingly. Multiple independent runs for error bars and significance tests were not feasible within our resource constraints. revision: yes
- Additional pretraining runs (for deduplication ablation or multiple seeds) cannot be performed due to computational cost.
Circularity Check
Empirical pretraining study with no derivations or self-referential predictions
full rationale
The paper describes an empirical pipeline for curating a Persian corpus (normalization, duplicate removal, semantic deduplication via vector database) and training a RoBERTa-base model, followed by benchmark evaluation on tasks like QA and NLI. No equations, fitted parameters, or predictions appear in the provided text. The tokenizer ablation (BPE vs. WordPiece) is a direct controlled comparison on the same corpus, not a renamed fit. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. All claims rest on externally reported benchmark scores (e.g., PQuAD F1), which are falsifiable outside the paper's own definitions. This is a standard self-contained empirical study.
Axiom & Free-Parameter Ledger
free parameters (1)
- BPE vocabulary size =
139000
axioms (1)
- domain assumption The Sepahr-Danesh collection after the described multi-stage pipeline yields higher-quality pretraining data than raw or less-curated alternatives.
Reference graph
Works this paper leans on
-
[1]
BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding
J. Devlin, M. -W. Chang, K. Lee, and K. Toutanova, “{BERT:} Pre -training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, {NAACL-HLT} 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long a...
-
[2]
Unsupervised Cross-lingual Representation Learning at Scale,
A. Conneau et al., “Unsupervised Cross-lingual Representation Learning at Scale,” in ACL, 2020
2020
-
[3]
ParsBERT: Transformer- based Model for Persian Language Understanding,
M. Farahani, M. Gharachorloo, M. Farahani, and M. Manthouri, “ParsBERT: Transformer- based Model for Persian Language Understanding,” Neural Process. Lett. , vol. 53, no. 6, pp. 3831–3847, 2021, doi: 10.1007/s11063-021-10528-4
-
[4]
AriaBERT: A Pre-trained Persian BERT Model for Natural Language Understanding,
A. Ghafouri, M. A. Abbasi, and H. Naderi, “AriaBERT: A Pre-trained Persian BERT Model for Natural Language Understanding,” 2023, [Online]. Available: https://doi.org/10.21203/rs.3.rs-3558473/v1
-
[5]
FaBERT: Pre -training BERT on Persian Blogs
M. Masumi, S. S. Majd, M. Shamsfard, and H. Beigy, “FaBERT: Pre -training BERT on Persian Blogs.” 2024. [Online]. Available: https://arxiv.org/abs/2402.06617
arXiv 2024
-
[6]
S. M. Sadjadi, Z. Rajabi, L. Rabiei, and M.-S. Moin, “FarSSiBERT: A Novel Transformer- based Model for Semantic Similarity Measurement of Persian Social Networks Informal Texts.” 2024. [Online]. Available: https://arxiv.org/abs/2407.19173
arXiv 2024
-
[7]
SINA-BERT: A pre-trained Language Model for Analysis of Medical Texts in Persian
N. Taghizadeh, E. Doostmohammadi, E. Seifossadat, H. R. Rabiee, and M. S. Tahaei, “SINA-BERT: A pre-trained Language Model for Analysis of Medical Texts in Persian.” 2021
2021
-
[8]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Y. Liu et al., “RoBERTa: A Robustly Optimized BERT Pretraining Approach.” 2019
2019
-
[9]
TookaBERT: A Step Forward for Persian NLU
M. SadraeiJavaheri et al., “TookaBERT: A Step Forward for Persian NLU.” 2024. [Online]. Available: https://arxiv.org/abs/2407.16382
arXiv 2024
-
[10]
Hakim: Farsi Text Embedding Model
M. Sarmadi, M. Alikhani, E. Zinvandi, and Z. Pourbahman, “Hakim: Farsi Text Embedding Model.” 2025. [Online]. Available: https://arxiv.org/abs/2505.08435
arXiv 2025
-
[11]
{A}ra{BERT}: Transformer-based Model for {A}rabic Language Understanding,
W. Antoun, F. Baly, and H. Hajj, “{A}ra{BERT}: Transformer-based Model for {A}rabic Language Understanding,” in Proceedings of the 4th Workshop on Open -Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection , H. Al-Khalifa, W. Magdy, K. Darwish, T. Elsayed, and H. Mubarak, Eds., Marseille, France: European Language...
2020
-
[12]
{H}er{BERT} Based Language Model Detects Quantifiers and Their Semantic Properties in {P}olish,
M. Woliński, B. Nitoń, W. Kieraś, and J. Szymanik, “{H}er{BERT} Based Language Model Detects Quantifiers and Their Semantic Properties in {P}olish,” in Proceedings of the Thirteenth Language Resources and Evaluation Conference, N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, J....
2022
-
[13]
BERTimbau: Pretrained BERT Models for Brazilian Portuguese,
F. Souza, R. Nogueira, and R. Lotufo, “BERTimbau: Pretrained BERT Models for Brazilian Portuguese,” in Intelligent Systems: 9th Brazilian Conference, BRACIS 2020, Rio Grande, Brazil, October 20 –23, 2020, Proceedings, Part I , Berlin, Heidelberg: Springer -Verlag, 2020, pp. 403–417. doi: 10.1007/978-3-030-61377-8_28
-
[14]
KR-BERT: A Small-Scale Korean-Specific Language Model
S. Lee, H. Jang, Y. Baik, S. Park, and H. Shin, “KR-BERT: A Small-Scale Korean-Specific Language Model.” 2020. [Online]. Available: https://arxiv.org/abs/2008.03979
arXiv 2020
-
[15]
GottBERT: a pure German Language Model,
R. Scheible et al., “GottBERT: a pure German Language Model,” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , Association for Computational Linguistics, 2024, pp. 21237 –21250. doi: 10.18653/v1/2024.emnlp - main.1183
-
[16]
Pars-OFF: A Benchmark for Offensive Language Detection on Farsi Social Media,
T. S. Ataei, K. Darvishi, S. Javdan, A. Pourdabiri, B. Minaei -Bidgoli, and M. T. Pilehvar, “Pars-OFF: A Benchmark for Offensive Language Detection on Farsi Social Media,” IEEE Trans. Affect. Comput. , vol. 14, no. 4, pp. 2787 –2795, 2023, doi: 10.1109/TAFFC.2022.3219229
-
[17]
Pars - {ABSA}: a Manually Annotated Aspect-based Sentiment Analysis Benchmark on {F}arsi Product Reviews,
T. Shangipour ataei, K. Darvishi, S. Javdan, B. Minaei -Bidgoli, and S. Eetemadi, “Pars - {ABSA}: a Manually Annotated Aspect-based Sentiment Analysis Benchmark on {F}arsi Product Reviews,” in Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France: European Language Resources Association, Jun. 2022, pp. 7056–7060
2022
-
[18]
{F}ar{E}x{S}tance: Explainable Stance Detection for {F}arsi,
M. Zarharan, M. Hashemi, M. Behroozrazegh, S. Eetemadi, M. T. Pilehvar, and J. Foster, “{F}ar{E}x{S}tance: Explainable Stance Detection for {F}arsi,” in Proceedings of the 31st International Conference on Computational Linguistics , O. Rambow, L. Wanner, M. Apidianaki, H. Al -Khalifa, B. Di Eugenio, and S. Schockaert, Eds., Abu Dhabi, UAE: Association for...
2025
-
[19]
FarsTail: a Persian natural language inference dataset,
H. Amirkhani, M. AzariJafari, S. Faridan -Jahromi, Z. Kouhkan, Z. Pourjafari, and A. Amirak, “FarsTail: a Persian natural language inference dataset,” Soft Comput., Jul. 2023, doi: 10.1007/s00500-023-08959-3
-
[20]
FarSick: A Persian Semantic Textual Similarity And Natural Language Inference Dataset,
Z. Ghasemi and M. A. Keyvanrad, “FarSick: A Persian Semantic Textual Similarity And Natural Language Inference Dataset,” in 2021 11th International Conference on Computer Engineering and Knowledge (ICCKE) , 2021, pp. 194 –199. doi: 10.1109/ICCKE54056.2021.9721521
-
[21]
PQuAD: A Persian question answering dataset,
K. Darvishi, N. Shahbodaghkhan, Z. Abbasiantaeb, and S. Momtazi, “PQuAD: A Persian question answering dataset,” Comput. Speech & Lang., vol. 80, p. 101486, May 2023, doi: 10.1016/j.csl.2023.101486
-
[22]
PCoQA: Persian Conversational Question Answering Dataset
H. H. Hemati, A. Toghyani, A. Souri, S. H. Alavian, H. Sameti, and H. Beigy, “PCoQA: Persian Conversational Question Answering Dataset.” 2023. [Online]. Available: https://arxiv.org/abs/2312.04362
arXiv 2023
-
[23]
{P}ars{T}wi{NER}: A Corpus for Named Entity Recognition at Informal {P}ersian,
M. Aghajani, A. Badri, and H. Beigy, “{P}ars{T}wi{NER}: A Corpus for Named Entity Recognition at Informal {P}ersian,” in Proceedings of the Seventh Workshop on Noisy User-generated Text (W -NUT 2021), W. Xu, A. Ritter, T. Baldwin, and A. Rahimi, Eds., Online: Association for Computational Linguistics, Nov. 2021, pp. 131 –136. doi: 10.18653/v1/2021.wnut-1.16
-
[24]
DeepSentiPers: Novel Deep Learning Models Trained Over Proposed Augmented Persian Sentiment Corpus
J. P. R. Sharami, P. A. Sarabestani, and S. A. Mirroshandel, “DeepSentiPers: Novel Deep Learning Models Trained Over Proposed Augmented Persian Sentiment Corpus.” 2020. [Online]. Available: https://arxiv.org/abs/2004.05328
arXiv 2020
-
[25]
ParsiNLU: A Suite of Language Understanding Challenges for Persian
D. Khashabi et al. , “ParsiNLU: A Suite of Language Understanding Challenges for Persian.” 2021. [Online]. Available: https://arxiv.org/abs/2012.06154
arXiv 2021
-
[26]
Perlex: A Bilingual Persian-English Gold Dataset for Relation Extraction,
M. Asgari-Bidhendi, B. Janfada, M. Nasser, and B. Minaei -Bidgoli, “Perlex: A Bilingual Persian-English Gold Dataset for Relation Extraction,” arXiv, vol. 8, 2020
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.