IHUBERT: Vector-Based Semantic Deduplication and Domain-Balanced Pretraining for Persian Resources

Arash Ghafouri; Hossein Saberi; Mahdi Firouzmandi; Mohammad Reza Hasani Ahangar

arxiv: 2606.20089 · v1 · pith:W2JN5T2Fnew · submitted 2026-06-18 · 💻 cs.CL · cs.AI

IHUBERT: Vector-Based Semantic Deduplication and Domain-Balanced Pretraining for Persian Resources

Arash Ghafouri , Mahdi Firouzmandi , Hossein Saberi , Mohammad Reza Hasani Ahangar This is my paper

Pith reviewed 2026-06-26 17:27 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords Persian language modelsemantic deduplicationdomain balancingpretraining corpusextractive QABPE tokenizerRoBERTaNLU benchmarks

0 comments

The pith

Vector-based semantic deduplication on a 45 GB Persian corpus produces a RoBERTa model that leads on extractive QA.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper trains a 125M-parameter Persian language model called IHUBERT from scratch using RoBERTa on a curated 45 GB subset of the Sepahr-Danesh collection. A multi-stage pipeline removes exact and near duplicates, anonymizes data, and applies vector-database semantic deduplication to balance domains and registers before training a custom 139k BPE tokenizer. This yields top scores on extractive question answering benchmarks and competitive results on several other Persian NLU tasks. If the curation approach drives the gains, it indicates that reducing semantic redundancy and enforcing domain balance can improve model quality even when total tokens remain modest.

Core claim

IHUBERT is trained from scratch with the RoBERTa-base encoder on a 45 GB curated subset of the Sepahr-Danesh collection after normalization, duplicate removal, anonymization, and vector-database-based semantic deduplication for domain balancing. A 139k-vocabulary BPE tokenizer is trained on the full corpus. The model records F1 scores of 88.3542 on PQuAD and 49.0987 on ParsiNLU-RC, the highest reported, and Macro-F1 of 0.8350 on FarsTail; it stays competitive on NER and topic classification while trailing on relation extraction.

What carries the argument

Vector-database-based semantic deduplication that enforces distribution balance across domains and registers during corpus curation.

If this is right

The model sets new state-of-the-art results on two extractive QA benchmarks and one NLI benchmark.
A controlled ablation confirms that the chosen BPE tokenizer produces modestly lower subword fragmentation than WordPiece at the same vocabulary size.
Performance remains competitive on NER and topic classification while leaving the largest gap on relation extraction.
The overall pipeline demonstrates that large-scale monolingual pretraining for Persian can be improved by focusing on corpus quality and balance rather than scale alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same deduplication-plus-balancing pipeline could be tested on other low-resource languages where raw web data is noisy and domain-skewed.
If domain balance reduces certain failure modes on comprehension tasks, the method might also lower unintended biases in generated text for those languages.
Releasing the cleaned corpus would allow direct measurement of how much the vector step improves downstream metrics compared with simpler deduplication.
Extending the approach to larger token counts while keeping the same balancing controls would test whether quality gains compound with scale.

Load-bearing premise

The reported gains on downstream tasks arise directly from the semantic deduplication and domain balancing steps rather than tokenizer choice, random variation, or other unmeasured differences in training.

What would settle it

Retrain an identical RoBERTa-base model on the same 45 GB corpus after removing only the vector-database semantic deduplication step and check whether the F1 scores on PQuAD and ParsiNLU-RC fall to the level of prior Persian models.

read the original abstract

Persian pretrained language models (PLMs) are still limited by the scarcity of large-scale, high-quality pretraining corpora and by insufficient evaluation beyond standard classification and NER tasks. We present IHUBERT, a monolingual Persian PLM trained from scratch with the RoBERTa-base encoder (125M parameters) on a 45 GB curated subset of the Sepahr-Danesh collection (about 7-8B tokens). To improve corpus quality and reduce redundancy, we employ a multi-stage preprocessing pipeline that includes normalization, exact and near-duplicate removal, anonymization, and vector-database-based semantic deduplication for distribution balancing control across domains and registers. We additionally train a 139k-vocabulary BPE tokenizer on the full pretraining corpus to better capture Persian morphology and orthographic variation. IHUBERT is evaluated on seven Persian NLU benchmarks covering NER, sentiment analysis, topic classification, NLI, extractive question answering, and relation extraction, using task-standard metrics (entity-level F1, Macro-F1, EM/F1). IHUBERT achieves its strongest gains on extractive QA, ranking first on both PQuAD (F1 88.3542) and ParsiNLU-RC (F1 49.0987), and attains the best result on FarsTail (Macro-F1 0.8350). On NER and topic classification, it remains competitive (e.g., 0.8308 F1 on ParsTwiNER; 0.7953 Macro-F1 on DigiMag), while relation extraction remains the main remaining gap (0.6684 Macro-F1 on PERLEX). A controlled tokenizer ablation on the IHUBERT pretraining corpus shows that BPE yields slightly lower subword fragmentation than WordPiece at matched vocabulary size, supporting our tokenization design. Overall, IHUBERT advances Persian language modeling through semantically curated large-scale pretraining and broad evaluation across both classification and comprehension-oriented tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

IHUBERT gives a new Persian RoBERTa with strong QA numbers after semantic deduplication, but lacks the ablation that would tie those gains to the dedup step.

read the letter

This paper introduces IHUBERT, a 125M RoBERTa-base model for Persian trained from scratch on a 45 GB curated subset of Sepahr-Danesh (roughly 7-8B tokens). The main addition is a multi-stage pipeline that adds vector-database semantic deduplication on top of standard normalization, exact/near-duplicate removal, and anonymization, plus a 139k BPE tokenizer trained on the full corpus. It reports the best results on PQuAD (F1 88.35) and ParsiNLU-RC (F1 49.1), tops FarsTail (Macro-F1 0.835), and stays competitive on NER and topic classification while lagging on relation extraction.

The work is useful because it expands evaluation beyond the usual classification and NER tasks for Persian and includes a controlled tokenizer ablation that favors BPE over WordPiece on subword fragmentation. That part is concrete and reproducible enough to help others.

The soft spot is exactly the one flagged in the stress-test note: there is no ablation that trains an otherwise identical model on the corpus before versus after the semantic deduplication and domain-balancing step. The tokenizer control exists, but without isolating the vector-based dedup the paper cannot show that this step, rather than corpus size, training schedule, or seed effects, drives the QA gains. The abstract presents the full pipeline as the driver, yet the evidence for that specific component remains indirect.

The paper is for researchers who need Persian resources or who work on data curation pipelines for other lower-resource languages. The numbers are reported with task-standard metrics and the methods are described at a level that lets others replicate the pipeline.

It deserves peer review. The resource is real, the evaluation is broader than typical, and the main gap is fixable with one additional controlled run or clearer discussion of what can and cannot be attributed to each stage.

Referee Report

2 major / 2 minor

Summary. The paper presents IHUBERT, a 125M-parameter RoBERTa-base Persian PLM pretrained from scratch on a 45 GB curated subset (~7-8B tokens) of the Sepahr-Danesh collection. Curation uses a multi-stage pipeline (normalization, exact/near-duplicate removal, anonymization, and vector-database semantic deduplication for domain/register balancing), plus a custom 139k-vocabulary BPE tokenizer. The model is evaluated on seven Persian NLU benchmarks (NER, sentiment, topic classification, NLI, extractive QA, relation extraction) and reports top results on PQuAD (F1 88.3542), ParsiNLU-RC (F1 49.0987), and FarsTail (Macro-F1 0.8350), with a controlled BPE vs. WordPiece tokenizer ablation.

Significance. If the performance gains on extractive QA and NLI tasks can be attributed to the semantic deduplication and domain-balancing steps, the work would provide concrete evidence that vector-database curation improves monolingual pretraining for Persian, a lower-resource language. The broad task coverage and tokenizer ablation supply useful reference points for future Persian PLM development.

major comments (2)

[Abstract] Abstract: The central claim attributes the strongest gains (PQuAD F1 88.3542, ParsiNLU-RC F1 49.0987) to the multi-stage pipeline that includes vector-database-based semantic deduplication for distribution balancing. However, the only controlled ablation reported is tokenizer choice (BPE vs. WordPiece); no ablation trains otherwise identical models on the corpus before versus after the semantic deduplication step, so the performance delta cannot be isolated from corpus size, training schedule, or the 139k vocabulary.
[Abstract] Abstract (evaluation paragraph): Reported metrics lack error bars, number of random seeds, or statistical significance tests against baselines. Without these, it is not possible to determine whether the reported improvements over prior Persian models are robust or within run-to-run variance.

minor comments (2)

[Abstract] Abstract: The token count is given only as 'about 7-8B tokens'; an exact figure after all filtering steps would allow clearer comparison with other Persian corpora.
[Abstract] Abstract: The relation-extraction result (0.6684 Macro-F1 on PERLEX) is described as 'the main remaining gap' but no analysis or error breakdown is supplied to explain why this task lags while QA improves.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below, acknowledging where the manuscript requires clarification or qualification.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim attributes the strongest gains (PQuAD F1 88.3542, ParsiNLU-RC F1 49.0987) to the multi-stage pipeline that includes vector-database-based semantic deduplication for distribution balancing. However, the only controlled ablation reported is tokenizer choice (BPE vs. WordPiece); no ablation trains otherwise identical models on the corpus before versus after the semantic deduplication step, so the performance delta cannot be isolated from corpus size, training schedule, or the 139k vocabulary.

Authors: We agree that a direct before/after ablation on the semantic deduplication step would be required to isolate its contribution from other pipeline elements. Our available compute permitted only the reported tokenizer ablation. In the revised manuscript we have added explicit discussion in the Experiments and Limitations sections clarifying that performance gains reflect the full curation pipeline and cannot be attributed solely to deduplication. We frame the results accordingly rather than claiming isolated credit for that component. revision: partial
Referee: [Abstract] Abstract (evaluation paragraph): Reported metrics lack error bars, number of random seeds, or statistical significance tests against baselines. Without these, it is not possible to determine whether the reported improvements over prior Persian models are robust or within run-to-run variance.

Authors: We acknowledge this limitation. All models were trained with a single random seed owing to the high cost of 125M-parameter pretraining. The revised manuscript now states the single-seed nature of the results and qualifies all comparisons accordingly. Multiple independent runs for error bars and significance tests were not feasible within our resource constraints. revision: yes

standing simulated objections not resolved

Additional pretraining runs (for deduplication ablation or multiple seeds) cannot be performed due to computational cost.

Circularity Check

0 steps flagged

Empirical pretraining study with no derivations or self-referential predictions

full rationale

The paper describes an empirical pipeline for curating a Persian corpus (normalization, duplicate removal, semantic deduplication via vector database) and training a RoBERTa-base model, followed by benchmark evaluation on tasks like QA and NLI. No equations, fitted parameters, or predictions appear in the provided text. The tokenizer ablation (BPE vs. WordPiece) is a direct controlled comparison on the same corpus, not a renamed fit. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. All claims rest on externally reported benchmark scores (e.g., PQuAD F1), which are falsifiable outside the paper's own definitions. This is a standard self-contained empirical study.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the quality of the Sepahr-Danesh source after curation and on the assumption that semantic deduplication improves rather than harms downstream performance; tokenizer vocabulary size is an explicit design choice.

free parameters (1)

BPE vocabulary size = 139000
Chosen at 139k to capture Persian morphology and orthographic variation.

axioms (1)

domain assumption The Sepahr-Danesh collection after the described multi-stage pipeline yields higher-quality pretraining data than raw or less-curated alternatives.
Invoked as the basis for the 45 GB subset used in training.

pith-pipeline@v0.9.1-grok · 5918 in / 1388 out tokens · 40002 ms · 2026-06-26T17:27:07.457222+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 10 canonical work pages

[1]

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M. -W. Chang, K. Lee, and K. Toutanova, “{BERT:} Pre -training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, {NAACL-HLT} 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long a...

work page doi:10.18653/v1/n19-1423 2019
[2]

Unsupervised Cross-lingual Representation Learning at Scale,

A. Conneau et al., “Unsupervised Cross-lingual Representation Learning at Scale,” in ACL, 2020

2020
[3]

ParsBERT: Transformer- based Model for Persian Language Understanding,

M. Farahani, M. Gharachorloo, M. Farahani, and M. Manthouri, “ParsBERT: Transformer- based Model for Persian Language Understanding,” Neural Process. Lett. , vol. 53, no. 6, pp. 3831–3847, 2021, doi: 10.1007/s11063-021-10528-4

work page doi:10.1007/s11063-021-10528-4 2021
[4]

AriaBERT: A Pre-trained Persian BERT Model for Natural Language Understanding,

A. Ghafouri, M. A. Abbasi, and H. Naderi, “AriaBERT: A Pre-trained Persian BERT Model for Natural Language Understanding,” 2023, [Online]. Available: https://doi.org/10.21203/rs.3.rs-3558473/v1

work page doi:10.21203/rs.3.rs-3558473/v1 2023
[5]

FaBERT: Pre -training BERT on Persian Blogs

M. Masumi, S. S. Majd, M. Shamsfard, and H. Beigy, “FaBERT: Pre -training BERT on Persian Blogs.” 2024. [Online]. Available: https://arxiv.org/abs/2402.06617

arXiv 2024
[6]

FarSSiBERT: A Novel Transformer- based Model for Semantic Similarity Measurement of Persian Social Networks Informal Texts

S. M. Sadjadi, Z. Rajabi, L. Rabiei, and M.-S. Moin, “FarSSiBERT: A Novel Transformer- based Model for Semantic Similarity Measurement of Persian Social Networks Informal Texts.” 2024. [Online]. Available: https://arxiv.org/abs/2407.19173

arXiv 2024
[7]

SINA-BERT: A pre-trained Language Model for Analysis of Medical Texts in Persian

N. Taghizadeh, E. Doostmohammadi, E. Seifossadat, H. R. Rabiee, and M. S. Tahaei, “SINA-BERT: A pre-trained Language Model for Analysis of Medical Texts in Persian.” 2021

2021
[8]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Y. Liu et al., “RoBERTa: A Robustly Optimized BERT Pretraining Approach.” 2019

2019
[9]

TookaBERT: A Step Forward for Persian NLU

M. SadraeiJavaheri et al., “TookaBERT: A Step Forward for Persian NLU.” 2024. [Online]. Available: https://arxiv.org/abs/2407.16382

arXiv 2024
[10]

Hakim: Farsi Text Embedding Model

M. Sarmadi, M. Alikhani, E. Zinvandi, and Z. Pourbahman, “Hakim: Farsi Text Embedding Model.” 2025. [Online]. Available: https://arxiv.org/abs/2505.08435

arXiv 2025
[11]

{A}ra{BERT}: Transformer-based Model for {A}rabic Language Understanding,

W. Antoun, F. Baly, and H. Hajj, “{A}ra{BERT}: Transformer-based Model for {A}rabic Language Understanding,” in Proceedings of the 4th Workshop on Open -Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection , H. Al-Khalifa, W. Magdy, K. Darwish, T. Elsayed, and H. Mubarak, Eds., Marseille, France: European Language...

2020
[12]

{H}er{BERT} Based Language Model Detects Quantifiers and Their Semantic Properties in {P}olish,

M. Woliński, B. Nitoń, W. Kieraś, and J. Szymanik, “{H}er{BERT} Based Language Model Detects Quantifiers and Their Semantic Properties in {P}olish,” in Proceedings of the Thirteenth Language Resources and Evaluation Conference, N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, J....

2022
[13]

BERTimbau: Pretrained BERT Models for Brazilian Portuguese,

F. Souza, R. Nogueira, and R. Lotufo, “BERTimbau: Pretrained BERT Models for Brazilian Portuguese,” in Intelligent Systems: 9th Brazilian Conference, BRACIS 2020, Rio Grande, Brazil, October 20 –23, 2020, Proceedings, Part I , Berlin, Heidelberg: Springer -Verlag, 2020, pp. 403–417. doi: 10.1007/978-3-030-61377-8_28

work page doi:10.1007/978-3-030-61377-8_28 2020
[14]

KR-BERT: A Small-Scale Korean-Specific Language Model

S. Lee, H. Jang, Y. Baik, S. Park, and H. Shin, “KR-BERT: A Small-Scale Korean-Specific Language Model.” 2020. [Online]. Available: https://arxiv.org/abs/2008.03979

arXiv 2020
[15]

GottBERT: a pure German Language Model,

R. Scheible et al., “GottBERT: a pure German Language Model,” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , Association for Computational Linguistics, 2024, pp. 21237 –21250. doi: 10.18653/v1/2024.emnlp - main.1183

work page doi:10.18653/v1/2024.emnlp 2024
[16]

Pars-OFF: A Benchmark for Offensive Language Detection on Farsi Social Media,

T. S. Ataei, K. Darvishi, S. Javdan, A. Pourdabiri, B. Minaei -Bidgoli, and M. T. Pilehvar, “Pars-OFF: A Benchmark for Offensive Language Detection on Farsi Social Media,” IEEE Trans. Affect. Comput. , vol. 14, no. 4, pp. 2787 –2795, 2023, doi: 10.1109/TAFFC.2022.3219229

work page doi:10.1109/taffc.2022.3219229 2023
[17]

Pars - {ABSA}: a Manually Annotated Aspect-based Sentiment Analysis Benchmark on {F}arsi Product Reviews,

T. Shangipour ataei, K. Darvishi, S. Javdan, B. Minaei -Bidgoli, and S. Eetemadi, “Pars - {ABSA}: a Manually Annotated Aspect-based Sentiment Analysis Benchmark on {F}arsi Product Reviews,” in Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France: European Language Resources Association, Jun. 2022, pp. 7056–7060

2022
[18]

{F}ar{E}x{S}tance: Explainable Stance Detection for {F}arsi,

M. Zarharan, M. Hashemi, M. Behroozrazegh, S. Eetemadi, M. T. Pilehvar, and J. Foster, “{F}ar{E}x{S}tance: Explainable Stance Detection for {F}arsi,” in Proceedings of the 31st International Conference on Computational Linguistics , O. Rambow, L. Wanner, M. Apidianaki, H. Al -Khalifa, B. Di Eugenio, and S. Schockaert, Eds., Abu Dhabi, UAE: Association for...

2025
[19]

FarsTail: a Persian natural language inference dataset,

H. Amirkhani, M. AzariJafari, S. Faridan -Jahromi, Z. Kouhkan, Z. Pourjafari, and A. Amirak, “FarsTail: a Persian natural language inference dataset,” Soft Comput., Jul. 2023, doi: 10.1007/s00500-023-08959-3

work page doi:10.1007/s00500-023-08959-3 2023
[20]

FarSick: A Persian Semantic Textual Similarity And Natural Language Inference Dataset,

Z. Ghasemi and M. A. Keyvanrad, “FarSick: A Persian Semantic Textual Similarity And Natural Language Inference Dataset,” in 2021 11th International Conference on Computer Engineering and Knowledge (ICCKE) , 2021, pp. 194 –199. doi: 10.1109/ICCKE54056.2021.9721521

work page doi:10.1109/iccke54056.2021.9721521 2021
[21]

PQuAD: A Persian question answering dataset,

K. Darvishi, N. Shahbodaghkhan, Z. Abbasiantaeb, and S. Momtazi, “PQuAD: A Persian question answering dataset,” Comput. Speech & Lang., vol. 80, p. 101486, May 2023, doi: 10.1016/j.csl.2023.101486

work page doi:10.1016/j.csl.2023.101486 2023
[22]

PCoQA: Persian Conversational Question Answering Dataset

H. H. Hemati, A. Toghyani, A. Souri, S. H. Alavian, H. Sameti, and H. Beigy, “PCoQA: Persian Conversational Question Answering Dataset.” 2023. [Online]. Available: https://arxiv.org/abs/2312.04362

arXiv 2023
[23]

{P}ars{T}wi{NER}: A Corpus for Named Entity Recognition at Informal {P}ersian,

M. Aghajani, A. Badri, and H. Beigy, “{P}ars{T}wi{NER}: A Corpus for Named Entity Recognition at Informal {P}ersian,” in Proceedings of the Seventh Workshop on Noisy User-generated Text (W -NUT 2021), W. Xu, A. Ritter, T. Baldwin, and A. Rahimi, Eds., Online: Association for Computational Linguistics, Nov. 2021, pp. 131 –136. doi: 10.18653/v1/2021.wnut-1.16

work page doi:10.18653/v1/2021.wnut-1.16 2021
[24]

DeepSentiPers: Novel Deep Learning Models Trained Over Proposed Augmented Persian Sentiment Corpus

J. P. R. Sharami, P. A. Sarabestani, and S. A. Mirroshandel, “DeepSentiPers: Novel Deep Learning Models Trained Over Proposed Augmented Persian Sentiment Corpus.” 2020. [Online]. Available: https://arxiv.org/abs/2004.05328

arXiv 2020
[25]

ParsiNLU: A Suite of Language Understanding Challenges for Persian

D. Khashabi et al. , “ParsiNLU: A Suite of Language Understanding Challenges for Persian.” 2021. [Online]. Available: https://arxiv.org/abs/2012.06154

arXiv 2021
[26]

Perlex: A Bilingual Persian-English Gold Dataset for Relation Extraction,

M. Asgari-Bidhendi, B. Janfada, M. Nasser, and B. Minaei -Bidgoli, “Perlex: A Bilingual Persian-English Gold Dataset for Relation Extraction,” arXiv, vol. 8, 2020

2020

[1] [1]

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M. -W. Chang, K. Lee, and K. Toutanova, “{BERT:} Pre -training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, {NAACL-HLT} 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long a...

work page doi:10.18653/v1/n19-1423 2019

[2] [2]

Unsupervised Cross-lingual Representation Learning at Scale,

A. Conneau et al., “Unsupervised Cross-lingual Representation Learning at Scale,” in ACL, 2020

2020

[3] [3]

ParsBERT: Transformer- based Model for Persian Language Understanding,

M. Farahani, M. Gharachorloo, M. Farahani, and M. Manthouri, “ParsBERT: Transformer- based Model for Persian Language Understanding,” Neural Process. Lett. , vol. 53, no. 6, pp. 3831–3847, 2021, doi: 10.1007/s11063-021-10528-4

work page doi:10.1007/s11063-021-10528-4 2021

[4] [4]

AriaBERT: A Pre-trained Persian BERT Model for Natural Language Understanding,

A. Ghafouri, M. A. Abbasi, and H. Naderi, “AriaBERT: A Pre-trained Persian BERT Model for Natural Language Understanding,” 2023, [Online]. Available: https://doi.org/10.21203/rs.3.rs-3558473/v1

work page doi:10.21203/rs.3.rs-3558473/v1 2023

[5] [5]

FaBERT: Pre -training BERT on Persian Blogs

M. Masumi, S. S. Majd, M. Shamsfard, and H. Beigy, “FaBERT: Pre -training BERT on Persian Blogs.” 2024. [Online]. Available: https://arxiv.org/abs/2402.06617

arXiv 2024

[6] [6]

FarSSiBERT: A Novel Transformer- based Model for Semantic Similarity Measurement of Persian Social Networks Informal Texts

S. M. Sadjadi, Z. Rajabi, L. Rabiei, and M.-S. Moin, “FarSSiBERT: A Novel Transformer- based Model for Semantic Similarity Measurement of Persian Social Networks Informal Texts.” 2024. [Online]. Available: https://arxiv.org/abs/2407.19173

arXiv 2024

[7] [7]

SINA-BERT: A pre-trained Language Model for Analysis of Medical Texts in Persian

N. Taghizadeh, E. Doostmohammadi, E. Seifossadat, H. R. Rabiee, and M. S. Tahaei, “SINA-BERT: A pre-trained Language Model for Analysis of Medical Texts in Persian.” 2021

2021

[8] [8]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Y. Liu et al., “RoBERTa: A Robustly Optimized BERT Pretraining Approach.” 2019

2019

[9] [9]

TookaBERT: A Step Forward for Persian NLU

M. SadraeiJavaheri et al., “TookaBERT: A Step Forward for Persian NLU.” 2024. [Online]. Available: https://arxiv.org/abs/2407.16382

arXiv 2024

[10] [10]

Hakim: Farsi Text Embedding Model

M. Sarmadi, M. Alikhani, E. Zinvandi, and Z. Pourbahman, “Hakim: Farsi Text Embedding Model.” 2025. [Online]. Available: https://arxiv.org/abs/2505.08435

arXiv 2025

[11] [11]

{A}ra{BERT}: Transformer-based Model for {A}rabic Language Understanding,

W. Antoun, F. Baly, and H. Hajj, “{A}ra{BERT}: Transformer-based Model for {A}rabic Language Understanding,” in Proceedings of the 4th Workshop on Open -Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection , H. Al-Khalifa, W. Magdy, K. Darwish, T. Elsayed, and H. Mubarak, Eds., Marseille, France: European Language...

2020

[12] [12]

{H}er{BERT} Based Language Model Detects Quantifiers and Their Semantic Properties in {P}olish,

M. Woliński, B. Nitoń, W. Kieraś, and J. Szymanik, “{H}er{BERT} Based Language Model Detects Quantifiers and Their Semantic Properties in {P}olish,” in Proceedings of the Thirteenth Language Resources and Evaluation Conference, N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, J....

2022

[13] [13]

BERTimbau: Pretrained BERT Models for Brazilian Portuguese,

F. Souza, R. Nogueira, and R. Lotufo, “BERTimbau: Pretrained BERT Models for Brazilian Portuguese,” in Intelligent Systems: 9th Brazilian Conference, BRACIS 2020, Rio Grande, Brazil, October 20 –23, 2020, Proceedings, Part I , Berlin, Heidelberg: Springer -Verlag, 2020, pp. 403–417. doi: 10.1007/978-3-030-61377-8_28

work page doi:10.1007/978-3-030-61377-8_28 2020

[14] [14]

KR-BERT: A Small-Scale Korean-Specific Language Model

S. Lee, H. Jang, Y. Baik, S. Park, and H. Shin, “KR-BERT: A Small-Scale Korean-Specific Language Model.” 2020. [Online]. Available: https://arxiv.org/abs/2008.03979

arXiv 2020

[15] [15]

GottBERT: a pure German Language Model,

R. Scheible et al., “GottBERT: a pure German Language Model,” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , Association for Computational Linguistics, 2024, pp. 21237 –21250. doi: 10.18653/v1/2024.emnlp - main.1183

work page doi:10.18653/v1/2024.emnlp 2024

[16] [16]

Pars-OFF: A Benchmark for Offensive Language Detection on Farsi Social Media,

T. S. Ataei, K. Darvishi, S. Javdan, A. Pourdabiri, B. Minaei -Bidgoli, and M. T. Pilehvar, “Pars-OFF: A Benchmark for Offensive Language Detection on Farsi Social Media,” IEEE Trans. Affect. Comput. , vol. 14, no. 4, pp. 2787 –2795, 2023, doi: 10.1109/TAFFC.2022.3219229

work page doi:10.1109/taffc.2022.3219229 2023

[17] [17]

Pars - {ABSA}: a Manually Annotated Aspect-based Sentiment Analysis Benchmark on {F}arsi Product Reviews,

T. Shangipour ataei, K. Darvishi, S. Javdan, B. Minaei -Bidgoli, and S. Eetemadi, “Pars - {ABSA}: a Manually Annotated Aspect-based Sentiment Analysis Benchmark on {F}arsi Product Reviews,” in Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France: European Language Resources Association, Jun. 2022, pp. 7056–7060

2022

[18] [18]

{F}ar{E}x{S}tance: Explainable Stance Detection for {F}arsi,

M. Zarharan, M. Hashemi, M. Behroozrazegh, S. Eetemadi, M. T. Pilehvar, and J. Foster, “{F}ar{E}x{S}tance: Explainable Stance Detection for {F}arsi,” in Proceedings of the 31st International Conference on Computational Linguistics , O. Rambow, L. Wanner, M. Apidianaki, H. Al -Khalifa, B. Di Eugenio, and S. Schockaert, Eds., Abu Dhabi, UAE: Association for...

2025

[19] [19]

FarsTail: a Persian natural language inference dataset,

H. Amirkhani, M. AzariJafari, S. Faridan -Jahromi, Z. Kouhkan, Z. Pourjafari, and A. Amirak, “FarsTail: a Persian natural language inference dataset,” Soft Comput., Jul. 2023, doi: 10.1007/s00500-023-08959-3

work page doi:10.1007/s00500-023-08959-3 2023

[20] [20]

FarSick: A Persian Semantic Textual Similarity And Natural Language Inference Dataset,

Z. Ghasemi and M. A. Keyvanrad, “FarSick: A Persian Semantic Textual Similarity And Natural Language Inference Dataset,” in 2021 11th International Conference on Computer Engineering and Knowledge (ICCKE) , 2021, pp. 194 –199. doi: 10.1109/ICCKE54056.2021.9721521

work page doi:10.1109/iccke54056.2021.9721521 2021

[21] [21]

PQuAD: A Persian question answering dataset,

K. Darvishi, N. Shahbodaghkhan, Z. Abbasiantaeb, and S. Momtazi, “PQuAD: A Persian question answering dataset,” Comput. Speech & Lang., vol. 80, p. 101486, May 2023, doi: 10.1016/j.csl.2023.101486

work page doi:10.1016/j.csl.2023.101486 2023

[22] [22]

PCoQA: Persian Conversational Question Answering Dataset

H. H. Hemati, A. Toghyani, A. Souri, S. H. Alavian, H. Sameti, and H. Beigy, “PCoQA: Persian Conversational Question Answering Dataset.” 2023. [Online]. Available: https://arxiv.org/abs/2312.04362

arXiv 2023

[23] [23]

{P}ars{T}wi{NER}: A Corpus for Named Entity Recognition at Informal {P}ersian,

M. Aghajani, A. Badri, and H. Beigy, “{P}ars{T}wi{NER}: A Corpus for Named Entity Recognition at Informal {P}ersian,” in Proceedings of the Seventh Workshop on Noisy User-generated Text (W -NUT 2021), W. Xu, A. Ritter, T. Baldwin, and A. Rahimi, Eds., Online: Association for Computational Linguistics, Nov. 2021, pp. 131 –136. doi: 10.18653/v1/2021.wnut-1.16

work page doi:10.18653/v1/2021.wnut-1.16 2021

[24] [24]

DeepSentiPers: Novel Deep Learning Models Trained Over Proposed Augmented Persian Sentiment Corpus

J. P. R. Sharami, P. A. Sarabestani, and S. A. Mirroshandel, “DeepSentiPers: Novel Deep Learning Models Trained Over Proposed Augmented Persian Sentiment Corpus.” 2020. [Online]. Available: https://arxiv.org/abs/2004.05328

arXiv 2020

[25] [25]

ParsiNLU: A Suite of Language Understanding Challenges for Persian

D. Khashabi et al. , “ParsiNLU: A Suite of Language Understanding Challenges for Persian.” 2021. [Online]. Available: https://arxiv.org/abs/2012.06154

arXiv 2021

[26] [26]

Perlex: A Bilingual Persian-English Gold Dataset for Relation Extraction,

M. Asgari-Bidhendi, B. Janfada, M. Nasser, and B. Minaei -Bidgoli, “Perlex: A Bilingual Persian-English Gold Dataset for Relation Extraction,” arXiv, vol. 8, 2020

2020