arxiv: 2604.14907 · v1 · submitted 2026-04-16 · 💻 cs.CL · cs.LG

Recognition: unknown

Comparison of Modern Multilingual Text Embedding Techniques for Hate Speech Detection Task

Evaldas Vaiciukynas , Paulius Danenas , Linas Ablonskis , Algirdas Sukys , Edgaras Dambrauskas , Voldemaras Zitkus , Rita Butkiene , Rimantas Butleris

Authors on Pith no claims yet

Pith reviewed 2026-05-10 10:50 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords hate speech detectionmultilingual embeddingsLithuanian languagesupervised classificationanomaly detectionCatBoostsentence transformersPCA

0 comments

The pith

Supervised two-class models using multilingual embeddings substantially outperform one-class anomaly detection for hate speech across three languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether modern multilingual sentence embedding models can effectively support hate speech detection in Lithuanian, Russian, and English. It creates a new Lithuanian dataset called LtHate from news and social media and compares six embedding models in a standard pipeline. The key result is that two-class supervised classifiers based on CatBoost always beat one-class anomaly detectors like HBOS by a large margin. The best setups reach over 80 percent accuracy in Lithuanian, 92 percent in Russian, and 77 percent in English. Dimensionality reduction via PCA works well for the supervised case without much performance loss.

Core claim

Across all three datasets, two-class supervised models consistently and substantially outperform one-class anomaly detection. The best configurations achieve up to 80.96% accuracy and AUC ROC of 0.887 in Lithuanian using jina embeddings, 92.19% accuracy and AUC ROC of 0.978 in Russian using e5 embeddings, and 77.21% accuracy and AUC ROC of 0.859 in English using e5 with PCA. PCA compression preserves almost all discriminative power in the supervised setting.

What carries the argument

The comparison of six multilingual sentence encoders (potion, gemma, bge, snow, jina, e5) as input features to either a one-class HBOS anomaly detector or a two-class CatBoost classifier, optionally with PCA to 64 dimensions.

If this is right

Two-class CatBoost classifiers on these embeddings deliver the highest accuracies and AUC scores on all tested datasets.
PCA reduction to 64 features maintains nearly full performance for supervised classifiers but degrades anomaly detection results.
The new LtHate corpus provides a benchmark for hate speech detection in Lithuanian.
Gradient boosted decision trees paired with multilingual embeddings offer practical solutions for content moderation in multiple languages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

For deployment in content moderation systems, labeled data enabling supervised training is likely more valuable than relying on anomaly detection alone.
The success of PCA in supervised settings implies that lower-dimensional embeddings could reduce storage and computation costs for large-scale applications.
Similar benchmarking could be applied to other low-resource languages to identify effective embedding models for their hate speech detection needs.
Fine-tuning the embedding models on hate speech data might further improve the reported accuracies beyond the frozen-encoder approach used here.

Load-bearing premise

The introduced LtHate corpus and the other datasets have accurate labels that represent typical real-world hate speech without significant bias.

What would settle it

A new test set of Lithuanian social media posts with fresh human annotations where the top-performing supervised model falls below 70% accuracy or the anomaly detector matches or exceeds it would falsify the superiority of supervised models.

Figures

Figures reproduced from arXiv: 2604.14907 by Algirdas Sukys, Edgaras Dambrauskas, Evaldas Vaiciukynas, Linas Ablonskis, Paulius Danenas, Rimantas Butleris, Rita Butkiene, Voldemaras Zitkus.

**Figure 2.** Figure 2: Lithuanian hate speech detection curves using compressed embeddings: ROC (left) and PRC (right). [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Russian hate speech detection curves using original embeddings: ROC (left) and PRC (right). [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Russian language hate speech detection curves using compressed embeddings: ROC (left) and PRC (right). [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: English language hate speech detection curves using original embeddings: ROC (left) and PRC (right). [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: English language hate speech detection curves using compressed embeddings: ROC (left) and PRC (right). [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

read the original abstract

Online hate speech and abusive language pose a growing challenge for content moderation, especially in multilingual settings and for low-resource languages such as Lithuanian. This paper investigates to what extent modern multilingual sentence embedding models can support accurate hate speech detection in Lithuanian, Russian, and English, and how their performance depends on downstream modeling choices and feature dimensionality. We introduce LtHate, a new Lithuanian hate speech corpus derived from news portals and social networks, and benchmark six modern multilingual encoders (potion, gemma, bge, snow, jina, e5) on LtHate, RuToxic, and EnSuperset using a unified Python pipeline. For each embedding, we train both a one class HBOS anomaly detector and a two class CatBoost classifier, with and without principal component analysis (PCA) compression to 64-dimensional feature vectors. Across all datasets, two class supervised models consistently and substantially outperform one class anomaly detection, with the best configurations achieving up to 80.96% accuracy and AUC ROC of 0.887 in Lithuanian (jina), 92.19% accuracy and AUC ROC of 0.978 in Russian (e5), and 77.21% accuracy and AUC ROC of 0.859 in English (e5 with PCA). PCA compression preserves almost all discriminative power in the supervised setting, while showing some negative impact for the unsupervised anomaly detection case. These results demonstrate how modern multilingual sentence embeddings combined with gradient boosted decision trees provide robust soft-computing solutions for multilingual hate speech detection applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the LtHate corpus and benchmarks six multilingual sentence embeddings (potion, gemma, bge, snow, jina, e5) for hate speech detection on LtHate, RuToxic, and EnSuperset. It compares one-class HBOS anomaly detection against two-class CatBoost classification, with and without 64-dimensional PCA compression, reporting that supervised models substantially outperform unsupervised ones, with peak results of 80.96% accuracy / 0.887 AUC (Lithuanian, jina), 92.19% / 0.978 (Russian, e5), and 77.21% / 0.859 (English, e5+PCA). PCA largely preserves supervised performance while degrading unsupervised results.

Significance. If the label quality and dataset representativeness hold, the work supplies a useful empirical benchmark for multilingual hate speech detection, particularly for low-resource Lithuanian, and demonstrates the practical value of combining modern embeddings with gradient-boosted trees. The unified Python pipeline and release of LtHate constitute clear strengths for reproducibility and future research.

major comments (2)

[§3] §3 (LtHate corpus description): No annotation protocol, number of annotators, inter-annotator agreement (e.g., Fleiss' kappa), or external validation is reported for the newly introduced LtHate corpus. Because the headline performance numbers (80.96% accuracy, 0.887 AUC) and the supervised-vs-unsupervised gap rest directly on these labels, the absence of this information is load-bearing for the central empirical claims.
[§4] §4 (Experimental setup): The manuscript does not specify train/test split ratios, the hyperparameter search procedure for CatBoost and HBOS, or the precise PCA implementation and variance retained. These omissions prevent verification of the reported margins (e.g., 0.978 AUC on Russian) and limit reproducibility of the finding that PCA preserves discriminative power in the supervised case.

minor comments (2)

[Abstract] Abstract and §2: The six embedding models are referred to only by short names (potion, gemma, bge, snow, jina, e5) without citations to their source papers; adding these references would improve traceability.
[§5] Table captions (presumed in §5): Ensure that all reported metrics include both accuracy and AUC-ROC for every configuration, and clarify whether the numbers are macro-averaged or weighted.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight important aspects for improving the clarity and reproducibility of our work on multilingual hate speech detection. We address each major comment below and will incorporate the necessary revisions into the manuscript.

read point-by-point responses

Referee: [§3] §3 (LtHate corpus description): No annotation protocol, number of annotators, inter-annotator agreement (e.g., Fleiss' kappa), or external validation is reported for the newly introduced LtHate corpus. Because the headline performance numbers (80.96% accuracy, 0.887 AUC) and the supervised-vs-unsupervised gap rest directly on these labels, the absence of this information is load-bearing for the central empirical claims.

Authors: We agree that a detailed account of the annotation process is essential for validating the LtHate corpus and supporting the empirical claims. In the revised manuscript, we will expand §3 to include the full annotation protocol, the number of annotators, inter-annotator agreement statistics (including Fleiss' kappa), and any steps taken for external validation or quality control. These additions will directly address the load-bearing nature of the label quality for the reported performance metrics. revision: yes
Referee: [§4] §4 (Experimental setup): The manuscript does not specify train/test split ratios, the hyperparameter search procedure for CatBoost and HBOS, or the precise PCA implementation and variance retained. These omissions prevent verification of the reported margins (e.g., 0.978 AUC on Russian) and limit reproducibility of the finding that PCA preserves discriminative power in the supervised case.

Authors: We concur that these experimental details are required for reproducibility and verification of the results, including the AUC margins and the PCA findings. In the revised version of §4, we will explicitly state the train/test split ratios, describe the hyperparameter search procedures used for CatBoost and HBOS (including any grid or random search configurations and validation strategy), and provide the precise PCA implementation details along with the variance retained when reducing to 64 dimensions. This will allow readers to replicate the supervised vs. unsupervised comparisons and the dimensionality reduction effects. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmarking with direct measurements

full rationale

The paper conducts an experimental comparison of six multilingual sentence embeddings on three hate-speech datasets (including a newly introduced LtHate corpus), training HBOS anomaly detectors and CatBoost classifiers with optional PCA. All reported accuracies and AUC values are direct empirical outcomes from model training and evaluation on held-out data; the manuscript contains no equations, derivations, fitted parameters renamed as predictions, or self-citation chains that reduce the central claims to their own inputs by construction. The work is therefore self-contained as a standard benchmarking study.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on standard machine-learning assumptions about data representativeness and label quality plus the reliability of pre-trained embedding models; no new free parameters, axioms, or invented entities are introduced.

axioms (2)

domain assumption Datasets are representative samples of real-world hate speech and labels are accurate.
Required for the reported accuracies to generalize beyond the specific corpora.
domain assumption Pre-trained multilingual embeddings capture discriminative features for hate speech.
Implicit in using the embeddings as input features without further adaptation.

pith-pipeline@v0.9.0 · 5613 in / 1347 out tokens · 35636 ms · 2026-05-10T10:50:33.394761+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

78 extracted references · 47 canonical work pages · 8 internal anchors

[1]

Fortuna and S

Paula Fortuna and Sérgio Nunes. A survey on automatic detection of hate speech in text.ACM Computing Surveys, 51(4):85:1–85:30, 2018. doi:10.1145/3232676

work page doi:10.1145/3232676 2018
[2]

Resources and benchmark corpora for hate speech detection: a systematic review.Language Resources and Evaluation, 55:477–523, 2021

Fabio Poletto, Valerio Basile, Manuela Sanguinetti, Cristina Bosco, and Viviana Patti. Resources and benchmark corpora for hate speech detection: a systematic review.Language Resources and Evaluation, 55:477–523, 2021. doi:10.1007/s10579-020-09502-8

work page doi:10.1007/s10579-020-09502-8 2021
[3]

Survey on the impact of online disinformation and hate speech, 2023

UNESCO and Ipsos. Survey on the impact of online disinformation and hate speech, 2023. URL https: //www.unesco.org/sites/default/files/medias/fichiers/2023/11/unesco_ipsos_survey.pdf

2023
[4]

Hate speech: A systematized review.SAGE Open, 13(1):1–18, 2023

María Antonia Paz, Julio Montero-Díaz, and Alicia Moreno-Delgado. Hate speech: A systematized review.SAGE Open, 13(1):1–18, 2023. doi:10.1177/21582440231181311

work page doi:10.1177/21582440231181311 2023
[5]

Exposure to online hate in four nations: A cross-national consideration.Deviant Behavior, 38(3):254–266, 2017

James Hawdon, Atte Oksanen, and Pekka Räsänen. Exposure to online hate in four nations: A cross-national consideration.Deviant Behavior, 38(3):254–266, 2017. doi:10.1080/01639625.2016.1196985

work page doi:10.1080/01639625.2016.1196985 2017
[6]

Fanning the flames of hate: Social media and hate crime.Journal of the European Economic Association, 19(4):2131–2167, 2021

Karsten Müller and Carlo Schwarz. Fanning the flames of hate: Social media and hate crime.Journal of the European Economic Association, 19(4):2131–2167, 2021. doi:10.1093/jeea/jvaa045. 14 Multilingual Text Embeddings for Hate Speech DetectionA PREPRINT

work page doi:10.1093/jeea/jvaa045 2021
[7]

Report of the independent international fact-finding mission on Myanmar,

United Nations Human Rights Council. Report of the independent international fact-finding mission on Myanmar,
[8]

A/HRC/39/64

URLhttps://www.ohchr.org/en/hr-bodies/hrc/myanmar-ffm/index. A/HRC/39/64
[9]

Regulation (EU) 2022/2065 of the European Parliament and of the Council on a single market for digital services and amending Directive 2000/31/EC (Digital Services Act), 2022

European Parliament and Council of the European Union. Regulation (EU) 2022/2065 of the European Parliament and of the Council on a single market for digital services and amending Directive 2000/31/EC (Digital Services Act), 2022. URL https://eur-lex.europa.eu/eli/reg/2022/2065/oj. Official Journal of the European Union, L 277, 1–102

2022
[10]

Directions in abusive language training data, a systematic review: Garbage in, garbage out.PLOS ONE, 15(12):e0243300, 2020

Bertie Vidgen and Leon Derczynski. Directions in abusive language training data, a systematic review: Garbage in, garbage out.PLOS ONE, 15(12):e0243300, 2020. doi:10.1371/journal.pone.0243300

work page doi:10.1371/journal.pone.0243300 2020
[11]

Schmidt and M

Anna Schmidt and Michael Wiegand. A survey on hate speech detection using natural language processing. In Lun-Wei Ku and Cheng-Te Li, editors,Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media, SocialNLP@EACL 2017, Valencia, Spain, April 3, 2017, pages 1–10. Association for Computational Linguistics, 2017. doi:...

work page doi:10.18653/v1/w17-1101 2017
[12]

Macy, and Ingmar Weber

Thomas Davidson, Dana Warmsley, Michael W. Macy, and Ingmar Weber. Automated hate speech detection and the problem of offensive language. InProceedings of the Eleventh International Conference on Web and Social Media, ICWSM 2017, Montréal, Québec, Canada, May 15-18, 2017, pages 512–515. AAAI Press, 2017. URL https://aaai.org/ocs/index.php/ICWSM/ICWSM17/pa...

2017
[13]

Using convolutional neural networks to classify hate-speech

Björn Gambäck and Utpal Kumar Sikdar. Using convolutional neural networks to classify hate-speech. In Zeerak Waseem, Wendy Hui Kyong Chung, Dirk Hovy, and Joel R. Tetreault, editors,Proceedings of the First Workshop on Abusive Language Online, ALW@ACL 2017, Vancouver, BC, Canada, August 4, 2017, pages 85–90. Association for Computational Linguistics, 2017...

work page doi:10.18653/v1/w17-3013 2017
[14]

Deep learning for hate speech detection in tweets

Pinkesh Badjatiya, Shashank Gupta, Manish Gupta, and Vasudeva Varma. Deep learning for hate speech detection in tweets. In Rick Barrett, Rick Cummings, Eugene Agichtein, and Evgeniy Gabrilovich, editors,Proceedings of the 26th International Conference on World Wide Web Companion, Perth, Australia, April 3-7, 2017, pages 759–760. ACM, 2017. doi:10.1145/304...

work page doi:10.1145/3041021.3054223 2017
[15]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in neural information processing systems, pages 5998–6008, 2017. URLhttp://arxiv.org/abs/1706.03762

work page internal anchor Pith review Pith/arXiv arXiv 2017
[16]

Hate speech detection and racial bias mitigation in social media based on bert model.PLOS ONE, 15(8):1–26, 08 2020

Marzieh Mozafari, Reza Farahbakhsh, and Noël Crespi. Hate speech detection and racial bias mitigation in social media based on bert model.PLOS ONE, 15(8):1–26, 08 2020. doi:10.1371/journal.pone.0237861

work page doi:10.1371/journal.pone.0237861 2020
[17]

Hatebert: Retraining bert for abusive language detection in english

Tommaso Caselli, Valerio Basile, Jelena Mitrovi´c, and Michael Granitzer. Hatebert: Retraining bert for abusive language detection in english. InProceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021), pages 17–25. Association for Computational Linguistics, 2021. doi:10.18653/v1/2021.woah-1.3

work page doi:10.18653/v1/2021.woah-1.3 2021
[18]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding.CoRR, abs/1810.04805, 2018. doi:10.48550/arXiv.1810.04805

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1810.04805 2018
[19]

Guanting Dong, Hongyi Yuan, Keming Lu, Chengpeng Li, Mingfeng Xue, Dayiheng Liu, Wei Wang, Zheng Yuan, Chang Zhou, and Jingren Zhou

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale.CoRR, abs/1911.02116, 2019. doi:10.48550/arXiv.1911.02116

work page doi:10.48550/arxiv.1911.02116 1911
[20]

SemEval-2020 task 12: Multilingual offensive language identifi- cation in social media (OffensEval 2020)

Marcos Zampieri, Preslav Nakov, Sara Rosenthal, Pepa Atanasova, Georgi Karadzhov, Hamdy Mubarak, Leon Derczynski, Zeses Pitenis, and Ça˘grı Çöltekin. SemEval-2020 task 12: Multilingual offensive language identifi- cation in social media (OffensEval 2020). In Aurelie Herbelot, Xiaodan Zhu, Alexis Palmer, Nathan Schneider, Jonathan May, and Ekaterina Shutov...

work page doi:10.18653/v1/2020.semeval-1.188 2020
[21]

Multilingual offensive language identification with cross-lingual em- beddings

Tharindu Ranasinghe and Marcos Zampieri. Multilingual offensive language identification with cross-lingual em- beddings. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors,Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5838–5844, Online, November 2020. Association for Computational Linguis...

work page doi:10.18653/v1/2020.emnlp-main.470 2020
[22]

Leveraging multilingual transformers for hate speech detection

Sayar Ghosh Roy, Ujwal Narayan, Tathagata Raha, Zubair Abid, and Vasudeva Varma. Leveraging multilingual transformers for hate speech detection. In Parth Mehta, Thomas Mandl, Prasenjit Majumder, and Mandar Mitra, editors,Working Notes of FIRE 2020 - Forum for Information Retrieval Evaluation, Hyderabad, India, December 16-20, 2020, volume 2826 ofCEUR Work...

2020
[23]

Hale, and Paul Röttger

Manuel Tonneau, Diyi Liu, Samuel Fraiberger, Ralph Schroeder, Scott A. Hale, and Paul Röttger. From languages to geographies: Towards evaluating cultural bias in hate speech datasets. In Yi-Ling Chung, Zeerak Talat, Debora 15 Multilingual Text Embeddings for Hate Speech DetectionA PREPRINT Nozza, Flor Miriam Plaza-del Arco, Paul Röttger, Aida Mostafazadeh...

work page doi:10.18653/v1/2024.woah-1.23 2024
[24]

Cross-lingual transfer learning for hate speech detection

Irina Bigoulaeva, Viktor Hangya, and Alexander Fraser. Cross-lingual transfer learning for hate speech detection. In Bharathi Raja Chakravarthi, John P. McCrae, Manel Zarrouk, Rajeev K. Bali, and Paul Buitelaar, editors, Proceedings of the First Workshop on Language Technology for Equality, Diversity and Inclusion, LT-EDI@EACL 2021, Online, April 19, 2021...

2021
[25]

Addressing the challenges of cross- lingual hate speech detection.CoRR, abs/2201.05922, 2022

Irina Bigoulaeva, Viktor Hangya, Iryna Gurevych, and Alexander Fraser. Addressing the challenges of cross- lingual hate speech detection.CoRR, abs/2201.05922, 2022. doi:10.48550/arxiv.2201.05922

work page doi:10.48550/arxiv.2201.05922 2022
[26]

Exploring hate speech detection models for Lithuanian language

Justina Mandravickait˙e, Egl˙e Rimkien˙e, Mindaugas Petkeviˇcius, Milita Songailait˙e, Eimantas Zaranka, and Tomas Krilaviˇcius. Exploring hate speech detection models for Lithuanian language. In Agostina Calabrese, Christine de Kock, Debora Nozza, Flor Miriam Plaza-del Arco, Zeerak Talat, and Francielle Vargas, editors,Proceedings of the The 9th Workshop...

2025
[27]

Multilingual E5 Text Embeddings: A Technical Report

Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Multilingual E5 text embeddings: A technical report.arXiv, 2024. doi:10.48550/arXiv.2402.05672

work page internal anchor Pith review doi:10.48550/arxiv.2402.05672 2024
[28]

jina- embeddings-v3: Multilingual embeddings with task lora.arXiv preprint arXiv:2409.10173, 2024

Saba Sturua, Isabelle Mohr, Mohammad Kalim Akram, Michael Günther, Bo Wang, Markus Krimmel, Feng Wang, Georgios Mastrapas, Andreas Koukounas, Nan Wang, and Han Xiao. jina-embeddings-v3: Multilingual embeddings with task lora.arXiv, 2024. doi:10.48550/arXiv.2409.10173

work page doi:10.48550/arxiv.2409.10173 2024
[29]

Arctic-embed 2.0: Multilingual retrieval without compromise.arXiv preprint arXiv:2412.04506, 2024

Puxuan Yu, Luke Merrick, Gaurav Nuti, and Daniel Campos. Arctic-embed 2.0: Multilingual retrieval without compromise.arXiv, 2024. doi:10.48550/arXiv.2412.04506

work page doi:10.48550/arxiv.2412.04506 2024
[30]

M3-embedding: Multi- linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation.arXiv,

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. M3-embedding: Multi- linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation.arXiv,
[31]

doi:10.48550/arXiv.2402.03216

work page internal anchor Pith review doi:10.48550/arxiv.2402.03216
[32]

Hateful symbols or hateful people? predictive features for hate speech detection on twitter

Zeerak Waseem and Dirk Hovy. Hateful symbols or hateful people? predictive features for hate speech detection on twitter. InProceedings of the Student Research Workshop, SRW@HLT-NAACL 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17,...

work page doi:10.18653/v1/n16-2013 2016
[33]

doi: 10.18653/v1/S19-2010

Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, and Ritesh Kumar. Semeval-2019 task 6: Identifying and categorizing offensive language in social media (offenseval). In Jonathan May, Ekaterina Shutova, Aurélie Herbelot, Xiaodan Zhu, Marianna Apidianaki, and Saif M. Mohammad, editors,Proceedings of the 13th International Worksh...

work page doi:10.18653/v1/s19-2010 2019
[34]

Cross-lingual hate speech detec- tion using domain-specific word embeddings.PLOS ONE, 19(7):e0306521, July 2024

Ayme Arango Monnar, Jorge Perez Rojas, and Barbara Polete Labra. Cross-lingual hate speech detec- tion using domain-specific word embeddings.PLOS ONE, 19(7):e0306521, July 2024. ISSN 1932-6203. doi:10.1371/journal.pone.0306521

work page doi:10.1371/journal.pone.0306521 2024
[35]

A multilingual evaluation for online hate speech detection.arXiv, 20(2), March 2020

Michele Corazza, Stefano Menini, Elena Cabrio, Sara Tonelli, and Serena Villata. A multilingual evaluation for online hate speech detection.arXiv, 20(2), March 2020. ISSN 1533-5399. doi:10.1145/3377323

work page doi:10.1145/3377323 2020
[36]

Transformers at hsd-2lang 2024: Hate speech detection in arabic and turkish tweets using BERT based architectures

Kriti Singhal and Jatin Bedi. Transformers at hsd-2lang 2024: Hate speech detection in arabic and turkish tweets using BERT based architectures. In Ali Hürriyetoglu, Hristo Tanev, Surendrabikram Thapa, and Gökçe Uludogan, editors,Proceedings of the 7th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text, CAS...

2024
[37]

Generalizable multilingual hate speech detection on low resource indian languages using fair selection in federated learning

Akshay Singh and Rahul Thakur. Generalizable multilingual hate speech detection on low resource indian languages using fair selection in federated learning. In Kevin Duh, Helena Gómez-Adorno, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technol...

work page doi:10.18653/v1/2024.naacl- 2024
[38]

Sayan Ghosh and Suman Kumar Senapati. Hate speech detection in low-resourced indian languages: An analysis of transformer-based monolingual and multilingual models with cross-lingual experiments.Natural Language Engineering, 2024. doi:10.1017/S1351324924000281. 16 Multilingual Text Embeddings for Hate Speech DetectionA PREPRINT

work page doi:10.1017/s1351324924000281 2024
[39]

A dual contrastive learning framework for enhanced hate speech detection in low-resource languages

Krishan Chavinda and Uthayasanker Thayasivam. A dual contrastive learning framework for enhanced hate speech detection in low-resource languages. In Kengatharaiyer Sarveswaran, Ashwini Vaidya, Bal Krishna Bal, Sana Shams, and Surendrabikram Thapa, editors,Proceedings of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2025), p...

2025
[40]

Distributed Representations of Words and Phrases and their Compositionality

Tomas Mikolov, I. Sutskever, Kai Chen, G. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality.Neural Information Processing Systems, 2013. doi:10.48550/arXiv.1310.4546

work page Pith review doi:10.48550/arxiv.1310.4546 2013
[41]

A survey of cross-lingual word embedding models.Journal of Artificial Intelligence Research, 2017

Sebastian Ruder, Ivan Vulic, and Anders Søgaard. A survey of cross-lingual word embedding models.Journal of Artificial Intelligence Research, 2017. doi:10.1613/jair.1.11640

work page doi:10.1613/jair.1.11640 2017
[42]

Kanayama, Trevor Cohn, Tengfei Ma, Steven Bird, and Long Duong

H. Kanayama, Trevor Cohn, Tengfei Ma, Steven Bird, and Long Duong. Multilingual training of crosslingual word embeddings.Conference of the European Chapter of the Association for Computational Linguistics, 2017. doi:10.18653/V1/E17-1084

work page doi:10.18653/v1/e17-1084 2017
[43]

Unsupervised multilingual word embeddings

Xilun Chen and Claire Cardie. Unsupervised multilingual word embeddings. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 261–270. Association for Computational Linguistics, 2018. doi:10.18653/v1/d18-1024

work page doi:10.18653/v1/d18-1024 2018
[44]

In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. Language-agnostic bert sentence embedding. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 878–891. Association for Computational Linguistics, 2022. doi:10.18653/v1/2022.acl- long.62

work page doi:10.18653/v1/2022.acl- 2022
[45]

Improving text embeddings with large language models

Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Improving text embeddings with large language models. InProceedings of the 62nd Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), pages 11897–11916. Association for Computational Linguistics,
[46]

doi:10.18653/v1/2024.acl-long.642

work page doi:10.18653/v1/2024.acl-long.642 2024
[47]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025. doi:10.48550/arXiv.2506.05176

work page internal anchor Pith review doi:10.48550/arxiv.2506.05176 2025
[48]

Lithuanian hate speech corpus v.1, 2025

Rita Butkien˙e, Dambrauskas Edgaras, Šukys Algirdas, and Žitkus V oldemaras. Lithuanian hate speech corpus v.1, 2025. URL http://hdl.handle.net/20.500.11821/69. CLARIN-LT digital library in the Republic of Lithuania

2025
[49]

LITIS v.1, 2016

Darius Amileviˇcius and Mažvydas Petkevi ˇcius. LITIS v.1, 2016. URL http://hdl.handle.net/20.500. 11821/11. CLARIN-LT digital library in the Republic of Lithuania

2016
[50]

An Italian Twitter corpus of hate speech against immigrants

Manuela Sanguinetti, Fabio Poletto, Cristina Bosco, Viviana Patti, and Marco Stranisci. An Italian Twitter corpus of hate speech against immigrants. In Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperid...

2018
[51]

Methods for detoxification of texts for the russian language.Multimodal Technologies and Interaction, 5(9), 2021

Daryna Dementieva, Daniil Moskovskiy, Varvara Logacheva, David Dale, Olga Kozlova, Nikita Semenov, and Alexander Panchenko. Methods for detoxification of texts for the russian language.Multimodal Technologies and Interaction, 5(9), 2021. ISSN 2414-4088. doi:10.3390/mti5090054

work page doi:10.3390/mti5090054 2021
[52]

Model2vec: Fast state-of-the-art static embeddings, 2024

Stephan Tulkens and Thomas van Dongen. Model2vec: Fast state-of-the-art static embeddings, 2024. URL https://github.com/MinishLab/model2vec

2024
[53]

URL https://huggingface.co/minishlab/ potion-multilingual-128M

minishlab/potion-multilingual-128M at Hugging Face, 2025. URL https://huggingface.co/minishlab/ potion-multilingual-128M

2025
[54]

URLhttps://huggingface.co/BAAI/bge-m3

BAAI/bge-m3 at Hugging Face, 2024. URLhttps://huggingface.co/BAAI/bge-m3

2024
[55]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.arXiv, 2023. doi:10.48550/arXiv.1910.10683

work page internal anchor Pith review doi:10.48550/arxiv.1910.10683 2023
[56]

URLhttps://huggingface.co/datasets/allenai/c4

allenai/C4 datasets at Hugging Face, 2020. URLhttps://huggingface.co/datasets/allenai/c4

2020
[57]

POTION: bag of tricks leads to better models, 2024

Stephan Tulkens and Thomas van Dongen. POTION: bag of tricks leads to better models, 2024. URL https: //minishlab.github.io/tokenlearn_blogpost/. 17 Multilingual Text Embeddings for Hate Speech DetectionA PREPRINT

2024
[58]

EmbeddingGemma: Powerful and Lightweight Text Representations

Henrique Schechter Vera, Sahil Dua, Biao Zhang, Daniel Salz, Ryan Mullins, Sindhu Raghuram Panyam, Sara Smoot, Iftekhar Naim, Joe Zou, Feiyang Chen, Daniel Cer, Alice Lisak, Min Choi, Lucas Gonzalez, Omar Sanseviero, Glenn Cameron, Ian Ballantyne, Kat Black, Kaifeng Chen, Weiyi Wang, Zhe Li, Gus Martins, Jinhyuk Lee, Mark Sherwood, Juyeong Ji, Renjie Wu, ...

work page internal anchor Pith review doi:10.48550/arxiv.2509.20354 2025
[59]

URL https://huggingface.co/google/ embeddinggemma-300m

google/embeddinggemma-300m at Hugging Face, 2025. URL https://huggingface.co/google/ embeddinggemma-300m

2025
[60]

URL https://arxiv.org/abs/ 2502.13595

Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, Márton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzemi ´nski, Genta Indra Winata, Saba Sturua, Saiteja Utpala, Mathieu Ciancone, Marion Schaeffer, Gabriel Sequeira, Diganta Misra, Shreeya Dhakal, Jonathan Rystrøm, Roman Solomatin, Ömer Ça˘gatan, Akash Kundu, Martin Bernstorff, Shi...

work page doi:10.48550/arxiv.2502.13595 2025
[61]

RoFormer: Enhanced Transformer with Rotary Position Embedding,

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024. doi:10.1016/j.neucom.2023.127063

work page doi:10.1016/j.neucom.2023.127063 2024
[62]

Matryoshka representation learning

Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, and Ali Farhadi. Matryoshka representation learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pa...

2022
[63]

URLhttps://huggingface.co/datasets/Shitao/MLDR

Shitao/MLDR datasets at Hugging Face, 2025. URLhttps://huggingface.co/datasets/Shitao/MLDR

2025
[64]

URL https://huggingface.co/datasets/Shitao/ bge-m3-data

Shitao/bge-m3-data datasets at Hugging Face, 2024. URL https://huggingface.co/datasets/Shitao/ bge-m3-data

2024
[65]

URL https://huggingface.co/ Snowflake/snowflake-arctic-embed-l-v2.0

Snowflake/snowflake-arctic-embed-l-v2.0 at Hugging Face, 2025. URL https://huggingface.co/ Snowflake/snowflake-arctic-embed-l-v2.0

2025
[66]

MIRACL : A Multilingual Retrieval Dataset Covering 18 Diverse Languages

Xinyu Zhang, Nandan Thakur, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Mehdi Rezagholizadeh, and Jimmy Lin. MIRACL: A multilingual retrieval dataset covering 18 diverse languages.Transactions of the Association for Computational Linguistics, 11:1114–1131, 09 2023. ISSN 2307-387X. doi:10.1162/tacl_a_00595

work page doi:10.1162/tacl_a_00595 2023
[67]

URL https://huggingface.co/jinaai/ jina-embeddings-v3

jinaai/jina-embeddings-v3 at Hugging Face, 2024. URL https://huggingface.co/jinaai/ jina-embeddings-v3

2024
[68]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models.arXiv, 2021. doi:10.48550/arXiv.2106.09685

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2106.09685 2021
[69]

URL https://huggingface.co/intfloat/ multilingual-e5-large-instruct

intfloat/multilingual-e5-large-instruct at Hugging Face, 2023. URL https://huggingface.co/intfloat/ multilingual-e5-large-instruct

2023
[70]

URL https://huggingface.co/FacebookAI/ xlm-roberta-large

FacebookAI/xlm-roberta-large at Hugging Face, 2019. URL https://huggingface.co/FacebookAI/ xlm-roberta-large. 18 Multilingual Text Embeddings for Hate Speech DetectionA PREPRINT

2019
[71]

Improving text embeddings with large language models

Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Improving text embeddings with large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11897–11916, 2024

2024
[72]

Hotelling

H. Hotelling. Analysis of a complex of statistical variables into principal components.Journal of Educational Psychology, 24(6):417–441, September 1933. ISSN 0022-0663. doi:10.1037/h0071325

work page doi:10.1037/h0071325 1933
[73]

Histogram-based outlier score (HBOS): A fast unsupervised anomaly de- tection algorithm.KI-2012: poster and demo track, 1:59–63, 2012

Markus Goldstein and Andreas Dengel. Histogram-based outlier score (HBOS): A fast unsupervised anomaly de- tection algorithm.KI-2012: poster and demo track, 1:59–63, 2012. URL https://www.dfki.de/fileadmin/ user_upload/import/6431_HBOS-poster.pdf

2012
[74]

CatBoost: unbiased boosting with categorical features

Liudmila Prokhorenkova, Gleb Gusev, Aleksandr V orobev, Anna Veronika Dorogush, and Andrey Gulin. CatBoost: unbiased boosting with categorical features. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,Advances in Neural Information Process- ing Systems, volume 31. Curran Associates, Inc., 2018. URL https://pap...

2018
[75]

Jin Huang and C.X. Ling. Using AUC and accuracy in evaluating learning algorithms.IEEE Transactions on Knowledge and Data Engineering, 17:299–310, March 2005. doi:10.1109/tkde.2005.50

work page doi:10.1109/tkde.2005.50 2005
[76]

Precision-recall curves.SSRN Electronic Journal, 2016

Andreas Beger. Precision-recall curves.SSRN Electronic Journal, 2016. doi:10.2139/ssrn.2765419

work page doi:10.2139/ssrn.2765419 2016
[77]

A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960

Jacob Cohen. A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20: 37–46, 1960. doi:10.1177/001316446002000104

work page doi:10.1177/001316446002000104 1960
[78]

Marry L. McHugh. Interrater reliability: the kappa statistic.Biochemia Medica, pages 276–282, 2012. doi:10.11613/bm.2012.031. 19

work page doi:10.11613/bm.2012.031 2012