Recognition: unknown
Comparison of Modern Multilingual Text Embedding Techniques for Hate Speech Detection Task
Pith reviewed 2026-05-10 10:50 UTC · model grok-4.3
The pith
Supervised two-class models using multilingual embeddings substantially outperform one-class anomaly detection for hate speech across three languages.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across all three datasets, two-class supervised models consistently and substantially outperform one-class anomaly detection. The best configurations achieve up to 80.96% accuracy and AUC ROC of 0.887 in Lithuanian using jina embeddings, 92.19% accuracy and AUC ROC of 0.978 in Russian using e5 embeddings, and 77.21% accuracy and AUC ROC of 0.859 in English using e5 with PCA. PCA compression preserves almost all discriminative power in the supervised setting.
What carries the argument
The comparison of six multilingual sentence encoders (potion, gemma, bge, snow, jina, e5) as input features to either a one-class HBOS anomaly detector or a two-class CatBoost classifier, optionally with PCA to 64 dimensions.
If this is right
- Two-class CatBoost classifiers on these embeddings deliver the highest accuracies and AUC scores on all tested datasets.
- PCA reduction to 64 features maintains nearly full performance for supervised classifiers but degrades anomaly detection results.
- The new LtHate corpus provides a benchmark for hate speech detection in Lithuanian.
- Gradient boosted decision trees paired with multilingual embeddings offer practical solutions for content moderation in multiple languages.
Where Pith is reading between the lines
- For deployment in content moderation systems, labeled data enabling supervised training is likely more valuable than relying on anomaly detection alone.
- The success of PCA in supervised settings implies that lower-dimensional embeddings could reduce storage and computation costs for large-scale applications.
- Similar benchmarking could be applied to other low-resource languages to identify effective embedding models for their hate speech detection needs.
- Fine-tuning the embedding models on hate speech data might further improve the reported accuracies beyond the frozen-encoder approach used here.
Load-bearing premise
The introduced LtHate corpus and the other datasets have accurate labels that represent typical real-world hate speech without significant bias.
What would settle it
A new test set of Lithuanian social media posts with fresh human annotations where the top-performing supervised model falls below 70% accuracy or the anomaly detector matches or exceeds it would falsify the superiority of supervised models.
Figures
read the original abstract
Online hate speech and abusive language pose a growing challenge for content moderation, especially in multilingual settings and for low-resource languages such as Lithuanian. This paper investigates to what extent modern multilingual sentence embedding models can support accurate hate speech detection in Lithuanian, Russian, and English, and how their performance depends on downstream modeling choices and feature dimensionality. We introduce LtHate, a new Lithuanian hate speech corpus derived from news portals and social networks, and benchmark six modern multilingual encoders (potion, gemma, bge, snow, jina, e5) on LtHate, RuToxic, and EnSuperset using a unified Python pipeline. For each embedding, we train both a one class HBOS anomaly detector and a two class CatBoost classifier, with and without principal component analysis (PCA) compression to 64-dimensional feature vectors. Across all datasets, two class supervised models consistently and substantially outperform one class anomaly detection, with the best configurations achieving up to 80.96% accuracy and AUC ROC of 0.887 in Lithuanian (jina), 92.19% accuracy and AUC ROC of 0.978 in Russian (e5), and 77.21% accuracy and AUC ROC of 0.859 in English (e5 with PCA). PCA compression preserves almost all discriminative power in the supervised setting, while showing some negative impact for the unsupervised anomaly detection case. These results demonstrate how modern multilingual sentence embeddings combined with gradient boosted decision trees provide robust soft-computing solutions for multilingual hate speech detection applications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the LtHate corpus and benchmarks six multilingual sentence embeddings (potion, gemma, bge, snow, jina, e5) for hate speech detection on LtHate, RuToxic, and EnSuperset. It compares one-class HBOS anomaly detection against two-class CatBoost classification, with and without 64-dimensional PCA compression, reporting that supervised models substantially outperform unsupervised ones, with peak results of 80.96% accuracy / 0.887 AUC (Lithuanian, jina), 92.19% / 0.978 (Russian, e5), and 77.21% / 0.859 (English, e5+PCA). PCA largely preserves supervised performance while degrading unsupervised results.
Significance. If the label quality and dataset representativeness hold, the work supplies a useful empirical benchmark for multilingual hate speech detection, particularly for low-resource Lithuanian, and demonstrates the practical value of combining modern embeddings with gradient-boosted trees. The unified Python pipeline and release of LtHate constitute clear strengths for reproducibility and future research.
major comments (2)
- [§3] §3 (LtHate corpus description): No annotation protocol, number of annotators, inter-annotator agreement (e.g., Fleiss' kappa), or external validation is reported for the newly introduced LtHate corpus. Because the headline performance numbers (80.96% accuracy, 0.887 AUC) and the supervised-vs-unsupervised gap rest directly on these labels, the absence of this information is load-bearing for the central empirical claims.
- [§4] §4 (Experimental setup): The manuscript does not specify train/test split ratios, the hyperparameter search procedure for CatBoost and HBOS, or the precise PCA implementation and variance retained. These omissions prevent verification of the reported margins (e.g., 0.978 AUC on Russian) and limit reproducibility of the finding that PCA preserves discriminative power in the supervised case.
minor comments (2)
- [Abstract] Abstract and §2: The six embedding models are referred to only by short names (potion, gemma, bge, snow, jina, e5) without citations to their source papers; adding these references would improve traceability.
- [§5] Table captions (presumed in §5): Ensure that all reported metrics include both accuracy and AUC-ROC for every configuration, and clarify whether the numbers are macro-averaged or weighted.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. The comments highlight important aspects for improving the clarity and reproducibility of our work on multilingual hate speech detection. We address each major comment below and will incorporate the necessary revisions into the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (LtHate corpus description): No annotation protocol, number of annotators, inter-annotator agreement (e.g., Fleiss' kappa), or external validation is reported for the newly introduced LtHate corpus. Because the headline performance numbers (80.96% accuracy, 0.887 AUC) and the supervised-vs-unsupervised gap rest directly on these labels, the absence of this information is load-bearing for the central empirical claims.
Authors: We agree that a detailed account of the annotation process is essential for validating the LtHate corpus and supporting the empirical claims. In the revised manuscript, we will expand §3 to include the full annotation protocol, the number of annotators, inter-annotator agreement statistics (including Fleiss' kappa), and any steps taken for external validation or quality control. These additions will directly address the load-bearing nature of the label quality for the reported performance metrics. revision: yes
-
Referee: [§4] §4 (Experimental setup): The manuscript does not specify train/test split ratios, the hyperparameter search procedure for CatBoost and HBOS, or the precise PCA implementation and variance retained. These omissions prevent verification of the reported margins (e.g., 0.978 AUC on Russian) and limit reproducibility of the finding that PCA preserves discriminative power in the supervised case.
Authors: We concur that these experimental details are required for reproducibility and verification of the results, including the AUC margins and the PCA findings. In the revised version of §4, we will explicitly state the train/test split ratios, describe the hyperparameter search procedures used for CatBoost and HBOS (including any grid or random search configurations and validation strategy), and provide the precise PCA implementation details along with the variance retained when reducing to 64 dimensions. This will allow readers to replicate the supervised vs. unsupervised comparisons and the dimensionality reduction effects. revision: yes
Circularity Check
No circularity: purely empirical benchmarking with direct measurements
full rationale
The paper conducts an experimental comparison of six multilingual sentence embeddings on three hate-speech datasets (including a newly introduced LtHate corpus), training HBOS anomaly detectors and CatBoost classifiers with optional PCA. All reported accuracies and AUC values are direct empirical outcomes from model training and evaluation on held-out data; the manuscript contains no equations, derivations, fitted parameters renamed as predictions, or self-citation chains that reduce the central claims to their own inputs by construction. The work is therefore self-contained as a standard benchmarking study.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Datasets are representative samples of real-world hate speech and labels are accurate.
- domain assumption Pre-trained multilingual embeddings capture discriminative features for hate speech.
Reference graph
Works this paper leans on
-
[1]
Paula Fortuna and Sérgio Nunes. A survey on automatic detection of hate speech in text.ACM Computing Surveys, 51(4):85:1–85:30, 2018. doi:10.1145/3232676
-
[2]
Fabio Poletto, Valerio Basile, Manuela Sanguinetti, Cristina Bosco, and Viviana Patti. Resources and benchmark corpora for hate speech detection: a systematic review.Language Resources and Evaluation, 55:477–523, 2021. doi:10.1007/s10579-020-09502-8
-
[3]
Survey on the impact of online disinformation and hate speech, 2023
UNESCO and Ipsos. Survey on the impact of online disinformation and hate speech, 2023. URL https: //www.unesco.org/sites/default/files/medias/fichiers/2023/11/unesco_ipsos_survey.pdf
2023
-
[4]
Hate speech: A systematized review.SAGE Open, 13(1):1–18, 2023
María Antonia Paz, Julio Montero-Díaz, and Alicia Moreno-Delgado. Hate speech: A systematized review.SAGE Open, 13(1):1–18, 2023. doi:10.1177/21582440231181311
-
[5]
James Hawdon, Atte Oksanen, and Pekka Räsänen. Exposure to online hate in four nations: A cross-national consideration.Deviant Behavior, 38(3):254–266, 2017. doi:10.1080/01639625.2016.1196985
-
[6]
Karsten Müller and Carlo Schwarz. Fanning the flames of hate: Social media and hate crime.Journal of the European Economic Association, 19(4):2131–2167, 2021. doi:10.1093/jeea/jvaa045. 14 Multilingual Text Embeddings for Hate Speech DetectionA PREPRINT
-
[7]
Report of the independent international fact-finding mission on Myanmar,
United Nations Human Rights Council. Report of the independent international fact-finding mission on Myanmar,
-
[8]
A/HRC/39/64
URLhttps://www.ohchr.org/en/hr-bodies/hrc/myanmar-ffm/index. A/HRC/39/64
-
[9]
Regulation (EU) 2022/2065 of the European Parliament and of the Council on a single market for digital services and amending Directive 2000/31/EC (Digital Services Act), 2022
European Parliament and Council of the European Union. Regulation (EU) 2022/2065 of the European Parliament and of the Council on a single market for digital services and amending Directive 2000/31/EC (Digital Services Act), 2022. URL https://eur-lex.europa.eu/eli/reg/2022/2065/oj. Official Journal of the European Union, L 277, 1–102
2022
-
[10]
Bertie Vidgen and Leon Derczynski. Directions in abusive language training data, a systematic review: Garbage in, garbage out.PLOS ONE, 15(12):e0243300, 2020. doi:10.1371/journal.pone.0243300
-
[11]
Anna Schmidt and Michael Wiegand. A survey on hate speech detection using natural language processing. In Lun-Wei Ku and Cheng-Te Li, editors,Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media, SocialNLP@EACL 2017, Valencia, Spain, April 3, 2017, pages 1–10. Association for Computational Linguistics, 2017. doi:...
-
[12]
Macy, and Ingmar Weber
Thomas Davidson, Dana Warmsley, Michael W. Macy, and Ingmar Weber. Automated hate speech detection and the problem of offensive language. InProceedings of the Eleventh International Conference on Web and Social Media, ICWSM 2017, Montréal, Québec, Canada, May 15-18, 2017, pages 512–515. AAAI Press, 2017. URL https://aaai.org/ocs/index.php/ICWSM/ICWSM17/pa...
2017
-
[13]
Using convolutional neural networks to classify hate-speech
Björn Gambäck and Utpal Kumar Sikdar. Using convolutional neural networks to classify hate-speech. In Zeerak Waseem, Wendy Hui Kyong Chung, Dirk Hovy, and Joel R. Tetreault, editors,Proceedings of the First Workshop on Abusive Language Online, ALW@ACL 2017, Vancouver, BC, Canada, August 4, 2017, pages 85–90. Association for Computational Linguistics, 2017...
-
[14]
Deep learning for hate speech detection in tweets
Pinkesh Badjatiya, Shashank Gupta, Manish Gupta, and Vasudeva Varma. Deep learning for hate speech detection in tweets. In Rick Barrett, Rick Cummings, Eugene Agichtein, and Evgeniy Gabrilovich, editors,Proceedings of the 26th International Conference on World Wide Web Companion, Perth, Australia, April 3-7, 2017, pages 759–760. ACM, 2017. doi:10.1145/304...
-
[15]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in neural information processing systems, pages 5998–6008, 2017. URLhttp://arxiv.org/abs/1706.03762
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[16]
Marzieh Mozafari, Reza Farahbakhsh, and Noël Crespi. Hate speech detection and racial bias mitigation in social media based on bert model.PLOS ONE, 15(8):1–26, 08 2020. doi:10.1371/journal.pone.0237861
-
[17]
Hatebert: Retraining bert for abusive language detection in english
Tommaso Caselli, Valerio Basile, Jelena Mitrovi´c, and Michael Granitzer. Hatebert: Retraining bert for abusive language detection in english. InProceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021), pages 17–25. Association for Computational Linguistics, 2021. doi:10.18653/v1/2021.woah-1.3
-
[18]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding.CoRR, abs/1810.04805, 2018. doi:10.48550/arXiv.1810.04805
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1810.04805 2018
-
[19]
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale.CoRR, abs/1911.02116, 2019. doi:10.48550/arXiv.1911.02116
-
[20]
Marcos Zampieri, Preslav Nakov, Sara Rosenthal, Pepa Atanasova, Georgi Karadzhov, Hamdy Mubarak, Leon Derczynski, Zeses Pitenis, and Ça˘grı Çöltekin. SemEval-2020 task 12: Multilingual offensive language identifi- cation in social media (OffensEval 2020). In Aurelie Herbelot, Xiaodan Zhu, Alexis Palmer, Nathan Schneider, Jonathan May, and Ekaterina Shutov...
-
[21]
Multilingual offensive language identification with cross-lingual em- beddings
Tharindu Ranasinghe and Marcos Zampieri. Multilingual offensive language identification with cross-lingual em- beddings. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors,Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5838–5844, Online, November 2020. Association for Computational Linguis...
-
[22]
Leveraging multilingual transformers for hate speech detection
Sayar Ghosh Roy, Ujwal Narayan, Tathagata Raha, Zubair Abid, and Vasudeva Varma. Leveraging multilingual transformers for hate speech detection. In Parth Mehta, Thomas Mandl, Prasenjit Majumder, and Mandar Mitra, editors,Working Notes of FIRE 2020 - Forum for Information Retrieval Evaluation, Hyderabad, India, December 16-20, 2020, volume 2826 ofCEUR Work...
2020
-
[23]
Manuel Tonneau, Diyi Liu, Samuel Fraiberger, Ralph Schroeder, Scott A. Hale, and Paul Röttger. From languages to geographies: Towards evaluating cultural bias in hate speech datasets. In Yi-Ling Chung, Zeerak Talat, Debora 15 Multilingual Text Embeddings for Hate Speech DetectionA PREPRINT Nozza, Flor Miriam Plaza-del Arco, Paul Röttger, Aida Mostafazadeh...
-
[24]
Cross-lingual transfer learning for hate speech detection
Irina Bigoulaeva, Viktor Hangya, and Alexander Fraser. Cross-lingual transfer learning for hate speech detection. In Bharathi Raja Chakravarthi, John P. McCrae, Manel Zarrouk, Rajeev K. Bali, and Paul Buitelaar, editors, Proceedings of the First Workshop on Language Technology for Equality, Diversity and Inclusion, LT-EDI@EACL 2021, Online, April 19, 2021...
2021
-
[25]
Addressing the challenges of cross- lingual hate speech detection.CoRR, abs/2201.05922, 2022
Irina Bigoulaeva, Viktor Hangya, Iryna Gurevych, and Alexander Fraser. Addressing the challenges of cross- lingual hate speech detection.CoRR, abs/2201.05922, 2022. doi:10.48550/arxiv.2201.05922
-
[26]
Exploring hate speech detection models for Lithuanian language
Justina Mandravickait˙e, Egl˙e Rimkien˙e, Mindaugas Petkeviˇcius, Milita Songailait˙e, Eimantas Zaranka, and Tomas Krilaviˇcius. Exploring hate speech detection models for Lithuanian language. In Agostina Calabrese, Christine de Kock, Debora Nozza, Flor Miriam Plaza-del Arco, Zeerak Talat, and Francielle Vargas, editors,Proceedings of the The 9th Workshop...
2025
-
[27]
Multilingual E5 Text Embeddings: A Technical Report
Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Multilingual E5 text embeddings: A technical report.arXiv, 2024. doi:10.48550/arXiv.2402.05672
work page internal anchor Pith review doi:10.48550/arxiv.2402.05672 2024
-
[28]
jina- embeddings-v3: Multilingual embeddings with task lora.arXiv preprint arXiv:2409.10173, 2024
Saba Sturua, Isabelle Mohr, Mohammad Kalim Akram, Michael Günther, Bo Wang, Markus Krimmel, Feng Wang, Georgios Mastrapas, Andreas Koukounas, Nan Wang, and Han Xiao. jina-embeddings-v3: Multilingual embeddings with task lora.arXiv, 2024. doi:10.48550/arXiv.2409.10173
-
[29]
Arctic-embed 2.0: Multilingual retrieval without compromise.arXiv preprint arXiv:2412.04506, 2024
Puxuan Yu, Luke Merrick, Gaurav Nuti, and Daniel Campos. Arctic-embed 2.0: Multilingual retrieval without compromise.arXiv, 2024. doi:10.48550/arXiv.2412.04506
-
[30]
M3-embedding: Multi- linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation.arXiv,
Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. M3-embedding: Multi- linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation.arXiv,
-
[31]
doi:10.48550/arXiv.2402.03216
work page internal anchor Pith review doi:10.48550/arxiv.2402.03216
-
[32]
Hateful symbols or hateful people? predictive features for hate speech detection on twitter
Zeerak Waseem and Dirk Hovy. Hateful symbols or hateful people? predictive features for hate speech detection on twitter. InProceedings of the Student Research Workshop, SRW@HLT-NAACL 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17,...
-
[33]
Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, and Ritesh Kumar. Semeval-2019 task 6: Identifying and categorizing offensive language in social media (offenseval). In Jonathan May, Ekaterina Shutova, Aurélie Herbelot, Xiaodan Zhu, Marianna Apidianaki, and Saif M. Mohammad, editors,Proceedings of the 13th International Worksh...
-
[34]
Ayme Arango Monnar, Jorge Perez Rojas, and Barbara Polete Labra. Cross-lingual hate speech detec- tion using domain-specific word embeddings.PLOS ONE, 19(7):e0306521, July 2024. ISSN 1932-6203. doi:10.1371/journal.pone.0306521
-
[35]
A multilingual evaluation for online hate speech detection.arXiv, 20(2), March 2020
Michele Corazza, Stefano Menini, Elena Cabrio, Sara Tonelli, and Serena Villata. A multilingual evaluation for online hate speech detection.arXiv, 20(2), March 2020. ISSN 1533-5399. doi:10.1145/3377323
-
[36]
Transformers at hsd-2lang 2024: Hate speech detection in arabic and turkish tweets using BERT based architectures
Kriti Singhal and Jatin Bedi. Transformers at hsd-2lang 2024: Hate speech detection in arabic and turkish tweets using BERT based architectures. In Ali Hürriyetoglu, Hristo Tanev, Surendrabikram Thapa, and Gökçe Uludogan, editors,Proceedings of the 7th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text, CAS...
2024
-
[37]
Akshay Singh and Rahul Thakur. Generalizable multilingual hate speech detection on low resource indian languages using fair selection in federated learning. In Kevin Duh, Helena Gómez-Adorno, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technol...
-
[38]
Sayan Ghosh and Suman Kumar Senapati. Hate speech detection in low-resourced indian languages: An analysis of transformer-based monolingual and multilingual models with cross-lingual experiments.Natural Language Engineering, 2024. doi:10.1017/S1351324924000281. 16 Multilingual Text Embeddings for Hate Speech DetectionA PREPRINT
-
[39]
A dual contrastive learning framework for enhanced hate speech detection in low-resource languages
Krishan Chavinda and Uthayasanker Thayasivam. A dual contrastive learning framework for enhanced hate speech detection in low-resource languages. In Kengatharaiyer Sarveswaran, Ashwini Vaidya, Bal Krishna Bal, Sana Shams, and Surendrabikram Thapa, editors,Proceedings of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2025), p...
2025
-
[40]
Distributed Representations of Words and Phrases and their Compositionality
Tomas Mikolov, I. Sutskever, Kai Chen, G. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality.Neural Information Processing Systems, 2013. doi:10.48550/arXiv.1310.4546
-
[41]
A survey of cross-lingual word embedding models.Journal of Artificial Intelligence Research, 2017
Sebastian Ruder, Ivan Vulic, and Anders Søgaard. A survey of cross-lingual word embedding models.Journal of Artificial Intelligence Research, 2017. doi:10.1613/jair.1.11640
-
[42]
Kanayama, Trevor Cohn, Tengfei Ma, Steven Bird, and Long Duong
H. Kanayama, Trevor Cohn, Tengfei Ma, Steven Bird, and Long Duong. Multilingual training of crosslingual word embeddings.Conference of the European Chapter of the Association for Computational Linguistics, 2017. doi:10.18653/V1/E17-1084
-
[43]
Unsupervised multilingual word embeddings
Xilun Chen and Claire Cardie. Unsupervised multilingual word embeddings. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 261–270. Association for Computational Linguistics, 2018. doi:10.18653/v1/d18-1024
-
[44]
Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. Language-agnostic bert sentence embedding. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 878–891. Association for Computational Linguistics, 2022. doi:10.18653/v1/2022.acl- long.62
-
[45]
Improving text embeddings with large language models
Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Improving text embeddings with large language models. InProceedings of the 62nd Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), pages 11897–11916. Association for Computational Linguistics,
-
[46]
doi:10.18653/v1/2024.acl-long.642
-
[47]
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025. doi:10.48550/arXiv.2506.05176
work page internal anchor Pith review doi:10.48550/arxiv.2506.05176 2025
-
[48]
Lithuanian hate speech corpus v.1, 2025
Rita Butkien˙e, Dambrauskas Edgaras, Šukys Algirdas, and Žitkus V oldemaras. Lithuanian hate speech corpus v.1, 2025. URL http://hdl.handle.net/20.500.11821/69. CLARIN-LT digital library in the Republic of Lithuania
2025
-
[49]
LITIS v.1, 2016
Darius Amileviˇcius and Mažvydas Petkevi ˇcius. LITIS v.1, 2016. URL http://hdl.handle.net/20.500. 11821/11. CLARIN-LT digital library in the Republic of Lithuania
2016
-
[50]
An Italian Twitter corpus of hate speech against immigrants
Manuela Sanguinetti, Fabio Poletto, Cristina Bosco, Viviana Patti, and Marco Stranisci. An Italian Twitter corpus of hate speech against immigrants. In Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperid...
2018
-
[51]
Daryna Dementieva, Daniil Moskovskiy, Varvara Logacheva, David Dale, Olga Kozlova, Nikita Semenov, and Alexander Panchenko. Methods for detoxification of texts for the russian language.Multimodal Technologies and Interaction, 5(9), 2021. ISSN 2414-4088. doi:10.3390/mti5090054
-
[52]
Model2vec: Fast state-of-the-art static embeddings, 2024
Stephan Tulkens and Thomas van Dongen. Model2vec: Fast state-of-the-art static embeddings, 2024. URL https://github.com/MinishLab/model2vec
2024
-
[53]
URL https://huggingface.co/minishlab/ potion-multilingual-128M
minishlab/potion-multilingual-128M at Hugging Face, 2025. URL https://huggingface.co/minishlab/ potion-multilingual-128M
2025
-
[54]
URLhttps://huggingface.co/BAAI/bge-m3
BAAI/bge-m3 at Hugging Face, 2024. URLhttps://huggingface.co/BAAI/bge-m3
2024
-
[55]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.arXiv, 2023. doi:10.48550/arXiv.1910.10683
work page internal anchor Pith review doi:10.48550/arxiv.1910.10683 2023
-
[56]
URLhttps://huggingface.co/datasets/allenai/c4
allenai/C4 datasets at Hugging Face, 2020. URLhttps://huggingface.co/datasets/allenai/c4
2020
-
[57]
POTION: bag of tricks leads to better models, 2024
Stephan Tulkens and Thomas van Dongen. POTION: bag of tricks leads to better models, 2024. URL https: //minishlab.github.io/tokenlearn_blogpost/. 17 Multilingual Text Embeddings for Hate Speech DetectionA PREPRINT
2024
-
[58]
EmbeddingGemma: Powerful and Lightweight Text Representations
Henrique Schechter Vera, Sahil Dua, Biao Zhang, Daniel Salz, Ryan Mullins, Sindhu Raghuram Panyam, Sara Smoot, Iftekhar Naim, Joe Zou, Feiyang Chen, Daniel Cer, Alice Lisak, Min Choi, Lucas Gonzalez, Omar Sanseviero, Glenn Cameron, Ian Ballantyne, Kat Black, Kaifeng Chen, Weiyi Wang, Zhe Li, Gus Martins, Jinhyuk Lee, Mark Sherwood, Juyeong Ji, Renjie Wu, ...
work page internal anchor Pith review doi:10.48550/arxiv.2509.20354 2025
-
[59]
URL https://huggingface.co/google/ embeddinggemma-300m
google/embeddinggemma-300m at Hugging Face, 2025. URL https://huggingface.co/google/ embeddinggemma-300m
2025
-
[60]
URL https://arxiv.org/abs/ 2502.13595
Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, Márton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzemi ´nski, Genta Indra Winata, Saba Sturua, Saiteja Utpala, Mathieu Ciancone, Marion Schaeffer, Gabriel Sequeira, Diganta Misra, Shreeya Dhakal, Jonathan Rystrøm, Roman Solomatin, Ömer Ça˘gatan, Akash Kundu, Martin Bernstorff, Shi...
-
[61]
RoFormer: Enhanced Transformer with Rotary Position Embedding,
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024. doi:10.1016/j.neucom.2023.127063
-
[62]
Matryoshka representation learning
Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, and Ali Farhadi. Matryoshka representation learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pa...
2022
-
[63]
URLhttps://huggingface.co/datasets/Shitao/MLDR
Shitao/MLDR datasets at Hugging Face, 2025. URLhttps://huggingface.co/datasets/Shitao/MLDR
2025
-
[64]
URL https://huggingface.co/datasets/Shitao/ bge-m3-data
Shitao/bge-m3-data datasets at Hugging Face, 2024. URL https://huggingface.co/datasets/Shitao/ bge-m3-data
2024
-
[65]
URL https://huggingface.co/ Snowflake/snowflake-arctic-embed-l-v2.0
Snowflake/snowflake-arctic-embed-l-v2.0 at Hugging Face, 2025. URL https://huggingface.co/ Snowflake/snowflake-arctic-embed-l-v2.0
2025
-
[66]
MIRACL : A Multilingual Retrieval Dataset Covering 18 Diverse Languages
Xinyu Zhang, Nandan Thakur, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Mehdi Rezagholizadeh, and Jimmy Lin. MIRACL: A multilingual retrieval dataset covering 18 diverse languages.Transactions of the Association for Computational Linguistics, 11:1114–1131, 09 2023. ISSN 2307-387X. doi:10.1162/tacl_a_00595
-
[67]
URL https://huggingface.co/jinaai/ jina-embeddings-v3
jinaai/jina-embeddings-v3 at Hugging Face, 2024. URL https://huggingface.co/jinaai/ jina-embeddings-v3
2024
-
[68]
LoRA: Low-Rank Adaptation of Large Language Models
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models.arXiv, 2021. doi:10.48550/arXiv.2106.09685
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2106.09685 2021
-
[69]
URL https://huggingface.co/intfloat/ multilingual-e5-large-instruct
intfloat/multilingual-e5-large-instruct at Hugging Face, 2023. URL https://huggingface.co/intfloat/ multilingual-e5-large-instruct
2023
-
[70]
URL https://huggingface.co/FacebookAI/ xlm-roberta-large
FacebookAI/xlm-roberta-large at Hugging Face, 2019. URL https://huggingface.co/FacebookAI/ xlm-roberta-large. 18 Multilingual Text Embeddings for Hate Speech DetectionA PREPRINT
2019
-
[71]
Improving text embeddings with large language models
Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Improving text embeddings with large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11897–11916, 2024
2024
-
[72]
H. Hotelling. Analysis of a complex of statistical variables into principal components.Journal of Educational Psychology, 24(6):417–441, September 1933. ISSN 0022-0663. doi:10.1037/h0071325
-
[73]
Histogram-based outlier score (HBOS): A fast unsupervised anomaly de- tection algorithm.KI-2012: poster and demo track, 1:59–63, 2012
Markus Goldstein and Andreas Dengel. Histogram-based outlier score (HBOS): A fast unsupervised anomaly de- tection algorithm.KI-2012: poster and demo track, 1:59–63, 2012. URL https://www.dfki.de/fileadmin/ user_upload/import/6431_HBOS-poster.pdf
2012
-
[74]
CatBoost: unbiased boosting with categorical features
Liudmila Prokhorenkova, Gleb Gusev, Aleksandr V orobev, Anna Veronika Dorogush, and Andrey Gulin. CatBoost: unbiased boosting with categorical features. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,Advances in Neural Information Process- ing Systems, volume 31. Curran Associates, Inc., 2018. URL https://pap...
2018
-
[75]
Jin Huang and C.X. Ling. Using AUC and accuracy in evaluating learning algorithms.IEEE Transactions on Knowledge and Data Engineering, 17:299–310, March 2005. doi:10.1109/tkde.2005.50
-
[76]
Precision-recall curves.SSRN Electronic Journal, 2016
Andreas Beger. Precision-recall curves.SSRN Electronic Journal, 2016. doi:10.2139/ssrn.2765419
-
[77]
Jacob Cohen. A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20: 37–46, 1960. doi:10.1177/001316446002000104
-
[78]
Marry L. McHugh. Interrater reliability: the kappa statistic.Biochemia Medica, pages 276–282, 2012. doi:10.11613/bm.2012.031. 19
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.