KIT-TIP-NLP at MultiPride: Continual Learning with Multilingual Foundation Model

Barathi Ganesh HB; Juuso Eronen; Michal Ptaszynski; Rene Melendez

arxiv: 2605.13415 · v2 · pith:HHMTDWQRnew · submitted 2026-05-13 · 💻 cs.CL · cs.AI· cs.LG

KIT-TIP-NLP at MultiPride: Continual Learning with Multilingual Foundation Model

Barathi Ganesh HB , Michal Ptaszynski , Rene Melendez , Juuso Eronen This is my paper

Pith reviewed 2026-05-20 21:27 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords reclaimed slursmultilingual classificationthreshold optimizationsocial media analysistransfer learningdata augmentationLGBTQ+ discourse

0 comments

The pith

Language-specific threshold refinement improves reclaimed slur detection by 2-5% F1 without model retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a multi-stage framework to classify reclaimed versus non-reclaimed uses of LGBTQ+-related slurs in English, Spanish, and Italian tweets. It combines back-translation for data augmentation, dynamic undersampling during transfer learning, and masked language modeling on top of a selected multilingual embedding model. The central result is that optimizing the classification threshold separately for each language through ROC analysis captures differences in how reclamatory usage appears across languages. This yields a consistent performance lift while avoiding the cost of retraining. A sympathetic reader would care because it offers a practical way to handle linguistic variation and data scarcity in social media analysis of sensitive discourse.

Core claim

The framework evaluates eight multilingual models and selects XLM-RoBERTa, then augments the corpus threefold via GPT-4o-mini back-translation while preserving class ratios. After inductive transfer learning with undersampling and optional masked language modeling pre-training, the authors apply language-specific decision thresholds optimized via ROC analysis. These per-language thresholds produce 2-5% absolute F1 gains over a single global threshold by accounting for distributional differences in model outputs and cross-linguistic variation in reclamatory expression.

What carries the argument

Language-specific decision thresholds optimized via ROC analysis on model confidence scores.

If this is right

Per-language threshold tuning delivers measurable gains in multilingual classification without additional training compute.
Optimal decision boundaries differ across languages because of distinct patterns in how reclamatory usage is expressed in each.
The combination of augmentation, undersampling, and threshold adjustment addresses class imbalance in low-resource settings for rare linguistic categories.
The full pipeline, including code and setup, supports direct replication and extension to similar cross-lingual detection tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same per-language threshold approach could be tested on other multilingual sentiment or stance detection problems where model confidence varies systematically by language.
Dynamic threshold selection based on recent data batches might further adapt the system to evolving usage patterns in online discourse.
Extending the evaluation to additional languages or different categories of reclaimed terms would test whether the observed variation in thresholds is a general property of multilingual models.

Load-bearing premise

Back-translation via GPT-4o-mini preserves both semantic content and the original class distribution ratios without introducing new biases that would change whether a slur instance counts as reclamatory.

What would settle it

Applying the language-specific thresholds versus a single global threshold to a fresh held-out test set of tweets and finding no F1 improvement or a reversal would falsify the claimed benefit.

Figures

Figures reproduced from arXiv: 2605.13415 by Barathi Ganesh HB, Juuso Eronen, Michal Ptaszynski, Rene Melendez.

**Figure 1.** Figure 1: Multi-stage multilingual hate-speech classification framework with four sequential runs refining performance via data-driven model selection, augmentation, hyperparameter selection, 5-fold CV, MLM adaptation, and threshold calibration. RUN 1: Inductive transfer learning with optimal foundation model. RUN 2: Transductive transfer learning on optimal foundation model follwed by Inductive transfer learning. … view at source ↗

**Figure 2.** Figure 2: Data distribution statistics: label imbalance in original dataset, augmented dataset after backtranslation, and chi-square analysis of language-label associations. A stratified 5-fold cross-validation framework was set up, which kept the same distribution of classes in the folds (80% training, 20% validation for each fold). Conventional machine learning baselines were then trained using the computed embed… view at source ↗

**Figure 3.** Figure 3: Fold Level Performance Metrics for Inductive Transfer Learning. Training was conducted for a maximum of 10 epochs per fold using the AdamW optimizer (𝛽1 = 0.9, 𝛽2 = 0.999, 𝜖 = 1𝑒 − 8) with linear learning rate warmup over the first 10% of total training steps, followed by linear decay to zero over remaining steps. The loss function was weighted cross entropy, computed as 𝐿 = −[𝑤0𝑙𝑜𝑔(𝑝0) + 𝑤1𝑙𝑜𝑔(𝑝1)]. In wh… view at source ↗

**Figure 4.** Figure 4: Transductive Transfer Learning: Parameter VS Validation Loss Following MLM adaptation, the finetuned model was saved and subsequently used as the initialization for downstream finetuning task. This downstream finetuning pipeline was identical to run 1 where dynamic undersampling (1:3 ratio), Optuna hyperparameter optimization (50 trials, TPE sampler, MedianPruner), 5-fold stratified cross validation, and … view at source ↗

**Figure 5.** Figure 5: Inductive Transfer Learning on Transductive Model. Impact on F1 Score with respect to the parameters. 3.5. Language-Specific Threshold Refinement and Prediction Reclassification Both run 1 and run 2 models produced continuous confidence scores via softmax normalization of the final layer logits 𝑐𝑜𝑛𝑓_𝑠𝑐𝑜𝑟𝑒 = 𝑒𝑥𝑝(𝑙𝑜𝑔𝑖𝑡_1)/(𝑒𝑥𝑝(𝑙𝑜𝑔𝑖𝑡_0) + 𝑒𝑥𝑝(𝑙𝑜𝑔𝑖𝑡_1)), where 𝑙𝑜𝑔𝑖𝑡_0 and 𝑙𝑜𝑔𝑖𝑡_1 denote the class-specific logi… view at source ↗

**Figure 6.** Figure 6: Threshold Analysis: Language-specific Optimal Thresholds The predictions of run 1 model were reclassified by means of learned language specific thresholds giving rise to run 3. In the same way, reclassification of run 2 predictions was done yielding run 4. The refining of the thresholds is a very crucial post-prediction optimization step which does not require extra computational power. This step usually r… view at source ↗

**Figure 7.** Figure 7: Final Test Set Results of Submitted Runs 1 - 4. The integration of domain knowledge through MLM (RUN 2) therefore gave language-dependent results, showing that MLM adaptation is not universally advantageous across multilingual contexts. While English performance showed a marginal improvement, Spanish and Italian displayed more variable responses to MLM pre-training. It would appear that morphologically ric… view at source ↗

read the original abstract

This paper presents a multi-stage framework for detecting reclaimed slurs in multilingual social media discourse. It addresses the challenge of identifying reclamatory versus non-reclamatory usage of LGBTQ+-related slurs across English, Spanish, and Italian tweets. The framework handles three intertwined methodological challenges like data scarcity, class imbalance, and cross-linguistic variation in sentiment expression. It integrates data-driven model selection via cross-validation, semantic-preserving augmentation through back-translation, inductive transfer learning with dynamic epoch-level undersampling, and domain-specific knowledge injection via masked language modeling. Eight multilingual embedding models were evaluated systematically, with XLM-RoBERTa selected as the foundation model based on macro-averaged F1 score. Data augmentation via GPT-4o-mini back-translation to alternate languages effectively tripled the training corpus while preserving semantic content and class distribution ratios. The framework produces four final runs for the evaluation purposes where RUN 1 is inductive transfer learning with augmentation and undersampling, RUN 2 with masked language modeling pre-training, RUN 3 and RUN 4 are previous predictions refined via language-specific decision thresholds optimized via ROC analysis. Language-specific threshold refinement reveals that optimal decision boundaries vary significantly across languages. This reflects distributional differences in model confidence scores and linguistic variation in reclamatory language usage. The threshold-based optimization yields 2-5% absolute F1 improvement without requiring model retraining. The methodology is fully reproducible, with all code and experimental setup available at https://github.com/rbg-research/MultiPRIDE-Evalita-2026.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a straightforward shared-task system paper that gets a small F1 lift from per-language threshold tuning on top of routine augmentation and fine-tuning.

read the letter

The main takeaway is that language-specific decision thresholds add 2-5% absolute F1 after training one model on augmented data, and the authors made the whole pipeline reproducible with public code. That is the practical contribution here. They evaluated eight multilingual models, chose XLM-RoBERTa, tripled the training set via GPT-4o-mini back-translation while keeping class ratios, added masked language modeling adaptation, and used epoch-level undersampling for imbalance. The final runs show that tuning separate ROC thresholds per language improves results without retraining. The GitHub link is a real plus for anyone who wants to inspect or reuse the setup. Credit where it is due: they actually shipped the code and described the steps clearly enough that the work can be checked. The soft spots are mostly about missing detail rather than fatal errors. The abstract states the gains but does not include the raw baseline numbers, confidence intervals, or full ablation tables, so it is hard to tell how much the thresholds are carrying versus the earlier steps. The back-translation assumption also looks thin. Reclamation is context-heavy and tied to speaker identity and cultural valence; machine translation can shift or neutralize those signals, which would make some of the added labels noisy. If that happens, the confidence scores shift and the language-specific thresholds end up correcting for artifacts instead of reflecting genuine cross-lingual differences. This paper is for teams working on the MultiPride shared task or similar low-resource social-media classification in a handful of languages. A reader who needs a working recipe for slur detection across English, Spanish, and Italian will find usable steps. It will not move broader theory on continual learning or foundation models. I would send it to peer review for the workshop proceedings because it is a complete, reproducible system description even if the gains stay incremental.

Referee Report

2 major / 2 minor

Summary. The paper presents a multi-stage framework for detecting reclaimed slurs in English, Spanish, and Italian tweets. It evaluates eight multilingual embedding models and selects XLM-RoBERTa, augments the training data threefold via GPT-4o-mini back-translation while claiming to preserve semantics and class ratios, applies inductive transfer learning with dynamic undersampling, optionally adds masked language modeling pre-training, and refines outputs using language-specific decision thresholds derived from ROC analysis. The central empirical claim is that these thresholds yield a 2-5% absolute F1 improvement without retraining.

Significance. If the reported gains are reproducible and the augmentation preserves label validity, the work supplies a practical, lightweight adaptation technique for handling cross-lingual variation and imbalance in pragmatically nuanced classification tasks. The public GitHub release of code and setup is a clear strength that supports reproducibility. The contribution remains primarily task-specific rather than advancing general continual-learning or multilingual modeling theory.

major comments (2)

[Abstract] Abstract: the manuscript asserts a 2-5% absolute F1 improvement from language-specific threshold refinement yet supplies no baseline F1 scores, confidence intervals, ablation tables, or statistical tests. Without these quantities the magnitude, reliability, and attribution of the gain cannot be verified.
[Abstract] Abstract: the claim that GPT-4o-mini back-translation 'preserves semantic content and class distribution ratios' is presented without human validation, semantic-similarity metrics, or label-consistency checks. Because reclamation is a context-, speaker-, and culture-dependent pragmatic property, any systematic shift in label semantics would bias the base classifier’s confidence scores and render the subsequent ROC-derived thresholds artifactual rather than reflective of genuine cross-lingual variation.

minor comments (2)

The title foregrounds 'Continual Learning' while the described pipeline centers on inductive transfer learning and post-hoc threshold tuning; an explicit mapping between the two concepts would reduce reader confusion.
[Abstract] The abstract states that eight models were evaluated systematically but does not report the macro-F1 scores or ranking that justified selecting XLM-RoBERTa; adding this table would strengthen the model-selection narrative.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below, clarifying the empirical support for our claims and acknowledging limitations in the augmentation validation. Revisions will be incorporated to improve transparency.

read point-by-point responses

Referee: [Abstract] Abstract: the manuscript asserts a 2-5% absolute F1 improvement from language-specific threshold refinement yet supplies no baseline F1 scores, confidence intervals, ablation tables, or statistical tests. Without these quantities the magnitude, reliability, and attribution of the gain cannot be verified.

Authors: We agree that the abstract would be strengthened by including explicit baseline F1 scores to contextualize the reported 2-5% gains. The full manuscript (Section 4 and associated tables) presents performance for all four runs, with RUN 3/4 showing the threshold-refined results compared against the inductive transfer learning baseline (RUN 1). The 2-5% absolute F1 improvement is measured per language against these baselines. To address the comment directly, we will revise the abstract to report the specific baseline macro-F1 values per language alongside the improved scores. Ablation tables comparing the runs are already included in the experimental section. Confidence intervals and formal statistical tests (e.g., McNemar) were not computed in the original experiments; we will add them in the revision where feasible using the existing predictions, though this may require additional computation. revision: yes
Referee: [Abstract] Abstract: the claim that GPT-4o-mini back-translation 'preserves semantic content and class distribution ratios' is presented without human validation, semantic-similarity metrics, or label-consistency checks. Because reclamation is a context-, speaker-, and culture-dependent pragmatic property, any systematic shift in label semantics would bias the base classifier’s confidence scores and render the subsequent ROC-derived thresholds artifactual rather than reflective of genuine cross-lingual variation.

Authors: We acknowledge that reclamation is a pragmatically nuanced phenomenon and that automated augmentation introduces risks of semantic drift. Class distribution ratios were preserved by design: each original instance was back-translated while retaining its original label, resulting in a balanced tripling of the corpus. Semantic preservation was assumed based on the quality of GPT-4o-mini translations, but no human validation, label-consistency checks, or semantic similarity metrics (e.g., embedding cosine or BLEU) were performed. We agree this constitutes a limitation for a context-dependent task. In the revision we will add an explicit limitations paragraph in the methodology or discussion section noting the absence of such validation and the potential for cultural shifts, while still reporting the empirical gains observed. We will not strengthen the preservation claim beyond what the data supports. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical results from standard validation and test splits

full rationale

The paper's derivation chain consists of model selection via cross-validation F1 scores, back-translation augmentation described as preserving class ratios, and language-specific threshold tuning via ROC analysis on held-out validation data, with final F1 gains measured on evaluation data. No equations or steps reduce the reported 2-5% improvement to a quantity defined by construction from the fitted thresholds themselves, nor do any load-bearing claims rely on self-citations or ansatzes that presuppose the target result. The framework is presented as externally reproducible via linked code, rendering the central claims self-contained against independent benchmarks rather than tautological.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central performance claims rest on the empirical effectiveness of standard transfer-learning and augmentation steps plus the validity of language-specific threshold tuning; no new entities are postulated.

free parameters (1)

language-specific decision thresholds
Chosen via ROC analysis on validation data for each language to maximize F1.

axioms (1)

domain assumption Back-translation preserves semantic content and class distribution ratios
Invoked to justify tripling the training corpus with GPT-4o-mini translations.

pith-pipeline@v0.9.0 · 5823 in / 1260 out tokens · 84974 ms · 2026-05-20T21:27:31.441819+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Language-specific threshold refinement reveals that optimal decision boundaries vary significantly across languages. The threshold-based optimization yields 2-5% absolute F1 improvement without requiring model retraining.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 2 internal anchors

[1]

Zsisku, A

E. Zsisku, A. Zubiaga, H. Dubossarsky, Hate speech detection and reclaimed language: Mitigating false positives and compounded discrimination, in: Proceedings of the 16th ACM Web Science Conference, 2024, pp. 241–249

work page 2024
[2]

B. R. Chakravarthi, R. Priyadharshini, T. Durairaj, J. P. McCrae, P. Buitelaar, P. Kumaresan, R. Pon- nusamy, Overview of the shared task on homophobia and transphobia detection in social media comments, in: Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion, 2022, pp. 369–377

work page 2022
[3]

Popa-Wyatt, Reclamation: Taking back control of words, Grazer Philosophische Studien 97 (2020) 159–176

M. Popa-Wyatt, Reclamation: Taking back control of words, Grazer Philosophische Studien 97 (2020) 159–176

work page 2020
[4]

Ferrando, L

C. Ferrando, L. Draetta, M. Madeddu, M. Sosto, V. Patti, P. Rosso, C. Bosco, J. Mata, E. Gualda, Multipride at evalita 2026: Overview of the multilingual automatic detection of slur reclamation in the lgbtq+ context task, in: Proceedings of the Ninth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2...

work page 2026
[5]

R. J. Tallarida, R. B. Murray, Chi-square test, in: Manual of pharmacologic calculations: with computer programs, Springer, 1987, pp. 140–142

work page 1987
[6]

S. J. Pan, Q. Yang, A survey on transfer learning. ieee transactions on knowledge and data engineering, 22 (10) 1345 (2010)

work page 2010
[7]

Accessed: 2026-01-07

OpenAI, Gpt-4o mini: advancing cost-efficient intelligence, https://openai.com/index/ gpt-4o-mini-advancing-cost-efficient-intelligence/, 2024. Accessed: 2026-01-07

work page 2024
[8]

Taheri, A

A. Taheri, A. Zamanifar, A. Farhadi, Enhancing aspect-based sentiment analysis using data augmentation based on back-translation, International Journal of Data Science and Analytics 19 (2025) 491–516

work page 2025
[9]

Pouyanfar, Y

S. Pouyanfar, Y. Tao, A. Mohan, H. Tian, A. S. Kaseb, K. Gauen, R. Dailey, S. Aghajanzadeh, Y.-H. Lu, S.-C. Chen, et al., Dynamic sampling in convolutional neural networks for imbalanced data classification, in: 2018 IEEE conference on multimedia information processing and retrieval (MIPR), IEEE, 2018, pp. 112–117

work page 2018
[10]

L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder, F. Wei, Multilingual e5 text embeddings: A technical report, arXiv preprint arXiv:2402.05672 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

J. Chen, S. Xiao, P. Zhang, K. Luo, D. Lian, Z. Liu, Bge m3-embedding: Multi-lingual, multi- functionality, multi-granularity text embeddings through self-knowledge distillation, arXiv preprint arXiv:2402.03216 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Zhang, Y

X. Zhang, Y. Zhang, D. Long, W. Xie, Z. Dai, J. Tang, H. Lin, B. Yang, P. Xie, F. Huang, et al., mgte: Generalized long-context text representation and reranking models for multilingual text retrieval, arXiv preprint arXiv:2407.19669 (2024)

work page arXiv 2024
[13]

arXiv preprint arXiv:2409.10173

S. Sturua, I. Mohr, M. K. Akram, M. Günther, B. Wang, M. Krimmel, F. Wang, G. Mastrapas, A. Koukounas, N. Wang, et al., jina-embeddings-v3: Multilingual embeddings with task lora, arXiv preprint arXiv:2409.10173 (2024)

work page arXiv 2024
[14]

Labs, Snowflake’s arctic embed 2.0 goes multilingual, 2024

S. Labs, Snowflake’s arctic embed 2.0 goes multilingual, 2024. URL: https://www.snowflake.com/ en/engineering-blog/snowflake-arctic-embed-2-multilingual/

work page 2024
[15]

F. Feng, Y. Yang, D. Cer, N. Arivazhagan, W. Wang, Language-agnostic bert sentence embedding, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 878–891

work page 2022
[16]

Y. Yang, D. Cer, A. Ahmad, M. Guo, J. Law, N. Constant, G. H. Abrego, S. Yuan, C. Tar, Y.-H. Sung, et al., Multilingual universal sentence encoder for semantic retrieval, in: Proceedings of the 58th annual meeting of the Association for Computational Linguistics: system demonstrations, 2020, pp. 87–94

work page 2020
[17]

Conneau, K

A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, in: Proceedings of the 58th annual meeting of the association for computational linguistics, 2020, pp. 8440–8451

work page 2020
[18]

Akiba, S

T. Akiba, S. Sano, T. Yanase, T. Ohta, M. Koyama, Optuna: A next-generation hyperparameter optimization framework, in: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2019, pp. 2623–2631

work page 2019
[19]

Don’t stop pretraining: Adapt language models to domains and tasks

S. Gururangan, A. Marasović, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, N. A. Smith, Don’t stop pretraining: Adapt language models to domains and tasks, arXiv preprint arXiv:2004.10964 (2020)

work page arXiv 2004

[1] [1]

Zsisku, A

E. Zsisku, A. Zubiaga, H. Dubossarsky, Hate speech detection and reclaimed language: Mitigating false positives and compounded discrimination, in: Proceedings of the 16th ACM Web Science Conference, 2024, pp. 241–249

work page 2024

[2] [2]

B. R. Chakravarthi, R. Priyadharshini, T. Durairaj, J. P. McCrae, P. Buitelaar, P. Kumaresan, R. Pon- nusamy, Overview of the shared task on homophobia and transphobia detection in social media comments, in: Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion, 2022, pp. 369–377

work page 2022

[3] [3]

Popa-Wyatt, Reclamation: Taking back control of words, Grazer Philosophische Studien 97 (2020) 159–176

M. Popa-Wyatt, Reclamation: Taking back control of words, Grazer Philosophische Studien 97 (2020) 159–176

work page 2020

[4] [4]

Ferrando, L

C. Ferrando, L. Draetta, M. Madeddu, M. Sosto, V. Patti, P. Rosso, C. Bosco, J. Mata, E. Gualda, Multipride at evalita 2026: Overview of the multilingual automatic detection of slur reclamation in the lgbtq+ context task, in: Proceedings of the Ninth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2...

work page 2026

[5] [5]

R. J. Tallarida, R. B. Murray, Chi-square test, in: Manual of pharmacologic calculations: with computer programs, Springer, 1987, pp. 140–142

work page 1987

[6] [6]

S. J. Pan, Q. Yang, A survey on transfer learning. ieee transactions on knowledge and data engineering, 22 (10) 1345 (2010)

work page 2010

[7] [7]

Accessed: 2026-01-07

OpenAI, Gpt-4o mini: advancing cost-efficient intelligence, https://openai.com/index/ gpt-4o-mini-advancing-cost-efficient-intelligence/, 2024. Accessed: 2026-01-07

work page 2024

[8] [8]

Taheri, A

A. Taheri, A. Zamanifar, A. Farhadi, Enhancing aspect-based sentiment analysis using data augmentation based on back-translation, International Journal of Data Science and Analytics 19 (2025) 491–516

work page 2025

[9] [9]

Pouyanfar, Y

S. Pouyanfar, Y. Tao, A. Mohan, H. Tian, A. S. Kaseb, K. Gauen, R. Dailey, S. Aghajanzadeh, Y.-H. Lu, S.-C. Chen, et al., Dynamic sampling in convolutional neural networks for imbalanced data classification, in: 2018 IEEE conference on multimedia information processing and retrieval (MIPR), IEEE, 2018, pp. 112–117

work page 2018

[10] [10]

L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder, F. Wei, Multilingual e5 text embeddings: A technical report, arXiv preprint arXiv:2402.05672 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

J. Chen, S. Xiao, P. Zhang, K. Luo, D. Lian, Z. Liu, Bge m3-embedding: Multi-lingual, multi- functionality, multi-granularity text embeddings through self-knowledge distillation, arXiv preprint arXiv:2402.03216 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Zhang, Y

X. Zhang, Y. Zhang, D. Long, W. Xie, Z. Dai, J. Tang, H. Lin, B. Yang, P. Xie, F. Huang, et al., mgte: Generalized long-context text representation and reranking models for multilingual text retrieval, arXiv preprint arXiv:2407.19669 (2024)

work page arXiv 2024

[13] [13]

arXiv preprint arXiv:2409.10173

S. Sturua, I. Mohr, M. K. Akram, M. Günther, B. Wang, M. Krimmel, F. Wang, G. Mastrapas, A. Koukounas, N. Wang, et al., jina-embeddings-v3: Multilingual embeddings with task lora, arXiv preprint arXiv:2409.10173 (2024)

work page arXiv 2024

[14] [14]

Labs, Snowflake’s arctic embed 2.0 goes multilingual, 2024

S. Labs, Snowflake’s arctic embed 2.0 goes multilingual, 2024. URL: https://www.snowflake.com/ en/engineering-blog/snowflake-arctic-embed-2-multilingual/

work page 2024

[15] [15]

F. Feng, Y. Yang, D. Cer, N. Arivazhagan, W. Wang, Language-agnostic bert sentence embedding, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 878–891

work page 2022

[16] [16]

Y. Yang, D. Cer, A. Ahmad, M. Guo, J. Law, N. Constant, G. H. Abrego, S. Yuan, C. Tar, Y.-H. Sung, et al., Multilingual universal sentence encoder for semantic retrieval, in: Proceedings of the 58th annual meeting of the Association for Computational Linguistics: system demonstrations, 2020, pp. 87–94

work page 2020

[17] [17]

Conneau, K

A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, in: Proceedings of the 58th annual meeting of the association for computational linguistics, 2020, pp. 8440–8451

work page 2020

[18] [18]

Akiba, S

T. Akiba, S. Sano, T. Yanase, T. Ohta, M. Koyama, Optuna: A next-generation hyperparameter optimization framework, in: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2019, pp. 2623–2631

work page 2019

[19] [19]

Don’t stop pretraining: Adapt language models to domains and tasks

S. Gururangan, A. Marasović, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, N. A. Smith, Don’t stop pretraining: Adapt language models to domains and tasks, arXiv preprint arXiv:2004.10964 (2020)

work page arXiv 2004