pith. sign in

arxiv: 2605.13415 · v2 · pith:HHMTDWQRnew · submitted 2026-05-13 · 💻 cs.CL · cs.AI· cs.LG

KIT-TIP-NLP at MultiPride: Continual Learning with Multilingual Foundation Model

Pith reviewed 2026-05-20 21:27 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords reclaimed slursmultilingual classificationthreshold optimizationsocial media analysistransfer learningdata augmentationLGBTQ+ discourse
0
0 comments X

The pith

Language-specific threshold refinement improves reclaimed slur detection by 2-5% F1 without model retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a multi-stage framework to classify reclaimed versus non-reclaimed uses of LGBTQ+-related slurs in English, Spanish, and Italian tweets. It combines back-translation for data augmentation, dynamic undersampling during transfer learning, and masked language modeling on top of a selected multilingual embedding model. The central result is that optimizing the classification threshold separately for each language through ROC analysis captures differences in how reclamatory usage appears across languages. This yields a consistent performance lift while avoiding the cost of retraining. A sympathetic reader would care because it offers a practical way to handle linguistic variation and data scarcity in social media analysis of sensitive discourse.

Core claim

The framework evaluates eight multilingual models and selects XLM-RoBERTa, then augments the corpus threefold via GPT-4o-mini back-translation while preserving class ratios. After inductive transfer learning with undersampling and optional masked language modeling pre-training, the authors apply language-specific decision thresholds optimized via ROC analysis. These per-language thresholds produce 2-5% absolute F1 gains over a single global threshold by accounting for distributional differences in model outputs and cross-linguistic variation in reclamatory expression.

What carries the argument

Language-specific decision thresholds optimized via ROC analysis on model confidence scores.

If this is right

  • Per-language threshold tuning delivers measurable gains in multilingual classification without additional training compute.
  • Optimal decision boundaries differ across languages because of distinct patterns in how reclamatory usage is expressed in each.
  • The combination of augmentation, undersampling, and threshold adjustment addresses class imbalance in low-resource settings for rare linguistic categories.
  • The full pipeline, including code and setup, supports direct replication and extension to similar cross-lingual detection tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same per-language threshold approach could be tested on other multilingual sentiment or stance detection problems where model confidence varies systematically by language.
  • Dynamic threshold selection based on recent data batches might further adapt the system to evolving usage patterns in online discourse.
  • Extending the evaluation to additional languages or different categories of reclaimed terms would test whether the observed variation in thresholds is a general property of multilingual models.

Load-bearing premise

Back-translation via GPT-4o-mini preserves both semantic content and the original class distribution ratios without introducing new biases that would change whether a slur instance counts as reclamatory.

What would settle it

Applying the language-specific thresholds versus a single global threshold to a fresh held-out test set of tweets and finding no F1 improvement or a reversal would falsify the claimed benefit.

Figures

Figures reproduced from arXiv: 2605.13415 by Barathi Ganesh HB, Juuso Eronen, Michal Ptaszynski, Rene Melendez.

Figure 1
Figure 1. Figure 1: Multi-stage multilingual hate-speech classification framework with four sequential runs refining performance via data-driven model selection, augmentation, hyperparameter selection, 5-fold CV, MLM adap￾tation, and threshold calibration. RUN 1: Inductive transfer learning with optimal foundation model. RUN 2: Transductive transfer learning on optimal foundation model follwed by Inductive transfer learning. … view at source ↗
Figure 2
Figure 2. Figure 2: Data distribution statistics: label imbalance in original dataset, augmented dataset after back￾translation, and chi-square analysis of language-label associations. A stratified 5-fold cross-validation framework was set up, which kept the same distribution of classes in the folds (80% training, 20% validation for each fold). Conventional machine learning baselines were then trained using the computed embed… view at source ↗
Figure 3
Figure 3. Figure 3: Fold Level Performance Metrics for Inductive Transfer Learning. Training was conducted for a maximum of 10 epochs per fold using the AdamW optimizer (𝛽1 = 0.9, 𝛽2 = 0.999, 𝜖 = 1𝑒 − 8) with linear learning rate warmup over the first 10% of total training steps, followed by linear decay to zero over remaining steps. The loss function was weighted cross entropy, computed as 𝐿 = −[𝑤0𝑙𝑜𝑔(𝑝0) + 𝑤1𝑙𝑜𝑔(𝑝1)]. In wh… view at source ↗
Figure 4
Figure 4. Figure 4: Transductive Transfer Learning: Parameter VS Validation Loss Following MLM adaptation, the finetuned model was saved and subsequently used as the initialization for downstream finetuning task. This downstream finetuning pipeline was identical to run 1 where dynamic undersampling (1:3 ratio), Optuna hyperparameter optimization (50 trials, TPE sampler, Medi￾anPruner), 5-fold stratified cross validation, and … view at source ↗
Figure 5
Figure 5. Figure 5: Inductive Transfer Learning on Transductive Model. Impact on F1 Score with respect to the parameters. 3.5. Language-Specific Threshold Refinement and Prediction Reclassification Both run 1 and run 2 models produced continuous confidence scores via softmax normalization of the final layer logits 𝑐𝑜𝑛𝑓_𝑠𝑐𝑜𝑟𝑒 = 𝑒𝑥𝑝(𝑙𝑜𝑔𝑖𝑡_1)/(𝑒𝑥𝑝(𝑙𝑜𝑔𝑖𝑡_0) + 𝑒𝑥𝑝(𝑙𝑜𝑔𝑖𝑡_1)), where 𝑙𝑜𝑔𝑖𝑡_0 and 𝑙𝑜𝑔𝑖𝑡_1 denote the class-specific logi… view at source ↗
Figure 6
Figure 6. Figure 6: Threshold Analysis: Language-specific Optimal Thresholds The predictions of run 1 model were reclassified by means of learned language specific thresholds giving rise to run 3. In the same way, reclassification of run 2 predictions was done yielding run 4. The refining of the thresholds is a very crucial post-prediction optimization step which does not require extra computational power. This step usually r… view at source ↗
Figure 7
Figure 7. Figure 7: Final Test Set Results of Submitted Runs 1 - 4. The integration of domain knowledge through MLM (RUN 2) therefore gave language-dependent results, showing that MLM adaptation is not universally advantageous across multilingual contexts. While English performance showed a marginal improvement, Spanish and Italian displayed more variable responses to MLM pre-training. It would appear that morphologically ric… view at source ↗
read the original abstract

This paper presents a multi-stage framework for detecting reclaimed slurs in multilingual social media discourse. It addresses the challenge of identifying reclamatory versus non-reclamatory usage of LGBTQ+-related slurs across English, Spanish, and Italian tweets. The framework handles three intertwined methodological challenges like data scarcity, class imbalance, and cross-linguistic variation in sentiment expression. It integrates data-driven model selection via cross-validation, semantic-preserving augmentation through back-translation, inductive transfer learning with dynamic epoch-level undersampling, and domain-specific knowledge injection via masked language modeling. Eight multilingual embedding models were evaluated systematically, with XLM-RoBERTa selected as the foundation model based on macro-averaged F1 score. Data augmentation via GPT-4o-mini back-translation to alternate languages effectively tripled the training corpus while preserving semantic content and class distribution ratios. The framework produces four final runs for the evaluation purposes where RUN 1 is inductive transfer learning with augmentation and undersampling, RUN 2 with masked language modeling pre-training, RUN 3 and RUN 4 are previous predictions refined via language-specific decision thresholds optimized via ROC analysis. Language-specific threshold refinement reveals that optimal decision boundaries vary significantly across languages. This reflects distributional differences in model confidence scores and linguistic variation in reclamatory language usage. The threshold-based optimization yields 2-5% absolute F1 improvement without requiring model retraining. The methodology is fully reproducible, with all code and experimental setup available at https://github.com/rbg-research/MultiPRIDE-Evalita-2026.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents a multi-stage framework for detecting reclaimed slurs in English, Spanish, and Italian tweets. It evaluates eight multilingual embedding models and selects XLM-RoBERTa, augments the training data threefold via GPT-4o-mini back-translation while claiming to preserve semantics and class ratios, applies inductive transfer learning with dynamic undersampling, optionally adds masked language modeling pre-training, and refines outputs using language-specific decision thresholds derived from ROC analysis. The central empirical claim is that these thresholds yield a 2-5% absolute F1 improvement without retraining.

Significance. If the reported gains are reproducible and the augmentation preserves label validity, the work supplies a practical, lightweight adaptation technique for handling cross-lingual variation and imbalance in pragmatically nuanced classification tasks. The public GitHub release of code and setup is a clear strength that supports reproducibility. The contribution remains primarily task-specific rather than advancing general continual-learning or multilingual modeling theory.

major comments (2)
  1. [Abstract] Abstract: the manuscript asserts a 2-5% absolute F1 improvement from language-specific threshold refinement yet supplies no baseline F1 scores, confidence intervals, ablation tables, or statistical tests. Without these quantities the magnitude, reliability, and attribution of the gain cannot be verified.
  2. [Abstract] Abstract: the claim that GPT-4o-mini back-translation 'preserves semantic content and class distribution ratios' is presented without human validation, semantic-similarity metrics, or label-consistency checks. Because reclamation is a context-, speaker-, and culture-dependent pragmatic property, any systematic shift in label semantics would bias the base classifier’s confidence scores and render the subsequent ROC-derived thresholds artifactual rather than reflective of genuine cross-lingual variation.
minor comments (2)
  1. The title foregrounds 'Continual Learning' while the described pipeline centers on inductive transfer learning and post-hoc threshold tuning; an explicit mapping between the two concepts would reduce reader confusion.
  2. [Abstract] The abstract states that eight models were evaluated systematically but does not report the macro-F1 scores or ranking that justified selecting XLM-RoBERTa; adding this table would strengthen the model-selection narrative.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below, clarifying the empirical support for our claims and acknowledging limitations in the augmentation validation. Revisions will be incorporated to improve transparency.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the manuscript asserts a 2-5% absolute F1 improvement from language-specific threshold refinement yet supplies no baseline F1 scores, confidence intervals, ablation tables, or statistical tests. Without these quantities the magnitude, reliability, and attribution of the gain cannot be verified.

    Authors: We agree that the abstract would be strengthened by including explicit baseline F1 scores to contextualize the reported 2-5% gains. The full manuscript (Section 4 and associated tables) presents performance for all four runs, with RUN 3/4 showing the threshold-refined results compared against the inductive transfer learning baseline (RUN 1). The 2-5% absolute F1 improvement is measured per language against these baselines. To address the comment directly, we will revise the abstract to report the specific baseline macro-F1 values per language alongside the improved scores. Ablation tables comparing the runs are already included in the experimental section. Confidence intervals and formal statistical tests (e.g., McNemar) were not computed in the original experiments; we will add them in the revision where feasible using the existing predictions, though this may require additional computation. revision: yes

  2. Referee: [Abstract] Abstract: the claim that GPT-4o-mini back-translation 'preserves semantic content and class distribution ratios' is presented without human validation, semantic-similarity metrics, or label-consistency checks. Because reclamation is a context-, speaker-, and culture-dependent pragmatic property, any systematic shift in label semantics would bias the base classifier’s confidence scores and render the subsequent ROC-derived thresholds artifactual rather than reflective of genuine cross-lingual variation.

    Authors: We acknowledge that reclamation is a pragmatically nuanced phenomenon and that automated augmentation introduces risks of semantic drift. Class distribution ratios were preserved by design: each original instance was back-translated while retaining its original label, resulting in a balanced tripling of the corpus. Semantic preservation was assumed based on the quality of GPT-4o-mini translations, but no human validation, label-consistency checks, or semantic similarity metrics (e.g., embedding cosine or BLEU) were performed. We agree this constitutes a limitation for a context-dependent task. In the revision we will add an explicit limitations paragraph in the methodology or discussion section noting the absence of such validation and the potential for cultural shifts, while still reporting the empirical gains observed. We will not strengthen the preservation claim beyond what the data supports. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical results from standard validation and test splits

full rationale

The paper's derivation chain consists of model selection via cross-validation F1 scores, back-translation augmentation described as preserving class ratios, and language-specific threshold tuning via ROC analysis on held-out validation data, with final F1 gains measured on evaluation data. No equations or steps reduce the reported 2-5% improvement to a quantity defined by construction from the fitted thresholds themselves, nor do any load-bearing claims rely on self-citations or ansatzes that presuppose the target result. The framework is presented as externally reproducible via linked code, rendering the central claims self-contained against independent benchmarks rather than tautological.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central performance claims rest on the empirical effectiveness of standard transfer-learning and augmentation steps plus the validity of language-specific threshold tuning; no new entities are postulated.

free parameters (1)
  • language-specific decision thresholds
    Chosen via ROC analysis on validation data for each language to maximize F1.
axioms (1)
  • domain assumption Back-translation preserves semantic content and class distribution ratios
    Invoked to justify tripling the training corpus with GPT-4o-mini translations.

pith-pipeline@v0.9.0 · 5823 in / 1260 out tokens · 84974 ms · 2026-05-20T21:27:31.441819+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Language-specific threshold refinement reveals that optimal decision boundaries vary significantly across languages. The threshold-based optimization yields 2-5% absolute F1 improvement without requiring model retraining.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 2 internal anchors

  1. [1]

    Zsisku, A

    E. Zsisku, A. Zubiaga, H. Dubossarsky, Hate speech detection and reclaimed language: Mitigating false positives and compounded discrimination, in: Proceedings of the 16th ACM Web Science Conference, 2024, pp. 241–249

  2. [2]

    B. R. Chakravarthi, R. Priyadharshini, T. Durairaj, J. P. McCrae, P. Buitelaar, P. Kumaresan, R. Pon- nusamy, Overview of the shared task on homophobia and transphobia detection in social media comments, in: Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion, 2022, pp. 369–377

  3. [3]

    Popa-Wyatt, Reclamation: Taking back control of words, Grazer Philosophische Studien 97 (2020) 159–176

    M. Popa-Wyatt, Reclamation: Taking back control of words, Grazer Philosophische Studien 97 (2020) 159–176

  4. [4]

    Ferrando, L

    C. Ferrando, L. Draetta, M. Madeddu, M. Sosto, V. Patti, P. Rosso, C. Bosco, J. Mata, E. Gualda, Multipride at evalita 2026: Overview of the multilingual automatic detection of slur reclamation in the lgbtq+ context task, in: Proceedings of the Ninth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2...

  5. [5]

    R. J. Tallarida, R. B. Murray, Chi-square test, in: Manual of pharmacologic calculations: with computer programs, Springer, 1987, pp. 140–142

  6. [6]

    S. J. Pan, Q. Yang, A survey on transfer learning. ieee transactions on knowledge and data engineering, 22 (10) 1345 (2010)

  7. [7]

    Accessed: 2026-01-07

    OpenAI, Gpt-4o mini: advancing cost-efficient intelligence, https://openai.com/index/ gpt-4o-mini-advancing-cost-efficient-intelligence/, 2024. Accessed: 2026-01-07

  8. [8]

    Taheri, A

    A. Taheri, A. Zamanifar, A. Farhadi, Enhancing aspect-based sentiment analysis using data augmentation based on back-translation, International Journal of Data Science and Analytics 19 (2025) 491–516

  9. [9]

    Pouyanfar, Y

    S. Pouyanfar, Y. Tao, A. Mohan, H. Tian, A. S. Kaseb, K. Gauen, R. Dailey, S. Aghajanzadeh, Y.-H. Lu, S.-C. Chen, et al., Dynamic sampling in convolutional neural networks for imbalanced data classification, in: 2018 IEEE conference on multimedia information processing and retrieval (MIPR), IEEE, 2018, pp. 112–117

  10. [10]

    L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder, F. Wei, Multilingual e5 text embeddings: A technical report, arXiv preprint arXiv:2402.05672 (2024)

  11. [11]

    J. Chen, S. Xiao, P. Zhang, K. Luo, D. Lian, Z. Liu, Bge m3-embedding: Multi-lingual, multi- functionality, multi-granularity text embeddings through self-knowledge distillation, arXiv preprint arXiv:2402.03216 (2024)

  12. [12]

    Zhang, Y

    X. Zhang, Y. Zhang, D. Long, W. Xie, Z. Dai, J. Tang, H. Lin, B. Yang, P. Xie, F. Huang, et al., mgte: Generalized long-context text representation and reranking models for multilingual text retrieval, arXiv preprint arXiv:2407.19669 (2024)

  13. [13]

    arXiv preprint arXiv:2409.10173

    S. Sturua, I. Mohr, M. K. Akram, M. Günther, B. Wang, M. Krimmel, F. Wang, G. Mastrapas, A. Koukounas, N. Wang, et al., jina-embeddings-v3: Multilingual embeddings with task lora, arXiv preprint arXiv:2409.10173 (2024)

  14. [14]

    Labs, Snowflake’s arctic embed 2.0 goes multilingual, 2024

    S. Labs, Snowflake’s arctic embed 2.0 goes multilingual, 2024. URL: https://www.snowflake.com/ en/engineering-blog/snowflake-arctic-embed-2-multilingual/

  15. [15]

    F. Feng, Y. Yang, D. Cer, N. Arivazhagan, W. Wang, Language-agnostic bert sentence embedding, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 878–891

  16. [16]

    Y. Yang, D. Cer, A. Ahmad, M. Guo, J. Law, N. Constant, G. H. Abrego, S. Yuan, C. Tar, Y.-H. Sung, et al., Multilingual universal sentence encoder for semantic retrieval, in: Proceedings of the 58th annual meeting of the Association for Computational Linguistics: system demonstrations, 2020, pp. 87–94

  17. [17]

    Conneau, K

    A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, in: Proceedings of the 58th annual meeting of the association for computational linguistics, 2020, pp. 8440–8451

  18. [18]

    Akiba, S

    T. Akiba, S. Sano, T. Yanase, T. Ohta, M. Koyama, Optuna: A next-generation hyperparameter optimization framework, in: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2019, pp. 2623–2631

  19. [19]

    Don’t stop pretraining: Adapt language models to domains and tasks

    S. Gururangan, A. Marasović, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, N. A. Smith, Don’t stop pretraining: Adapt language models to domains and tasks, arXiv preprint arXiv:2004.10964 (2020)