pith. sign in

arxiv: 2605.05159 · v1 · submitted 2026-05-06 · 💻 cs.CL · cs.AI· cs.LG

PSK at SemEval-2026 Task 9: Multilingual Polarization Detection Using Ensemble Gemma Models with Synthetic Data Augmentation

Pith reviewed 2026-05-08 16:37 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords multilingual polarization detectionGemma modelssynthetic data augmentationLoRA fine-tuningensemble methodsSemEval taskthreshold tuning
0
0 comments X

The pith

Per-language fine-tuned Gemma models augmented with GPT-4o-mini synthetic data and threshold tuning reach a mean macro-F1 of 0.811 across 22 languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a system for binary polarization detection that processes text in 22 languages. Separate Gemma 3 models of 12B and 27B parameters are fine-tuned per language with LoRA on a combination of original and synthetic examples. Three synthetic generation methods are used, followed by embedding-based filtering and deduplication. Per-language threshold tuning on the development set and weighted ensembles of the two model sizes produce the reported performance, placing the system second overall with first place in three languages.

Core claim

Fine-tuning Gemma-3 models separately for each of 22 languages on real data plus synthetic polarized text generated by GPT-4o-mini, then forming per-language weighted ensembles and applying development-set threshold tuning, yields a mean macro-F1 of 0.811 while alternative architectures suffer 30-50% F1 drops on the test set.

What carries the argument

Per-language LoRA fine-tuning of Gemma-3 12B and 27B models on filtered synthetic data from GPT-4o-mini, combined via weighted ensembles with language-specific threshold selection.

If this is right

  • Synthetic data augmentation combined with per-language specialization improves robustness in multilingual classification.
  • Threshold tuning on a development set delivers 2-4% F1 gains without retraining the models.
  • Architectures that excel on development data can still lose 30-50% F1 on test data, showing that generalization must be verified explicitly.
  • Ensemble selection of model size and synthetic strategy per language increases consistency across diverse languages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same synthetic-augmentation pipeline could be tested on other multilingual tasks where labeled polarized examples are scarce.
  • Language-specific fine-tuning may capture polarization cues that a single shared model would miss.
  • Future experiments could measure how much performance depends on the particular LLM used to create the synthetic examples.
  • If the threshold-tuning gains hold only because the development set mirrors the test set, the reported ranking may not transfer to new domains.

Load-bearing premise

Synthetic data generated by GPT-4o-mini matches the distribution of real polarized text in every language and development-set threshold tuning will generalize to the unseen test set.

What would settle it

Evaluating the identical pipeline on a new test collection drawn from different sources and observing whether mean macro-F1 falls substantially below 0.811 or whether removing the synthetic data causes comparable degradation.

Figures

Figures reproduced from arXiv: 2605.05159 by Srikar Kashyap Pulipaka.

Figure 1
Figure 1. Figure 1: Overview of our multilingual polarization detection pipeline, from real-data splitting and synthetic view at source ↗
read the original abstract

We present our system for SemEval-2026 Task 9: Multilingual Polarization Detection, a binary classification task spanning 22 languages. Our approach fine-tunes separate Gemma~3 models (12B and 27B parameters) per language using Low-Rank Adaptation (LoRA), augmented with synthetic data generated by a large language model (LLM). We employ three synthetic data strategies (direct generation, paraphrasing, and contrastive pair creation) using GPT-4o-mini, with a multi-stage quality filtering pipeline including embedding-based deduplication. We find that per-language threshold tuning on the development set yields 2 to 4\% F1 improvements without retraining. We also use weighted ensembles of 12B and 27B model predictions with per-language strategy selection. Our final system achieves a mean macro-F1 of 0.811 across all 22 languages, ranking 2nd overall of the participating teams, with 1st place finishes in 3 languages and top-3 in 8 languages. We also find that alternative architectures (XLM-RoBERTa, Qwen3) that showed strong development set performance suffered 30 to 50\% F1 drops on the test set, highlighting the importance of generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents the PSK team's submission to SemEval-2026 Task 9 on multilingual polarization detection across 22 languages. It fine-tunes separate Gemma-3 (12B and 27B) models per language with LoRA, augments training data using three synthetic strategies (direct generation, paraphrasing, contrastive pairs) from GPT-4o-mini plus embedding-based deduplication, applies per-language threshold tuning on the development set, and uses weighted ensembles with per-language strategy selection. The system reports a mean macro-F1 of 0.811, 2nd place overall, 1st in 3 languages and top-3 in 8, while noting that XLM-RoBERTa and Qwen3 suffer 30-50% F1 drops on the test set.

Significance. If the synthetic data augmentation and per-language threshold tuning are shown to generalize, the work supplies a competitive baseline for multilingual polarization detection and useful evidence of generalization fragility across architectures. The concrete ranking and cross-architecture comparison are strengths. Without ablations isolating the contributions of the synthetic data and tuning, however, the result remains primarily a competition outcome rather than a source of transferable methodological insight.

major comments (2)
  1. [Abstract and Results] The reported 2-4% F1 gains from per-language threshold tuning on the development set (Abstract) are load-bearing for the final ranking, yet no ablation or analysis is provided to show that the chosen thresholds generalize to the unseen test distribution rather than capturing dev-set artifacts. This is especially salient given the 30-50% drops observed for other models.
  2. [Method and Experiments] No ablation is presented comparing Gemma-3 performance with versus without the GPT-4o-mini synthetic data (direct generation, paraphrasing, contrastive pairs plus deduplication). The central claim attributes success to this augmentation pipeline, but the assumption that the synthetic distribution matches real polarized text across all 22 languages remains untested.
minor comments (2)
  1. [Method] Exact ensemble weights, per-language strategy selection rules, and the precise criteria/thresholds used in the multi-stage quality filtering pipeline are not specified, hindering reproducibility.
  2. [Results] Performance numbers lack error bars, confidence intervals, or statistical significance tests comparing the final system to baselines or ablated variants.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback emphasizing the value of ablations for establishing transferable insights beyond the competition results. We address each major comment below and will revise the manuscript to incorporate additional discussion and analysis of limitations.

read point-by-point responses
  1. Referee: [Abstract and Results] The reported 2-4% F1 gains from per-language threshold tuning on the development set (Abstract) are load-bearing for the final ranking, yet no ablation or analysis is provided to show that the chosen thresholds generalize to the unseen test distribution rather than capturing dev-set artifacts. This is especially salient given the 30-50% drops observed for other models.

    Authors: We agree that an explicit ablation comparing tuned versus untuned thresholds on the test set would strengthen claims of generalization. Thresholds were selected exclusively on the development set under competition constraints where test labels were unavailable. Our final test performance (mean macro-F1 0.811, 2nd place) and the 30-50% drops for XLM-RoBERTa and Qwen3 provide indirect support for the approach. In the revised manuscript we will add a subsection analyzing the per-language threshold values, their variance, and a discussion of possible dev-set artifacts, while acknowledging the absence of a direct test-set ablation. revision: partial

  2. Referee: [Method and Experiments] No ablation is presented comparing Gemma-3 performance with versus without the GPT-4o-mini synthetic data (direct generation, paraphrasing, contrastive pairs plus deduplication). The central claim attributes success to this augmentation pipeline, but the assumption that the synthetic distribution matches real polarized text across all 22 languages remains untested.

    Authors: We acknowledge that a controlled ablation isolating the synthetic data pipeline on the test set is missing. Internal development runs indicated performance gains from the three strategies and embedding-based deduplication, motivating their use. However, a full test-set comparison was not performed within the shared-task timeline. We will revise the manuscript to include a limitations paragraph explicitly addressing the untested synthetic-to-real distribution match across languages, along with qualitative examples of generated data and the quality filtering steps to aid reader assessment of the method. revision: partial

standing simulated objections not resolved
  • Full quantitative ablations isolating the test-set contributions of per-language threshold tuning and the GPT-4o-mini synthetic data augmentation, as these experiments were not conducted during the original competition timeline due to resource constraints.

Circularity Check

0 steps flagged

No circularity: empirical system results on external test set

full rationale

The paper is a shared-task systems description with no mathematical derivation chain, equations, or first-principles claims. All reported outcomes (mean macro-F1 of 0.811, per-language rankings, 2-4% F1 gains from threshold tuning) are direct empirical measurements on the organizer-provided test set after standard dev-set hyperparameter choices. Synthetic data generation, LoRA fine-tuning, embedding deduplication, and per-language threshold selection are procedural steps whose outputs are evaluated externally; none reduces to its own inputs by construction, self-definition, or self-citation. The absence of any fitted parameter presented as an independent prediction or any load-bearing self-citation chain confirms the derivation is self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The approach rests on standard machine-learning assumptions about data augmentation quality and per-language adaptation rather than new axioms or entities. One fitted element is the per-language thresholds.

free parameters (1)
  • per-language decision thresholds
    Tuned on development set to obtain 2-4% F1 gain; assumed to generalize to test data.
axioms (2)
  • domain assumption Synthetic data produced by GPT-4o-mini is of high quality and matches the distribution of real polarized text in each of the 22 languages.
    Invoked to justify the three augmentation strategies without independent validation reported in the abstract.
  • domain assumption Per-language fine-tuning with LoRA captures language-specific polarization signals better than a single multilingual model.
    Core premise behind training separate models and strategy selection per language.

pith-pipeline@v0.9.0 · 5538 in / 1700 out tokens · 65741 ms · 2026-05-08T16:37:54.966880+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

  1. [1]

    Working Notes of CLEF 2024 -- Conference and Labs of the Evaluation Forum , year =

    Mhalgi, Shrirang and Pulipaka, Srikar Kashyap and K\". Working Notes of CLEF 2024 -- Conference and Labs of the Evaluation Forum , year =

  2. [2]

    2025 , url =

    Cegin, Jan and Simko, Jakub and Brusilovsky, Peter , booktitle =. 2025 , url =

  3. [3]

    , journal =

    Yong, Zheng Xin and Menghini, Cristina and Bach, Stephen H. , journal =. 2024 , url =

  4. [4]

    In: Erk, K., Smith, N.A

    Improving Neural Machine Translation Models with Monolingual Data , author =. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =. 2016 , address =. doi:10.18653/v1/P16-1009 , url =

  5. [5]

    Understanding Back-Translation at Scale

    Understanding Back-Translation at Scale , author =. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages =. 2018 , address =. doi:10.18653/v1/D18-1045 , url =

  6. [6]

    Improving Counterfactual Generation for Fair Hate Speech Detection

    Improving Counterfactual Generation for Fair Hate Speech Detection , author =. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021) , pages =. 2021 , address =. doi:10.18653/v1/2021.woah-1.10 , url =

  7. [7]

    and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =

    Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =. 2022 , url =

  8. [8]

    and Elkan, Charles and Narayanaswamy, Balakrishnan , journal =

    Lipton, Zachary C. and Elkan, Charles and Narayanaswamy, Balakrishnan , journal =. Thresholding Classifiers to Maximize. 2014 , url =

  9. [9]

    Pattern Recognition , volume =

    Threshold Optimisation for Multi-label Classifiers , author =. Pattern Recognition , volume =. 2013 , publisher =

  10. [10]

    Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks

    Reimers, Nils and Gurevych, Iryna , booktitle =. Sentence-. 2019 , address =. doi:10.18653/v1/D19-1410 , url =

  11. [11]

    2023 , url =

    Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke , booktitle =. 2023 , url =

  12. [12]

    2024 , eprint=

    Gemma: Open Models Based on Gemini Research and Technology , author=. 2024 , eprint=

  13. [13]

    2026 , eprint=

    POLAR: A Benchmark for Multilingual, Multicultural, and Multi-Event Online Polarization , author=. 2026 , eprint=

  14. [14]

    Proceedings of the 20th International Workshop on Semantic Evaluation (SemEval-2026) , year =

    Naseem, Usman and Geislinger, Robert and Ren, Juan and Kohail, Sarah and Garrido Veliz, Rudy and Sam Sahil, P and Zhang, Yiran and Stranisci, Marco Antonio and Abdulmumin, Idris and Alacam,. Proceedings of the 20th International Workshop on Semantic Evaluation (SemEval-2026) , year =