pith. sign in

arxiv: 2604.09625 · v1 · submitted 2026-03-18 · 💻 cs.CL

Toward Generalized Cross-Lingual Hateful Language Detection with Web-Scale Data and Ensemble LLM Annotations

Pith reviewed 2026-05-15 09:43 UTC · model grok-4.3

classification 💻 cs.CL
keywords hate speech detectioncross-lingualLLM synthetic annotationsweb-scale pretrainingensemble learningmultilingual NLPlow-resource languages
0
0 comments X

The pith

Combining web-scale unlabelled data with ensemble LLM annotations substantially improves cross-lingual hate speech detection for small models and low-resource languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether large amounts of unlabelled web text plus labels generated by multiple large language models can improve hateful language detection across languages. They first continue pre-training BERT models on crawled texts in English, German, Spanish, and Vietnamese, which raises average macro-F1 by roughly 3 percent across sixteen benchmarks and helps most in low-resource settings. They then have four open-source LLMs annotate additional data and combine the outputs with three ensemble methods, finding that a LightGBM meta-learner works best. Fine-tuning on these synthetic labels lifts a 1B-parameter model by 11 percent pooled F1 but adds only 0.6 percent for a 14B-parameter model. A sympathetic reader would care because the approach offers a practical route to stronger detectors without large new sets of human labels.

Core claim

Continued pre-training on web-scale unlabelled data from four languages yields an average macro-F1 gain of approximately 3% across sixteen benchmarks. Synthetic annotations produced by an ensemble of four open-source LLMs via a LightGBM meta-learner allow fine-tuning that boosts a 1B-parameter model by 11% pooled F1 but only 0.6% for a 14B-parameter model, with the approach proving most useful for smaller models and low-resource languages.

What carries the argument

LLM ensemble for synthetic annotation using LightGBM meta-learner combined with continued masked language modelling pre-training on web-crawled texts.

If this is right

  • Smaller models receive large gains from the synthetic labels while larger models receive only modest gains.
  • Low-resource languages benefit more from the continued pre-training step than high-resource languages.
  • The LightGBM meta-learner ensemble consistently outperforms mean averaging and majority voting for creating usable synthetic labels.
  • The overall method improves detection performance across multiple languages without requiring additional human annotation effort.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same web-plus-ensemble recipe could be applied to other low-resource NLP tasks such as sentiment or toxicity detection.
  • Gains may shrink if the web crawl distribution drifts far from the target social-media domain over time.
  • Practitioners could test whether mixing a small amount of human labels with the synthetic set further stabilizes the small-model improvements.

Load-bearing premise

The web-crawled texts from OpenWebSearch.eu match the distribution of the sixteen benchmark datasets and the LLM-generated labels contain no systematic biases that would hurt real-world performance.

What would settle it

If a model trained on the synthetic labels and web pre-training shows no gain or a loss in F1 score on a new collection of real social-media posts collected after the original crawl, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2604.09625 by Dang H. Dang, Jelena Mitrovi, Michael Granitzer.

Figure 1
Figure 1. Figure 1: provides a high-level overview of the pipeline. The following subsections detail each component. 3.1. OpenWebSearch.eu Data Collection We used OWS (Granitzer et al., 2024) and OWI (Hendriksen et al., 2024) to collect large-scale web texts in English, German, Spanish, and Viet￾namese. To increase the proportion of conver￾sational and user-generated content, we filtered the OWS index to retain only URLs whos… view at source ↗
read the original abstract

We study whether large-scale unlabelled web data and LLM-based synthetic annotations can improve multilingual hate speech detection. Starting from texts crawled via OpenWebSearch.eu~(OWS) in four languages (English, German, Spanish, Vietnamese), we pursue two complementary strategies. First, we apply continued pre-training to BERT models by continuing masked language modelling on unlabelled OWS texts before supervised fine-tuning, and show that this yields an average macro-F1 gain of approximately 3% over standard baselines across sixteen benchmarks, with stronger gains in low-resource settings. Second, we use four open-source LLMs (Mistral-7B, Llama3.1-8B, Gemma2-9B, Qwen2.5-14B) to produce synthetic annotations through three ensemble strategies: mean averaging, majority voting, and a LightGBM meta-learner. The LightGBM ensemble consistently outperforms the other strategies. Fine-tuning on these synthetic labels substantially benefits a small model (Llama3.2-1B: +11% pooled F1), but provides only a modest gain for the larger Qwen2.5-14B (+0.6%). Our results indicate that the combination of web-scale unlabelled data and LLM-ensemble annotations is the most valuable for smaller models and low-resource languages.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that continued pre-training on large-scale unlabeled web data crawled via OpenWebSearch.eu in four languages yields an average macro-F1 gain of ~3% over standard baselines across sixteen hate-speech benchmarks (with stronger gains in low-resource settings), and that fine-tuning smaller models on synthetic labels produced by an ensemble of four open-source LLMs (via mean averaging, majority voting, or LightGBM meta-learner) further improves performance, most notably +11% pooled F1 for Llama3.2-1B while only +0.6% for Qwen2.5-14B.

Significance. If the empirical claims hold after addressing the noted gaps, the work would be significant for low-resource multilingual NLP: it demonstrates a practical route to leverage abundant web-scale data and LLM-generated labels to boost hate-speech detectors without requiring large amounts of human-annotated data, with particular value for smaller models and under-resourced languages.

major comments (2)
  1. [Abstract and §4 (Experimental Setup)] The central empirical claims (abstract: ~3% average macro-F1 from continued pre-training; +11% pooled F1 for Llama3.2-1B from synthetic labels) rest on the untested assumption that the OWS corpus is distributionally close to the sixteen evaluation benchmarks. No domain-similarity statistics (token overlap, embedding MMD, or perplexity under a held-out LM) are reported in the data or experimental sections, so observed gains could arise from accidental overlap rather than the proposed pipeline; this is load-bearing for the low-resource-language results highlighted in the abstract.
  2. [Abstract and Results] The abstract reports average gains and specific improvements without stating the exact baselines, number of runs, or error bars. This omission makes it impossible to assess whether the +3% macro-F1 and +11% pooled-F1 figures are statistically reliable or could be explained by variance or data leakage.
minor comments (2)
  1. [§3 (Annotation Pipeline)] Clarify the precise implementation details of the three ensemble strategies (mean averaging, majority voting, LightGBM meta-learner) and how label quality was validated against any human-annotated subsets.
  2. [§4] Add a table or appendix listing all sixteen benchmarks with language, size, and source information to allow readers to evaluate the low-resource claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our empirical claims. We address each major comment below and will revise the manuscript to improve clarity and address the noted gaps.

read point-by-point responses
  1. Referee: [Abstract and §4 (Experimental Setup)] The central empirical claims (abstract: ~3% average macro-F1 from continued pre-training; +11% pooled F1 for Llama3.2-1B from synthetic labels) rest on the untested assumption that the OWS corpus is distributionally close to the sixteen evaluation benchmarks. No domain-similarity statistics (token overlap, embedding MMD, or perplexity under a held-out LM) are reported in the data or experimental sections, so observed gains could arise from accidental overlap rather than the proposed pipeline; this is load-bearing for the low-resource-language results highlighted in the abstract.

    Authors: We agree that explicit domain-similarity statistics would strengthen the paper by helping rule out overlap as an alternative explanation. The OWS corpus is a broad web crawl in the target languages while the benchmarks are curated hate-speech collections, but without metrics this remains an untested assumption. In the revision we will add token-overlap rates, embedding MMD distances, and perplexity comparisons (using a held-out LM) between OWS and each benchmark, plus a brief discussion of potential leakage in the limitations section. revision: yes

  2. Referee: [Abstract and Results] The abstract reports average gains and specific improvements without stating the exact baselines, number of runs, or error bars. This omission makes it impossible to assess whether the +3% macro-F1 and +11% pooled-F1 figures are statistically reliable or could be explained by variance or data leakage.

    Authors: We accept this criticism. The abstract will be updated to name the baselines (standard fine-tuned BERT models without OWS continued pre-training), state that results are averaged over five random seeds, and include standard deviations or error bars for the reported gains. The results section will be expanded with per-run variances and paired statistical significance tests to demonstrate reliability. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical pipeline

full rationale

The paper is a purely empirical study describing continued pre-training on OpenWebSearch.eu crawls followed by supervised fine-tuning and LLM-ensemble annotation strategies. No mathematical derivations, equations, or self-referential predictions appear in the text. All reported gains are measured against independent baselines on sixteen external benchmark datasets. No self-citation chains, fitted parameters renamed as predictions, or ansatzes smuggled via prior work are present. The central results rest on experimental outcomes rather than any reduction to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper builds on standard transfer learning and data augmentation techniques without introducing new free parameters or entities.

axioms (2)
  • domain assumption Masked language modeling on web data improves representation for downstream hate speech classification
    Invoked in the continued pre-training strategy.
  • domain assumption LLM-generated labels can serve as effective proxies for human annotations in training hate speech detectors
    Central to the synthetic annotation approach.

pith-pipeline@v0.9.0 · 5540 in / 1446 out tokens · 37155 ms · 2026-05-15T09:43:49.936927+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 1 internal anchor

  1. [1]

    While raw web text can be collected at scale, annotat- ing it remains costly (Ross and et al., 2017) and human annotators inevitably introduce subjective biases(Casellietal.,2021)

    Introduction A central bottleneck in building robust detectors for hateful and offensive language is the scarcity of high-quality labelled training data (Vidgen and et al., 2020; Fortuna and Nunes, 2018). While raw web text can be collected at scale, annotat- ing it remains costly (Ross and et al., 2017) and human annotators inevitably introduce subjectiv...

  2. [2]

    How- ever, existing work in this space remains narrow in scope, typically covering a single language and lacking rigorous comparison with models trained on human labels

    and have therefore been explored as auto- mated annotators (Hartvigsen et al., 2022). How- ever, existing work in this space remains narrow in scope, typically covering a single language and lacking rigorous comparison with models trained on human labels. Large web crawls such as OpenWeb- Search.eu (OWS) (Granitzer et al., 2024) and OpenWebIndex (OWI) (He...

  3. [3]

    We address this gap along two axes:

    make billions of multilingual pages available, yet how to best leverage them for hate-speech detection is an open question. We address this gap along two axes:

  4. [4]

    We investigate whetherdomain-adaptive contin- ued pre-trainingon large unlabelled OWS cor- pora improves the downstream performance of BERT-family models

  5. [5]

    Westudywhetherensemble-basedLLMannota- tioncan replace or supplement human labelling for multilingual hate-speech detection. Terminology.Throughout this paper,contin- ued pre-trainingdenotes an additional masked- language-modelling adaptation step applied to a general-purpose pre-trained BERT using domain- relevant butunlabelledOWS texts. This precedes th...

  6. [6]

    Related Work Domain-specificcontinuedpre-trainingforhate speech.Gururangan et al. (2020) established thatdomain-adaptive pre-training(DAPT), contin- uing masked language modelling on in-domain text before fine-tuning, consistently improves down- stream performance. HateBERT (Caselli et al.,

  7. [7]

    Toward Generalized Cross-Lingual Hateful Language Detection with Web-Scale Data and Ensemble LLM Annotations

    applied DAPT to abusive language using 1Repository includes all training scripts, annotation prompts, and evaluation code. arXiv:2604.09625v1 [cs.CL] 18 Mar 2026 1.5MEnglishRedditposts, whileBelayetal.(2025) extended it multilingually with AfroXLMR-Social across 19 African languages (F1 gains of 1–30%). Noneoftheseworksuseweb-crawleddataatOWS scale or cov...

  8. [8]

    Our work instead applies the ensemble paradigm to the annotationphase, combining token-probability out- puts from four LLMs

    built multimodal and Arabic BERT ensem- blesrespectivelyfordownstreamclassification. Our work instead applies the ensemble paradigm to the annotationphase, combining token-probability out- puts from four LLMs. LLM-based annotation and synthetic data. Zhu et al. (2023) found ChatGPT achieves 60.9% average accuracy relabelling social-computing datasets, whi...

  9. [9]

    The following subsections detail each component

    Methodology and Setup Figure 1 provides a high-level overview of the pipeline. The following subsections detail each component. 3.1. OpenWebSearch.eu Data Collection We used OWS (Granitzer et al., 2024) and OWI (Hendriksen et al., 2024) to collect large-scale web texts in English, German, Spanish, and Viet- namese. To increase the proportion of conver- sa...

  10. [10]

    Majority Voting (Vote):Each model’s prob- ability is thresholded to a hard label; a text is markedHateif at least two of the four models vote for it

  11. [11]

    Mean Averaging (Mean):For each class, the average probability across all models is com- puted; the class with the higher mean is as- signed

  12. [12]

    LightGBM Meta-Learner (LGB):A LightGBM classifier (Ke et al., 2017) is trained on the eight- dimensional probability vectors (two classes× four models) using the seven human-labelled training sets as supervision. UnlikeVote and Mean,whichtreatallmodelsequally, LGBlearns to weight each annotator differentially based on its reliability, and discovers confid...

  13. [13]

    Human-labeled training used the7-Set(ex- cluding Spanish)

    to fine-tuneLlama3.2-1B and Qwen2.5- 14B. Human-labeled training used the7-Set(ex- cluding Spanish). The synthetic subset contains 240,647texts: 125,617 German, 108,375 English, and 6,655 Vietnamese. After ensemble annotation, Vote assigned 4,717 texts as hate,Mean 3,994, andLGB3,122—confirming the pronounced class imbalance. 3.4.3. Computational Resource...

  14. [14]

    RQ1: OWS Continued Pre-Training with BERT Table 3 reports macro-F1 across all sixteen test sets

    Results 4.1. RQ1: OWS Continued Pre-Training with BERT Table 3 reports macro-F1 across all sixteen test sets. All four OWS-continually-pre-trained models outperform both BERT and HateBERT on every per-language average and on the overall 16-set average under multilingual training (7-Set Mix, 7- Set + Synth., 16-Mix). Monolingual OWS mod- els additionally b...

  15. [15]

    Answers to Research Questions RQ1: Value of unlabelled web data.OWS con- tinued pre-training reliably improves BERT-family models, especially in multilingual low-data settings

    Discussion 5.1. Answers to Research Questions RQ1: Value of unlabelled web data.OWS con- tinued pre-training reliably improves BERT-family models, especially in multilingual low-data settings. The effect is largest forOws4L when training data is scarce (+3% average F1 under 7-Set) and di- minishes as supervised data grows (+1% under 16-Mix). Neither monol...

  16. [16]

    Our key findings are:

    Conclusion Wepresentedalarge-scalebenchmarkstudyexam- ining how unlabelled web data and ensemble LLM annotations can improve multilingual hate-speech detection across sixteen benchmarks in four lan- guages. Our key findings are:

  17. [17]

    Domain-adaptive continued pre-training on OWSdataprovidesconsistentF1gains,withthe multilingual Ows4L achieving the best average macro-F1 (77.0%) across all configurations

  18. [18]

    The LightGBM ensemble outperforms mean av- eraging and majority voting and is the only syn- thetic strategy that reliably avoids regressions on unseen languages

  19. [19]

    The benefit of synthetic data scales inversely with model capacity: +10.6% pooled F1 for Llama3.2-1B, but only +0.6% forQwen2.5- 14B

  20. [20]

    These findings motivate future work on scaling OWS data, adopting updated models, and devel- oping richer annotation strategies

    Severe class imbalance in OWS-derived syn- thetic sets remains a critical bottleneck, espe- cially for low-resource languages. These findings motivate future work on scaling OWS data, adopting updated models, and devel- oping richer annotation strategies. Acknowledgements This work has received funding from the Bavarian State Ministry of Economic Affairs,...

  21. [21]

    References Ayme Arango Monnar, Jorge Perez, Barbara Poblete, Magdalena Saldaña, and Valentina Proust. 2022. Resources for multilingual hate speech detection. Seattle, Washington (Hybrid). Association for Computational Linguistics. Valerio Basile, Cristina Bosco, Elisabetta Fersini, Debora Nozza, Viviana Patti, Francisco Manuel Rangel Pardo, Paolo Rosso, a...

  22. [22]

    Qwen et al

    The llama 3 herd of models. Qwen et al. 2025. Qwen2.5 technical report. Paula Fortuna and Sérgio Nunes. 2018. A survey on automatic detection of hate speech in text. ACM Computing Surveys. Tommaso Giorgi, Lorenzo Cima, Tiziano Fagni, Marco Avvenuti, and Stefano Cresci. 2025. Hu- manandLLMbiasesinhatespeechannotations: A socio-demographic analysis of annot...

  23. [23]

    call me sexist, but

    Gemma 2: Improving open language mod- els at a practical size. Björn Ross and et al. 2017. Measuring the reliabil- ity of hate speech annotations: The case of the european refugee crisis. InWorkshop on Natu- ral Language Processing for Computer-Mediated Communication. Mattia Samory, Indira Sen, Julian Kohne, Fabian Floeck, and Claudia Wagner. 2021. "call ...

  24. [24]

    arXiv preprint arXiv:2304.10145 , year=

    Overview of germeval task 2, 2019 shared task on the identification of offensive language. pages 352 – 363. UnslothAI. 2023. Unsloth: A lightweight and fast framework for large-scale nlp models. Accessed: 2025-02-18. Muhammad Usman, Muhammad Ahmad, M. Shahiki Tash, Irina Gelbukh, Rolando Quin- tero Tellez, and Grigori Sidorov. 2025. Multi- lingual hate sp...