Toward Generalized Cross-Lingual Hateful Language Detection with Web-Scale Data and Ensemble LLM Annotations
Pith reviewed 2026-05-15 09:43 UTC · model grok-4.3
The pith
Combining web-scale unlabelled data with ensemble LLM annotations substantially improves cross-lingual hate speech detection for small models and low-resource languages.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Continued pre-training on web-scale unlabelled data from four languages yields an average macro-F1 gain of approximately 3% across sixteen benchmarks. Synthetic annotations produced by an ensemble of four open-source LLMs via a LightGBM meta-learner allow fine-tuning that boosts a 1B-parameter model by 11% pooled F1 but only 0.6% for a 14B-parameter model, with the approach proving most useful for smaller models and low-resource languages.
What carries the argument
LLM ensemble for synthetic annotation using LightGBM meta-learner combined with continued masked language modelling pre-training on web-crawled texts.
If this is right
- Smaller models receive large gains from the synthetic labels while larger models receive only modest gains.
- Low-resource languages benefit more from the continued pre-training step than high-resource languages.
- The LightGBM meta-learner ensemble consistently outperforms mean averaging and majority voting for creating usable synthetic labels.
- The overall method improves detection performance across multiple languages without requiring additional human annotation effort.
Where Pith is reading between the lines
- The same web-plus-ensemble recipe could be applied to other low-resource NLP tasks such as sentiment or toxicity detection.
- Gains may shrink if the web crawl distribution drifts far from the target social-media domain over time.
- Practitioners could test whether mixing a small amount of human labels with the synthetic set further stabilizes the small-model improvements.
Load-bearing premise
The web-crawled texts from OpenWebSearch.eu match the distribution of the sixteen benchmark datasets and the LLM-generated labels contain no systematic biases that would hurt real-world performance.
What would settle it
If a model trained on the synthetic labels and web pre-training shows no gain or a loss in F1 score on a new collection of real social-media posts collected after the original crawl, the central claim would be falsified.
Figures
read the original abstract
We study whether large-scale unlabelled web data and LLM-based synthetic annotations can improve multilingual hate speech detection. Starting from texts crawled via OpenWebSearch.eu~(OWS) in four languages (English, German, Spanish, Vietnamese), we pursue two complementary strategies. First, we apply continued pre-training to BERT models by continuing masked language modelling on unlabelled OWS texts before supervised fine-tuning, and show that this yields an average macro-F1 gain of approximately 3% over standard baselines across sixteen benchmarks, with stronger gains in low-resource settings. Second, we use four open-source LLMs (Mistral-7B, Llama3.1-8B, Gemma2-9B, Qwen2.5-14B) to produce synthetic annotations through three ensemble strategies: mean averaging, majority voting, and a LightGBM meta-learner. The LightGBM ensemble consistently outperforms the other strategies. Fine-tuning on these synthetic labels substantially benefits a small model (Llama3.2-1B: +11% pooled F1), but provides only a modest gain for the larger Qwen2.5-14B (+0.6%). Our results indicate that the combination of web-scale unlabelled data and LLM-ensemble annotations is the most valuable for smaller models and low-resource languages.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that continued pre-training on large-scale unlabeled web data crawled via OpenWebSearch.eu in four languages yields an average macro-F1 gain of ~3% over standard baselines across sixteen hate-speech benchmarks (with stronger gains in low-resource settings), and that fine-tuning smaller models on synthetic labels produced by an ensemble of four open-source LLMs (via mean averaging, majority voting, or LightGBM meta-learner) further improves performance, most notably +11% pooled F1 for Llama3.2-1B while only +0.6% for Qwen2.5-14B.
Significance. If the empirical claims hold after addressing the noted gaps, the work would be significant for low-resource multilingual NLP: it demonstrates a practical route to leverage abundant web-scale data and LLM-generated labels to boost hate-speech detectors without requiring large amounts of human-annotated data, with particular value for smaller models and under-resourced languages.
major comments (2)
- [Abstract and §4 (Experimental Setup)] The central empirical claims (abstract: ~3% average macro-F1 from continued pre-training; +11% pooled F1 for Llama3.2-1B from synthetic labels) rest on the untested assumption that the OWS corpus is distributionally close to the sixteen evaluation benchmarks. No domain-similarity statistics (token overlap, embedding MMD, or perplexity under a held-out LM) are reported in the data or experimental sections, so observed gains could arise from accidental overlap rather than the proposed pipeline; this is load-bearing for the low-resource-language results highlighted in the abstract.
- [Abstract and Results] The abstract reports average gains and specific improvements without stating the exact baselines, number of runs, or error bars. This omission makes it impossible to assess whether the +3% macro-F1 and +11% pooled-F1 figures are statistically reliable or could be explained by variance or data leakage.
minor comments (2)
- [§3 (Annotation Pipeline)] Clarify the precise implementation details of the three ensemble strategies (mean averaging, majority voting, LightGBM meta-learner) and how label quality was validated against any human-annotated subsets.
- [§4] Add a table or appendix listing all sixteen benchmarks with language, size, and source information to allow readers to evaluate the low-resource claims.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our empirical claims. We address each major comment below and will revise the manuscript to improve clarity and address the noted gaps.
read point-by-point responses
-
Referee: [Abstract and §4 (Experimental Setup)] The central empirical claims (abstract: ~3% average macro-F1 from continued pre-training; +11% pooled F1 for Llama3.2-1B from synthetic labels) rest on the untested assumption that the OWS corpus is distributionally close to the sixteen evaluation benchmarks. No domain-similarity statistics (token overlap, embedding MMD, or perplexity under a held-out LM) are reported in the data or experimental sections, so observed gains could arise from accidental overlap rather than the proposed pipeline; this is load-bearing for the low-resource-language results highlighted in the abstract.
Authors: We agree that explicit domain-similarity statistics would strengthen the paper by helping rule out overlap as an alternative explanation. The OWS corpus is a broad web crawl in the target languages while the benchmarks are curated hate-speech collections, but without metrics this remains an untested assumption. In the revision we will add token-overlap rates, embedding MMD distances, and perplexity comparisons (using a held-out LM) between OWS and each benchmark, plus a brief discussion of potential leakage in the limitations section. revision: yes
-
Referee: [Abstract and Results] The abstract reports average gains and specific improvements without stating the exact baselines, number of runs, or error bars. This omission makes it impossible to assess whether the +3% macro-F1 and +11% pooled-F1 figures are statistically reliable or could be explained by variance or data leakage.
Authors: We accept this criticism. The abstract will be updated to name the baselines (standard fine-tuned BERT models without OWS continued pre-training), state that results are averaged over five random seeds, and include standard deviations or error bars for the reported gains. The results section will be expanded with per-run variances and paired statistical significance tests to demonstrate reliability. revision: yes
Circularity Check
No circularity in empirical pipeline
full rationale
The paper is a purely empirical study describing continued pre-training on OpenWebSearch.eu crawls followed by supervised fine-tuning and LLM-ensemble annotation strategies. No mathematical derivations, equations, or self-referential predictions appear in the text. All reported gains are measured against independent baselines on sixteen external benchmark datasets. No self-citation chains, fitted parameters renamed as predictions, or ansatzes smuggled via prior work are present. The central results rest on experimental outcomes rather than any reduction to the paper's own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Masked language modeling on web data improves representation for downstream hate speech classification
- domain assumption LLM-generated labels can serve as effective proxies for human annotations in training hate speech detectors
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
continued pre-training of BERT models by continuing masked language modelling on unlabelled OWS texts before supervised fine-tuning
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Introduction A central bottleneck in building robust detectors for hateful and offensive language is the scarcity of high-quality labelled training data (Vidgen and et al., 2020; Fortuna and Nunes, 2018). While raw web text can be collected at scale, annotat- ing it remains costly (Ross and et al., 2017) and human annotators inevitably introduce subjectiv...
work page 2020
-
[2]
and have therefore been explored as auto- mated annotators (Hartvigsen et al., 2022). How- ever, existing work in this space remains narrow in scope, typically covering a single language and lacking rigorous comparison with models trained on human labels. Large web crawls such as OpenWeb- Search.eu (OWS) (Granitzer et al., 2024) and OpenWebIndex (OWI) (He...
work page 2022
-
[3]
We address this gap along two axes:
make billions of multilingual pages available, yet how to best leverage them for hate-speech detection is an open question. We address this gap along two axes:
-
[4]
We investigate whetherdomain-adaptive contin- ued pre-trainingon large unlabelled OWS cor- pora improves the downstream performance of BERT-family models
-
[5]
Westudywhetherensemble-basedLLMannota- tioncan replace or supplement human labelling for multilingual hate-speech detection. Terminology.Throughout this paper,contin- ued pre-trainingdenotes an additional masked- language-modelling adaptation step applied to a general-purpose pre-trained BERT using domain- relevant butunlabelledOWS texts. This precedes th...
work page 2020
-
[6]
Related Work Domain-specificcontinuedpre-trainingforhate speech.Gururangan et al. (2020) established thatdomain-adaptive pre-training(DAPT), contin- uing masked language modelling on in-domain text before fine-tuning, consistently improves down- stream performance. HateBERT (Caselli et al.,
work page 2020
-
[7]
applied DAPT to abusive language using 1Repository includes all training scripts, annotation prompts, and evaluation code. arXiv:2604.09625v1 [cs.CL] 18 Mar 2026 1.5MEnglishRedditposts, whileBelayetal.(2025) extended it multilingually with AfroXLMR-Social across 19 African languages (F1 gains of 1–30%). Noneoftheseworksuseweb-crawleddataatOWS scale or cov...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[8]
built multimodal and Arabic BERT ensem- blesrespectivelyfordownstreamclassification. Our work instead applies the ensemble paradigm to the annotationphase, combining token-probability out- puts from four LLMs. LLM-based annotation and synthetic data. Zhu et al. (2023) found ChatGPT achieves 60.9% average accuracy relabelling social-computing datasets, whi...
work page 2023
-
[9]
The following subsections detail each component
Methodology and Setup Figure 1 provides a high-level overview of the pipeline. The following subsections detail each component. 3.1. OpenWebSearch.eu Data Collection We used OWS (Granitzer et al., 2024) and OWI (Hendriksen et al., 2024) to collect large-scale web texts in English, German, Spanish, and Viet- namese. To increase the proportion of conver- sa...
work page 2024
-
[10]
Majority Voting (Vote):Each model’s prob- ability is thresholded to a hard label; a text is markedHateif at least two of the four models vote for it
-
[11]
Mean Averaging (Mean):For each class, the average probability across all models is com- puted; the class with the higher mean is as- signed
-
[12]
LightGBM Meta-Learner (LGB):A LightGBM classifier (Ke et al., 2017) is trained on the eight- dimensional probability vectors (two classes× four models) using the seven human-labelled training sets as supervision. UnlikeVote and Mean,whichtreatallmodelsequally, LGBlearns to weight each annotator differentially based on its reliability, and discovers confid...
work page 2017
-
[13]
Human-labeled training used the7-Set(ex- cluding Spanish)
to fine-tuneLlama3.2-1B and Qwen2.5- 14B. Human-labeled training used the7-Set(ex- cluding Spanish). The synthetic subset contains 240,647texts: 125,617 German, 108,375 English, and 6,655 Vietnamese. After ensemble annotation, Vote assigned 4,717 texts as hate,Mean 3,994, andLGB3,122—confirming the pronounced class imbalance. 3.4.3. Computational Resource...
-
[14]
RQ1: OWS Continued Pre-Training with BERT Table 3 reports macro-F1 across all sixteen test sets
Results 4.1. RQ1: OWS Continued Pre-Training with BERT Table 3 reports macro-F1 across all sixteen test sets. All four OWS-continually-pre-trained models outperform both BERT and HateBERT on every per-language average and on the overall 16-set average under multilingual training (7-Set Mix, 7- Set + Synth., 16-Mix). Monolingual OWS mod- els additionally b...
-
[15]
Discussion 5.1. Answers to Research Questions RQ1: Value of unlabelled web data.OWS con- tinued pre-training reliably improves BERT-family models, especially in multilingual low-data settings. The effect is largest forOws4L when training data is scarce (+3% average F1 under 7-Set) and di- minishes as supervised data grows (+1% under 16-Mix). Neither monol...
-
[16]
Conclusion Wepresentedalarge-scalebenchmarkstudyexam- ining how unlabelled web data and ensemble LLM annotations can improve multilingual hate-speech detection across sixteen benchmarks in four lan- guages. Our key findings are:
-
[17]
Domain-adaptive continued pre-training on OWSdataprovidesconsistentF1gains,withthe multilingual Ows4L achieving the best average macro-F1 (77.0%) across all configurations
-
[18]
The LightGBM ensemble outperforms mean av- eraging and majority voting and is the only syn- thetic strategy that reliably avoids regressions on unseen languages
-
[19]
The benefit of synthetic data scales inversely with model capacity: +10.6% pooled F1 for Llama3.2-1B, but only +0.6% forQwen2.5- 14B
-
[20]
Severe class imbalance in OWS-derived syn- thetic sets remains a critical bottleneck, espe- cially for low-resource languages. These findings motivate future work on scaling OWS data, adopting updated models, and devel- oping richer annotation strategies. Acknowledgements This work has received funding from the Bavarian State Ministry of Economic Affairs,...
-
[21]
References Ayme Arango Monnar, Jorge Perez, Barbara Poblete, Magdalena Saldaña, and Valentina Proust. 2022. Resources for multilingual hate speech detection. Seattle, Washington (Hybrid). Association for Computational Linguistics. Valerio Basile, Cristina Bosco, Elisabetta Fersini, Debora Nozza, Viviana Patti, Francisco Manuel Rangel Pardo, Paolo Rosso, a...
work page 2022
-
[22]
The llama 3 herd of models. Qwen et al. 2025. Qwen2.5 technical report. Paula Fortuna and Sérgio Nunes. 2018. A survey on automatic detection of hate speech in text. ACM Computing Surveys. Tommaso Giorgi, Lorenzo Cima, Tiziano Fagni, Marco Avvenuti, and Stefano Cresci. 2025. Hu- manandLLMbiasesinhatespeechannotations: A socio-demographic analysis of annot...
-
[23]
Gemma 2: Improving open language mod- els at a practical size. Björn Ross and et al. 2017. Measuring the reliabil- ity of hate speech annotations: The case of the european refugee crisis. InWorkshop on Natu- ral Language Processing for Computer-Mediated Communication. Mattia Samory, Indira Sen, Julian Kohne, Fabian Floeck, and Claudia Wagner. 2021. "call ...
work page 2017
-
[24]
arXiv preprint arXiv:2304.10145 , year=
Overview of germeval task 2, 2019 shared task on the identification of offensive language. pages 352 – 363. UnslothAI. 2023. Unsloth: A lightweight and fast framework for large-scale nlp models. Accessed: 2025-02-18. Muhammad Usman, Muhammad Ahmad, M. Shahiki Tash, Irina Gelbukh, Rolando Quin- tero Tellez, and Grigori Sidorov. 2025. Multi- lingual hate sp...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.