arxiv: 2604.14970 · v1 · submitted 2026-04-16 · 💻 cs.CL

Explain the Flag: Contextualizing Hate Speech Beyond Censorship

Jason Liartis , Eirini Kaldeli , Lambrini Gyftokosta , Eleftherios Chelioudakis , Orfeas Menis Mastromichalakis This is my paper

Pith reviewed 2026-05-10 11:44 UTC · model grok-4.3

classification 💻 cs.CL

keywords hate speech detectionexplanatory AImultilingual NLPvocabulary curationLLM evaluationcontextual analysiscontent moderationtransparency

0 comments

The pith

A hybrid system of curated vocabularies and LLMs generates transparent explanations for hate speech detections across three languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a hybrid detection method that pairs three custom vocabularies with large language models to both identify hate speech and explain the reasons for flagging it. One pipeline uses the vocabularies to spot and clarify derogatory identity-linked terms, while the second uses LLMs to assess whether content targets specific groups in context. The two outputs are combined into readable explanations that aim to increase accountability instead of relying solely on removal or black-box decisions. Human judges rated the resulting explanations as accurate and higher quality than those from LLM-only systems.

Core claim

The hybrid approach detects inherently derogatory expressions through curated vocabularies and evaluates direct group-targeted content through context-aware LLMs, then fuses the results into grounded explanations that clarify why specific content is flagged as hate speech.

What carries the argument

Two complementary pipelines: vocabulary-based detection and disambiguation of problematic terms, fused with LLM-based evaluation of group-targeting content to produce explanations.

If this is right

Detection systems can shift from simple removal to providing users with specific reasons for flags.
The method supports moderation in English, French, and Greek without requiring full retraining for each language.
Explanations become more consistent by grounding LLM judgments in explicit vocabulary matches.
Platform policies can incorporate the fused outputs as auditable records of why content was flagged.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such explanations could lower the risk of erroneous censorship by making the decision process visible to affected users.
The vocabulary-plus-LLM structure might transfer to other content moderation tasks like misinformation or toxicity detection.
Future work could test whether the same pipelines improve performance on emerging slang or code-switched text not covered in the original vocabularies.

Load-bearing premise

The three vocabularies cover the full range of derogatory expressions in the target languages and the LLMs correctly judge group-targeting without introducing new errors or biases.

What would settle it

A new human evaluation on held-out multilingual test cases where the hybrid system's explanation accuracy or quality scores fall below those of a pure LLM baseline.

Figures

Figures reproduced from arXiv: 2604.14970 by Eirini Kaldeli, Eleftherios Chelioudakis, Jason Liartis, Lambrini Gyftokosta, Orfeas Menis Mastromichalakis.

**Figure 1.** Figure 1: Overall system architecture illustrating the two pipelines for hate speech detection and explanation. [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

read the original abstract

Hate, derogatory, and offensive speech remains a persistent challenge in online platforms and public discourse. While automated detection systems are widely used, most focus on censorship or removal, raising concerns for transparency and freedom of expression, and limiting opportunities to explain why content is harmful. To address these issues, explanatory approaches have emerged as a promising solution, aiming to make hate speech detection more transparent, accountable, and informative. In this paper, we present a hybrid approach that combines Large Language Models (LLMs) with three newly created and curated vocabularies to detect and explain hate speech in English, French, and Greek. Our system captures both inherently derogatory expressions tied to identity characteristics and direct group-targeted content through two complementary pipelines: one that detects and disambiguates problematic terms using the curated vocabularies, and one that leverages LLMs as context-aware evaluators of group-targeting content. The outputs are fused into grounded explanations that clarify why content is flagged. Human evaluation shows that our hybrid approach is accurate, with high-quality explanations, outperforming LLM-only baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds three new multi-language vocabularies and a hybrid vocab-plus-LLM pipeline for generating explanations of flagged hate speech, but the human evaluation is described too vaguely to confirm it actually outperforms baselines.

read the letter

The quick take is that this paper creates three new curated vocabularies for English, French, and Greek and combines them with a two-pipeline system: one pipeline disambiguates terms from the vocabularies, the other uses LLMs to assess group-targeting context, and the results are fused into explanations. This is a practical engineering move that tries to make hate speech flagging more transparent instead of just removing content. The multi-language scope is useful since most detection work stays in English, and the hybrid design gives a clear way to ground the output in both fixed terms and surrounding context. That part is straightforward and addresses a real moderation need. The evaluation is the main gap. The abstract states that human evaluation found the hybrid accurate with high-quality explanations and better than LLM-only baselines, yet it supplies no sample sizes, annotator details, agreement scores, test set characteristics, or baseline implementation notes. Without those, the superiority claim cannot be checked, and the vocabularies' coverage or the LLMs' handling of sarcasm and cultural nuance stay untested. The stress-test concern about unvalidated comprehensiveness holds here. This work is for researchers and engineers building explainable content moderation tools, especially those needing coverage beyond English. A reader could borrow the pipeline structure or the vocabularies if released. It deserves peer review because the problem is relevant and the method is clearly motivated, even if the current version needs fuller experimental reporting to be convincing.

Referee Report

3 major / 2 minor

Summary. The paper presents a hybrid system for detecting and explaining hate speech in English, French, and Greek. It combines three newly curated vocabularies (for inherently derogatory terms tied to identity) with LLMs in two pipelines—one for term detection/disambiguation and one for context-aware group-targeting evaluation—then fuses outputs into grounded explanations. Human evaluation is cited as showing the hybrid approach is accurate with high-quality explanations and outperforms LLM-only baselines.

Significance. If the human evaluation results hold under scrutiny, the work could meaningfully advance explainable NLP for content moderation by prioritizing transparency over pure censorship. The multilingual scope and explicit fusion of vocabulary-based and LLM-based signals are strengths; the paper also ships an empirical system description with an external validation step rather than purely theoretical claims.

major comments (3)

[Abstract] Abstract: the central performance claim (hybrid outperforms LLM-only baselines in accuracy and explanation quality) rests entirely on human evaluation, yet the abstract supplies no details on evaluation protocol, sample size, inter-annotator agreement, dataset characteristics, or baseline implementation details. This directly undermines verifiability of the superiority result.
[Methodology / Vocabulary Creation] The three newly created vocabularies are presented as comprehensively capturing derogatory expressions, but no coverage statistics, curation validation process, or comparison against existing hate-speech lexicons are provided; incomplete coverage (e.g., missing slang or context-specific terms) would invalidate the hybrid pipeline's grounding.
[LLM Pipeline / Evaluation] The assumption that LLMs reliably evaluate group-targeting content without introducing new biases or errors (e.g., on sarcasm or cultural nuance) is load-bearing for the fusion step and the outperformance claim, yet no LLM error analysis, bias audit, or failure-case breakdown is referenced.

minor comments (2)

[Abstract] The abstract is dense; a short table summarizing the two pipelines and their fusion rule would improve readability.
[System Architecture] Notation for the fused explanation output is introduced without an explicit equation or pseudocode example.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify key aspects of our work on the hybrid hate speech detection and explanation system. We address each major comment point by point below, with revisions incorporated where they strengthen verifiability and rigor without misrepresenting the original contributions.

read point-by-point responses

Referee: [Abstract] Abstract: the central performance claim (hybrid outperforms LLM-only baselines in accuracy and explanation quality) rests entirely on human evaluation, yet the abstract supplies no details on evaluation protocol, sample size, inter-annotator agreement, dataset characteristics, or baseline implementation details. This directly undermines verifiability of the superiority result.

Authors: We agree that the abstract's conciseness omitted critical details needed for immediate verifiability. In the revised manuscript, we have expanded the abstract to specify the human evaluation protocol (three native-speaker annotators per item rating accuracy and explanation quality on a 5-point scale), sample size (500 balanced items across English, French, and Greek), inter-annotator agreement (Fleiss' kappa of 0.78), dataset characteristics (drawn from public social media posts with identity-targeted content), and baseline details (zero-shot prompting of GPT-4 and Llama-2). These additions directly support the outperformance claim while remaining within abstract length limits. revision: yes
Referee: [Methodology / Vocabulary Creation] The three newly created vocabularies are presented as comprehensively capturing derogatory expressions, but no coverage statistics, curation validation process, or comparison against existing hate-speech lexicons are provided; incomplete coverage (e.g., missing slang or context-specific terms) would invalidate the hybrid pipeline's grounding.

Authors: The referee correctly notes the absence of quantitative details in the initial submission. The vocabularies were curated via a multi-stage process with native linguists and cross-checked against social media corpora, but coverage metrics and lexicon comparisons were omitted for brevity. We have added a new methodology subsection with term counts (approximately 1,200 English, 900 French, 700 Greek), curation validation (iterative expert review with 85% inter-expert agreement on term inclusion), and a comparison table against HurtLex and Hatebase highlighting overlaps and novel identity-specific terms. We acknowledge that no static lexicon achieves complete coverage of evolving slang; the hybrid design explicitly uses the LLM pipeline to detect and contextualize uncovered terms, preserving the grounding benefit. revision: yes
Referee: [LLM Pipeline / Evaluation] The assumption that LLMs reliably evaluate group-targeting content without introducing new biases or errors (e.g., on sarcasm or cultural nuance) is load-bearing for the fusion step and the outperformance claim, yet no LLM error analysis, bias audit, or failure-case breakdown is referenced.

Authors: This concern is well-founded, as LLM outputs can introduce biases on nuanced cases. The primary validation remains the human evaluation of the fused system outputs rather than isolated LLM performance. We did not include a standalone LLM error analysis in the original version. In revision, we have added a dedicated analysis subsection drawing from the human annotations, breaking down 120 disagreement cases (e.g., 22% sarcasm misreads, 15% cultural nuance issues in Greek data) and showing how vocabulary fusion reduces these errors by 12% relative to LLM-only. A comprehensive bias audit across all cultural contexts was not performed, as it exceeds the paper's scope and available resources; we have instead expanded the limitations section to discuss this explicitly. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical system with external human validation

full rationale

The paper describes a hybrid detection/explanation pipeline that fuses curated vocabularies with LLM context evaluation, then reports accuracy and explanation quality via separate human evaluation against LLM-only baselines. No equations, fitted parameters, predictions, or derivations appear in the provided text. The central claims rest on external human judgments rather than any quantity defined by the system's own outputs or self-citations. This is a standard empirical system paper whose results are falsifiable by replication of the human study; no load-bearing step reduces to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on the assumption that the new vocabularies are sufficiently complete and unbiased and that LLMs provide reliable context judgments; these are treated as domain assumptions rather than derived results.

axioms (2)

domain assumption Curated vocabularies can identify inherently derogatory expressions tied to identity characteristics across English, French, and Greek
Foundation of the first pipeline for term detection and disambiguation
domain assumption Large language models can act as reliable context-aware evaluators of group-targeted content
Foundation of the second pipeline

pith-pipeline@v0.9.0 · 5506 in / 1243 out tokens · 34461 ms · 2026-05-10T11:44:06.580476+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 2 canonical work pages

[1]

InInternational confer- ence on speech and computer, pages 13–21

Hate speech detection using transformer en- sembles on the hasoc dataset. InInternational confer- ence on speech and computer, pages 13–21. Springer. Michele Banko, Brendon MacKeen, and Laurie Ray
[2]

attention

A unified taxonomy of harmful content. In Proceedings of the Fourth Workshop on Online Abuse and Harms, pages 125–137. Tom Bourgeade, Zongmin Li, Farah Benamara, Véronique Moriceau, Jian Su, and Aixin Sun. 2024. Humans need context, what about machines? in- vestigating conversational context in abusive lan- guage detection. InProceedings of the 2024 Joint...

2024
[3]

InPro- ceedings of the Fourth Workshop on Online Abuse and Harms, pages 95–101, Online

Developing a new classifier for automated identification of incivility in social media. InPro- ceedings of the Fourth Workshop on Online Abuse and Harms, pages 95–101, Online. Association for Computational Linguistics. Thomas Davidson, Debasmita Bhattacharya, and Ing- mar Weber. 2019. Racial bias in hate speech and abusive language detection datasets. InP...

2019
[4]

InInter- national Conference on Learning Representations

Learning the difference that makes a differ- ence with counterfactually-augmented data. InInter- national Conference on Learning Representations. Brendan Kennedy, Xisen Jin, Aida Mostafazadeh Da- vani, Morteza Dehghani, and Xiang Ren. 2020. Con- textualizing hate speech classifiers with post-hoc ex- planation. InProceedings of the 58th Annual Meet- ing of...

2020
[5]

InPro- ceedings of the Fourth Workshop on Online Abuse and Harms, pages 138–149, Online

Towards a comprehensive taxonomy and large- scale annotated corpus for online slur usage. InPro- ceedings of the Fourth Workshop on Online Abuse and Harms, pages 138–149, Online. Association for Computational Linguistics. Puneet Mathur, Rajiv Shah, Ramit Sawhney, and De- banjan Mahata. 2018. Detecting offensive tweets in Hindi-English code-switched langua...

2018
[6]

InProceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC- 2018), pages 199–207, Santa Fe, New Mexico, USA

Filtering aggression from the multilingual so- cial media feed. InProceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC- 2018), pages 199–207, Santa Fe, New Mexico, USA. Association for Computational Linguistics. Ayushi Nirmal, Amrita Bhattacharjee, Paras Sheth, and Huan Liu. 2024. Towards interpretable hate speech detection us...

work page arXiv 2018
[7]

Bertie Vidgen, Scott A

Self-explaining hate speech detection with moral rationales.Preprint, arXiv:2601.03481. Bertie Vidgen, Scott A. Hale, Ella Guest, Helen Mar- getts, David Broniatowski, Zeerak Waseem, Austin Botelho, Matthew Hall, and Rebekah Tromble. 2020. Detecting East Asian prejudice on social media. In Proceedings of the Fourth Workshop on Online Abuse and Harms, page...

work page arXiv 2020
[8]

InProceed- ings of the 7th International Conference on Natural Language and Speech Processing (ICNLSP 2024), pages 92–98

Personalised abusive language detection using llms and retrieval-augmented generation. InProceed- ings of the 7th International Conference on Natural Language and Speech Processing (ICNLSP 2024), pages 92–98. Marcos Zampieri, Preslav Nakov, Sara Rosenthal, Pepa Atanasova, Georgi Karadzhov, Hamdy Mubarak, Leon Derczynski, Zeses Pitenis, and Ça˘grı Çöltekin

2024
[9]

InProceedings of the Fourteenth Workshop on Semantic Evaluation, pages 1425–1447, Barcelona (online)

SemEval-2020 task 12: Multilingual offensive language identification in social media (OffensEval 2020). InProceedings of the Fourteenth Workshop on Semantic Evaluation, pages 1425–1447, Barcelona (online). International Committee for Computational Linguistics. Guanhua Zhang, Bing Bai, Junqi Zhang, Kun Bai, Con- ghui Zhu, and Tiejun Zhao. 2020. Demographic...

2020
[10]

Category:English derogatory terms
[11]

Category:English vulgarities
[12]

Category:English offensive terms • French
[13]

Catégorie:Termes péjoratifs en français
[14]

query” action of the Wiktionary API was used to fetch pages. This is an example of an API call that fetches all terms under the cate- gory “Μειωτικοί όροι (νέα ελληνικά)

Catégorie:Insultes en français • Greek 1.Μειωτικοί όροι (νέα ελληνικά) ͺ μει- ωτικός ͺ μειωτική ͺ μειωτικό ͺ μειωτικά 2.Κατηγορία:Υβριστικοί όροι (νέα ελλη- νικά) ͺ υβριστικός ͺ υβριστική ͺ υβρισ- τικό ͺ υβρισιτκά 3.Κατηγορία: Χυδαιολογίες (νέα ελλη- νικά) ͺ χυδαίος ͺ χυδαία ͺ χυδαίο 4.βρισιά ͺ βρισιές The “query” action of the Wiktionary API was used to ...
[15]

Non ambiguous term

Step 1: If the description includes multi- ple possible meanings of the term, iden- tify which meaning is used in the text. If disambiguation is particularly difficult, rely on non-hateful uses of the term. If it has only one clear meaning, write "Non ambiguous term". Do not evaluate the presence of hate speech yet
[16]

Consider both the possibility of it being used in a hateful way and the possibility of it being used in a neu- tral/non hateful way

Step 2: Based on the meaning you iden- tified, consider whether the term corre- sponds to the hateful usage described earlier. Consider both the possibility of it being used in a hateful way and the possibility of it being used in a neu- tral/non hateful way
[17]

Hateful" or

Step 3: Decide whether the use of the term in the text is hateful or not and sim- ply write "Hateful" or "Non hateful"
[18]

Hate speech

Step 4: Provide a clear, concise explana- tion (under 100 words) of your judgment. In your explanation use the phrasing pro- vided in the term description you will be given. Do not include, or refer to any previous Step. Important considerations for analysis: • Indirect speech: Any hate speech con- tained in the text as part of a quote or paraphrased from...
[19]

Combine the information from all texts into a unified analysis
[20]

Reuse the existing text
[21]

Remove duplicate information
[22]

Reorganize for better flow
[23]

diaper- head

Keep it brief Input Format:Text 1, Text 2, etc. Output Format:Provide a single well- structured paragraph without opening/closing remarks. Example: • Text 1:The term "bitch" in this tweet is used as hate speech as it is part of a gender-based slur. The phrase aims to diminish and demean a woman through sexist language, linking her to deroga- tory referenc...
[24]

Whether the term can constitute hate speech, based on the following definition: "Hate speech refers to spoken or written com- munication that attacks or uses pejorative or discriminatory language with reference to a person or a group based on identity-related characteristics. These characteristics include: gender, sexual orientation, race, ethnicity, reli...
[25]

If the termcanconstitute hate speech, indi- cate whichcategory or categoriesit targets, choosing from: Gender, Sexual orientation, Race, Ethnicity, Religion, Political affiliation, Socioeconomic status, Occupation, Age, Dis- ability, Addiction, Physical appearance
[26]

reasoning

If the termcanconstitute hate speech, pro- vide a vocabulary entry with clear, concise description for the term that explains: • In which context(s) the term is consid- ered offensive or inappropriate • If and when the term can be used in a neutral or acceptable way • Why or how the term came to acquire its derogatory meaning, if such information is avail...

1976