pith. sign in

arxiv: 2511.06391 · v3 · submitted 2025-11-09 · 💻 cs.CL · cs.AI

HatePrototypes: Interpretable and Transferable Representations for Implicit and Explicit Hate Speech Detection

Pith reviewed 2026-05-17 23:24 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords hate speech detectionimplicit hateprototypestransferable representationsearly exitingcontent moderation
0
0 comments X

The pith

Class-level prototypes from hate-optimized models enable cross-task transfer for explicit and implicit hate speech detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper questions the necessity of repeated fine-tuning for new hate speech benchmarks, especially for implicit hate that requires deeper semantic processing. It introduces HatePrototypes as class-level vector representations from language models already optimized for hate detection. These prototypes can be constructed from as few as 50 examples per class and allow transfer between explicit and implicit hate tasks, remaining interchangeable across different benchmarks. Additionally, they support parameter-free early exiting that works well for both types of hate. This matters because it could lead to more efficient and adaptable systems for moderating offensive content online.

Core claim

HatePrototypes are class-level vector representations derived from language models optimized for hate speech detection; when built from minimal examples they enable cross-task transfer between explicit and implicit hate with interchangeable use across benchmarks and effective parameter-free early exiting for both hate types.

What carries the argument

HatePrototypes, defined as class-level vector representations from hate-optimized language models that serve as interpretable and transferable features for detection.

If this is right

  • Small sets of examples suffice to create effective prototypes for multiple hate detection tasks.
  • Prototypes can be swapped between benchmarks without loss of performance.
  • Early exiting based on prototype similarity reduces computation while preserving accuracy for implicit and explicit hate.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar prototype methods might simplify adaptation in other classification domains involving subtle distinctions.
  • Moderation systems could shift toward reusable components rather than full model retraining for each new dataset.
  • The approach may improve explainability by allowing inspection of what each prototype represents in terms of hate features.

Load-bearing premise

That class-level vectors derived from a hate-optimized language model capture the semantic distinctions required for implicit hate without needing full contextual processing or task-specific fine-tuning.

What would settle it

A test where prototypes extracted from an explicit hate dataset are applied to an implicit hate benchmark and fail to achieve comparable accuracy to a fine-tuned model would falsify the transfer claim.

Figures

Figures reproduced from arXiv: 2511.06391 by Irina Proskurina, Julien Velcin, Marc-Antoine Carpentier.

Figure 2
Figure 2. Figure 2: F1-scores for BERT and OPT models across four dataset pairs (tuned source-evaluation target) with varying numbers of prototypes per class. marks. To quantify the proportion of in-domain performance that transfers, we report the relative macro-F1 with respect to the fine-tuned (FT) perfor￾mance on the same data: F 1(X|proto(Y )) F 1(X|proto(X)) , where X denotes the encoder/evaluation domain and Y the proto… view at source ↗
Figure 3
Figure 3. Figure 3: Prototype selection results: relative cross [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Layer-wise proportion of samples exiting [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Macro-F1 score (%) and average exit layer across similarity gaps for the explicit HX and implicit SBIC benchmarks. Model=OPT. use of African American English expressions that occur in both hateful and non-hateful contexts. Con￾sequently, only minor semantic differences beyond lexical cues determine the perceived hatefulness of a message. Overall, we find that a simple parameter-free exiting strategy based … view at source ↗
Figure 5
Figure 5. Figure 5: F1-scores vs. speed-up on HateXplain and SBIC for the OPT model, compared to entropy￾based (DEEOPT) and patience-based (PABEE) baselines. Prototype-based early exiting achieves speedups comparable to entropy-based baselines while con￾sistently outperforming the patience-based ap￾proach. The most reliable gains, with no significant drop in F1-score, are observed for speedups below 1.5× across both benchmark… view at source ↗
read the original abstract

Optimization of offensive content moderation models for different types of hateful messages is typically achieved through continued pre-training or fine-tuning on new hate speech benchmarks. However, existing benchmarks mainly address explicit hate toward protected groups and often overlook implicit or indirect hate, such as demeaning comparisons, calls for exclusion or violence, and subtle discriminatory language that still causes harm. While explicit hate can often be captured through surface features, implicit hate requires deeper, full-model semantic processing. In this work, we question the need for repeated fine-tuning and analyze the role of HatePrototypes, class-level vector representations derived from language models optimized for hate speech detection and safety moderation. We find that these prototypes, built from as few as 50 examples per class, enable cross-task transfer between explicit and implicit hate, with interchangeable prototypes across benchmarks. Moreover, we show that parameter-free early exiting with prototypes is effective for both hate types. We release the code, prototype resources, and evaluation scripts to support future research on efficient and transferable hate speech detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces HatePrototypes as class-level vector representations derived from embeddings of language models optimized for hate speech detection and safety. These prototypes are constructed from as few as 50 examples per class and are claimed to enable cross-task transfer between explicit and implicit hate speech detection tasks, to be interchangeable across different benchmarks, and to support parameter-free early exiting that remains effective for both hate types. The work argues against the necessity of repeated fine-tuning or continued pre-training when adapting to new hate speech benchmarks, particularly those involving implicit hate.

Significance. If the empirical claims are substantiated with rigorous metrics, this would represent a meaningful contribution to efficient and interpretable hate speech detection. The approach could lower the barrier to handling implicit hate (which the paper notes requires deeper semantic processing) by providing reusable, low-data prototypes that reduce reliance on full-model fine-tuning, with potential benefits for real-time moderation systems and cross-benchmark generalization.

major comments (2)
  1. [§4 and §5] §4 (Experimental Setup) and §5 (Results): The central claim of interchangeable prototypes enabling cross-task transfer between explicit and implicit hate rests on empirical tests, yet the manuscript provides no quantitative details on how interchangeability is operationalized (e.g., direct substitution accuracy, cosine similarity thresholds, or classification F1 when swapping prototypes from one benchmark to another). Without these metrics or controls for benchmark overlap, it is unclear whether observed transfer reflects genuine semantic capture or dataset artifacts.
  2. [§5.2] §5.2 (Early Exiting Experiments): The parameter-free early exiting is reported as effective for implicit hate, but implicit hate is described in the introduction as depending on subtle comparisons, exclusionary framing, and non-surface semantics. A single aggregated class vector is unlikely to preserve token- or sentence-level relational cues; the paper should include an ablation comparing prototype-based exiting against full contextual processing on implicit-only subsets to test whether performance gains are limited to easier explicit cases.
minor comments (2)
  1. [Abstract] The abstract states results are 'positive' but omits any numerical values, baseline comparisons, or statistical significance; these should be summarized with key numbers even in the abstract for clarity.
  2. [§3] Clarify the exact aggregation method used to form the class-level vectors (mean pooling, centroid, etc.) and whether any normalization or projection is applied, as this directly affects claims of parameter-freeness and interpretability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each major point below with clarifications drawn from the existing experiments and indicate revisions where they will strengthen the presentation of our claims about prototype interchangeability and early exiting.

read point-by-point responses
  1. Referee: [§4 and §5] §4 (Experimental Setup) and §5 (Results): The central claim of interchangeable prototypes enabling cross-task transfer between explicit and implicit hate rests on empirical tests, yet the manuscript provides no quantitative details on how interchangeability is operationalized (e.g., direct substitution accuracy, cosine similarity thresholds, or classification F1 when swapping prototypes from one benchmark to another). Without these metrics or controls for benchmark overlap, it is unclear whether observed transfer reflects genuine semantic capture or dataset artifacts.

    Authors: We thank the referee for this observation. Interchangeability is operationalized in §5 through cross-benchmark transfer experiments: prototypes derived from one benchmark (e.g., explicit hate) are substituted directly into the classification pipeline for another benchmark (e.g., implicit hate), with performance measured by F1 score on the target task. These results are reported for multiple benchmark pairs and demonstrate effective transfer with as few as 50 examples per class. To make the operationalization more explicit and address potential dataset artifacts, we will add a dedicated paragraph and table in §5 that reports (i) cosine similarity between prototypes across benchmarks and (ii) direct substitution accuracy/F1 when prototypes are swapped, along with controls for lexical overlap between datasets. revision: yes

  2. Referee: [§5.2] §5.2 (Early Exiting Experiments): The parameter-free early exiting is reported as effective for implicit hate, but implicit hate is described in the introduction as depending on subtle comparisons, exclusionary framing, and non-surface semantics. A single aggregated class vector is unlikely to preserve token- or sentence-level relational cues; the paper should include an ablation comparing prototype-based exiting against full contextual processing on implicit-only subsets to test whether performance gains are limited to easier explicit cases.

    Authors: This is a fair critique. While our §5.2 results show that prototype-based early exiting maintains high performance on datasets containing implicit hate (and is compared against full-model baselines), we did not isolate an implicit-only subset for a dedicated ablation. We agree that such an experiment would better test whether the class-level prototype captures the relational and framing cues needed for implicit cases. We will add this ablation to §5.2, reporting accuracy and F1 for prototype early-exit versus full contextual inference restricted to implicit hate instances only. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on empirical cross-benchmark tests

full rationale

The manuscript presents HatePrototypes as class-level vectors derived from a hate-optimized language model and evaluates their use for cross-task transfer and parameter-free early exiting via empirical tests on explicit and implicit hate benchmarks. No equations, derivations, or self-citation chains are shown that reduce the reported transfer performance or interchangeability results to quantities defined by the same fitted parameters or by construction. The central claims are supported by cross-benchmark evaluations rather than self-referential fitting or imported uniqueness theorems, rendering the derivation self-contained against external data.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claims rest on the empirical effectiveness of prototype vectors extracted from already-optimized hate detection models. No new mathematical axioms or physical entities are introduced; the main added element is the prototype construction procedure itself.

free parameters (1)
  • examples per class
    The paper states prototypes are built from as few as 50 examples per class; this threshold is chosen to demonstrate low-data capability.
axioms (1)
  • domain assumption Language models already optimized for hate speech detection produce vector representations whose class averages are semantically meaningful for both explicit and implicit hate.
    Invoked when constructing and transferring the prototypes without additional training.
invented entities (1)
  • HatePrototypes no independent evidence
    purpose: Class-level vector representations for transferable and interpretable hate detection
    New representational object introduced in the paper; no independent falsifiable evidence outside the reported experiments is provided.

pith-pipeline@v0.9.0 · 5482 in / 1402 out tokens · 47444 ms · 2026-05-17T23:24:14.575104+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 1 internal anchor

  1. [1]

    Introduction The impact of online hate comments and their harmful consequences spans a wide range of ef- fects,fromindividualhatecrimesandpsychological trauma to the disruption of group discussions, dis- tortion of community norms, distraction from the mainpostcontent, anddiscouragementofuserpar- ticipation (Müller and Schwarz, 2020; Lees et al., 2022). T...

  2. [2]

    Related Work Transferability in Hate Speech DetectionDe- spitestrongin-domainperformance,languagemod- els for hate speech detection often fail to trans- fer across datasets, platforms, or categories of abuse (Pachinger et al., 2023; Khurana et al., 2022). Early studies demonstrate that differences in dataset design and annotation practices out- weigh arch...

  3. [3]

    These prototypes are further used for cross-task trans- fer analysis and layer-wise classification for model early exiting

    Methodology We use training subsets of multiple hate speech benchmark datasets to constructHatePrototypes, class centroids representing the mean embedding of each of the hate and non-hate classes. These prototypes are further used for cross-task trans- fer analysis and layer-wise classification for model early exiting. Let D = {(xi, yi)}N i=1 be the train...

  4. [4]

    Experimental Setup ModelsTo compare how different architectures encode hate speech, we use two models of com- parable size: the encoder BERT2 (Devlin et al.,

  5. [5]

    OPT is pre-trained for causal language modeling with a 50k-token vocabulary, while BERT is pre- trained for masked language modeling with a 29k- token vocabulary

    with 109M parameters and the decoder OPT3 (Zhang et al., 2022) with 125M parameters. OPT is pre-trained for causal language modeling with a 50k-token vocabulary, while BERT is pre- trained for masked language modeling with a 29k- token vocabulary. Despite architectural differences, both are case-sensitive, have 12 layers and 12 attention heads, and share ...

  6. [6]

    How do you call a Black man? You call his cell number

    Prototypes for Task Transfer Inthissection,wepresenttheresultsofusingproto- typestotransferknowledgebetweendifferenttasks. Our experiments involve three datasets: one for fine-tuning the model, one for creating the proto- types, and one for testing performance. We inves- tigate two types of transfer. In the first case, referred to ascross-domain transfer,...

  7. [7]

    Early-Exiting with Prototypes Next, we analyze the applicability of constructed prototypes for early exiting. For these experiments, we use the exiting rule defined in Eq.(3), where an exit at layerℓ is performed if the difference be- tween the similarities of the input and the two class prototypes exceeds a thresholdδ. 6.1. Early-exiting with Prototypes ...

  8. [8]

    We an- alyzed two applications: (1) prototype-based cross- domain classification and (2) prototype-guided early exiting

    Conclusion In this work, we presentHatePrototypes, a parameter-freeapproachforclassifyingimplicitand explicit hate speech using class prototypes de- rived from fine-tuned language models. We an- alyzed two applications: (1) prototype-based cross- domain classification and (2) prototype-guided early exiting. Our results show that prototype representations ...

  9. [9]

    Bibliographical References Hyeseon Ahn, Youngwook Kim, Jungin Kim, and Yo-Sub Han. 2024. SharedCon: Implicit hate speech detection using shared semantics. In Findings of the Association for Computational Linguistics: ACL 2024, pages 10444–10455, Bangkok, Thailand. Association for Computa- tional Linguistics. Abdullah Albanyan, Ahmed Hassan, and Eduardo Bl...

  10. [10]

    ELRA and ICCL

    GPT-HateCheck: Can LLMs write better functional tests for hate speech detec- tion? InProceedings of the 2024 Joint Interna- tional Conference on Computational Linguistics, Language Resources and Evaluation (LREC- COLING 2024), pages 7867–7885, Torino, Italia. ELRA and ICCL. Urja Khurana, Ivar Vermeulen, Eric Nalisnick, Mar- loes Van Noorloos, and Antske F...

  11. [11]

    InProceedingsof the 29th International Conference on Computa- tional Linguistics, pages 6667–6679, Gyeongju, Republic of Korea

    Generalizable implicit hate speech detec- tionusingcontrastivelearning. InProceedingsof the 29th International Conference on Computa- tional Linguistics, pages 6667–6679, Gyeongju, Republic of Korea. International Committee on Computational Linguistics. AlyssaLees,VinhQ.Tran,YiTay,JeffreySorensen, Jai Gupta, Donald Metzler, and Lucy Vasserman

  12. [12]

    In Proceedings of the 28th ACM SIGKDD Confer- ence on Knowledge Discovery and Data Mining, KDD ’22, page 3197–3207, New York, NY, USA

    A new generation of perspective api: Effi- cient multilingual character-level transformers. In Proceedings of the 28th ACM SIGKDD Confer- ence on Knowledge Discovery and Data Mining, KDD ’22, page 3197–3207, New York, NY, USA. Association for Computing Machinery. João Leite, Carolina Scarton, and Diego Silva

  13. [13]

    InProceedings of the 14th International Conference on Recent Advances in Natural Lan- guage Processing, pages 631–640, Varna, Bul- garia

    Noisy self-training with data augmenta- tions for offensive and hate speech detection tasks. InProceedings of the 14th International Conference on Recent Advances in Natural Lan- guage Processing, pages 631–640, Varna, Bul- garia. INCOMA Ltd., Shoumen, Bulgaria. Weijie Liu, Peng Zhou, Zhiruo Wang, Zhe Zhao, Haotang Deng, and Qi Ju. 2020. FastBERT: a self-...

  14. [14]

    In Proceedings of the 2022 Conference on Empir- ical Methods in Natural Language Processing: IndustryTrack, pages571–578, AbuDhabi, UAE

    A stacking-based efficient method for toxic language detection on live streaming chat. In Proceedings of the 2022 Conference on Empir- ical Methods in Natural Language Processing: IndustryTrack, pages571–578, AbuDhabi, UAE. Association for Computational Linguistics. Pia Pachinger, Allan Hanbury, Julia Neidhardt, and Anna Planitzer. 2023. Toward disambigua...

  15. [15]

    InAdvances in Neural Information Processing Systems, volume 30

    Prototypicalnetworksforfew-shotlearning. InAdvances in Neural Information Processing Systems, volume 30. Curran Associates, Inc. Rohit Sridhar and Diyi Yang. 2022. Explaining toxic text via knowledge enhanced text generation. In Proceedingsofthe2022ConferenceoftheNorth American Chapter of the Association for Com- putational Linguistics: Human Language Tec...

  16. [16]

    InFindings of the Association for Computa- tional Linguistics: NAACL 2024, pages 116–131, Mexico City, Mexico

    DEED: Dynamic early exit on decoder for accelerating encoder-decoder transformer mod- els. InFindings of the Association for Computa- tional Linguistics: NAACL 2024, pages 116–131, Mexico City, Mexico. Association for Computa- tional Linguistics. Manuel Tonneau, Diyi Liu, Niyati Malhotra, Scott A. Hale, Samuel Fraiberger, Victor Orozco-Olvera, and Paul Rö...

  17. [17]

    2021.Latent Hatred: A Benchmark for Under- standing Implicit Hate Speech

    Language Resource References ElSherief, Mai and Ziems, Caleb and Muchlinski, David and Anupindi, Vaishnavi and Seybolt, Jor- dyn and De Choudhury, Munmun and Yang, Diyi. 2021.Latent Hatred: A Benchmark for Under- standing Implicit Hate Speech. Association for Computational Linguistics. Mathew, Binny and Saha, Punyajoy and Yimam, Seid Muhie and Biemann, Ch...