pith. sign in

arxiv: 2509.15174 · v3 · submitted 2025-09-18 · 💻 cs.CL · cs.AI

SMARTER: A Data-efficient Framework to Improve Toxicity Detection with Explanation via Self-augmenting Large Language Models

Pith reviewed 2026-05-18 15:48 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords toxicity detectionlarge language modelspreference optimizationexplainable AIdata efficiencyhate speechcontent moderationself-augmentation
0
0 comments X

The pith

LLMs boost toxicity detection accuracy up to 13 percent by aligning on their own synthetic explanations for correct and incorrect labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a two-stage framework called SMARTER that lets large language models generate their own explanations for both accurate and mistaken toxicity judgments, then uses preference optimization to favor the better ones with very little human-labeled data. In the second stage, weaker models are trained to match the style and content of stronger models' explanations through cross-model alignment. This approach improves macro-F1 scores on three hate-speech benchmarks while needing only a small fraction of typical training data. A sympathetic reader would care because content moderation systems currently require large amounts of expensive annotated examples and still produce opaque decisions; self-augmentation could make high-quality, explainable moderation feasible in low-resource languages or new platforms.

Core claim

SMARTER is a data-efficient two-stage framework for explainable content moderation. In Stage 1, LLMs produce synthetic explanations for both correct and incorrect labels so that preference optimization can align the model toward higher-quality rationales with minimal human supervision. In Stage 2, cross-model training lets weaker models adopt the stylistic and semantic patterns of stronger models, simultaneously raising classification performance and explanation quality. Experiments on HateXplain, Latent Hate, and Implicit Hate show macro-F1 gains of up to 13 percent over standard few-shot baselines while using far less labeled data than full supervised training.

What carries the argument

Two-stage self-augmentation via preference optimization over LLM-generated explanations for both correct and incorrect predictions, followed by cross-model stylistic alignment.

If this is right

  • Classification and explanation quality improve together rather than trading off.
  • The method works with only a small fraction of the labeled data normally required.
  • Weaker models can be upgraded by aligning to stronger models' explanation style.
  • The same pipeline scales to other low-resource moderation settings without new human annotation.
  • Both classification accuracy and human-interpretable rationales are produced by the same trained model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The framework could be tested on non-English toxicity datasets to check whether self-augmentation reduces the need for native-speaker annotators.
  • Preference optimization over self-generated labels might be combined with existing RLHF pipelines to improve safety guardrails more broadly.
  • If explanation quality is measured by human raters after Stage 2, the cross-model alignment step may serve as a cheap proxy for human feedback.
  • The approach suggests a general recipe for any classification task where both the label and a short justification are desired but labeled data are scarce.

Load-bearing premise

The synthetic explanations the LLM produces for both right and wrong labels are high enough quality and free enough of new bias that they can serve as reliable training signals.

What would settle it

Run the full SMARTER pipeline on one of the three benchmarks and measure whether adding the preference-optimization stage produces no gain or a loss in macro-F1 compared with the same base LLM trained only on the few-shot examples without any synthetic explanations.

Figures

Figures reproduced from arXiv: 2509.15174 by Advik Sachdeva, Hal Daum\'e III, Huy Nghiem.

Figure 1
Figure 1. Figure 1: Bar plots for K-shot classification experiments on 3 datasets using Llama and T5 models. Macro F1 scores and percentage change over Baseline are displayed on top. Results for Baselines and DPO-augmented variants for K ∈ {16, 32, 64, 128} are displayed on the left subfigures. Results for K = 256 of Baseline, KTO, DPO-augmented and its other variants on various sub-sampling strategies (section 5.1.1) are sho… view at source ↗
Figure 2
Figure 2. Figure 2: Macro-F1 scores on test portion of the 3 test [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Self-augmenting pipeline: for each post, explanations are conditionally generated for the gold label and [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Prompt template for classification tasks [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompt template for classification tasks. The definition block should contain the set’s full label space [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Prompt template to obtain explanation conditioned on a single label and its definition. [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prompt template for evaluating explanation–label consistency. The model is asked to judge whether a generated explanation is logically consistent with its predicted label [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Prompt template for evaluating explanation–definition consistency. The model is asked to assess whether a generated explanation logically aligns with the formal definition of the predicted label [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of Precision, Recall, and Macro F1-scores for the [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Percentage distribution of Llama- and T5-style explanations on test sets by BERT style classifiers, pre (Base) and post cross-model (Xmod) refined [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Examples of a post in HateXplain with gold label, along with the explanations of the T5 and Llama self-augmented variants at K=128, and the cross-trained model using all K=256 in total. We observe that models trained with more data (+Xmod) gives the correct classification [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Examples of a post in Latent Hate with gold label, along with the explanations of the T5 and Llama self-augmented variants at K=128, and the cross-trained model using all K=256 in total. In this example, we observe that T5 absorbs the verbosity of Llama after cross-training, yet still gets the wrong label. On the other hand, Llama’s explanation is more terse after cross-training, and also classifies the p… view at source ↗
Figure 14
Figure 14. Figure 14: Examples of a post in Implicit Hate with gold label, along with the explanations of the T5 and Llama self-augmented variants at K=128, and the cross-trained model using all K=256 in total. In this example, we also observe the cross-pollination of style in terms of verbosity, albeit to a lesser degree, but all models classify correctly [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Examples of explanations that are judged to contains some forms of inconsistency with respect to the [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗
read the original abstract

WARNING: This paper contains examples of offensive materials. To address the proliferation of toxic content on social media, we introduce SMARTER, we introduce SMARTER, a data-efficient two-stage framework for explainable content moderation using Large Language Models (LLMs). In Stage 1, we leverage LLMs' own outputs to generate synthetic explanations for both correct and incorrect labels, enabling alignment via preference optimization with minimal human supervision. In Stage 2, we refine explanation quality through cross-model training, allowing weaker models to align stylistically and semantically with stronger ones. Experiments on three benchmark tasks -- HateXplain, Latent Hate, and Implicit Hate -- demonstrate that SMARTER enables LLMs to achieve up to a 13% macro-F1 improvement over standard few-shot baselines while using only a fraction of the full training data. Our framework offers a scalable strategy for low-resource settings by harnessing LLMs' self-improving capabilities for both classification and explanation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces SMARTER, a two-stage framework for data-efficient, explainable toxicity detection using LLMs. Stage 1 generates synthetic explanations for both correct and incorrect labels to enable preference optimization with minimal human supervision; Stage 2 refines explanation quality via cross-model training. Experiments on HateXplain, Latent Hate, and Implicit Hate benchmarks claim up to 13% macro-F1 gains over few-shot baselines while using only a fraction of the full training data.

Significance. If the empirical claims hold after proper validation, the work could offer a scalable approach to low-resource content moderation by leveraging LLM self-augmentation for both classification accuracy and explanation quality, addressing a practical need in subjective toxicity tasks.

major comments (3)
  1. [Abstract] Abstract: the reported 13% macro-F1 improvement is stated without any description of experimental setup, baseline definitions (e.g., exact few-shot prompting details), number of runs, statistical significance tests, or controls for bias introduced by synthetic data generation.
  2. [Stage 1] Stage 1 description: the framework relies on LLM-generated explanations for incorrect labels to form preference pairs, yet provides no explicit mechanism for fidelity checking, quality filtering, or human validation of these synthetic rationales; this is load-bearing for the central claim of net-positive self-improvement.
  3. [Experiments] Experiments section: in subjective tasks such as Implicit Hate and Latent Hate, the absence of analysis on whether synthetic explanations for incorrect labels amplify initial model biases or hallucinations undermines the interpretation of the reported gains as genuine self-improvement rather than optimization artifacts.
minor comments (1)
  1. [Abstract] Abstract contains a clear duplication: 'we introduce SMARTER, we introduce SMARTER'.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below, indicating where we agree and the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported 13% macro-F1 improvement is stated without any description of experimental setup, baseline definitions (e.g., exact few-shot prompting details), number of runs, statistical significance tests, or controls for bias introduced by synthetic data generation.

    Authors: We agree that the abstract would benefit from greater specificity. In the revised manuscript we have updated the abstract to briefly note that the reported gains are the maximum observed across the three benchmarks relative to standard few-shot prompting with the identical base LLM, that results are averaged over five independent runs, and that statistical significance was assessed via paired bootstrap tests. Detailed baseline prompting templates, run counts, and synthetic-data ablation controls are now explicitly referenced in the abstract and expanded in Section 4. revision: partial

  2. Referee: [Stage 1] Stage 1 description: the framework relies on LLM-generated explanations for incorrect labels to form preference pairs, yet provides no explicit mechanism for fidelity checking, quality filtering, or human validation of these synthetic rationales; this is load-bearing for the central claim of net-positive self-improvement.

    Authors: The referee rightly highlights that explicit safeguards are important for the self-improvement claim. While the original preference-optimization step implicitly down-weights low-quality rationales by contrasting them with correct-label explanations, we have added an automated fidelity filter in Stage 1 that uses a separate NLI model to discard explanations whose entailment score with the assigned label falls below a threshold. We report the filtering rate and include a small-scale human validation study in the appendix to substantiate that the retained synthetic data supports net-positive gains. revision: yes

  3. Referee: [Experiments] Experiments section: in subjective tasks such as Implicit Hate and Latent Hate, the absence of analysis on whether synthetic explanations for incorrect labels amplify initial model biases or hallucinations undermines the interpretation of the reported gains as genuine self-improvement rather than optimization artifacts.

    Authors: We concur that bias amplification is a legitimate concern in these subjective domains. In the revised experiments section we have added a targeted analysis that compares hallucination and bias indicators (measured by an independent verifier model) on synthetic explanations before and after SMARTER training. The results indicate that inconsistency rates do not increase and, in several cases, decrease, supporting that the observed macro-F1 gains reflect genuine improvement rather than artifacts. We discuss these findings and provide illustrative examples in the appendix. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical framework evaluated on external benchmarks

full rationale

The paper presents SMARTER as a two-stage empirical method: LLM-generated synthetic explanations for preference optimization (Stage 1) followed by cross-model refinement (Stage 2). All performance claims (up to 13% macro-F1 gain) are direct measurements on public benchmarks (HateXplain, Latent Hate, Implicit Hate) against few-shot baselines using a fraction of training data. No equations, self-definitional quantities, fitted inputs renamed as predictions, or load-bearing self-citations appear in the derivation. The central results remain externally falsifiable on standard datasets and do not reduce to the method's own outputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the premise that LLMs can produce usable synthetic supervision signals; no explicit free parameters or invented entities are named in the abstract.

axioms (1)
  • domain assumption Large language models can generate synthetic explanations for both correct and incorrect toxicity labels that are useful for preference optimization.
    This premise underpins Stage 1 of the described pipeline.

pith-pipeline@v0.9.0 · 5703 in / 1218 out tokens · 43699 ms · 2026-05-18T15:48:54.468466+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 2 internal anchors

  1. [1]

    InThe 7th Workshop on Online Abuse and Harms (WOAH), pages 231–242

    Robust hate speech detection in social me- dia: A cross-dataset empirical evaluation. InThe 7th Workshop on Online Abuse and Harms (WOAH), pages 231–242. Stephanie Alice Baker, Matthew Wade, and Michael James Walsh. 2020. <? covid19?> the challenges of responding to misinformation during a pandemic: content moderation and the limi- tations of the concept ...

  2. [2]

    Matteo Cinelli, Gianmarco De Francisci Morales, Alessandro Galeazzi, Walter Quattrociocchi, and Michele Starnini

    A review of the f-measure: its history, prop- erties, criticism, and alternatives.ACM Computing Surveys, 56(3):1–24. Matteo Cinelli, Gianmarco De Francisci Morales, Alessandro Galeazzi, Walter Quattrociocchi, and Michele Starnini. 2021. The echo chamber effect on social media.Proceedings of the National Academy of Sciences, 118(9):e2023301118. Jacob Devli...

  3. [3]

    The Llama 3 Herd of Models

    Bert and fasttext embeddings for automatic de- tection of toxic speech. In2020 International Multi- Conference on:“Organization of Knowledge and Ad- vanced Technologies”(OCTA), pages 1–5. IEEE. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. Th...

  4. [4]

    Katharine Gelber

    How well do hate speech, toxicity, abusive and offensive language classification models gener- alize across datasets?Information Processing & Management, 58(3):102524. Katharine Gelber. 2021. Differentiating hate speech: a systemic discrimination approach.Critical Review of International Social and Political Philosophy. Feng Gu, Zongxia Li, Carlos Rafael ...

  5. [5]

    Large Language Models Are Effective Human Annotation Assistants, But Not Good Independent Annotators

    Large language models are effective human annotation assistants, but not good independent an- notators.arXiv preprint arXiv:2503.06778. Shibo Hao, Yi Gu, Haotian Luo, Tianyang Liu, Xiyan Shao, Xinyuan Wang, Shuhua Xie, Haodi Ma, Adithya Samavedhi, Qiyue Gao, et al. Llm reasoners: New evaluation, library, and analysis of step-by-step reasoning with large l...

  6. [6]

    define your terms

    Synthetic data generation with large language models for text classification: Potential and limita- tions. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10443–10461. Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. 2023. The flan c...

  7. [7]

    fair explanation

    A decade of tweets: Visualizing racial senti- ments towards minoritized groups in the united states between 2011 and 2021.Epidemiology, 35(1):51–59. Tin Nguyen, Jiannan Xu, Aayushi Roy, Hal Daumé III, and Marine Carpuat. 2023. Towards conceptualiza- tion of “fair explanation”: Disparate impacts of anti- asian hate speech explanations on content moderators...

  8. [8]

    Melissa Kazemi Rad, Huy Nghiem, Andy Luo, Sahil Wadhwa, Mohammad Sorower, and Stephen Rawls

    Training language models to follow instruc- tions with human feedback.Advances in neural in- formation processing systems, 35:27730–27744. Melissa Kazemi Rad, Huy Nghiem, Andy Luo, Sahil Wadhwa, Mohammad Sorower, and Stephen Rawls

  9. [9]

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn

    Refining input guardrails: Enhancing llm-as-a- judge efficiency through chain-of-thought fine-tuning and alignment.arXiv preprint arXiv:2501.13080. Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn

  10. [10]

    Sarthak Roy, Ashish Harshvardhan, Animesh Mukher- jee, and Punyajoy Saha

    Direct preference optimization: Your language model is secretly a reward model.Advances in Neu- ral Information Processing Systems, 36. Sarthak Roy, Ashish Harshvardhan, Animesh Mukher- jee, and Punyajoy Saha. 2023. Probing llms for hate speech detection: strengths and vulnerabilities. In Findings of the Association for Computational Lin- guistics: EMNLP ...

  11. [11]

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Al- isa Liu, Noah A Smith, Daniel Khashabi, and Han- naneh Hajishirzi

    Large-scale hate speech detection with cross- domain transfer.arXiv preprint arXiv:2203.01111. Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Al- isa Liu, Noah A Smith, Daniel Khashabi, and Han- naneh Hajishirzi. 2022. Self-instruct: Aligning lan- guage models with self-generated instructions.arXiv preprint arXiv:2212.10560. Benjamin Warner, Antoine Chaffin...

  12. [12]

    lost in context

    Tree of thoughts: Deliberate problem solving with large language models.Advances in Neural Information Processing Systems, 36. Zhixue Zhao, Ziqi Zhang, and Frank Hopfgartner. 2021. A comparative study of using pre-trained language models for toxic comment classification. InCompan- ion Proceedings of the Web Conference 2021, pages 500–507. Adam Zweiger, Jy...