SMARTER: A Data-efficient Framework to Improve Toxicity Detection with Explanation via Self-augmenting Large Language Models

Advik Sachdeva; Hal Daum\'e III; Huy Nghiem

arxiv: 2509.15174 · v3 · submitted 2025-09-18 · 💻 cs.CL · cs.AI

SMARTER: A Data-efficient Framework to Improve Toxicity Detection with Explanation via Self-augmenting Large Language Models

Huy Nghiem , Advik Sachdeva , Hal Daum\'e III This is my paper

Pith reviewed 2026-05-18 15:48 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords toxicity detectionlarge language modelspreference optimizationexplainable AIdata efficiencyhate speechcontent moderationself-augmentation

0 comments

The pith

LLMs boost toxicity detection accuracy up to 13 percent by aligning on their own synthetic explanations for correct and incorrect labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a two-stage framework called SMARTER that lets large language models generate their own explanations for both accurate and mistaken toxicity judgments, then uses preference optimization to favor the better ones with very little human-labeled data. In the second stage, weaker models are trained to match the style and content of stronger models' explanations through cross-model alignment. This approach improves macro-F1 scores on three hate-speech benchmarks while needing only a small fraction of typical training data. A sympathetic reader would care because content moderation systems currently require large amounts of expensive annotated examples and still produce opaque decisions; self-augmentation could make high-quality, explainable moderation feasible in low-resource languages or new platforms.

Core claim

SMARTER is a data-efficient two-stage framework for explainable content moderation. In Stage 1, LLMs produce synthetic explanations for both correct and incorrect labels so that preference optimization can align the model toward higher-quality rationales with minimal human supervision. In Stage 2, cross-model training lets weaker models adopt the stylistic and semantic patterns of stronger models, simultaneously raising classification performance and explanation quality. Experiments on HateXplain, Latent Hate, and Implicit Hate show macro-F1 gains of up to 13 percent over standard few-shot baselines while using far less labeled data than full supervised training.

What carries the argument

Two-stage self-augmentation via preference optimization over LLM-generated explanations for both correct and incorrect predictions, followed by cross-model stylistic alignment.

If this is right

Classification and explanation quality improve together rather than trading off.
The method works with only a small fraction of the labeled data normally required.
Weaker models can be upgraded by aligning to stronger models' explanation style.
The same pipeline scales to other low-resource moderation settings without new human annotation.
Both classification accuracy and human-interpretable rationales are produced by the same trained model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework could be tested on non-English toxicity datasets to check whether self-augmentation reduces the need for native-speaker annotators.
Preference optimization over self-generated labels might be combined with existing RLHF pipelines to improve safety guardrails more broadly.
If explanation quality is measured by human raters after Stage 2, the cross-model alignment step may serve as a cheap proxy for human feedback.
The approach suggests a general recipe for any classification task where both the label and a short justification are desired but labeled data are scarce.

Load-bearing premise

The synthetic explanations the LLM produces for both right and wrong labels are high enough quality and free enough of new bias that they can serve as reliable training signals.

What would settle it

Run the full SMARTER pipeline on one of the three benchmarks and measure whether adding the preference-optimization stage produces no gain or a loss in macro-F1 compared with the same base LLM trained only on the few-shot examples without any synthetic explanations.

Figures

Figures reproduced from arXiv: 2509.15174 by Advik Sachdeva, Hal Daum\'e III, Huy Nghiem.

**Figure 1.** Figure 1: Bar plots for K-shot classification experiments on 3 datasets using Llama and T5 models. Macro F1 scores and percentage change over Baseline are displayed on top. Results for Baselines and DPO-augmented variants for K ∈ {16, 32, 64, 128} are displayed on the left subfigures. Results for K = 256 of Baseline, KTO, DPO-augmented and its other variants on various sub-sampling strategies (section 5.1.1) are sho… view at source ↗

**Figure 2.** Figure 2: Macro-F1 scores on test portion of the 3 test [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 4.** Figure 4: Self-augmenting pipeline: for each post, explanations are conditionally generated for the gold label and [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: Prompt template for classification tasks [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt template for classification tasks. The definition block should contain the set’s full label space [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt template to obtain explanation conditioned on a single label and its definition. [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt template for evaluating explanation–label consistency. The model is asked to judge whether a generated explanation is logically consistent with its predicted label [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt template for evaluating explanation–definition consistency. The model is asked to assess whether a generated explanation logically aligns with the formal definition of the predicted label [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Comparison of Precision, Recall, and Macro F1-scores for the [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Percentage distribution of Llama- and T5-style explanations on test sets by BERT style classifiers, pre (Base) and post cross-model (Xmod) refined [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: Examples of a post in HateXplain with gold label, along with the explanations of the T5 and Llama self-augmented variants at K=128, and the cross-trained model using all K=256 in total. We observe that models trained with more data (+Xmod) gives the correct classification [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

**Figure 13.** Figure 13: Examples of a post in Latent Hate with gold label, along with the explanations of the T5 and Llama self-augmented variants at K=128, and the cross-trained model using all K=256 in total. In this example, we observe that T5 absorbs the verbosity of Llama after cross-training, yet still gets the wrong label. On the other hand, Llama’s explanation is more terse after cross-training, and also classifies the p… view at source ↗

**Figure 14.** Figure 14: Examples of a post in Implicit Hate with gold label, along with the explanations of the T5 and Llama self-augmented variants at K=128, and the cross-trained model using all K=256 in total. In this example, we also observe the cross-pollination of style in terms of verbosity, albeit to a lesser degree, but all models classify correctly [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗

**Figure 15.** Figure 15: Examples of explanations that are judged to contains some forms of inconsistency with respect to the [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗

read the original abstract

WARNING: This paper contains examples of offensive materials. To address the proliferation of toxic content on social media, we introduce SMARTER, we introduce SMARTER, a data-efficient two-stage framework for explainable content moderation using Large Language Models (LLMs). In Stage 1, we leverage LLMs' own outputs to generate synthetic explanations for both correct and incorrect labels, enabling alignment via preference optimization with minimal human supervision. In Stage 2, we refine explanation quality through cross-model training, allowing weaker models to align stylistically and semantically with stronger ones. Experiments on three benchmark tasks -- HateXplain, Latent Hate, and Implicit Hate -- demonstrate that SMARTER enables LLMs to achieve up to a 13% macro-F1 improvement over standard few-shot baselines while using only a fraction of the full training data. Our framework offers a scalable strategy for low-resource settings by harnessing LLMs' self-improving capabilities for both classification and explanation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SMARTER assembles existing LLM self-augmentation and preference tricks into a two-stage pipeline for toxicity classification plus explanation, but the abstract leaves the quality of synthetic training signals unexamined.

read the letter

The core idea is a two-stage setup where an LLM first generates explanations for both correct and incorrect labels on toxicity examples, then uses preference optimization on those pairs before a second stage aligns weaker models to stronger ones via cross-model training. This targets data-efficient, explainable moderation on tasks like HateXplain, Latent Hate, and Implicit Hate, with a claimed 13% macro-F1 gain over few-shot baselines using only a fraction of labeled data. The assembly of self-augmentation, preference tuning, and distillation for joint classification and explanation is the concrete new element, even if the pieces themselves are familiar from prior alignment work. It does address a real practical gap in low-resource content moderation by trying to reduce reliance on large human-labeled sets while producing explanations alongside predictions. The benchmarks chosen are standard and the minimal-supervision angle is clearly motivated. The main soft spot is the lack of any described checks on the synthetic explanations, especially those generated for incorrect labels. In subjective toxicity categories, an LLM can easily produce plausible-sounding rationales that simply justify its own errors or embed initial biases, and nothing in the abstract indicates filtering, human validation, or self-consistency tests to catch that. Without those controls, the performance lift could come from extra data volume or optimization artifacts rather than genuine improvement. The paper stays empirical with no circular derivations or parameter fitting issues. For readers working on practical LLM pipelines for moderation or data-efficient alignment, the full methods section would be worth seeing to judge whether the synthetic pairs actually add signal. It is coherent enough on its own terms to merit referee time, mainly to verify the experimental details and data quality steps that the abstract omits. I would send it for review rather than desk reject, but with a note to the authors to clarify how they ensure the generated explanations for wrong labels are net helpful.

Referee Report

3 major / 1 minor

Summary. The paper introduces SMARTER, a two-stage framework for data-efficient, explainable toxicity detection using LLMs. Stage 1 generates synthetic explanations for both correct and incorrect labels to enable preference optimization with minimal human supervision; Stage 2 refines explanation quality via cross-model training. Experiments on HateXplain, Latent Hate, and Implicit Hate benchmarks claim up to 13% macro-F1 gains over few-shot baselines while using only a fraction of the full training data.

Significance. If the empirical claims hold after proper validation, the work could offer a scalable approach to low-resource content moderation by leveraging LLM self-augmentation for both classification accuracy and explanation quality, addressing a practical need in subjective toxicity tasks.

major comments (3)

[Abstract] Abstract: the reported 13% macro-F1 improvement is stated without any description of experimental setup, baseline definitions (e.g., exact few-shot prompting details), number of runs, statistical significance tests, or controls for bias introduced by synthetic data generation.
[Stage 1] Stage 1 description: the framework relies on LLM-generated explanations for incorrect labels to form preference pairs, yet provides no explicit mechanism for fidelity checking, quality filtering, or human validation of these synthetic rationales; this is load-bearing for the central claim of net-positive self-improvement.
[Experiments] Experiments section: in subjective tasks such as Implicit Hate and Latent Hate, the absence of analysis on whether synthetic explanations for incorrect labels amplify initial model biases or hallucinations undermines the interpretation of the reported gains as genuine self-improvement rather than optimization artifacts.

minor comments (1)

[Abstract] Abstract contains a clear duplication: 'we introduce SMARTER, we introduce SMARTER'.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below, indicating where we agree and the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the reported 13% macro-F1 improvement is stated without any description of experimental setup, baseline definitions (e.g., exact few-shot prompting details), number of runs, statistical significance tests, or controls for bias introduced by synthetic data generation.

Authors: We agree that the abstract would benefit from greater specificity. In the revised manuscript we have updated the abstract to briefly note that the reported gains are the maximum observed across the three benchmarks relative to standard few-shot prompting with the identical base LLM, that results are averaged over five independent runs, and that statistical significance was assessed via paired bootstrap tests. Detailed baseline prompting templates, run counts, and synthetic-data ablation controls are now explicitly referenced in the abstract and expanded in Section 4. revision: partial
Referee: [Stage 1] Stage 1 description: the framework relies on LLM-generated explanations for incorrect labels to form preference pairs, yet provides no explicit mechanism for fidelity checking, quality filtering, or human validation of these synthetic rationales; this is load-bearing for the central claim of net-positive self-improvement.

Authors: The referee rightly highlights that explicit safeguards are important for the self-improvement claim. While the original preference-optimization step implicitly down-weights low-quality rationales by contrasting them with correct-label explanations, we have added an automated fidelity filter in Stage 1 that uses a separate NLI model to discard explanations whose entailment score with the assigned label falls below a threshold. We report the filtering rate and include a small-scale human validation study in the appendix to substantiate that the retained synthetic data supports net-positive gains. revision: yes
Referee: [Experiments] Experiments section: in subjective tasks such as Implicit Hate and Latent Hate, the absence of analysis on whether synthetic explanations for incorrect labels amplify initial model biases or hallucinations undermines the interpretation of the reported gains as genuine self-improvement rather than optimization artifacts.

Authors: We concur that bias amplification is a legitimate concern in these subjective domains. In the revised experiments section we have added a targeted analysis that compares hallucination and bias indicators (measured by an independent verifier model) on synthetic explanations before and after SMARTER training. The results indicate that inconsistency rates do not increase and, in several cases, decrease, supporting that the observed macro-F1 gains reflect genuine improvement rather than artifacts. We discuss these findings and provide illustrative examples in the appendix. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical framework evaluated on external benchmarks

full rationale

The paper presents SMARTER as a two-stage empirical method: LLM-generated synthetic explanations for preference optimization (Stage 1) followed by cross-model refinement (Stage 2). All performance claims (up to 13% macro-F1 gain) are direct measurements on public benchmarks (HateXplain, Latent Hate, Implicit Hate) against few-shot baselines using a fraction of training data. No equations, self-definitional quantities, fitted inputs renamed as predictions, or load-bearing self-citations appear in the derivation. The central results remain externally falsifiable on standard datasets and do not reduce to the method's own outputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the premise that LLMs can produce usable synthetic supervision signals; no explicit free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption Large language models can generate synthetic explanations for both correct and incorrect toxicity labels that are useful for preference optimization.
This premise underpins Stage 1 of the described pipeline.

pith-pipeline@v0.9.0 · 5703 in / 1218 out tokens · 43699 ms · 2026-05-18T15:48:54.468466+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 2 internal anchors

[1]

InThe 7th Workshop on Online Abuse and Harms (WOAH), pages 231–242

Robust hate speech detection in social me- dia: A cross-dataset empirical evaluation. InThe 7th Workshop on Online Abuse and Harms (WOAH), pages 231–242. Stephanie Alice Baker, Matthew Wade, and Michael James Walsh. 2020. <? covid19?> the challenges of responding to misinformation during a pandemic: content moderation and the limi- tations of the concept ...

work page 2020
[2]

Matteo Cinelli, Gianmarco De Francisci Morales, Alessandro Galeazzi, Walter Quattrociocchi, and Michele Starnini

A review of the f-measure: its history, prop- erties, criticism, and alternatives.ACM Computing Surveys, 56(3):1–24. Matteo Cinelli, Gianmarco De Francisci Morales, Alessandro Galeazzi, Walter Quattrociocchi, and Michele Starnini. 2021. The echo chamber effect on social media.Proceedings of the National Academy of Sciences, 118(9):e2023301118. Jacob Devli...

work page 2021
[3]

The Llama 3 Herd of Models

Bert and fasttext embeddings for automatic de- tection of toxic speech. In2020 International Multi- Conference on:“Organization of Knowledge and Ad- vanced Technologies”(OCTA), pages 1–5. IEEE. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. Th...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Katharine Gelber

How well do hate speech, toxicity, abusive and offensive language classification models gener- alize across datasets?Information Processing & Management, 58(3):102524. Katharine Gelber. 2021. Differentiating hate speech: a systemic discrimination approach.Critical Review of International Social and Political Philosophy. Feng Gu, Zongxia Li, Carlos Rafael ...

work page 2021
[5]

Large Language Models Are Effective Human Annotation Assistants, But Not Good Independent Annotators

Large language models are effective human annotation assistants, but not good independent an- notators.arXiv preprint arXiv:2503.06778. Shibo Hao, Yi Gu, Haotian Luo, Tianyang Liu, Xiyan Shao, Xinyuan Wang, Shuhua Xie, Haodi Ma, Adithya Samavedhi, Qiyue Gao, et al. Llm reasoners: New evaluation, library, and analysis of step-by-step reasoning with large l...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

define your terms

Synthetic data generation with large language models for text classification: Potential and limita- tions. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10443–10461. Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. 2023. The flan c...

work page 2023
[7]

fair explanation

A decade of tweets: Visualizing racial senti- ments towards minoritized groups in the united states between 2011 and 2021.Epidemiology, 35(1):51–59. Tin Nguyen, Jiannan Xu, Aayushi Roy, Hal Daumé III, and Marine Carpuat. 2023. Towards conceptualiza- tion of “fair explanation”: Disparate impacts of anti- asian hate speech explanations on content moderators...

work page 2011
[8]

Melissa Kazemi Rad, Huy Nghiem, Andy Luo, Sahil Wadhwa, Mohammad Sorower, and Stephen Rawls

Training language models to follow instruc- tions with human feedback.Advances in neural in- formation processing systems, 35:27730–27744. Melissa Kazemi Rad, Huy Nghiem, Andy Luo, Sahil Wadhwa, Mohammad Sorower, and Stephen Rawls

work page
[9]

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn

Refining input guardrails: Enhancing llm-as-a- judge efficiency through chain-of-thought fine-tuning and alignment.arXiv preprint arXiv:2501.13080. Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn

work page arXiv
[10]

Sarthak Roy, Ashish Harshvardhan, Animesh Mukher- jee, and Punyajoy Saha

Direct preference optimization: Your language model is secretly a reward model.Advances in Neu- ral Information Processing Systems, 36. Sarthak Roy, Ashish Harshvardhan, Animesh Mukher- jee, and Punyajoy Saha. 2023. Probing llms for hate speech detection: strengths and vulnerabilities. In Findings of the Association for Computational Lin- guistics: EMNLP ...

work page arXiv 2023
[11]

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Al- isa Liu, Noah A Smith, Daniel Khashabi, and Han- naneh Hajishirzi

Large-scale hate speech detection with cross- domain transfer.arXiv preprint arXiv:2203.01111. Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Al- isa Liu, Noah A Smith, Daniel Khashabi, and Han- naneh Hajishirzi. 2022. Self-instruct: Aligning lan- guage models with self-generated instructions.arXiv preprint arXiv:2212.10560. Benjamin Warner, Antoine Chaffin...

work page arXiv 2022
[12]

lost in context

Tree of thoughts: Deliberate problem solving with large language models.Advances in Neural Information Processing Systems, 36. Zhixue Zhao, Ziqi Zhang, and Frank Hopfgartner. 2021. A comparative study of using pre-trained language models for toxic comment classification. InCompan- ion Proceedings of the Web Conference 2021, pages 500–507. Adam Zweiger, Jy...

work page arXiv 2021

[1] [1]

InThe 7th Workshop on Online Abuse and Harms (WOAH), pages 231–242

Robust hate speech detection in social me- dia: A cross-dataset empirical evaluation. InThe 7th Workshop on Online Abuse and Harms (WOAH), pages 231–242. Stephanie Alice Baker, Matthew Wade, and Michael James Walsh. 2020. <? covid19?> the challenges of responding to misinformation during a pandemic: content moderation and the limi- tations of the concept ...

work page 2020

[2] [2]

Matteo Cinelli, Gianmarco De Francisci Morales, Alessandro Galeazzi, Walter Quattrociocchi, and Michele Starnini

A review of the f-measure: its history, prop- erties, criticism, and alternatives.ACM Computing Surveys, 56(3):1–24. Matteo Cinelli, Gianmarco De Francisci Morales, Alessandro Galeazzi, Walter Quattrociocchi, and Michele Starnini. 2021. The echo chamber effect on social media.Proceedings of the National Academy of Sciences, 118(9):e2023301118. Jacob Devli...

work page 2021

[3] [3]

The Llama 3 Herd of Models

Bert and fasttext embeddings for automatic de- tection of toxic speech. In2020 International Multi- Conference on:“Organization of Knowledge and Ad- vanced Technologies”(OCTA), pages 1–5. IEEE. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. Th...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Katharine Gelber

How well do hate speech, toxicity, abusive and offensive language classification models gener- alize across datasets?Information Processing & Management, 58(3):102524. Katharine Gelber. 2021. Differentiating hate speech: a systemic discrimination approach.Critical Review of International Social and Political Philosophy. Feng Gu, Zongxia Li, Carlos Rafael ...

work page 2021

[5] [5]

Large Language Models Are Effective Human Annotation Assistants, But Not Good Independent Annotators

Large language models are effective human annotation assistants, but not good independent an- notators.arXiv preprint arXiv:2503.06778. Shibo Hao, Yi Gu, Haotian Luo, Tianyang Liu, Xiyan Shao, Xinyuan Wang, Shuhua Xie, Haodi Ma, Adithya Samavedhi, Qiyue Gao, et al. Llm reasoners: New evaluation, library, and analysis of step-by-step reasoning with large l...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[6] [6]

define your terms

Synthetic data generation with large language models for text classification: Potential and limita- tions. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10443–10461. Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. 2023. The flan c...

work page 2023

[7] [7]

fair explanation

A decade of tweets: Visualizing racial senti- ments towards minoritized groups in the united states between 2011 and 2021.Epidemiology, 35(1):51–59. Tin Nguyen, Jiannan Xu, Aayushi Roy, Hal Daumé III, and Marine Carpuat. 2023. Towards conceptualiza- tion of “fair explanation”: Disparate impacts of anti- asian hate speech explanations on content moderators...

work page 2011

[8] [8]

Melissa Kazemi Rad, Huy Nghiem, Andy Luo, Sahil Wadhwa, Mohammad Sorower, and Stephen Rawls

Training language models to follow instruc- tions with human feedback.Advances in neural in- formation processing systems, 35:27730–27744. Melissa Kazemi Rad, Huy Nghiem, Andy Luo, Sahil Wadhwa, Mohammad Sorower, and Stephen Rawls

work page

[9] [9]

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn

Refining input guardrails: Enhancing llm-as-a- judge efficiency through chain-of-thought fine-tuning and alignment.arXiv preprint arXiv:2501.13080. Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn

work page arXiv

[10] [10]

Sarthak Roy, Ashish Harshvardhan, Animesh Mukher- jee, and Punyajoy Saha

Direct preference optimization: Your language model is secretly a reward model.Advances in Neu- ral Information Processing Systems, 36. Sarthak Roy, Ashish Harshvardhan, Animesh Mukher- jee, and Punyajoy Saha. 2023. Probing llms for hate speech detection: strengths and vulnerabilities. In Findings of the Association for Computational Lin- guistics: EMNLP ...

work page arXiv 2023

[11] [11]

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Al- isa Liu, Noah A Smith, Daniel Khashabi, and Han- naneh Hajishirzi

Large-scale hate speech detection with cross- domain transfer.arXiv preprint arXiv:2203.01111. Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Al- isa Liu, Noah A Smith, Daniel Khashabi, and Han- naneh Hajishirzi. 2022. Self-instruct: Aligning lan- guage models with self-generated instructions.arXiv preprint arXiv:2212.10560. Benjamin Warner, Antoine Chaffin...

work page arXiv 2022

[12] [12]

lost in context

Tree of thoughts: Deliberate problem solving with large language models.Advances in Neural Information Processing Systems, 36. Zhixue Zhao, Ziqi Zhang, and Frank Hopfgartner. 2021. A comparative study of using pre-trained language models for toxic comment classification. InCompan- ion Proceedings of the Web Conference 2021, pages 500–507. Adam Zweiger, Jy...

work page arXiv 2021