ToxiFrench: Benchmarking and Enhancing Language Models via CoT Fine-Tuning for French Toxicity Detection

Axel Delaval; Haicheng Wang; Han Qiu; Jialiang Lu; Shujian Yang

arxiv: 2508.11281 · v3 · submitted 2025-08-15 · 💻 cs.CL · cs.AI· cs.CY

ToxiFrench: Benchmarking and Enhancing Language Models via CoT Fine-Tuning for French Toxicity Detection

Axel Delaval , Shujian Yang , Haicheng Wang , Han Qiu , Jialiang Lu This is my paper

Pith reviewed 2026-05-18 23:15 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CY

keywords French toxicity detectionToxiFrench datasetchain-of-thought fine-tuningdynamic weighted losssmall language modelsbenchmark evaluationcross-lingual retentionsemi-automated annotation

0 comments

The pith

A 4-billion-parameter model fine-tuned with chain-of-thought steps reaches state-of-the-art results on French toxicity detection and beats larger models while keeping cross-lingual performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ToxiFrench, a dataset of 53,622 French online comments built with a semi-automated process that limits human labeling to 10 percent. Benchmark tests reveal that smaller language models often prove more robust than bigger ones on this task. The authors then apply a chain-of-thought fine-tuning method that uses dynamic weighted loss to focus training on the model's final judgment. This produces a 4B model whose balanced accuracy rises 10 percent above its starting point and exceeds the scores of GPT-4o and DeepSeek-R1 on the new benchmark.

Core claim

The central claim is that a Chain-of-Thought fine-tuning strategy paired with Dynamic Weighted Loss, applied to a 4B model on the ToxiFrench benchmark, yields higher balanced accuracy than much larger models while preserving the ability to handle tasks in other languages.

What carries the argument

The Chain-of-Thought fine-tuning strategy with Dynamic Weighted Loss, which progressively increases emphasis on the model's final decision during training.

If this is right

The fine-tuned 4B model raises balanced accuracy by 10 percent over its unfine-tuned baseline.
The same model records higher scores than GPT-4o and DeepSeek-R1 on the ToxiFrench benchmark.
Smaller models can deliver stronger robustness and generalization than larger models for French toxicity detection.
Cross-lingual capabilities remain intact after the targeted fine-tuning step.
The dataset supplies a balanced split that supports systematic evaluation of future models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be adapted to build similar benchmarks for toxicity detection in other languages with limited labeled data.
Targeted fine-tuning on culturally specific comments may allow smaller models to handle content moderation tasks without needing the full scale of frontier systems.
If the dynamic weighting approach improves decision faithfulness across tasks, it could be tested on related classification problems such as hate-speech or misinformation detection.
Practical deployment would benefit from checking performance on live French social-media streams rather than static benchmark splits.

Load-bearing premise

The semi-automated annotation pipeline produces toxicity labels that match those from full human annotation closely enough for the benchmark to be trusted.

What would settle it

A side-by-side comparison in which the same set of comments receives fully human labels and shows statistically significant differences from the semi-automated labels would falsify the benchmark's reliability.

read the original abstract

Detecting toxic content using language models is crucial yet challenging. While substantial progress has been made in English, toxicity detection in French remains underdeveloped, primarily due to the lack of culturally relevant, human-annotated, large-scale datasets. In this work, we release ToxiFrench, a dataset of 53,622 French online comments together with a balanced benchmark split for systematic evaluation. The dataset is constructed via a semi-automated annotation pipeline that reduces manual labeling to only 10% through high-confidence LLM-based pre-annotation and human verification, while ensuring statistical alignment with human-only annotation. We then benchmark a broad range of models and uncover a counterintuitive finding: Small Language Models (SLMs) often surpass larger models in robustness and generalization on this task. Motivated by this finding, we propose a novel Chain-of-Thought (CoT) fine-tuning strategy using a Dynamic Weighted Loss (DWL) that progressively emphasizes the model's final decision and significantly improves faithfulness. Our fine-tuned 4B model (Qwen3-4B) achieves state-of-the-art performance on the benchmark. It improves its balanced accuracy by 10% over its baseline and achieves better performance than GPT-4o and DeepSeek-R1 on our benchmark, while successfully retaining cross-lingual capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces ToxiFrench, a dataset of 53,622 French online comments for toxicity detection, built via a semi-automated annotation pipeline that uses high-confidence LLM pre-annotation followed by human verification on only 10% of samples while claiming statistical alignment with fully human-annotated data. It benchmarks a range of models and reports the counterintuitive result that smaller language models often outperform larger ones in robustness and generalization. The authors then propose Chain-of-Thought fine-tuning with a Dynamic Weighted Loss (DWL) that progressively emphasizes the final decision, and claim that the resulting fine-tuned Qwen3-4B model achieves state-of-the-art balanced accuracy on the benchmark (10% gain over baseline), surpasses GPT-4o and DeepSeek-R1, and retains cross-lingual capabilities.

Significance. If the annotation alignment holds and the performance deltas are robustly verified, the work would be a useful contribution to multilingual toxicity detection by releasing a culturally relevant French benchmark and by showing that targeted CoT fine-tuning can make small models competitive with much larger ones. The dataset release itself is a clear positive that enables follow-on research.

major comments (3)

[Dataset construction / annotation pipeline] The central claim that the semi-automated pipeline 'ensures statistical alignment with human-only annotation' is load-bearing for every downstream result, yet no quantitative evidence (Cohen's kappa, chi-square test on label distributions, balanced accuracy on a held-out human-verified subset, or p-values) is provided in the dataset section or appendix. Without these metrics the benchmark cannot be treated as reliable ground truth.
[Experimental results] §4 (experimental results) and the abstract report a 10% balanced-accuracy gain and superiority to GPT-4o without error bars, confidence intervals, or any statistical significance test. This makes it impossible to assess whether the reported deltas are distinguishable from noise.
[Method / fine-tuning strategy] The Dynamic Weighted Loss (DWL) is presented as a key innovation that 'progressively emphasizes the model's final decision,' but neither the exact weighting schedule nor its hyper-parameters are defined (no equation or pseudocode). Consequently the ablation-free claim that DWL drives the faithfulness improvement cannot be evaluated or reproduced.

minor comments (2)

[Abstract / experiments] The abstract states that cross-lingual capabilities are retained but gives no evaluation protocol or metrics; a short paragraph or table in the experiments section would clarify this claim.
[Figures and tables] Table or figure captions should explicitly list the exact number of models, seeds, and evaluation splits used so that readers can immediately verify the scope of the benchmarking.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which have identified important areas where the manuscript can be strengthened in terms of rigor and reproducibility. We address each major comment below and commit to revisions that directly respond to the concerns.

read point-by-point responses

Referee: [Dataset construction / annotation pipeline] The central claim that the semi-automated pipeline 'ensures statistical alignment with human-only annotation' is load-bearing for every downstream result, yet no quantitative evidence (Cohen's kappa, chi-square test on label distributions, balanced accuracy on a held-out human-verified subset, or p-values) is provided in the dataset section or appendix. Without these metrics the benchmark cannot be treated as reliable ground truth.

Authors: We agree that explicit quantitative validation of the statistical alignment is necessary to support the reliability of the ToxiFrench benchmark. Although the semi-automated pipeline was designed to maintain alignment through high-confidence pre-annotation and targeted human verification, the manuscript does not report the supporting metrics. We will add Cohen's kappa, chi-square test results on label distributions, balanced accuracy on the held-out human-verified subset, and associated p-values to Section 3 and the appendix in the revised manuscript. revision: yes
Referee: [Experimental results] §4 (experimental results) and the abstract report a 10% balanced-accuracy gain and superiority to GPT-4o without error bars, confidence intervals, or any statistical significance test. This makes it impossible to assess whether the reported deltas are distinguishable from noise.

Authors: We acknowledge that the lack of error bars, confidence intervals, and statistical significance testing weakens the interpretability of the reported performance improvements. We will revise Section 4 to include results from multiple runs with standard deviations, confidence intervals, and appropriate statistical tests (e.g., paired t-tests) to demonstrate that the observed gains are robust and distinguishable from noise. Corresponding updates will be made to the abstract. revision: yes
Referee: [Method / fine-tuning strategy] The Dynamic Weighted Loss (DWL) is presented as a key innovation that 'progressively emphasizes the model's final decision,' but neither the exact weighting schedule nor its hyper-parameters are defined (no equation or pseudocode). Consequently the ablation-free claim that DWL drives the faithfulness improvement cannot be evaluated or reproduced.

Authors: We agree that the current description of the Dynamic Weighted Loss lacks the necessary detail for reproducibility. We will expand the Methods section to include the precise mathematical formulation of the weighting schedule, all hyper-parameter values, and pseudocode for the loss computation. This addition will allow the community to evaluate and replicate the fine-tuning approach. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper constructs a new benchmark via semi-automated LLM pre-annotation plus 10% human verification, asserts statistical alignment with human-only labels, then reports empirical performance of models (including a fine-tuned 4B model) on a held-out split. These performance deltas (+10% balanced accuracy, superiority to GPT-4o) are measured quantities on the external benchmark rather than quantities defined by or equivalent to the annotation pipeline itself. No self-definitional reductions, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described methodology; the central claims remain independent empirical results.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that LLM pre-annotation plus limited human verification produces labels statistically equivalent to full human annotation, and on the unstated details of how the Dynamic Weighted Loss is implemented and scheduled.

free parameters (1)

Dynamic Weighted Loss schedule parameters
The progressive emphasis on the final decision is controlled by unspecified weighting coefficients that are fitted or chosen during training.

axioms (1)

domain assumption LLM-based pre-annotation at high confidence produces labels that remain statistically aligned with human-only annotation after human verification of the remaining 10%.
Invoked in the dataset-construction paragraph of the abstract; if false, the benchmark and all downstream performance numbers lose validity.

pith-pipeline@v0.9.0 · 5779 in / 1462 out tokens · 31842 ms · 2026-05-18T23:15:42.976584+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

semi-automated annotation pipeline that reduces manual labeling to only 10% through high-confidence LLM-based pre-annotation
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

dynamic weighted loss function that progressively increases the weight on the final conclusion’s loss

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.