pith. sign in

arxiv: 2604.08886 · v1 · submitted 2026-04-10 · 💻 cs.SE

Real-Time Toxicity Filtering for Open-Source Code Reviews

Pith reviewed 2026-05-10 17:57 UTC · model grok-4.3

classification 💻 cs.SE
keywords toxicity detectioncode reviewsopen sourceBERTLlamadetoxificationbrowser extensionsoftware collaboration
0
0 comments X

The pith

A browser extension detects toxic code reviews at 97% F1 and rewrites them using fine-tuned language models to reduce harm while keeping technical content.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ToxiShield as a real-time browser extension that flags and rewrites toxic comments in open-source code reviews. It relies on a BERT classifier for initial detection, prompted language models for classifying types of toxicity, and a fine-tuned Llama model to produce less toxic versions of the comments. The authors test these components on tens of thousands of real code-review texts and run a small validation with developers. If the approach holds, it could turn potentially damaging interactions into constructive ones without requiring reviewers to self-edit.

Core claim

ToxiShield combines three modules: a fine-tuned BERT binary classifier that reaches 97% F1 on 38,761 code-review texts, a prompted Claude 3.5 Sonnet model that achieves 39% MCC for multiclass toxicity typing on 1,200 samples, and a fine-tuned Llama 3.2 model that attains 95.27% style-transfer accuracy, 97.03% fluency, 67.07% content preservation, and 84% J-score. Small-scale testing with ten developers indicates the resulting detoxified reviews support more inclusive collaboration.

What carries the argument

ToxiShield, a three-module pipeline that first identifies toxicity with a fine-tuned BERT classifier, then performs reasoned multiclass classification, and finally detoxifies the text with a fine-tuned Llama 3.2 model.

If this is right

  • High-accuracy real-time detection makes it feasible to filter comments before they reach the reviewer.
  • If content preservation remains adequate, automated rewriting can turn toxic feedback into usable input without losing engineering value.
  • Multiclass classification allows the system to apply different rewriting strategies depending on the type of toxicity present.
  • Successful deployment would reduce the volume of harmful exchanges that currently discourage participation in open-source projects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pipeline could be adapted to flag and soften toxic comments in issue trackers or pull-request discussions beyond formal code reviews.
  • Platform maintainers could embed the detector directly into GitHub or GitLab interfaces so that rewritten text appears as a suggested edit rather than a separate tool.
  • Periodic retraining on new toxicity patterns would be needed because language and norms in open-source communities continue to shift.
  • The modest content-preservation score suggests that critical technical details should still be reviewed by a human before the rewritten comment is posted.

Load-bearing premise

The detoxification step keeps the original technical meaning and intent intact without adding errors, and the performance numbers hold for code reviews outside the training datasets and the small developer test group.

What would settle it

A larger blinded study in which developers compare original and detoxified versions of the same reviews and report that the rewritten text changes the technical point or introduces inaccuracies that affect how the review is acted upon.

Figures

Figures reproduced from arXiv: 2604.08886 by Amiangshu Bosu, Anindya Iqbal, Jaydeb Sarker, Md Awsaf Alam Anindya, Showvik Biswas.

Figure 1
Figure 1. Figure 1: Motivational Workflow of ToxiShield that not only flags toxic code review comments but also provides actionable feedback and suggests civil alternatives. While recent work by Rahman et al. uses a customized T5 model to rephrase un￾civil comments [6], it lacks explainability and real-time integration. Therefore, we propose ToxiShield [1], a framework that proac￾tively detects and mitigates toxicity in SE co… view at source ↗
read the original abstract

Toxic interactions in open-source software development harm community collaboration. To combat this, we propose ToxiShield, a realtime browser extension that identifies and detoxifies toxic code reviews. The framework comprises three modules: toxicity identification, reasoned multiclass classification, and code review detoxification. Our fine-tuned BERT-based binary classifier achieved a 97% F1-score on 38,761 code review texts. For multiclass classification, Claude 3.5 Sonnet with prompt engineering achieved a 39% MCC and 42% F1 on 1,200 samples. Finally, our fine-tuned Llama 3.2 detoxification model reached 95.27% style transfer accuracy, 97.03% fluency, 67.07% content preservation, and an 84% J-score. Validation with 10 software developers suggests ToxiShield effectively fosters a more inclusive open-source environment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes ToxiShield, a real-time browser extension for identifying and detoxifying toxic code reviews in open-source projects. It comprises a fine-tuned BERT binary toxicity classifier (97% F1 on 38,761 texts), a Claude 3.5 Sonnet multiclass classifier (39% MCC, 42% F1 on 1,200 samples), and a fine-tuned Llama 3.2 detoxification model (95.27% style transfer accuracy, 97.03% fluency, 67.07% content preservation, 84% J-score), with validation via a survey of 10 developers.

Significance. If the detoxification module can be shown to preserve technical intent without introducing errors in code-review suggestions, the work could offer a practical tool for reducing toxicity in OSS communities. The provision of concrete metrics on defined dataset sizes is a positive aspect, as is the use of established models (BERT, Llama) with reported performance numbers rather than purely qualitative claims.

major comments (3)
  1. [Abstract] Abstract (detoxification results): The central claim that ToxiShield produces usable, non-toxic code reviews rests on the detoxification module, yet the reported 67.07% content preservation is low for a domain where semantic drift can invalidate technical suggestions (e.g., altering a specific performance fix). No details are given on the exact metric used for content preservation, whether it was human-evaluated for engineering accuracy, or any error analysis of the 33% altered content.
  2. [Abstract] Multiclass classification results: The 39% MCC for the reasoned multiclass module on 1,200 samples is low enough to question its contribution to the overall framework; the manuscript should clarify whether this module is required for the end-to-end real-time system or if the binary classifier alone suffices for the toxicity filtering claim.
  3. [Validation] Developer validation section: The claim that ToxiShield fosters a more inclusive environment is supported only by a survey of n=10 developers with no reported details on survey design, questions, response rates, or statistical significance, limiting the strength of the utility conclusion.
minor comments (2)
  1. [Abstract] The abstract omits any mention of baselines, inter-annotator agreement for annotations, or how content-preservation failures were quantified, which would strengthen the evaluation description.
  2. [Abstract] Notation for the J-score is introduced without definition or reference to its formula or prior use in style-transfer literature.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important areas for clarification and strengthening of our claims. We address each major comment below and indicate where revisions will be made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract (detoxification results): The central claim that ToxiShield produces usable, non-toxic code reviews rests on the detoxification module, yet the reported 67.07% content preservation is low for a domain where semantic drift can invalidate technical suggestions (e.g., altering a specific performance fix). No details are given on the exact metric used for content preservation, whether it was human-evaluated for engineering accuracy, or any error analysis of the 33% altered content.

    Authors: We agree that the content preservation score requires additional context and that the lack of details limits interpretability, particularly given the risks of semantic drift in technical code review content. The current manuscript reports the aggregate score without specifying the underlying metric or providing error analysis. In the revision, we will add a dedicated subsection describing the content preservation metric (including how it was computed), discuss the observed trade-offs with fluency and style transfer, include an error analysis of altered cases, and acknowledge that no domain-specific human evaluation of engineering accuracy was performed. We will also temper claims about producing 'usable' reviews to reflect these limitations. revision: yes

  2. Referee: [Abstract] Multiclass classification results: The 39% MCC for the reasoned multiclass module on 1,200 samples is low enough to question its contribution to the overall framework; the manuscript should clarify whether this module is required for the end-to-end real-time system or if the binary classifier alone suffices for the toxicity filtering claim.

    Authors: The binary classifier forms the foundation of the real-time toxicity filtering functionality, while the multiclass module was intended to support more nuanced detoxification by identifying specific toxicity categories. We acknowledge that an MCC of 0.39 on the 1,200-sample evaluation is modest and raises valid questions about its incremental value. In the revised manuscript, we will clarify that the core toxicity filtering claim relies on the binary classifier and that the multiclass component is supplementary rather than required for the end-to-end system. We will also add discussion of the multiclass performance challenges and note that the system can operate effectively without it. revision: yes

  3. Referee: [Validation] Developer validation section: The claim that ToxiShield fosters a more inclusive environment is supported only by a survey of n=10 developers with no reported details on survey design, questions, response rates, or statistical significance, limiting the strength of the utility conclusion.

    Authors: The small sample size and absence of methodological details do limit the strength of the inclusivity claims, and we agree this section is underdeveloped. The survey was intended as preliminary validation rather than conclusive evidence. In the revision, we will expand the section to include full details on survey design, questions, recruitment process, response rates, and any analysis performed. We will also revise the language to present the results as exploratory feedback from a small group of developers, without overstating the implications for fostering inclusivity. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical ML evaluation

full rationale

The paper reports standard supervised fine-tuning and evaluation of BERT and Llama models on held-out datasets (38,761 texts for binary classification; separate samples for multiclass and detoxification). No equations, derivations, or self-referential definitions appear. All reported metrics (F1, MCC, style transfer accuracy, content preservation, J-score) are computed on independent test sets rather than being defined in terms of the fitted parameters or prior outputs. No load-bearing step reduces to a self-citation chain or ansatz smuggled via citation. The work is self-contained empirical ML without circular structure.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the assumption that fine-tuned LLMs can reliably detect and rewrite toxicity in technical text without losing meaning, plus the existence of suitable labeled code-review datasets.

axioms (1)
  • domain assumption Language models fine-tuned on toxicity data can generalize to code-review comments.
    Invoked by the reported 97% F1 and detoxification scores.
invented entities (1)
  • ToxiShield no independent evidence
    purpose: Real-time browser extension combining detection, classification, and detoxification modules
    New named system proposed in the paper.

pith-pipeline@v0.9.0 · 5462 in / 1209 out tokens · 40422 ms · 2026-05-10T17:57:00.151890+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages

  1. [1]

    Md Awsaf Alam Anindya, Showvik Biswas, Anindya Iqbal, Jaydeb Sarker, and Amiangshu Bosu. 2026. ToxiShield: Enhancing Developer Collaboration through Real-Time Toxicity Filtering.Proceedings of the ACM on Software Engineering3, FSE (2026). doi:10.1145/3808130

  2. [2]

    2026.ToxiShield: Replication Package

    Md Awsaf Alam Anindya, Showvik Biswas, Anindya Iqbal, Jaydeb Sarker, and Amiangshu Bosu. 2026.ToxiShield: Replication Package. https://github.com/WSU- SEAL/ToxiShield

  3. [3]

    Tanni Dev, Sayma Sultana, and Amiangshu Bosu. 2025. Beyond Binary Moder- ation: Identifying Fine-Grained Sexist and Misogynistic on GitHub with Large Language Models. In2025 ACM/IEEE International Symposium on Empirical Soft- ware Engineering and Measurement (ESEM)(Honolulu, Hawai, USA, USA). 1–12

  4. [4]

    Zhenxin Fu, Xiaoye Tan, Nanyun Peng, Dongyan Zhao, and Rui Yan. 2018. Style transfer in text: Exploration and evaluation. InProceedings of the AAAI conference on artificial intelligence, Vol. 32

  5. [5]

    Did you miss my comment or what?

    Courtney Miller, Sophie Cohen, Daniel Klug, Bogdan Vasilescu, and Christian KaUstner. 2022. " Did you miss my comment or what?" understanding toxicity in open source discussions. InProceedings of the 44th International Conference on Software Engineering. 710–722

  6. [6]

    Md Shamimur Rahman, Zadia Codabux, and Chanchal K Roy. 2024. Do Words Have Power? Understanding and Fostering Civility in Code Review Discussion. Proceedings of the ACM on Software Engineering1, FSE (2024), 1632–1655. doi:10. 1145/3660780

  7. [7]

    Jaydeb Sarker, Asif Kamal Turzo, and Amiangshu Bosu. 2025. The Landscape of Toxicity: An Empirical Investigation of Toxicity on GitHub.Proceedings of the ACM on Software Engineering2, FSE (2025), 623–646. doi:10.1145/3715744

  8. [8]

    Jaydeb Sarker, Asif Kamal Turzo, Ming Dong, and Amiangshu Bosu. 2023. Auto- mated Identification of Toxic Code Reviews Using ToxiCR.ACM Transactions on Software Engineering and Methodology32, 5 (July 2023), 1–32. doi:10.1145/3583562