Real-Time Toxicity Filtering for Open-Source Code Reviews

Amiangshu Bosu; Anindya Iqbal; Jaydeb Sarker; Md Awsaf Alam Anindya; Showvik Biswas

arxiv: 2604.08886 · v1 · submitted 2026-04-10 · 💻 cs.SE

Real-Time Toxicity Filtering for Open-Source Code Reviews

Md Awsaf Alam Anindya , Showvik Biswas , Anindya Iqbal , Jaydeb Sarker , Amiangshu Bosu This is my paper

Pith reviewed 2026-05-10 17:57 UTC · model grok-4.3

classification 💻 cs.SE

keywords toxicity detectioncode reviewsopen sourceBERTLlamadetoxificationbrowser extensionsoftware collaboration

0 comments

The pith

A browser extension detects toxic code reviews at 97% F1 and rewrites them using fine-tuned language models to reduce harm while keeping technical content.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ToxiShield as a real-time browser extension that flags and rewrites toxic comments in open-source code reviews. It relies on a BERT classifier for initial detection, prompted language models for classifying types of toxicity, and a fine-tuned Llama model to produce less toxic versions of the comments. The authors test these components on tens of thousands of real code-review texts and run a small validation with developers. If the approach holds, it could turn potentially damaging interactions into constructive ones without requiring reviewers to self-edit.

Core claim

ToxiShield combines three modules: a fine-tuned BERT binary classifier that reaches 97% F1 on 38,761 code-review texts, a prompted Claude 3.5 Sonnet model that achieves 39% MCC for multiclass toxicity typing on 1,200 samples, and a fine-tuned Llama 3.2 model that attains 95.27% style-transfer accuracy, 97.03% fluency, 67.07% content preservation, and 84% J-score. Small-scale testing with ten developers indicates the resulting detoxified reviews support more inclusive collaboration.

What carries the argument

ToxiShield, a three-module pipeline that first identifies toxicity with a fine-tuned BERT classifier, then performs reasoned multiclass classification, and finally detoxifies the text with a fine-tuned Llama 3.2 model.

If this is right

High-accuracy real-time detection makes it feasible to filter comments before they reach the reviewer.
If content preservation remains adequate, automated rewriting can turn toxic feedback into usable input without losing engineering value.
Multiclass classification allows the system to apply different rewriting strategies depending on the type of toxicity present.
Successful deployment would reduce the volume of harmful exchanges that currently discourage participation in open-source projects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pipeline could be adapted to flag and soften toxic comments in issue trackers or pull-request discussions beyond formal code reviews.
Platform maintainers could embed the detector directly into GitHub or GitLab interfaces so that rewritten text appears as a suggested edit rather than a separate tool.
Periodic retraining on new toxicity patterns would be needed because language and norms in open-source communities continue to shift.
The modest content-preservation score suggests that critical technical details should still be reviewed by a human before the rewritten comment is posted.

Load-bearing premise

The detoxification step keeps the original technical meaning and intent intact without adding errors, and the performance numbers hold for code reviews outside the training datasets and the small developer test group.

What would settle it

A larger blinded study in which developers compare original and detoxified versions of the same reviews and report that the rewritten text changes the technical point or introduces inaccuracies that affect how the review is acted upon.

Figures

Figures reproduced from arXiv: 2604.08886 by Amiangshu Bosu, Anindya Iqbal, Jaydeb Sarker, Md Awsaf Alam Anindya, Showvik Biswas.

**Figure 1.** Figure 1: Motivational Workflow of ToxiShield that not only flags toxic code review comments but also provides actionable feedback and suggests civil alternatives. While recent work by Rahman et al. uses a customized T5 model to rephrase uncivil comments [6], it lacks explainability and real-time integration. Therefore, we propose ToxiShield [1], a framework that proactively detects and mitigates toxicity in SE co… view at source ↗

read the original abstract

Toxic interactions in open-source software development harm community collaboration. To combat this, we propose ToxiShield, a realtime browser extension that identifies and detoxifies toxic code reviews. The framework comprises three modules: toxicity identification, reasoned multiclass classification, and code review detoxification. Our fine-tuned BERT-based binary classifier achieved a 97% F1-score on 38,761 code review texts. For multiclass classification, Claude 3.5 Sonnet with prompt engineering achieved a 39% MCC and 42% F1 on 1,200 samples. Finally, our fine-tuned Llama 3.2 detoxification model reached 95.27% style transfer accuracy, 97.03% fluency, 67.07% content preservation, and an 84% J-score. Validation with 10 software developers suggests ToxiShield effectively fosters a more inclusive open-source environment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ToxiShield builds a working browser extension that flags toxic code reviews and rewrites them, but the detoxification step only keeps 67% of the original content, which undercuts how useful the cleaned reviews actually are.

read the letter

The paper's main contribution is a complete real-time system called ToxiShield that runs as a browser extension. It detects toxic comments in code reviews with a fine-tuned BERT model, does some multiclass labeling, and then uses a fine-tuned Llama 3.2 model to rewrite the text in a less toxic way. They report 97% F1 on the binary classifier trained on 38,761 examples, which is a respectable result on a reasonably sized dataset for this domain. The end-to-end integration into something developers could actually install and use is the part that feels new compared to earlier toxicity work in NLP or software engineering.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes ToxiShield, a real-time browser extension for identifying and detoxifying toxic code reviews in open-source projects. It comprises a fine-tuned BERT binary toxicity classifier (97% F1 on 38,761 texts), a Claude 3.5 Sonnet multiclass classifier (39% MCC, 42% F1 on 1,200 samples), and a fine-tuned Llama 3.2 detoxification model (95.27% style transfer accuracy, 97.03% fluency, 67.07% content preservation, 84% J-score), with validation via a survey of 10 developers.

Significance. If the detoxification module can be shown to preserve technical intent without introducing errors in code-review suggestions, the work could offer a practical tool for reducing toxicity in OSS communities. The provision of concrete metrics on defined dataset sizes is a positive aspect, as is the use of established models (BERT, Llama) with reported performance numbers rather than purely qualitative claims.

major comments (3)

[Abstract] Abstract (detoxification results): The central claim that ToxiShield produces usable, non-toxic code reviews rests on the detoxification module, yet the reported 67.07% content preservation is low for a domain where semantic drift can invalidate technical suggestions (e.g., altering a specific performance fix). No details are given on the exact metric used for content preservation, whether it was human-evaluated for engineering accuracy, or any error analysis of the 33% altered content.
[Abstract] Multiclass classification results: The 39% MCC for the reasoned multiclass module on 1,200 samples is low enough to question its contribution to the overall framework; the manuscript should clarify whether this module is required for the end-to-end real-time system or if the binary classifier alone suffices for the toxicity filtering claim.
[Validation] Developer validation section: The claim that ToxiShield fosters a more inclusive environment is supported only by a survey of n=10 developers with no reported details on survey design, questions, response rates, or statistical significance, limiting the strength of the utility conclusion.

minor comments (2)

[Abstract] The abstract omits any mention of baselines, inter-annotator agreement for annotations, or how content-preservation failures were quantified, which would strengthen the evaluation description.
[Abstract] Notation for the J-score is introduced without definition or reference to its formula or prior use in style-transfer literature.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important areas for clarification and strengthening of our claims. We address each major comment below and indicate where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract (detoxification results): The central claim that ToxiShield produces usable, non-toxic code reviews rests on the detoxification module, yet the reported 67.07% content preservation is low for a domain where semantic drift can invalidate technical suggestions (e.g., altering a specific performance fix). No details are given on the exact metric used for content preservation, whether it was human-evaluated for engineering accuracy, or any error analysis of the 33% altered content.

Authors: We agree that the content preservation score requires additional context and that the lack of details limits interpretability, particularly given the risks of semantic drift in technical code review content. The current manuscript reports the aggregate score without specifying the underlying metric or providing error analysis. In the revision, we will add a dedicated subsection describing the content preservation metric (including how it was computed), discuss the observed trade-offs with fluency and style transfer, include an error analysis of altered cases, and acknowledge that no domain-specific human evaluation of engineering accuracy was performed. We will also temper claims about producing 'usable' reviews to reflect these limitations. revision: yes
Referee: [Abstract] Multiclass classification results: The 39% MCC for the reasoned multiclass module on 1,200 samples is low enough to question its contribution to the overall framework; the manuscript should clarify whether this module is required for the end-to-end real-time system or if the binary classifier alone suffices for the toxicity filtering claim.

Authors: The binary classifier forms the foundation of the real-time toxicity filtering functionality, while the multiclass module was intended to support more nuanced detoxification by identifying specific toxicity categories. We acknowledge that an MCC of 0.39 on the 1,200-sample evaluation is modest and raises valid questions about its incremental value. In the revised manuscript, we will clarify that the core toxicity filtering claim relies on the binary classifier and that the multiclass component is supplementary rather than required for the end-to-end system. We will also add discussion of the multiclass performance challenges and note that the system can operate effectively without it. revision: yes
Referee: [Validation] Developer validation section: The claim that ToxiShield fosters a more inclusive environment is supported only by a survey of n=10 developers with no reported details on survey design, questions, response rates, or statistical significance, limiting the strength of the utility conclusion.

Authors: The small sample size and absence of methodological details do limit the strength of the inclusivity claims, and we agree this section is underdeveloped. The survey was intended as preliminary validation rather than conclusive evidence. In the revision, we will expand the section to include full details on survey design, questions, recruitment process, response rates, and any analysis performed. We will also revise the language to present the results as exploratory feedback from a small group of developers, without overstating the implications for fostering inclusivity. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical ML evaluation

full rationale

The paper reports standard supervised fine-tuning and evaluation of BERT and Llama models on held-out datasets (38,761 texts for binary classification; separate samples for multiclass and detoxification). No equations, derivations, or self-referential definitions appear. All reported metrics (F1, MCC, style transfer accuracy, content preservation, J-score) are computed on independent test sets rather than being defined in terms of the fitted parameters or prior outputs. No load-bearing step reduces to a self-citation chain or ansatz smuggled via citation. The work is self-contained empirical ML without circular structure.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the assumption that fine-tuned LLMs can reliably detect and rewrite toxicity in technical text without losing meaning, plus the existence of suitable labeled code-review datasets.

axioms (1)

domain assumption Language models fine-tuned on toxicity data can generalize to code-review comments.
Invoked by the reported 97% F1 and detoxification scores.

invented entities (1)

ToxiShield no independent evidence
purpose: Real-time browser extension combining detection, classification, and detoxification modules
New named system proposed in the paper.

pith-pipeline@v0.9.0 · 5462 in / 1209 out tokens · 40422 ms · 2026-05-10T17:57:00.151890+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

our fine-tuned Llama 3.2 detoxification model reached 95.27% style transfer accuracy, 97.03% fluency, 67.07% content preservation, and an 84% J-score

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages

[1]

Md Awsaf Alam Anindya, Showvik Biswas, Anindya Iqbal, Jaydeb Sarker, and Amiangshu Bosu. 2026. ToxiShield: Enhancing Developer Collaboration through Real-Time Toxicity Filtering.Proceedings of the ACM on Software Engineering3, FSE (2026). doi:10.1145/3808130

work page doi:10.1145/3808130 2026
[2]

2026.ToxiShield: Replication Package

Md Awsaf Alam Anindya, Showvik Biswas, Anindya Iqbal, Jaydeb Sarker, and Amiangshu Bosu. 2026.ToxiShield: Replication Package. https://github.com/WSU- SEAL/ToxiShield

work page 2026
[3]

Tanni Dev, Sayma Sultana, and Amiangshu Bosu. 2025. Beyond Binary Moder- ation: Identifying Fine-Grained Sexist and Misogynistic on GitHub with Large Language Models. In2025 ACM/IEEE International Symposium on Empirical Soft- ware Engineering and Measurement (ESEM)(Honolulu, Hawai, USA, USA). 1–12

work page 2025
[4]

Zhenxin Fu, Xiaoye Tan, Nanyun Peng, Dongyan Zhao, and Rui Yan. 2018. Style transfer in text: Exploration and evaluation. InProceedings of the AAAI conference on artificial intelligence, Vol. 32

work page 2018
[5]

Did you miss my comment or what?

Courtney Miller, Sophie Cohen, Daniel Klug, Bogdan Vasilescu, and Christian KaUstner. 2022. " Did you miss my comment or what?" understanding toxicity in open source discussions. InProceedings of the 44th International Conference on Software Engineering. 710–722

work page 2022
[6]

Md Shamimur Rahman, Zadia Codabux, and Chanchal K Roy. 2024. Do Words Have Power? Understanding and Fostering Civility in Code Review Discussion. Proceedings of the ACM on Software Engineering1, FSE (2024), 1632–1655. doi:10. 1145/3660780

work page 2024
[7]

Jaydeb Sarker, Asif Kamal Turzo, and Amiangshu Bosu. 2025. The Landscape of Toxicity: An Empirical Investigation of Toxicity on GitHub.Proceedings of the ACM on Software Engineering2, FSE (2025), 623–646. doi:10.1145/3715744

work page doi:10.1145/3715744 2025
[8]

Jaydeb Sarker, Asif Kamal Turzo, Ming Dong, and Amiangshu Bosu. 2023. Auto- mated Identification of Toxic Code Reviews Using ToxiCR.ACM Transactions on Software Engineering and Methodology32, 5 (July 2023), 1–32. doi:10.1145/3583562

work page doi:10.1145/3583562 2023

[1] [1]

Md Awsaf Alam Anindya, Showvik Biswas, Anindya Iqbal, Jaydeb Sarker, and Amiangshu Bosu. 2026. ToxiShield: Enhancing Developer Collaboration through Real-Time Toxicity Filtering.Proceedings of the ACM on Software Engineering3, FSE (2026). doi:10.1145/3808130

work page doi:10.1145/3808130 2026

[2] [2]

2026.ToxiShield: Replication Package

Md Awsaf Alam Anindya, Showvik Biswas, Anindya Iqbal, Jaydeb Sarker, and Amiangshu Bosu. 2026.ToxiShield: Replication Package. https://github.com/WSU- SEAL/ToxiShield

work page 2026

[3] [3]

Tanni Dev, Sayma Sultana, and Amiangshu Bosu. 2025. Beyond Binary Moder- ation: Identifying Fine-Grained Sexist and Misogynistic on GitHub with Large Language Models. In2025 ACM/IEEE International Symposium on Empirical Soft- ware Engineering and Measurement (ESEM)(Honolulu, Hawai, USA, USA). 1–12

work page 2025

[4] [4]

Zhenxin Fu, Xiaoye Tan, Nanyun Peng, Dongyan Zhao, and Rui Yan. 2018. Style transfer in text: Exploration and evaluation. InProceedings of the AAAI conference on artificial intelligence, Vol. 32

work page 2018

[5] [5]

Did you miss my comment or what?

Courtney Miller, Sophie Cohen, Daniel Klug, Bogdan Vasilescu, and Christian KaUstner. 2022. " Did you miss my comment or what?" understanding toxicity in open source discussions. InProceedings of the 44th International Conference on Software Engineering. 710–722

work page 2022

[6] [6]

Md Shamimur Rahman, Zadia Codabux, and Chanchal K Roy. 2024. Do Words Have Power? Understanding and Fostering Civility in Code Review Discussion. Proceedings of the ACM on Software Engineering1, FSE (2024), 1632–1655. doi:10. 1145/3660780

work page 2024

[7] [7]

Jaydeb Sarker, Asif Kamal Turzo, and Amiangshu Bosu. 2025. The Landscape of Toxicity: An Empirical Investigation of Toxicity on GitHub.Proceedings of the ACM on Software Engineering2, FSE (2025), 623–646. doi:10.1145/3715744

work page doi:10.1145/3715744 2025

[8] [8]

Jaydeb Sarker, Asif Kamal Turzo, Ming Dong, and Amiangshu Bosu. 2023. Auto- mated Identification of Toxic Code Reviews Using ToxiCR.ACM Transactions on Software Engineering and Methodology32, 5 (July 2023), 1–32. doi:10.1145/3583562

work page doi:10.1145/3583562 2023